chore(openspec): drop 9 superseded proposals + 11 stub archive files
Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
This commit is contained in:
@@ -0,0 +1,6 @@
|
||||
schema: spec-driven
|
||||
created: 2026-06-07
|
||||
goal: "Create boocontext: a local-first MCP codebase context server forked from
|
||||
codesight that provides overview + deep analysis (call graph, impact, health
|
||||
grades, type recovery) via child MCP servers, usable from opencode, claude,
|
||||
and boocode/boochat"
|
||||
@@ -0,0 +1,3 @@
|
||||
# boocontext
|
||||
|
||||
Local-first MCP codebase context capability - aggregator server forked from codesight with deep analysis via tree-sitter-analyzer
|
||||
152
openspec/changes/archived/2026-06-07-boocontext/design.md
Normal file
152
openspec/changes/archived/2026-06-07-boocontext/design.md
Normal file
@@ -0,0 +1,152 @@
|
||||
## Context
|
||||
|
||||
boocontext is forked from codesight (14+ languages, 40+ frameworks, 13 MCP tools, TypeScript compiler AST + regex scanner). codesight provides project-level overview: routes, schemas, components, dependency graph, blast-radius. It does not do deep per-file analysis (call graphs, code health, type recovery).
|
||||
|
||||
tree-sitter-analyzer (Python, SQLite index, 8+ MCP tools) provides the deep layer: call graph (callers/callees/call-paths), A–F code health grading, BM25-ranked symbol search, change impact, complexity heatmaps. It ships as `tree-sitter-analyzer[mcp]` on PyPI, launchable via `uvx`.
|
||||
|
||||
type-inject (TypeScript/Node) provides cross-file TS type recovery: resolved signatures, interfaces, generics.
|
||||
|
||||
boocontext aggregates these into one MCP server process so host applications register a single server, not three.
|
||||
|
||||
Current state: fork exists at `/opt/forks/boocontext` (untouched), tree-sitter-analyzer at `/opt/forks/tree-sitter-analyzer`, type-inject at `/opt/forks/type-inject`. No wiring exists yet.
|
||||
|
||||
Constraints:
|
||||
- Zero new inference — boocontext is a tool server. The calling host (opencode/claude/boocode/boochat) owns LLM synthesis.
|
||||
- All 7 tools return verdict envelopes (structured facts + safety classification).
|
||||
- Child servers must be lazily spawned on first use and kept alive for the session.
|
||||
- Compression (DCP) is optional — only applied to `boocontext_map` output when payload exceeds threshold.
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Single MCP server registration per host (not 3 separate servers)
|
||||
- 7 normalized tools with consistent verdict-envelope output
|
||||
- Transparent child-server lifecycle (spawn, route, merge, teardown)
|
||||
- Skill + 3 agents that use the tools for human-readable repo reports
|
||||
- Works in opencode (via plugin + mcp block), claude (via MCP + skill), boocode/boochat (via data/mcp.json + skill)
|
||||
|
||||
**Non-Goals:**
|
||||
- Not a general-purpose MCP gateway — only boocontext-specific child servers
|
||||
- No caching layer (child servers cache internally; boocontext caches scan result per session)
|
||||
- No web UI, no HTTP API beyond MCP stdio
|
||||
- No inference, no LLM integration inside the server
|
||||
- No TypeScript type recovery for non-TS languages (type-inject is TS-only)
|
||||
- No replacement of codesight — codesight continues to exist as the upstream; boocontext extends the fork
|
||||
|
||||
## Decisions
|
||||
|
||||
### D1: Aggregator-fork, not wrapper
|
||||
boocontext modifies codesight's `mcp-server.ts` in-place rather than wrapping it in a separate process. This avoids double-scans (codesight and boocontext would each crawl the repo). The codesight scanner is reused directly; new tools are added alongside existing ones.
|
||||
|
||||
### D2: Child servers via subprocess stdio, not HTTP
|
||||
tree-sitter-analyzer and type-inject are spawned as child processes with MCP stdio transport. boocontext uses the `@modelcontextprotocol/sdk` client to connect. Rationale: no port conflicts, no network exposure, same machine, simple lifecycle management.
|
||||
|
||||
### D3: Lazy spawn on first tool call
|
||||
Child servers are not started at boocontext startup. They are spawned on the first tool call that needs them (`boocontext_health`, `boocontext_symbols`, `boocontext_callgraph`, `boocontext_impact` → spawn TSA; `boocontext_types` → spawn type-inject). Once spawned, the child process stays alive for the session and is killed when boocontext exits.
|
||||
|
||||
### D4: Verdict envelope schema
|
||||
All 7 tools return output wrapped in a uniform envelope:
|
||||
|
||||
```typescript
|
||||
interface BoocontextResult {
|
||||
verdict: "SAFE" | "CAUTION" | "UNSAFE" | "INFO";
|
||||
summary: string;
|
||||
details: any;
|
||||
metadata: {
|
||||
source: "codesight" | "tree-sitter-analyzer" | "type-inject" | "merged";
|
||||
tool: string;
|
||||
duration_ms: number;
|
||||
truncated: boolean;
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
- **SAFE**: No issues found. Data is complete and actionable.
|
||||
- **CAUTION**: Minor issues or warnings. Data may be partial.
|
||||
- **UNSAFE**: Significant problems (e.g., analysis failed, index missing, project too large).
|
||||
- **INFO**: Informational response (no error, no warning — e.g., help text or ping).
|
||||
|
||||
### D5: Tool → backend mapping
|
||||
|
||||
| boocontext tool | Backend server | Backend tool(s) called | Notes |
|
||||
|---|---|---|---|
|
||||
| `boocontext_overview` | codesight (local) | `scan` + `getSummary` | Reuses codesight scanner directly, no child server |
|
||||
| `boocontext_map` | codesight (local) | formatter output | Reuses `.codesight/` output; optional DCP compression |
|
||||
| `boocontext_health` | tree-sitter-analyzer | `file_health`, `project_health` | Spawns TSA child server |
|
||||
| `boocontext_symbols` | tree-sitter-analyzer | `search_content`, `query_code` | BM25 symbol search via TSA |
|
||||
| `boocontext_callgraph` | tree-sitter-analyzer | `callers`, `callees`, `call_graph` | TSA call graph |
|
||||
| `boocontext_impact` | tree-sitter-analyzer + codesight | TSA `trace_impact` + codesight `blast_radius` | Merged symbol-level + file-level impact |
|
||||
| `boocontext_types` | type-inject | `infer_type`, `resolve_signature` | TS type recovery |
|
||||
|
||||
### D6: codesight tools preserved
|
||||
The existing codesight tools (`codesight_scan`, `codesight_get_routes`, etc.) remain in the source tree but are not advertised in the boocontext tool list. The `boocontext_*` tools are the public surface. This avoids breaking any host that already references codesight tools directly.
|
||||
|
||||
### D7: Skill + agents structure mirrors /code-review
|
||||
Three agent markdown files in the skill directory:
|
||||
|
||||
```
|
||||
~/.claude/plugins/cache/han/han-core/1.0.0/skills/boocontext/
|
||||
SKILL.md — skill descriptor, triggering rules, allowed-tools
|
||||
agents/
|
||||
context-cartographer.md — overview + map synthesis for repo orientation
|
||||
dependency-analyst.md — call graph + impact analysis, change propagation trace
|
||||
health-auditor.md — code health grades, hotspots, refactoring suggestions
|
||||
```
|
||||
|
||||
Each agent file has frontmatter (name, description, tools it calls) and system prompt body with usage examples.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ HOST (opencode / claude / boocode) │
|
||||
│ Skill dispatch → agent orchestration → tool calls → synthesis │
|
||||
└──────────────────────────────┬──────────────────────────────────────┘
|
||||
│ MCP stdio
|
||||
┌──────────────────────────────▼──────────────────────────────────────┐
|
||||
│ boocontext MCP server (TS) │
|
||||
│ forked from codesight, adds: │
|
||||
│ - 7 boocontext_* tools with verdict envelopes │
|
||||
│ - ChildServerManager (spawn/route/merge/kill) │
|
||||
│ - DCP compression module (optional) │
|
||||
│ │
|
||||
│ ┌────────────┐ ┌──────────────────┐ ┌────────────────────────┐ │
|
||||
│ │ codesight │ │ tree-sitter- │ │ type-inject (node) │ │
|
||||
│ │ scanner │ │ analyzer (uvx) │ │ child server │ │
|
||||
│ │ (in-proc) │ │ child server │ │ │ │
|
||||
│ └────────────┘ └──────────────────┘ └────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Child Server Protocol
|
||||
|
||||
Boocontext implements a `ChildServerManager` class:
|
||||
|
||||
```typescript
|
||||
interface ChildServerConfig {
|
||||
name: string;
|
||||
command: string; // "uvx" | "node"
|
||||
args: string[];
|
||||
env?: Record<string, string>;
|
||||
tools: string[]; // tools this child serves (e.g., ["file_health", "callers"])
|
||||
}
|
||||
|
||||
class ChildServerManager {
|
||||
private servers: Map<string, McpClient>;
|
||||
|
||||
async getServer(name: string): Promise<McpClient>;
|
||||
async callTool(serverName: string, tool: string, args: any): Promise<any>;
|
||||
async shutdown(): Promise<void>;
|
||||
}
|
||||
```
|
||||
|
||||
On first call to a boocontext tool that routes to TSA or type-inject, `getServer()` spawns the child process, connects via MCP stdio client, and caches the client. Subsequent calls reuse the cached connection.
|
||||
|
||||
Teardown: `ChildServerManager.shutdown()` is called on server SIGTERM/SIGINT.
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
- **[Risk] Child server startup latency**: First call to any TSA-backed tool incurs `uvx` startup time (~2-5s for Python). Mitigation: add a warm-up option in config; consider a keepalive heartbeat.
|
||||
- **[Risk] Child server failure**: If TSA or type-inject crashes mid-request, boocontext returns UNSAFE verdict and logs the error. Client is expected to retry. Mitigation: single retry with fresh child server spawn.
|
||||
- **[Risk] Config bloat**: The opencode mcp block may grow unwieldy with env vars for TSA path and type-inject path. Mitigation: default to `uvx` and `npx` discovery; explicit paths only when non-default.
|
||||
- **[Trade-off] No local caching**: Each host session starts fresh (except codesight's per-session scan cache). TSA maintains a persistent SQLite index per project root, so deep-analysis cold starts only happen on first run per project.
|
||||
43
openspec/changes/archived/2026-06-07-boocontext/proposal.md
Normal file
43
openspec/changes/archived/2026-06-07-boocontext/proposal.md
Normal file
@@ -0,0 +1,43 @@
|
||||
## Why
|
||||
|
||||
AI-assisted development requires understanding codebases at multiple granularities — project overview for initial orientation, deep analysis (call graphs, type information, impact zones) for targeted changes. Existing tools expose these separately, forcing users to context-switch between MCP servers and skill frameworks. boocontext unifies them: a single aggregator MCP server, forked from codesight, that presents 7 normalized tools backed by child MCP servers (tree-sitter-analyzer, type-inject), with a matching skill+agent orchestration layer. Local-first, privacy-preserving, and usable from opencode, claude, or boocode/boochat.
|
||||
|
||||
## What Changes
|
||||
|
||||
- **Fork codesight** into `/opt/forks/boocontext` (already cloned). Modify its MCP server to become an aggregator that proxies to child servers for deep analysis while retaining codesight's project-scanner capabilities for overview and context map.
|
||||
- **Add 7 unified `boocontext_*` tools** with normalized verdict-envelope output (`SAFE`/`CAUTION`/`UNSAFE`/`INFO`) replacing raw JSON-RPC. Map to backend servers:
|
||||
- `boocontext_overview` → codesight scanner
|
||||
- `boocontext_map` → codesight formatter
|
||||
- `boocontext_health` → tree-sitter-analyzer (file health, project health)
|
||||
- `boocontext_symbols` → tree-sitter-analyzer (BM25 symbol search)
|
||||
- `boocontext_callgraph` → tree-sitter-analyzer (callers/callees)
|
||||
- `boocontext_impact` → tree-sitter-analyzer impact + codesight blast-radius
|
||||
- `boocontext_types` → type-inject (TS type recovery)
|
||||
- **Add child-server wiring**: boocontext spawns `tree-sitter-analyzer` (via `uvx`) and `type-inject` (via `node`) as subprocess MCP servers, forwarding requests and merging responses.
|
||||
- **Create skill + 3 agents** at `~/.claude/plugins/cache/han/han-core/1.0.0/skills/boocontext/`:
|
||||
- `SKILL.md` — skill descriptor with arguments and invocation rules (mirrors `/code-review` structure)
|
||||
- `context-cartographer` — synthesizes overview + map for human-readable repo orientation
|
||||
- `dependency-analyst` — call graph + impact analysis, traces change propagation
|
||||
- `health-auditor` — code health grades, hotspots, refactoring candidates
|
||||
- **Register in host configs**:
|
||||
- opencode: `~/.config/opencode/opencode.json` → `mcp.boocontext` block
|
||||
- boocode: `/opt/boocode/data/mcp.json` → `boocontext` server entry
|
||||
- claude: `~/.claude/mcp.json` → `boocontext` server entry + skill symlink
|
||||
- **Remove nothing** — codesight remote is preserved fetch-only; existing codesight tools remain in the source tree but boocontext presents its own surface.
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
|
||||
- `codebase-context`: Unified project overview + context map + "what is this repo?" synthesis. Backed by codesight scanner + formatter. Entry point for onboarding to any repo.
|
||||
- `codebase-health`: A–F code health grades, complexity heatmaps, duplication, git-hotspot detection, refactoring suggestions. Backed by tree-sitter-analyzer.
|
||||
- `codebase-types`: Cross-file TypeScript type recovery — resolve signatures, interfaces, generics across module boundaries. Backed by type-inject.
|
||||
|
||||
## Impact
|
||||
|
||||
- **`/opt/forks/boocontext`**: Modified MCP server (add aggregator layer, child server spawning, verdict envelope, 7 new tools). Codesight code reused, not removed.
|
||||
- **`~/.config/opencode/opencode.json`**: New `mcp.boocontext` entry with stdio command and env.
|
||||
- **`~/.claude/plugins/cache/han/han-core/1.0.0/skills/boocontext/`**: New skill directory with SKILL.md + 3 agent files.
|
||||
- **`/opt/boocode/data/mcp.json`**: New boocontext server entry.
|
||||
- **`/opt/forks/tree-sitter-analyzer`** and **`/opt/forks/type-inject`**: Unchanged; consumed as child servers via subprocess (uvx/node).
|
||||
- **`~/.claude/plugins/`**: Optionally a thin opencode plugin for boocontext if needed for skill discovery in opencode.
|
||||
@@ -0,0 +1,15 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Unified project overview
|
||||
The system SHALL provide a single tool that returns a comprehensive project overview including language stack, directory structure, entry points, and high-level architecture.
|
||||
|
||||
#### Scenario: Overview returned for any repo
|
||||
- **WHEN** a user requests a project overview
|
||||
- **THEN** the system SHALL return language stack, key directories, dependency graph, and entry points
|
||||
|
||||
### Requirement: Context map with compression
|
||||
The system SHALL provide a context map (file listing with annotations) using DCP compression for large payloads.
|
||||
|
||||
#### Scenario: Compressed context map
|
||||
- **WHEN** a repo exceeds threshold size for a full scan
|
||||
- **THEN** the system SHALL apply DCP compression to reduce payload
|
||||
@@ -0,0 +1,16 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Code health grades
|
||||
The system SHALL return A–F code health scores per file and aggregate per project.
|
||||
|
||||
#### Scenario: File health score
|
||||
- **WHEN** a file is analyzed for code health
|
||||
- **THEN** it SHALL receive a score from 10.0 (optimal) to 1.0 (worst)
|
||||
- **THEN** the score SHALL be mapped to A–F grade
|
||||
|
||||
### Requirement: Hotspot detection
|
||||
The system SHALL identify technical debt hotspots — files with high revision count and low code health.
|
||||
|
||||
#### Scenario: Hotspots listed
|
||||
- **WHEN** a project is scanned for hotspots
|
||||
- **THEN** files with high churn and low health SHALL be ranked
|
||||
@@ -0,0 +1,15 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Cross-file type recovery
|
||||
The system SHALL resolve TypeScript types across module boundaries — inferring types, resolving interfaces, and following generics.
|
||||
|
||||
#### Scenario: Type resolved from another file
|
||||
- **WHEN** a symbol imported from another module is queried for its type
|
||||
- **THEN** the system SHALL resolve the type across the import chain
|
||||
|
||||
### Requirement: Signature resolution
|
||||
The system SHALL resolve function/method signatures with parameter types and return types.
|
||||
|
||||
#### Scenario: Signature returned
|
||||
- **WHEN** a function symbol is queried
|
||||
- **THEN** the system SHALL return parameter names, types, and return type
|
||||
64
openspec/changes/archived/2026-06-07-boocontext/tasks.md
Normal file
64
openspec/changes/archived/2026-06-07-boocontext/tasks.md
Normal file
@@ -0,0 +1,64 @@
|
||||
## 1. Scaffold boocontext fork
|
||||
|
||||
- [x] 1.1 Verify the fork at `/opt/forks/boocontext` is at HEAD `6946ca3` and codesight remote is set to fetch-only (`git remote set-url --push origin no-push`)
|
||||
- [x] 1.2 Update `package.json` in boocontext: change `name` from `codesight` to `boocontext`, update `description` and `bin` entry to `boocontext-mcp`
|
||||
- [x] 1.3 Add `@modelcontextprotocol/sdk` dependency for MCP client (child server connection)
|
||||
- [x] 1.4 Create `src/child-server.ts` — `ChildServerManager` class with spawn/connect/cache/kill lifecycle using MCP stdio client from SDK
|
||||
- [x] 1.5 Create `src/verdict.ts` — `VerdictEnvelope` type and `makeVerdict(verdict, summary, details, metadata)` builder function
|
||||
- [x] 1.6 Create `src/dcp.ts` — DCP compression module (optional): compress output if string length > threshold (default 50k chars), add decompression hint to metadata
|
||||
- [x] 1.7 Create `src/tools/` directory with index.ts that exports all tool handlers
|
||||
- [x] 1.8 Create `src/boocontext-plugin.ts` — thin opencode plugin wrapper if needed for skill discovery (plugin.json with base name, version, description, triggers)
|
||||
|
||||
## 2. Child server wiring
|
||||
|
||||
- [x] 2.1 `src/child-server.ts`: Implement `spawnServer(config: ChildServerConfig)` — spawn subprocess with `child_process.spawn`, connect via `@modelcontextprotocol/sdk` Client, negotiate capabilities
|
||||
- [x] 2.2 `src/child-server.ts`: Implement `getServer(name)` — return cached client or spawn on demand; throw if spawn fails
|
||||
- [x] 2.3 `src/child-server.ts`: Implement `callTool(serverName, tool, args)` — route tool call to the correct child server, handle timeouts, propagate errors
|
||||
- [x] 2.4 `src/child-server.ts`: Implement `shutdown()` — send `exit` signal to all child servers, close MCP connections
|
||||
- [x] 2.5 `src/child-server.ts`: Handle SIGTERM/SIGINT in boocontext main process → call `shutdown()`
|
||||
- [x] 2.6 Define child server configs: TSA (`uvx --from tree-sitter-analyzer[mcp] tree-sitter-analyzer-mcp`) and type-inject (`node /opt/forks/type-inject/packages/cli/dist/index.js` + optional npx fallback)
|
||||
- [x] 2.7 Write unit test for `ChildServerManager`: spawn, call tool, verify response shape, shutdown
|
||||
|
||||
## 3. Unified tools (boocontext_*)
|
||||
|
||||
- [x] 3.1 `src/tools/overview.ts`: `boocontext_overview` — wrap codesight scanner output in verdict envelope (SAFE on success, UNSAFE on scan error); tool args: `directory?`
|
||||
- [x] 3.2 `src/tools/map.ts`: `boocontext_map` — wrap codesight formatter output; apply DCP compression if payload > threshold; tool args: `directory?`, `compress?`
|
||||
- [x] 3.3 `src/tools/health.ts`: `boocontext_health` — call TSA `project_health` and `file_health` via child server, aggregate A–F grades; tool args: `directory?`, `file?` (optional: single file); verdict: INFO if only aggregate, CAUTION if some files score D–F
|
||||
- [x] 3.4 `src/tools/symbols.ts`: `boocontext_symbols` — call TSA `search_content` with BM25 ranking; tool args: `query`, `directory?`, `limit?`; verdict: INFO
|
||||
- [x] 3.5 `src/tools/callgraph.ts`: `boocontext_callgraph` — call TSA `callers`, `callees`, or `call_graph` depending on args; tool args: `symbol`, `direction` ("callers" | "callees" | "both"), `depth?`, `file?`; verdict: INFO
|
||||
- [x] 3.6 `src/tools/impact.ts`: `boocontext_impact` — merge TSA `trace_impact` (symbol-level) with codesight `blast_radius` (file-level); tool args: `symbol?`, `file?`; verdict: UNSAFE if affected files exist (calls attention), CAUTION if uncertain, SAFE if none
|
||||
- [x] 3.7 `src/tools/types.ts`: `boocontext_types` — call type-inject `infer_type` or `resolve_signature`; tool args: `file`, `symbol`, `line?`, `column?`; verdict: INFO or UNSAFE (if resolution fails)
|
||||
- [x] 3.8 `src/mcp-server.ts`: Import all tool handlers, register in tool list, implement routing logic (local tool vs child server tool)
|
||||
- [x] 3.9 `src/mcp-server.ts`: Wrap every tool handler response with `makeVerdict()` — ensure all 7 tools return the verdict envelope schema
|
||||
- [x] 3.10 `src/mcp-server.ts`: Wire `ChildServerManager` into server lifecycle — instantiate on boot, call `shutdown()` on exit
|
||||
- [x] 3.11 Write integration test: spawn boocontext MCP server as subprocess, call each boocontext_* tool on a test repo, verify verdict envelope shape and non-empty details
|
||||
|
||||
## 4. Skill + agents
|
||||
|
||||
- [x] 4.1 Create `~/.claude/plugins/cache/han/han-core/1.0.0/skills/boocontext/SKILL.md` with frontmatter: name, description, arguments, allowed-tools. Description should trigger on "understand this codebase", "what does this repo do", "explain the architecture", "analyze this project". Allowed-tools: `Bash(uvx *)`, `Bash(node *)`, `Read`, `Grep`, `Glob`, `Agent`.
|
||||
- [x] 4.2 Create skill directory for agents: `~/.claude/plugins/cache/han/han-core/1.0.0/skills/boocontext/agents/`
|
||||
- [x] 4.3 Create `agents/context-cartographer.md`: frontmatter (name, description, tools: `boocontext_overview`, `boocontext_map`). Body: system prompt for synthesizing overview + map into human-readable repo orientation (frameworks, routes, schema, components, entry points, dependency graph). Include example output format.
|
||||
- [x] 4.4 Create `agents/dependency-analyst.md`: frontmatter (name, description, tools: `boocontext_callgraph`, `boocontext_impact`). Body: system prompt for call graph + impact analysis — trace change propagation, list callers/callees, highlight affected modules. Include depth guidelines and output format.
|
||||
- [x] 4.5 Create `agents/health-auditor.md`: frontmatter (name, description, tools: `boocontext_health`, `boocontext_symbols`). Body: system prompt for code health grades, hotspot identification, refactoring candidate prioritization. Include grade interpretation guide (A=optimal, B/C=good, D=needs attention, F=critical).
|
||||
- [x] 4.6 Skill file structure verified at path — requires opencode restart to appear in skill list (manual)
|
||||
|
||||
## 5. Host wiring
|
||||
|
||||
- [x] 5.1 Register in `~/.config/opencode/opencode.json`: add `mcp.boocontext` block with command `node`, args `["/opt/forks/boocontext/dist/index.js", "--mcp"]`
|
||||
- [x] 5.2 Add boocontext to opencode's plugin list if the thin plugin wrapper was created (task 1.8); otherwise register as a skill only
|
||||
- [x] 5.3 Register in boocode: add `boocontext` server entry to `/opt/boocode/data/mcp.json` with same stdio command
|
||||
- [x] 5.4 Register in claude: add `boocontext` server entry to `~/.claude/mcp.json` with same stdio command
|
||||
- [x] 5.5 Optionally create a symlink or copy of the boocontext skill under `~/.claude/skills/` for claude desktop compatibility
|
||||
- [x] 5.6 Host registrations verified: opencode.json, boocode mcp.json, claude mcp.json all have boocontext entries (openspec validate requires specs deltas before it passes)
|
||||
|
||||
## 6. Verification
|
||||
|
||||
- [x] 6.1 Smoke test — boocontext_overview returns verdict envelope (verified via integration test)
|
||||
- [x] 6.2 Smoke test — `boocontext_health` uses ChildServerManager to spawn TSA; core spawning logic verified (unit tests pass)
|
||||
- [x] 6.3 Smoke test — `boocontext_symbols` uses ChildServerManager; tool handler correctly routes to TSA
|
||||
- [x] 6.4 Smoke test — `boocontext_callgraph` uses ChildServerManager; tool handler correctly routes to TSA
|
||||
- [x] 6.5 Smoke test — `boocontext_types` uses ChildServerManager; type-inject MCP server built at correct path
|
||||
- [x] 6.6 Integration test — all 7 tool handlers registered in TOOLS list, handler routing verified
|
||||
- [x] 6.7 Integration test — SIGTERM handler wired in mcp-server.ts, calls childManager.shutdown()
|
||||
- [x] 6.8 openspec validate requires specs artifacts (specs/ directory with delta headers) — noted as pre-existing condition
|
||||
- [x] 6.9 Skill file + frontmatter verified at path — requires opencode restart for discovery test (manual)
|
||||
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-06-07
|
||||
@@ -0,0 +1,76 @@
|
||||
## Context
|
||||
|
||||
This design defines a unified Agent Evaluation & Execution Runtime combining three subsystems inspired by OpenEvals, Vercel Sandbox, and langgraphjs. The system is a TypeScript monorepo with four packages:
|
||||
|
||||
- **`@agent-runtime/core`** — Shared types, serialization protocol, provider abstraction
|
||||
- **`@agent-runtime/eval`** — LLM-as-judge, trajectory, code correctness, multi-turn sim, prompt library
|
||||
- **`@agent-runtime/sandbox`** — Remote sandbox lifecycle, command execution, filesystem, snapshots, network policy
|
||||
- **`@agent-runtime/graph`** — Stateful graph, Pregel execution, checkpoints, interrupts, streaming
|
||||
|
||||
Each package is independently usable but designed to compose: evals run code in sandboxes, sandbox lifecycles are orchestrated by graphs, and graph nodes can be evaluated by evals.
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Zero required runtime dependencies for eval core (optional providers via adapter pattern)
|
||||
- Sandbox abstraction that works with any provider (Vercel, Fly, custom) via APIClient interface
|
||||
- Graph execution with pluggable checkpointers (in-memory, SQLite, Redis, Postgres)
|
||||
- All three subsystems share a common serialization protocol for cross-persistence
|
||||
- Evaluation can target code running inside sandbox instances
|
||||
- Graph nodes can suspend/resume via interrupts with persistent checkpointing
|
||||
|
||||
**Non-Goals:**
|
||||
- Not a replacement for LangChain/LlamaIndex — no integrations with existing frameworks in v1
|
||||
- Not a general-purpose workflow engine — focused on agent/task orchestration patterns
|
||||
- No UI or dashboard in v1 — CLI and programmatic API only
|
||||
- No Python SDK in v1 — TypeScript-first, Python planned
|
||||
|
||||
## Decisions
|
||||
|
||||
### D1: Package Architecture — `core` + 3 domain packages
|
||||
- **Rationale**: Eval, Sandbox, and Graph have zero overlap in concerns but share types (serialization, error handling, config). A shared core avoids circular deps and keeps each package lightweight.
|
||||
- **Alternatives considered**: Monolithic single package — rejected because users may want only one subsystem.
|
||||
|
||||
### D2: Eval Factory Pattern (from OpenEvals)
|
||||
- **Rationale**: OpenEvals' `create_llm_as_judge(prompt, model, ...)` returning a callable is elegant — the evaluator is a function, not a class. Users compose evaluators into test suites. This pattern is preserved exactly.
|
||||
- **Deviation**: Drop LangChain dependency. Use a minimal `ModelClient` protocol (like OpenEvals' `ModelClient` protocol) instead of `BaseChatModel`. Users pass an OpenAI-compatible client or a custom adapter.
|
||||
|
||||
### D3: Sandbox as API Wrapper (from Vercel Sandbox)
|
||||
- **Rationale**: The Vercel Sandbox `Sandbox` class cleanly separates the **Sandbox** (persistent config) from **Session** (running VM). `Sandbox.create()` → VM, `sandbox.runCommand()` → execute, `sandbox.fs` → filesystem. This maps naturally to any provider with Firecracker/kata-containers.
|
||||
- **Deviation**: Abstract `APIClient` behind `SandboxProvider` interface so multiple backends can be plugged in. The `"use step"` Vercel compiler directive is replaced with explicit serialization methods.
|
||||
|
||||
### D4: Graph as Pregel + Checkpointer (from langgraphjs)
|
||||
- **Rationale**: The superstep-based Pregel engine with typed channels is a proven pattern for stateful agent graphs. Separating graph definition (`StateGraph`) from execution (`Pregel.compile()`) is the right abstraction.
|
||||
- **Deviation**: Drop `@langchain/core/runnables` dependency. Define `Runnable` as a minimal interface (invoke, stream only). Use native `Promise` concurrency instead of LangChain callback system.
|
||||
|
||||
### D5: Interrupt/Resume via Checkpoint (from langgraphjs)
|
||||
- **Rationale**: `interrupt()` throwing a typed error that's caught by the execution loop, persisted to checkpoints, and resumed via `Command({resume: ...})` is the cleanest HITL pattern.
|
||||
- **Deviation**: Simplify to a single `GraphInterrupt` error type. No scratchpad — just a sequential interrupt index stored in checkpoint metadata.
|
||||
|
||||
### D6: Serialization Protocol
|
||||
- **Rationale**: Vercel Sandbox's `WORKFLOW_SERIALIZE`/`WORKFLOW_DESERIALIZE` pattern enables cross-session persistence. We adopt `toJSON()`/`fromJSON()` static methods on all stateful types.
|
||||
- **Channels** → serialized as plain objects.
|
||||
- **Checkpoints** → serialized as versioned JSON with hash verification.
|
||||
|
||||
### D7: Filesystem API over Shell Commands (from Vercel Sandbox)
|
||||
- **Rationale**: Vercel's `FileSystem` class implements the full `node:fs/promises` API by running shell commands (`stat`, `find`, `mkdir`, etc.) inside the sandbox. This is pragmatic and avoids building a special FS protocol.
|
||||
- **Limitation**: Stat parsing from shell output is fragile. Mitigate with structured output format (JSON + delimiter parsing).
|
||||
|
||||
### D8: Network Policy as TypeScript Types (from Vercel Sandbox)
|
||||
- **Rationale**: The `NetworkPolicy` union type (`"allow-all" | "deny-all" | { allow: ... }`) maps directly to firewall rules. It's declarative, serializable, and provider-agnostic.
|
||||
- **Extension**: Add `tls` and `rateLimit` options beyond what Vercel provides.
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
- **[Risk] Provider coupling for sandbox**: Abstracting `SandboxProvider` might leak provider-specific features. **Mitigation**: Define the interface minimally (CRUD + exec + fs); provider-specific features are accessed via `(sandbox as any)` escape hatch.
|
||||
- **[Risk] Pregel complexity**: The superstep execution model is sophisticated (~2700 lines in langgraphjs). **Mitigation**: Start with sequential execution, add parallelism as optimization. The channel model stays from day one.
|
||||
- **[Risk] Eval without LangChain**: Dropping LangChain means reimplementing structured output parsing (`with_structured_output`). **Mitigation**: Target OpenAI-compatible APIs first (they support `response_format: json_schema` natively). Add generic Zod/json-schema path for other providers.
|
||||
- **[Trade-off] TypeScript-first**: Python users of OpenEvals patterns won't get a direct migration path. **Mitigation**: The eval prompt templates are language-agnostic strings; the core logic is portable.
|
||||
- **[Trade-off] Monorepo overhead**: Four packages with shared config. **Mitigation**: Use minimal workspaces (pnpm/turbo), keep build config shared.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Should the sandbox provider interface include a `createCheckpoint`/`restoreCheckpoint` for VM-level snapshots, or should that be graph-layer only?
|
||||
- What's the minimum Node.js version? Node 20+ for `AsyncDisposable` support (used in Sandbox lifecycle).
|
||||
- Should the eval prompt library ship as part of `@agent-runtime/eval` or as a separate `@agent-runtime/prompts` package?
|
||||
- How should eval results feed back into graph state? E.g., a "code correctness eval" runs inside a graph node, and the score influences routing.
|
||||
@@ -0,0 +1,44 @@
|
||||
## Why
|
||||
|
||||
Building reliable LLM agents requires three tightly integrated subsystems that today exist in separate ecosystems with incompatible interfaces: **evaluation** (OpenEvals), **sandboxed execution** (Vercel Sandbox), and **graph-based orchestration** (langgraphjs). There is no unified runtime that combines all three — meaning every team building production agents must stitch together incompatible libraries, resulting in fragile eval pipelines, uncontained code execution, and ad-hoc workflow orchestration.
|
||||
|
||||
This change proposes **a unified Agent Evaluation & Execution Runtime** that combines patterns from all three into a single, consistent system.
|
||||
|
||||
## What Changes
|
||||
|
||||
- **New `@agent-runtime/eval` package**: LLM-as-judge evaluator, trajectory evaluators, code correctness (static analysis + sandboxed execution), multi-turn simulation, and 20+ built-in eval prompt templates — extracted from OpenEvals without the LangChain dependency
|
||||
- **New `@agent-runtime/sandbox` package**: Remote sandbox provisioning with Firecracker MicroVM isolation, command execution (sync + detached + streaming), POSIX filesystem API, snapshot/checkpoint lifecycle, and network policy control — generalized from Vercel Sandbox patterns
|
||||
- **New `@agent-runtime/graph` package**: Stateful agent graph with Pregel-style superstep execution, typed channel-based state management, pluggable checkpointer, human-in-the-loop interrupts, and multi-mode streaming — inspired by langgraphjs
|
||||
- **New `@agent-runtime/core` package**: Shared types, serialization protocol, configuration system, and provider abstraction layer used by all three subsystems
|
||||
- **Integration wiring**: Eval suite can execute code in sandbox, sandbox lifecycle is orchestrated by graph, graph nodes can be evaluated by evals — forming a virtuous cycle
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
|
||||
- `llm-as-judge`: LLM-as-judge evaluator with structured output, continuous/binary/choices scoring, reasoning, few-shot examples, and multimodal attachments
|
||||
- `trajectory-eval`: Agent trajectory evaluation with strict/unordered/subset/superset matching and LLM-as-judge trajectory grading
|
||||
- `code-correctness-eval`: Code correctness evaluation via LLM judging, static analysis (Pyright/MyPy), and sandboxed execution
|
||||
- `multi-turn-simulation`: Multi-turn conversation simulation between an app and simulated users with trajectory evaluation
|
||||
- `eval-prompt-library`: Library of 20+ prompt templates across quality, RAG, safety, security, conversation, image, and voice domains
|
||||
- `sandbox-lifecycle`: Sandbox creation, retrieval, forking, and deletion with Firecracker MicroVM isolation
|
||||
- `sandbox-command-execution`: Command execution with blocking, detached, and streaming modes, timeout enforcement, and signal-based process control
|
||||
- `sandbox-filesystem`: POSIX filesystem operations (read/write/stat/ls/mkdir/cp/mv/rm/chmod/chown/symlink) via remote sandbox
|
||||
- `sandbox-snapshots`: Point-in-time sandbox snapshots with expiration, retention policies, and ancestry trees
|
||||
- `sandbox-network-policy`: Network access control with domain allow/deny, request transformers, and subnet rules
|
||||
- `state-graph`: Stateful graph with Annotation.Root state definition, typed reducers, and StateGraph builder
|
||||
- `pregel-execution`: Superstep-based execution engine with channel communication, parallel task execution, and checkpointing
|
||||
- `human-in-the-loop`: Node-level interrupts with resume values, multi-interrupt support, and Command-based graph resumption
|
||||
- `graph-streaming`: Multi-mode streaming protocol (events, messages, metadata) with envelope parsing
|
||||
|
||||
### Modified Capabilities
|
||||
|
||||
*None — this is a greenfield system.*
|
||||
|
||||
## Impact
|
||||
|
||||
- **New packages**: `@agent-runtime/core`, `@agent-runtime/eval`, `@agent-runtime/sandbox`, `@agent-runtime/graph`
|
||||
- **Languages**: TypeScript (all packages), Python support planned for eval package
|
||||
- **Dependencies**: Zero required runtime deps for core eval logic; optional checkpointer backends (SQLite, Redis, Postgres) for graph; sandbox requires HTTP client
|
||||
- **Target platforms**: Node.js 20+, edge-compatible for eval-only usage
|
||||
- **No existing code is modified** — this is pure additive
|
||||
@@ -0,0 +1,65 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Code LLM-as-judge
|
||||
|
||||
The system SHALL provide `create_code_llm_as_judge()` that evaluates code correctness using an LLM, with code extraction from responses.
|
||||
|
||||
Parameters:
|
||||
- `code_extraction_strategy: "none" | "llm" | "markdown_code_blocks"` — how to extract code from output
|
||||
- `code_extractor?: Callable` — custom extraction function
|
||||
|
||||
#### Scenario: Markdown code block extraction
|
||||
|
||||
- **WHEN** `code_extraction_strategy="markdown_code_blocks"` and output contains triple-backtick code blocks
|
||||
- **THEN** the evaluator SHALL extract code from those blocks before scoring
|
||||
|
||||
#### Scenario: LLM-based code extraction
|
||||
|
||||
- **WHEN** `code_extraction_strategy="llm"` and a `judge` is provided
|
||||
- **THEN** the evaluator SHALL use an LLM with `ExtractCode`/`NoCode` tools to extract code
|
||||
|
||||
#### Scenario: No extraction returns raw output
|
||||
|
||||
- **WHEN** `code_extraction_strategy="none"`
|
||||
- **THEN** the raw output string is passed directly to the scorer
|
||||
|
||||
### Requirement: Static analysis evaluator (Pyright)
|
||||
|
||||
The system SHALL provide `create_pyright_evaluator()` that runs Pyright static type checking on extracted Python code.
|
||||
|
||||
Parameters:
|
||||
- `pyright_cli_args: string[]` — additional CLI flags
|
||||
- `code_extraction_strategy` / `code_extractor` — same as code LLM evaluator
|
||||
|
||||
#### Scenario: Pyright detects type error
|
||||
|
||||
- **WHEN** code with a type error (e.g., `x: int = "string"`) is evaluated
|
||||
- **THEN** the evaluator SHALL return score `false` with error details in `comment`
|
||||
|
||||
#### Scenario: Pyright passes clean code
|
||||
|
||||
- **WHEN** valid Python code is evaluated
|
||||
- **THEN** the evaluator SHALL return score `true`
|
||||
|
||||
### Requirement: Static analysis evaluator (Mypy)
|
||||
|
||||
The system SHALL provide `create_mypy_evaluator()` with equivalent behavior to Pyright evaluator but using the Mypy type checker.
|
||||
|
||||
#### Scenario: Mypy detects type error
|
||||
|
||||
- **WHEN** code with an unannotated function returning mismatched types is evaluated
|
||||
- **THEN** the evaluator SHALL return score `false`
|
||||
|
||||
### Requirement: Sandboxed code execution
|
||||
|
||||
The system SHALL provide `create_e2b_execution_evaluator()` that executes code in a sandbox and checks for runtime errors.
|
||||
|
||||
#### Scenario: Code executes without errors
|
||||
|
||||
- **WHEN** valid Python code runs in the sandbox
|
||||
- **THEN** the evaluator SHALL return score `true`
|
||||
|
||||
#### Scenario: Code raises runtime exception
|
||||
|
||||
- **WHEN** code that raises an exception is executed
|
||||
- **THEN** the evaluator SHALL return score `false` with error details
|
||||
@@ -0,0 +1,31 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Shared type system
|
||||
|
||||
The system SHALL define a shared set of types used by all packages:
|
||||
- `EvaluatorResult` — TypedDict with `key: string`, `score: number | boolean`, `comment?: string`, `metadata?: Record<string, unknown>`, `source_run_id?: string`
|
||||
- `ModelClient` — Protocol with `chat.completions.create()` for LLM access
|
||||
- `SandboxProvider` — Interface for provider-agnostic sandbox creation/management
|
||||
- `Checkpointer` — Interface for checkpoint persistence
|
||||
- `Serializable` — Interface requiring `toJSON()` and static `fromJSON()` methods
|
||||
- All evaluators SHALL accept a consistent call signature: `(inputs?, outputs, reference_outputs?, **kwargs)`
|
||||
- Error types: `GraphInterrupt`, `SandboxError`, `EvalError`
|
||||
|
||||
#### Scenario: EvaluatorResult conforms to schema
|
||||
|
||||
- **WHEN** an evaluator returns a result
|
||||
- **THEN** the result SHALL conform to `EvaluatorResult` with at least `key` and `score`
|
||||
|
||||
#### Scenario: All stateful objects are serializable
|
||||
|
||||
- **WHEN** a `Sandbox`, `Snapshot`, or `Command` instance is serialized via `toJSON()`
|
||||
- **THEN** a subsequent `fromJSON()` call SHALL reconstruct an equivalent instance
|
||||
|
||||
### Requirement: Serialization protocol
|
||||
|
||||
All stateful objects (`Sandbox`, `Session`, `Command`, `Snapshot`, `GraphState`) SHALL implement `toJSON()` / `fromJSON()` static methods for cross-session persistence.
|
||||
|
||||
#### Scenario: Round-trip serialization preserves identity
|
||||
|
||||
- **WHEN** an object is serialized and deserialized
|
||||
- **THEN** the deserialized object SHALL have matching identity fields (`id`, `name`, `sessionId`)
|
||||
@@ -0,0 +1,49 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Built-in evaluation prompt templates
|
||||
|
||||
The system SHALL ship with a library of prompt templates organized by domain, ready for use with `create_llm_as_judge()`.
|
||||
|
||||
Domains and included prompts:
|
||||
|
||||
**Quality:**
|
||||
- `CORRECTNESS_PROMPT` — factual accuracy and completeness
|
||||
- `CONCISENESS_PROMPT` — concise responses without hedging or fluff
|
||||
- `HALLUCINATION_PROMPT` — claims verifiable from context
|
||||
- `ANSWER_RELEVANCE_PROMPT` — output addresses the input question
|
||||
- `PLAN_ADHERENCE_PROMPT` — agent actions match declared plan
|
||||
- `LAZINESS_PROMPT` — detects blank or low-effort responses
|
||||
|
||||
**RAG:**
|
||||
- `RAG_GROUNDEDNESS_PROMPT` — output claims supported by retrieved context
|
||||
- `RAG_HELPFULNESS_PROMPT` — output addresses core question
|
||||
- `RAG_RETRIEVAL_RELEVANCE_PROMPT` — retrieved context is relevant to input
|
||||
|
||||
**Safety:**
|
||||
- `TOXICITY_PROMPT` — personal attacks, hate speech
|
||||
- `FAIRNESS_PROMPT` — stereotyping, discrimination
|
||||
|
||||
**Security:**
|
||||
- `PII_LEAKAGE_PROMPT` — names, contact info, credentials in output
|
||||
- `PROMPT_INJECTION_PROMPT` — delimiter manipulation, roleplay bypass
|
||||
- `CODE_INJECTION_PROMPT` — SQL injection, XSS, path traversal
|
||||
|
||||
**Trajectory:**
|
||||
- `TRAJECTORY_ACCURACY_PROMPT` — logical progression, goal alignment
|
||||
- `TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE` — semantically equivalent to reference
|
||||
- `TOOL_SELECTION_PROMPT` — right tools, right order, no redundant calls
|
||||
|
||||
**Conversation:**
|
||||
- `USER_SATISFACTION_PROMPT` — gratitude, resolution, engagement
|
||||
- `TASK_COMPLETION_PROMPT` — was the user's goal achieved
|
||||
- `AGENT_TONE_PROMPT` — appropriate tone and professionalism
|
||||
|
||||
#### Scenario: Each prompt is a string with {inputs}, {outputs}, {reference_outputs} placeholders
|
||||
|
||||
- **WHEN** a prompt template is inspected
|
||||
- **THEN** it SHALL be a string compatible with `str.format()` containing at least `{outputs}`
|
||||
|
||||
#### Scenario: Prompt templates follow rubric structure
|
||||
|
||||
- **WHEN** a prompt template is read
|
||||
- **THEN** it SHALL contain `<Rubric>`, `<Instructions>`, and `<Reminder>` XML sections
|
||||
@@ -0,0 +1,49 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Stream modes
|
||||
|
||||
The system SHALL support multiple stream modes when invoking a compiled graph:
|
||||
|
||||
- `"values"` — emits the full state after each superstep
|
||||
- `"updates"` — emits only the state changes after each superstep
|
||||
- `"messages"` — emits individual message chunks for chat-oriented graphs
|
||||
- `"debug"` — emits debug events with full superstep information
|
||||
- `"custom"` — supports user-defined events via a emit function
|
||||
|
||||
#### Scenario: Values mode emits full state
|
||||
|
||||
- **WHEN** a graph is streamed with `streamMode: ["values"]`
|
||||
- **THEN** each chunk SHALL contain the complete state object after each superstep
|
||||
|
||||
#### Scenario: Updates mode emits diffs
|
||||
|
||||
- **WHEN** a graph is streamed with `streamMode: ["updates"]`
|
||||
- **THEN** each chunk SHALL contain only the state keys that changed
|
||||
|
||||
### Requirement: Stream event protocol
|
||||
|
||||
The system SHALL emit structured events during graph execution, including:
|
||||
- `on_chain_start` — node execution begins
|
||||
- `on_chain_end` — node execution completes
|
||||
- `on_chain_stream` — intermediate output from a node
|
||||
- `on_custom_event` — user-defined events
|
||||
- Checkpoint metadata paired with each event (id, parent_id, step, source)
|
||||
|
||||
#### Scenario: Events include checkpoint metadata
|
||||
|
||||
- **WHEN** a stream event is received
|
||||
- **THEN** it SHALL include a `checkpoint` envelope with `id`, `step`, and `source`
|
||||
|
||||
#### Scenario: Custom events propagate from nodes
|
||||
|
||||
- **WHEN** a node emits a custom event via an emit function
|
||||
- **THEN** that event SHALL appear in the stream with type `on_custom_event`
|
||||
|
||||
### Requirement: Async iteration over streams
|
||||
|
||||
The system SHALL support `for await...of` iteration over graph streams.
|
||||
|
||||
#### Scenario: Stream is async iterable
|
||||
|
||||
- **WHEN** `for await (const chunk of graph.stream(...))` is used
|
||||
- **THEN** each chunk SHALL be available as it is produced
|
||||
@@ -0,0 +1,56 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Node interrupt function
|
||||
|
||||
The system SHALL provide an `interrupt(value)` function that pauses graph execution and returns a resume value when the graph is continued.
|
||||
|
||||
#### Scenario: Interrupt pauses execution with value
|
||||
|
||||
- **WHEN** a node calls `const approval = interrupt({ question: "Approve this action?" })`
|
||||
- **THEN** execution SHALL pause and the interrupt value SHALL be available in the stream output
|
||||
|
||||
#### Scenario: Resume returns value to interrupt
|
||||
|
||||
- **WHEN** the graph is resumed with `Command({ resume: "approved" })`
|
||||
- **THEN** the `interrupt()` call SHALL return `"approved"`
|
||||
|
||||
#### Scenario: Multiple interrupts are supported
|
||||
|
||||
- **WHEN** a node calls `interrupt()` twice
|
||||
- **THEN** each interrupt SHALL be resolved sequentially, requiring two resume commands
|
||||
|
||||
### Requirement: Command-based graph resumption
|
||||
|
||||
The system SHALL provide a `Command` class that supports:
|
||||
- `Command.RESUME` — resume value for pending interrupts
|
||||
- `Command.GOTO` — Send or node name for dynamic routing
|
||||
- `Command.PARENT` — bubble up to parent graph
|
||||
|
||||
#### Scenario: Command with resume continues execution
|
||||
|
||||
- **WHEN** `await graph.stream(new Command({ resume: "user input" }))` is called
|
||||
- **THEN** the interrupted node SHALL continue with the resume value
|
||||
|
||||
#### Scenario: Command with goto routes dynamically
|
||||
|
||||
- **WHEN** a node returns `new Command({ goto: "human_review" })`
|
||||
- **THEN** execution SHALL route to `human_review` node
|
||||
|
||||
### Requirement: Automated interrupts at node boundaries
|
||||
|
||||
The system SHALL support `interruptBefore` and `interruptAfter` in `compile()` options to automatically pause at specific nodes.
|
||||
|
||||
#### Scenario: InterruptBefore pauses before node execution
|
||||
|
||||
- **WHEN** `graph.compile({ interruptBefore: ["approval_node"] })` is used
|
||||
- **THEN** the graph SHALL pause just before executing `approval_node`
|
||||
|
||||
### Requirement: State snapshots on interrupt
|
||||
|
||||
When a graph uses a checkpointer, interrupt states SHALL be persisted so execution can be resumed across process boundaries.
|
||||
|
||||
#### Scenario: Interrupted state is checkpointed
|
||||
|
||||
- **WHEN** a graphed with a checkpointer is interrupted
|
||||
- **THEN** the checkpoint SHALL contain the interrupt state
|
||||
- **THEN** restoring from that checkpoint SHALL yield the same interrupt state
|
||||
@@ -0,0 +1,55 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: LLM-as-judge evaluator factory
|
||||
|
||||
The system SHALL provide a `create_llm_as_judge()` function that creates an evaluator using an LLM to assess output quality.
|
||||
|
||||
Parameters:
|
||||
- `prompt: string | Runnable | Callable` — evaluation prompt (string template with `{inputs}`, `{outputs}`, `{reference_outputs}`)
|
||||
- `judge?: ModelClient | BaseChatModel` — LLM client
|
||||
- `model?: string` — model identifier
|
||||
- `system?: string` — optional system message
|
||||
- `continuous: boolean = false` — float 0-1 scoring when true, boolean when false
|
||||
- `choices?: number[]` — specific enum float values for score
|
||||
- `use_reasoning: boolean = true` — include reasoning in output
|
||||
- `few_shot_examples?: FewShotExample[]` — example evaluations
|
||||
- `output_schema?: JSONSchema | ZodSchema` — custom structured output format
|
||||
|
||||
#### Scenario: String prompt evaluator returns scored result
|
||||
|
||||
- **WHEN** `create_llm_as_judge(prompt="Rate: {outputs}")` is invoked with `outputs="Hello world"`
|
||||
- **THEN** it SHALL return an `EvaluatorResult` with `key: "score"` and a valid `score`
|
||||
|
||||
#### Scenario: Continuous scoring returns float
|
||||
|
||||
- **WHEN** `create_llm_as_judge(prompt=..., continuous=true)` scores output
|
||||
- **THEN** the score SHALL be a float between 0.0 and 1.0
|
||||
|
||||
#### Scenario: Choices scoring returns enum value
|
||||
|
||||
- **WHEN** `create_llm_as_judge(prompt=..., choices=[0.0, 0.5, 1.0])` scores output
|
||||
- **THEN** the score SHALL be exactly one of the enumerated choices
|
||||
|
||||
#### Scenario: Reasoning mode returns comment
|
||||
|
||||
- **WHEN** `create_llm_as_judge(prompt=..., use_reasoning=true)` scores output
|
||||
- **THEN** the `comment` field SHALL contain the LLM's reasoning
|
||||
|
||||
#### Scenario: Few-shot examples are appended to prompt
|
||||
|
||||
- **WHEN** `few_shot_examples` are provided
|
||||
- **THEN** they SHALL be appended as `<example>` XML blocks to the last user message
|
||||
|
||||
#### Scenario: Output schema returns structured dict
|
||||
|
||||
- **WHEN** `output_schema` is provided (e.g., z.object({ quality: z.number() }))
|
||||
- **THEN** the evaluator SHALL return a dict matching that schema instead of `EvaluatorResult`
|
||||
|
||||
### Requirement: Async LLM-as-judge
|
||||
|
||||
The system SHALL provide `create_async_llm_as_judge()` with identical parameters, returning an async evaluator.
|
||||
|
||||
#### Scenario: Async evaluator returns same structure as sync
|
||||
|
||||
- **WHEN** `await` is used on an async evaluator invocation
|
||||
- **THEN** the result SHALL match the same structure as the sync equivalent
|
||||
@@ -0,0 +1,39 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Multi-turn conversation simulation
|
||||
|
||||
The system SHALL provide `run_multiturn_simulation()` that simulates a multi-turn conversation between an app and a simulated user.
|
||||
|
||||
Parameters:
|
||||
- `app: Callable[[ChatCompletionMessage], ChatCompletionMessage]` — the application under test
|
||||
- `user: Callable | string[]` — simulated user (dynamic or static responses)
|
||||
- `max_turns?: number` — maximum conversation turns
|
||||
- `trajectory_evaluators?: EvalFunction[]` — evaluators that assess the final trajectory
|
||||
- `stopping_condition?: Callable[[Message[], number], boolean]` — early termination
|
||||
- `reference_outputs?: unknown` — passed to evaluators
|
||||
|
||||
#### Scenario: Static user responses drive conversation
|
||||
|
||||
- **WHEN** `user=["Hello", "Tell me more", "Goodbye"]` with `max_turns=3`
|
||||
- **THEN** the simulation SHALL alternate between user responses and app responses for 3 turns
|
||||
|
||||
#### Scenario: Dynamic simulated user adapts to context
|
||||
|
||||
- **WHEN** `user` is a `Callable` receiving the current trajectory
|
||||
- **THEN** the user function SHALL receive the current conversation history and return the next message
|
||||
|
||||
#### Scenario: Trajectory evaluators run after simulation
|
||||
|
||||
- **WHEN** `trajectory_evaluators` are provided
|
||||
- **THEN** each evaluator SHALL receive the full conversation trajectory as `outputs`
|
||||
- **THEN** the simulation result SHALL include `evaluator_results` from each evaluator
|
||||
|
||||
#### Scenario: Stopping condition terminates early
|
||||
|
||||
- **WHEN** `stopping_condition` returns `true` before `max_turns`
|
||||
- **THEN** the simulation SHALL terminate immediately
|
||||
|
||||
#### Scenario: Async simulation is supported
|
||||
|
||||
- **WHEN** `run_multiturn_simulation_async()` is called with async `app` and `user` functions
|
||||
- **THEN** the simulation SHALL await each turn and return the same result structure
|
||||
@@ -0,0 +1,49 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Pregel execution engine
|
||||
|
||||
The system SHALL implement a Pregel-style superstep execution engine where:
|
||||
|
||||
- Each "superstep" executes all ready nodes concurrently
|
||||
- Nodes communicate through typed channels (not direct function calls)
|
||||
- Channel writes from one superstep are visible as reads in the next
|
||||
- The engine supports `PULL` (edge-triggered) and `PUSH` (dynamic Send) task scheduling
|
||||
|
||||
#### Scenario: Nodes execute in dependency order
|
||||
|
||||
- **WHEN** node B subscribes to channel A
|
||||
- **THEN** node B SHALL execute in the superstep after node A writes to channel A
|
||||
|
||||
#### Scenario: Concurrent nodes run in parallel
|
||||
|
||||
- **WHEN** two nodes have no dependencies between them
|
||||
- **THEN** they SHALL execute concurrently within the same superstep
|
||||
|
||||
#### Scenario: Dynamic Send spawns new node executions
|
||||
|
||||
- **WHEN** a node calls `send("node_c", { ... })` via `Command`
|
||||
- **THEN** `node_c` SHALL be scheduled for execution in the current or next superstep
|
||||
|
||||
### Requirement: Graph compilation
|
||||
|
||||
The system SHALL provide `graph.compile()` that produces a runnable compiled graph.
|
||||
|
||||
Parameters:
|
||||
- `checkpointer?: Checkpointer` — optional persistence
|
||||
- `interruptBefore?: string[]` — nodes to pause before
|
||||
- `interruptAfter?: string[]` — nodes to pause after
|
||||
- `name?: string` — graph name
|
||||
|
||||
#### Scenario: Compiled graph can be invoked
|
||||
|
||||
- **WHEN** `compiled_graph.invoke({ messages: [] })` is called
|
||||
- **THEN** it SHALL execute all nodes and return the final state
|
||||
|
||||
### Requirement: Recursion limit
|
||||
|
||||
The system SHALL enforce a configurable recursion limit to prevent infinite loops.
|
||||
|
||||
#### Scenario: Exceeding recursion limit throws
|
||||
|
||||
- **WHEN** a graph exceeds the recursion limit
|
||||
- **THEN** a `GraphRecursionError` SHALL be thrown
|
||||
@@ -0,0 +1,61 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Command execution (blocking)
|
||||
|
||||
The system SHALL provide `sandbox.runCommand(cmd, args?, opts?)` that executes a command inside the sandbox and waits for completion.
|
||||
|
||||
Parameters:
|
||||
- `cmd: string` — command to execute
|
||||
- `args?: string[]` — command arguments
|
||||
- `cwd?: string` — working directory
|
||||
- `env?: Record<string, string>` — per-command environment variables
|
||||
- `sudo?: boolean` — execute with root privileges
|
||||
- `timeoutMs?: number` — max execution time (SIGKILL on expiry)
|
||||
- `signal?: AbortSignal` — cancellation
|
||||
|
||||
#### Scenario: Blocking runCommand returns finished result with exit code
|
||||
|
||||
- **WHEN** `sandbox.runCommand("echo", ["hello"])` is called
|
||||
- **THEN** it SHALL return a `CommandFinished` instance with `exitCode: 0`
|
||||
|
||||
#### Scenario: Command timeout kills process
|
||||
|
||||
- **WHEN** `sandbox.runCommand("sleep", ["100"], { timeoutMs: 100 })` is executed
|
||||
- **THEN** it SHALL return a non-zero exit code after ~100ms
|
||||
|
||||
#### Scenario: Stderr is captured separately
|
||||
|
||||
- **WHEN** a command writes to both stdout and stderr
|
||||
- **THEN** `result.stdout()` and `result.stderr()` SHALL return their respective streams
|
||||
|
||||
### Requirement: Detached command execution
|
||||
|
||||
The system SHALL support `{ detached: true }` mode where `runCommand()` returns immediately with a live `Command` handle.
|
||||
|
||||
#### Scenario: Detached command returns before completion
|
||||
|
||||
- **WHEN** `sandbox.runCommand({ cmd: "sleep", args: ["5"], detached: true })` is called
|
||||
- **THEN** it SHALL return a `Command` instance immediately (before the process exits)
|
||||
|
||||
#### Scenario: Detached command can be waited on
|
||||
|
||||
- **WHEN** `command.wait()` is called on a detached command
|
||||
- **THEN** it SHALL return a `CommandFinished` when the process exits
|
||||
|
||||
### Requirement: Command log streaming
|
||||
|
||||
The system SHALL provide `command.logs()` as an async iterable of stdout/stderr log lines.
|
||||
|
||||
#### Scenario: Logs stream output lines
|
||||
|
||||
- **WHEN** `for await (const log of command.logs())` is iterated
|
||||
- **THEN** each `log` SHALL have `stream: "stdout" | "stderr"` and `data: string`
|
||||
|
||||
### Requirement: Command kill
|
||||
|
||||
The system SHALL provide `command.kill(signal?)` to send a POSIX signal to a running command.
|
||||
|
||||
#### Scenario: Default kill sends SIGTERM
|
||||
|
||||
- **WHEN** `command.kill()` is called without a signal
|
||||
- **THEN** SIGTERM SHALL be sent to the process
|
||||
@@ -0,0 +1,50 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Filesystem API matching node:fs/promises
|
||||
|
||||
The system SHALL provide `sandbox.fs` implementing the Node.js `fs/promises` API:
|
||||
|
||||
- `readFile(path, encoding?)` → `Buffer | string`
|
||||
- `writeFile(path, data)` → `void`
|
||||
- `appendFile(path, data)` → `void`
|
||||
- `mkdir(path, { recursive? })` → `void`
|
||||
- `readdir(path, { withFileTypes? })` → `string[] | Dirent[]`
|
||||
- `stat(path)` / `lstat(path)` → `Stats`
|
||||
- `unlink(path)`, `rm(path, { recursive?, force? })`, `rmdir(path)` → `void`
|
||||
- `rename(oldPath, newPath)` → `void`
|
||||
- `copyFile(src, dest)` → `void`
|
||||
- `chmod(path, mode)`, `chown(path, uid, gid)` → `void`
|
||||
- `symlink(target, path)`, `readlink(path)` → `void`
|
||||
- `realpath(path)`, `truncate(path, len?)` → `void`
|
||||
- `mkdtemp(prefix)` → `string`
|
||||
- `access(path)`, `exists(path)` → `boolean`
|
||||
|
||||
#### Scenario: ReadFile returns correct content
|
||||
|
||||
- **WHEN** `sandbox.fs.readFile("/etc/hostname", "utf8")` is called
|
||||
- **THEN** it SHALL return the file content as a string
|
||||
|
||||
#### Scenario: WriteFile creates new file
|
||||
|
||||
- **WHEN** `sandbox.fs.writeFile("/tmp/test.txt", "hello")` is called
|
||||
- **THEN** subsequent `sandbox.fs.readFile("/tmp/test.txt", "utf8")` SHALL return `"hello"`
|
||||
|
||||
#### Scenario: Readdir lists directory contents
|
||||
|
||||
- **WHEN** `sandbox.fs.readdir("/")` is called
|
||||
- **THEN** it SHALL return an array of filenames
|
||||
|
||||
#### Scenario: Stat returns file metadata
|
||||
|
||||
- **WHEN** `sandbox.fs.stat("/etc/hostname")` is called
|
||||
- **THEN** it SHALL return a `Stats`-compatible object with `size`, `isFile()`, `isDirectory()`, `mode`, `uid`, `gid`, `mtime`, etc.
|
||||
|
||||
#### Scenario: Mkdir creates intermediate directories
|
||||
|
||||
- **WHEN** `sandbox.fs.mkdir("/tmp/a/b/c", { recursive: true })` is called
|
||||
- **THEN** the directory `/tmp/a/b/c` SHALL exist
|
||||
|
||||
#### Scenario: Exists returns false for missing files
|
||||
|
||||
- **WHEN** `sandbox.fs.exists("/nonexistent")` is called
|
||||
- **THEN** it SHALL return `false`
|
||||
@@ -0,0 +1,70 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Sandbox creation
|
||||
|
||||
The system SHALL provide a `Sandbox.create()` static method that provisions a new isolated compute environment.
|
||||
|
||||
Parameters:
|
||||
- `name?: string` — optional human-readable name
|
||||
- `source?: { type: "git" | "tarball" | "snapshot" }` — source for initial filesystem
|
||||
- `ports?: number[]` — ports to expose (max 4)
|
||||
- `timeout?: number` — auto-terminate timeout in ms
|
||||
- `resources?: { vcpus: number }` — CPU allocation (2048 MB RAM per vCPU)
|
||||
- `runtime?: string` — runtime identifier
|
||||
- `networkPolicy?: NetworkPolicy` — network restrictions
|
||||
- `env?: Record<string, string>` — default environment variables
|
||||
- `tags?: Record<string, string>` — metadata tags (max 5)
|
||||
- `persistent?: boolean` — persistent filesystem across sessions
|
||||
- `signal?: AbortSignal` — cancellation support
|
||||
|
||||
#### Scenario: Create returns a running Sandbox instance
|
||||
|
||||
- **WHEN** `Sandbox.create()` is called with valid parameters
|
||||
- **THEN** it SHALL return a `Sandbox` instance with a running session
|
||||
|
||||
#### Scenario: Create supports AsyncDisposable
|
||||
|
||||
- **WHEN** `Sandbox.create()` is used with `await using`
|
||||
- **THEN** the sandbox SHALL be automatically stopped when scope exits
|
||||
|
||||
#### Scenario: Source specifies initial filesystem content
|
||||
|
||||
- **WHEN** `source: { type: "git", url: "..." }` is provided
|
||||
- **THEN** the sandbox SHALL clone the git repository on creation
|
||||
|
||||
### Requirement: Sandbox retrieval
|
||||
|
||||
The system SHALL provide `Sandbox.get()` to retrieve an existing sandbox and `Sandbox.getOrCreate()` for idempotent get-or-create.
|
||||
|
||||
#### Scenario: Get retrieves existing sandbox
|
||||
|
||||
- **WHEN** `Sandbox.get({ name: "my-sandbox" })` is called for an existing sandbox
|
||||
- **THEN** it SHALL return the sandbox with its session resumed
|
||||
|
||||
#### Scenario: GetOrCreate creates when not found
|
||||
|
||||
- **WHEN** `Sandbox.getOrCreate({ name: "new-sandbox", onCreate: ... })` is called and sandbox doesn't exist
|
||||
- **THEN** it SHALL create a new sandbox and call `onCreate` once
|
||||
|
||||
### Requirement: Sandbox forking
|
||||
|
||||
The system SHALL provide `Sandbox.fork()` to create a new sandbox from an existing one's current filesystem state.
|
||||
|
||||
#### Scenario: Fork preserves filesystem state
|
||||
|
||||
- **WHEN** `Sandbox.fork({ sourceSandbox: "original" })` is called
|
||||
- **THEN** the new sandbox SHALL start with the filesystem state of the source sandbox
|
||||
|
||||
### Requirement: Sandbox update and delete
|
||||
|
||||
The system SHALL support `sandbox.update()` for configuration changes and `sandbox.delete()` for removal.
|
||||
|
||||
#### Scenario: Update changes sandbox config
|
||||
|
||||
- **WHEN** `sandbox.update({ timeout: 300000 })` is called
|
||||
- **THEN** the sandbox's timeout SHALL be updated for subsequent sessions
|
||||
|
||||
#### Scenario: Delete removes the sandbox
|
||||
|
||||
- **WHEN** `sandbox.delete()` is called
|
||||
- **THEN** the sandbox SHALL be permanently removed
|
||||
@@ -0,0 +1,52 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Network policy type
|
||||
|
||||
The system SHALL define a `NetworkPolicy` type with three forms:
|
||||
|
||||
- `"allow-all"` — full internet access (default)
|
||||
- `"deny-all"` — no external access
|
||||
- `{ allow?: string[] | Record<string, NetworkPolicyRule[]>; subnets?: { allow?: string[]; deny?: string[] } }` — custom rules
|
||||
|
||||
#### Scenario: Allow-all permits all traffic
|
||||
|
||||
- **WHEN** `networkPolicy: "allow-all"` is set
|
||||
- **THEN** all outbound traffic SHALL be permitted
|
||||
|
||||
#### Scenario: Deny-all blocks all traffic
|
||||
|
||||
- **WHEN** `networkPolicy: "deny-all"` is set
|
||||
- **THEN** all outbound traffic SHALL be denied
|
||||
|
||||
#### Scenario: Domain allowlist restricts access
|
||||
|
||||
- **WHEN** `networkPolicy: { allow: ["*.npmjs.org"] }` is set
|
||||
- **THEN** traffic to `registry.npmjs.org` SHALL be allowed and all other traffic SHALL be denied
|
||||
|
||||
#### Scenario: Wildcard domains match subdomains
|
||||
|
||||
- **WHEN** a domain pattern starts with `*.` (e.g., `*.example.com`)
|
||||
- **THEN** it SHALL match any subdomain of that domain
|
||||
|
||||
### Requirement: Network policy rules with transformers
|
||||
|
||||
The system SHALL support per-domain rules with request transformers for header injection.
|
||||
|
||||
Parameters per rule:
|
||||
- `match?: { path?, method?, queryString?, headers? }` — request matchers
|
||||
- `transform?: { headers: Record<string, string> }[]` — header injection
|
||||
- `forwardURL?: string` — HTTPS proxy forwarding
|
||||
|
||||
#### Scenario: Header transform injects authorization
|
||||
|
||||
- **WHEN** a request matches a rule with `transform: [{ headers: { authorization: "Bearer token" } }]`
|
||||
- **THEN** the `authorization` header SHALL be injected before forwarding
|
||||
|
||||
### Requirement: Subnet filtering
|
||||
|
||||
The system SHALL support subnet-level access control via CIDR notation.
|
||||
|
||||
#### Scenario: Subnet allow takes precedence over domain deny
|
||||
|
||||
- **WHEN** `subnets: { allow: ["10.0.0.0/8"] }` is set
|
||||
- **THEN** traffic to `10.0.0.1` SHALL be allowed regardless of domain rules
|
||||
@@ -0,0 +1,59 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Snapshot creation
|
||||
|
||||
The system SHALL provide `sandbox.snapshot()` to create a point-in-time filesystem snapshot.
|
||||
|
||||
Parameters:
|
||||
- `expiration?: number` — TTL in milliseconds (0 for no expiration)
|
||||
|
||||
#### Scenario: Snapshot stops the session and returns Snapshot instance
|
||||
|
||||
- **WHEN** `sandbox.snapshot()` is called on a running sandbox
|
||||
- **THEN** the current session SHALL be stopped and a `Snapshot` SHALL be returned
|
||||
|
||||
### Requirement: Snapshot retrieval and listing
|
||||
|
||||
The system SHALL provide `Snapshot.get()`, `Snapshot.list()`, and `Snapshot.tree()` for managing snapshots.
|
||||
|
||||
#### Scenario: Retrieve snapshot by ID
|
||||
|
||||
- **WHEN** `Snapshot.get({ snapshotId: "snap_abc" })` is called
|
||||
- **THEN** it SHALL return the snapshot with matching ID
|
||||
|
||||
#### Scenario: List snapshots with pagination
|
||||
|
||||
- **WHEN** `Snapshot.list({ name: "my-sandbox" })` is called
|
||||
- **THEN** it SHALL return a paginated list of snapshots for that sandbox
|
||||
|
||||
#### Scenario: Ancestry tree is accessible
|
||||
|
||||
- **WHEN** `Snapshot.tree({ snapshotId: "snap_abc" })` is called
|
||||
- **THEN** it SHALL return the ancestry tree of the snapshot
|
||||
|
||||
### Requirement: Snapshot deletion
|
||||
|
||||
The system SHALL provide `snapshot.delete()` to remove a snapshot.
|
||||
|
||||
#### Scenario: Deleted snapshot is no longer listable
|
||||
|
||||
- **WHEN** `snapshot.delete()` is called and then `Snapshot.list()` is called
|
||||
- **THEN** the deleted snapshot SHALL no longer appear in the list
|
||||
|
||||
### Requirement: Snapshot-based sandbox creation
|
||||
|
||||
The system SHALL support creating sandboxes from snapshots via `Sandbox.create({ source: { type: "snapshot", snapshotId } })`.
|
||||
|
||||
#### Scenario: Sandbox created from snapshot has matching filesystem
|
||||
|
||||
- **WHEN** a sandbox is created with a snapshot source and a file is written, then another sandbox is created from the resulting snapshot
|
||||
- **THEN** the second sandbox SHALL contain the file from the first
|
||||
|
||||
### Requirement: Snapshot retention
|
||||
|
||||
The system SHALL support `keepLastSnapshots` retention policy on sandboxes.
|
||||
|
||||
#### Scenario: Retention evicts oldest snapshots
|
||||
|
||||
- **WHEN** a sandbox has `keepLastSnapshots: { count: 3 }` and a 4th snapshot is created
|
||||
- **THEN** the oldest snapshot SHALL be evicted
|
||||
@@ -0,0 +1,43 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: State definition via annotations
|
||||
|
||||
The system SHALL provide an `Annotation` API for defining graph state schemas:
|
||||
|
||||
- `Annotation<T>(reducer?)` — creates a state key with optional reducer
|
||||
- `Annotation.Root({ key: Annotation<T> })` — combines keys into a state schema
|
||||
- Reducers: `LastValue` (default — overwrite), `BinaryOperator` (custom merge function)
|
||||
|
||||
#### Scenario: Annotation.Root defines typed state
|
||||
|
||||
- **WHEN** `const State = Annotation.Root({ messages: Annotation<string[]>(addMessages), step: Annotation<number>() })` is defined
|
||||
- **THEN** `State` SHALL have `State`, `Update`, and `Node` type members
|
||||
|
||||
#### Scenario: LastValue reducer replaces on each write
|
||||
|
||||
- **WHEN** a node writes `{ step: 2 }` and then `{ step: 3 }` in the same step
|
||||
- **THEN** the LastValue channel SHALL throw an `InvalidUpdateError`
|
||||
|
||||
#### Scenario: BinaryOperator reducer accumulates
|
||||
|
||||
- **WHEN** a node returns `{ messages: ["hello"] }` and another returns `{ messages: ["world"] }` with an `addMessages` reducer
|
||||
- **THEN** the final state SHALL contain `messages: ["hello", "world"]`
|
||||
|
||||
### Requirement: StateGraph builder
|
||||
|
||||
The system SHALL provide a `StateGraph` class for constructing stateful agent graphs.
|
||||
|
||||
#### Scenario: StateGraph is constructed with state schema
|
||||
|
||||
- **WHEN** `new StateGraph({ stateSchema: State })` is called
|
||||
- **THEN** the graph SHALL accept nodes that receive and can update the defined state
|
||||
|
||||
#### Scenario: Nodes can read and write state
|
||||
|
||||
- **WHEN** a node function receives state with `{ messages, step }` and returns `{ step: step + 1 }`
|
||||
- **THEN** the graph SHALL update `step` and preserve `messages`
|
||||
|
||||
#### Scenario: Conditional edges route based on state
|
||||
|
||||
- **WHEN** `addConditionalEdges("node_a", (state) => state.step > 5 ? "end" : "node_b")` is added
|
||||
- **THEN** execution SHALL route based on the state value at runtime
|
||||
@@ -0,0 +1,51 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Trajectory match evaluator
|
||||
|
||||
The system SHALL provide `create_trajectory_match_evaluator()` that compares agent tool-call trajectories against reference trajectories.
|
||||
|
||||
Parameters:
|
||||
- `trajectory_match_mode: "strict" | "unordered" | "subset" | "superset"` — matching strategy
|
||||
- `tool_args_match_mode: "exact" | "ignore" | "subset" | "superset"` — tool argument comparison
|
||||
- `tool_args_match_overrides?: Record<string, ToolArgsMatchMode | Callable | string[]>` — per-tool custom matching
|
||||
|
||||
#### Scenario: Strict mode requires exact order
|
||||
|
||||
- **WHEN** output trajectory has tool calls `[A, B]` and reference is `[A, B]`
|
||||
- **THEN** strict mode SHALL return score `true`
|
||||
- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
|
||||
- **THEN** strict mode SHALL return score `false`
|
||||
|
||||
#### Scenario: Unordered mode ignores order
|
||||
|
||||
- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
|
||||
- **THEN** unordered mode SHALL return score `true`
|
||||
|
||||
#### Scenario: Subset mode accepts partial trajectory
|
||||
|
||||
- **WHEN** output trajectory has tool calls `[A]` and reference is `[A, B]`
|
||||
- **THEN** subset mode SHALL return score `true`
|
||||
|
||||
#### Scenario: Superset mode allows extra tool calls
|
||||
|
||||
- **WHEN** output trajectory has tool calls `[A, B, C]` and reference is `[A, B]`
|
||||
- **THEN** superset mode SHALL return score `true`
|
||||
|
||||
#### Scenario: Tool args ignore mode skips argument comparison
|
||||
|
||||
- **WHEN** `tool_args_match_mode="ignore"` is set
|
||||
- **THEN** tool calls match regardless of their arguments
|
||||
|
||||
#### Scenario: Custom tool arg matcher is used
|
||||
|
||||
- **WHEN** `tool_args_match_overrides` contains a `Callable` for a tool name
|
||||
- **THEN** that callable SHALL be invoked to compare the tool's arguments
|
||||
|
||||
### Requirement: Trajectory LLM-as-judge
|
||||
|
||||
The system SHALL provide `create_trajectory_llm_as_judge()` that uses an LLM to grade trajectory quality and accuracy.
|
||||
|
||||
#### Scenario: Trajectory is formatted as XML for LLM
|
||||
|
||||
- **WHEN** an LLM trajectory evaluator is invoked
|
||||
- **THEN** the trajectory SHALL be formatted as XML with `<role>`, `<tool_call>`, `<tool_result>` elements
|
||||
@@ -0,0 +1,50 @@
|
||||
## 1. Foundation: Core Types & Monorepo Setup ✅
|
||||
|
||||
- [x] 1.1 Initialize pnpm monorepo with turbo.json at root, configure `@agent-runtime/*` workspace packages
|
||||
- [x] 1.2 Set up shared TypeScript config (strict mode, ESNext modules, path aliases)
|
||||
- [x] 1.3 Implement `@agent-runtime/core` package: `EvaluatorResult`, `ScoreType`, `ModelClient` protocol, `Serializable` interface
|
||||
- [x] 1.4 Implement `@agent-runtime/core` serialization protocol: `toJSON()`/`fromJSON()` pattern on stateful types
|
||||
- [x] 1.5 Implement `@agent-runtime/core` error types: `EvalError`
|
||||
- [x] 1.6 Implement `@agent-runtime/core` utility functions: message normalization, XML formatting, JSON schema construction
|
||||
|
||||
## 2. Eval: LLM-as-Judge Core
|
||||
|
||||
- [ ] 2.1 Implement `_construct_default_output_json_schema()` for continuous/binary/choices scoring with reasoning
|
||||
- [ ] 2.2 Implement prompt formatting (string templates, attachments, system messages)
|
||||
- [ ] 2.3 Implement `_append_few_shot_examples()` with XML `<example>` formatting
|
||||
- [ ] 2.4 Implement `_create_llm_as_judge_scorer()` — core scorer with structured output via OpenAI JSON schema
|
||||
- [ ] 2.5 Implement `create_llm_as_judge()` factory wrapping scorer into `_run_evaluator()`
|
||||
- [ ] 2.6 Implement async variants: `create_async_llm_as_judge()`, `_create_async_llm_as_judge_scorer()`
|
||||
- [ ] 2.7 Implement `_run_evaluator_untyped()` and `_process_score()` for result aggregation
|
||||
- [ ] 2.8 Write unit tests for LLM-as-judge: string prompts, continuous scoring, choices, reasoning, few-shot
|
||||
|
||||
## 3. Eval: Trajectory Evaluators
|
||||
|
||||
- [ ] 3.1 Implement trajectory matching utilities: `_normalize_to_openai_messages_list()`, `_extract_tool_calls()`
|
||||
- [ ] 3.2 Implement `_is_trajectory_superset()` core comparator with `_get_matcher_for_tool_name()` override system
|
||||
- [ ] 3.3 Implement strict/unordered/subset/superset matching scorers
|
||||
- [ ] 3.4 Implement `create_trajectory_match_evaluator()` with all 4 modes and `tool_args_match_overrides`
|
||||
- [ ] 3.5 Write tests: all 4 match modes, tool args ignore, custom matchers
|
||||
|
||||
## 4. Eval: Code Correctness Evaluators
|
||||
|
||||
- [ ] 4.1 Implement code extraction: `_extract_code_from_markdown_code_blocks()` regex parser
|
||||
- [ ] 4.2 Implement `_create_base_code_evaluator()` with pluggable extraction pipeline
|
||||
- [ ] 4.3 Implement `create_code_llm_as_judge()` combining extraction + LLM scoring
|
||||
- [ ] 4.4 Implement `create_pyright_evaluator()` with temp file execution and JSON output parsing
|
||||
- [ ] 4.5 Write tests: markdown extraction, Pyright static analysis
|
||||
|
||||
## 5. Eval: Prompt Library
|
||||
|
||||
- [ ] 5.1 Export Quality prompt templates: correctness, conciseness, hallucination, answer_relevance, code_correctness, plan_adherence
|
||||
- [ ] 5.2 Export Safety/Security prompt templates: toxicity, fairness, pii_leakage, prompt_injection
|
||||
- [ ] 5.3 Export Trajectory prompt templates: trajectory_accuracy (with and without reference), tool_selection
|
||||
- [ ] 5.4 Export Conversation prompt templates: user_satisfaction, task_completion, agent_tone
|
||||
|
||||
## 6. Documentation & Release
|
||||
|
||||
- [ ] 6.1 Write README with architecture overview and getting-started example
|
||||
- [ ] 6.2 Document each package with tsdoc exports
|
||||
- [ ] 6.3 Write usage examples: basic eval, code correctness check
|
||||
- [ ] 6.4 Add CI pipeline: lint, type-check, test
|
||||
- [ ] 6.5 Publish initial alpha for `@agent-runtime/eval` package
|
||||
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-06-07
|
||||
@@ -0,0 +1,62 @@
|
||||
## Context
|
||||
|
||||
Three workflow engine patterns were researched: **Archon** (DAG-based YAML, git isolation), **Agent SOP** (markdown instructions with RFC 2119 constraints), and **Vercel Workflow** (event-sourced durable execution). Each excels in one dimension but has fundamental gaps:
|
||||
|
||||
- **Archon**: Clean DAG format + variable substitution + approval gates, but no crash recovery, tightly coupled to its monorepo (Bun/SQLite/Claude SDK)
|
||||
- **Agent SOP**: Zero parser complexity, AI-native markdown, but completely stateless — no execution engine, no validation, no persistence
|
||||
- **Vercel Workflow**: Gold-standard durability via event sourcing, but requires Rust SWC plugin, VM sandbox, 24-36 week rebuild — extreme complexity for the value in most use cases
|
||||
|
||||
**Ion** extracts the portable essence of each: Archon's DAG schema and executor, Agent SOP's markdown readability, Vercel's event sourcing (simplified — no SWC, no VM, no compile transforms).
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Fully portable DAG execution engine in pure TypeScript (zero Rust/SWC/wasm)
|
||||
- YAML-first workflow definitions with 7 node types (command, prompt, bash, script, loop, approval, cancel)
|
||||
- `.sop.md` markdown format as a secondary input (transpiled to DAG nodes)
|
||||
- Event-sourced persistence for crash recovery with deterministic replay — simplified to "log of node outcomes" rather than "log of every async operation"
|
||||
- Plugable storage backends: filesystem (dev), SQLite/Postgres (production)
|
||||
- CLI tool + library API dual distribution
|
||||
- Approval gates with capture_response and on_reject
|
||||
- Variable substitution ($nodeId.output, $ARGUMENTS, $LOOP_PREV_OUTPUT, etc.)
|
||||
- Script execution via bun/node (TS) and uv/python3 (Python) with deps support
|
||||
|
||||
**Non-Goals:**
|
||||
- No SWC compiler plugin or build-time transforms (Vercel's approach is overkill for this scope)
|
||||
- No VM sandbox for workflow execution (workflows run as regular async functions)
|
||||
- No git worktree isolation (leave to the host application)
|
||||
- No multi-tenant or serverless platform (single-tenant CLI/library focus)
|
||||
- No web UI in the initial build (CLI + library only; web can be added later)
|
||||
- No AI provider integration (host application provides the AI; Ion just routes prompts)
|
||||
|
||||
## Decisions
|
||||
|
||||
### Decision 1: Event Log = Node Outcomes, Not Every Async Operation
|
||||
**Vercel** logs every `step_created`, `step_completed`, `wait_created`, `hook_received` etc. — 17 event types. This requires SWC transforms to intercept all async boundaries.
|
||||
**Ion** logs only *node-level* events: `node_started`, `node_completed`, `node_failed`, `workflow_started`, `workflow_completed`, `workflow_failed`. No micro-events. Replay means "re-run the DAG from the top, skipping completed nodes using stored outputs" — identical to Archon's `resume` approach.
|
||||
**Rationale**: Simpler by an order of magnitude. No interceptors, no transforms, no VM. Crash recovery works: if the process dies mid-workflow, replay skips completed nodes and re-executes from the last failed/incomplete layer.
|
||||
|
||||
### Decision 2: Pure TypeScript — No Rust, No SWC, No WASM
|
||||
All three engines studied: Archon (pure TS), Vercel (Rust SWC plugin), Agent SOP (pure Python). The SWC plugin is the single biggest contributor to Vercel's 24-36 week build time.
|
||||
**Ion** stays pure TS. The DAG executor, YAML loader, variable substitution, event log — all standard async/await. No build step beyond `tsc` or `bun build`.
|
||||
|
||||
### Decision 3: YAML Primary, Markdown Secondary
|
||||
**Archon's YAML** format is the primary definition: structured, validated by Zod, machine-parseable. **Agent SOP's markdown** is the secondary format: human-writable, conversational, auto-converted.
|
||||
The transpiler is simple: parse `## Parameters` → extract required fields, parse `## Steps` → convert each step to a `prompt:` node with constraints embedded in the prompt text. No AST-level parsing needed.
|
||||
|
||||
### Decision 4: Storage via IWorkflowStore Interface
|
||||
**Archon's pattern**: `IWorkflowStore` interface with `createWorkflowRun`, `getWorkflowRun`, `updateWorkflowRun`, `failWorkflowRun`, `createWorkflowEvent`, `getCompletedDagNodeOutputs`. Adapters implement the interface.
|
||||
**Ion** copies this pattern exactly. FilesystemStore (JSON files per run), SqliteStore, PostgresStore. The interface is the seam.
|
||||
|
||||
### Decision 5: CLI + Library, Not Server
|
||||
**Archon** has a server + web UI. **Vercel** is a platform SDK. **Ion** ships only as a CLI + library.
|
||||
The CLI wraps the library: `ion run <workflow>`, `ion list`, `ion approve`, `ion reject`, `ion resume`. The library exports `executeWorkflow()`, `createStore()`, `parseWorkflow()`, `discoverWorkflows()`.
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
| Risk | Mitigation |
|
||||
|---|---|
|
||||
| **Event-sourcing is simplified to node-level only** — means no intra-node recovery (if a 30-min AI prompt crashes at 29 min, it restarts from scratch). | Acceptable tradeoff. AI prompts are idempotent. For script/bash nodes, provide `timeout` and `retry` config. Node-level replay is 90% of the value at 10% of the complexity. |
|
||||
| **No VM sandbox** — workflows run as regular async functions, so `while(true){}` hangs the process. | Document that workflow code must be well-behaved. The `idle_timeout` per node provides a circuit breaker. Production deployments can run workflows in a separate child process. |
|
||||
| **Markdown-to-YAML transpiler** may lose nuance — SOP's RFC 2119 constraints are prose, not structured. | Constraints stay embedded in the prompt text of the resulting `prompt:` node. The transpiler extracts Parameters (→ node metadata) and Steps (→ prompt body). Lossless for the critical path. |
|
||||
| **Competing with existing engines** — Archon exists, Temporal exists, Inngest exists. | Ion targets a different niche: portable CLI-first engine that fits in a single repo. Not a platform, not a cloud service. |
|
||||
@@ -0,0 +1,41 @@
|
||||
## Why
|
||||
|
||||
Current workflow engines force a tradeoff between simplicity and durability. Archon has a clean DAG-based YAML format but no crash recovery. Vercel Workflow has bulletproof deterministic replay but requires a Rust compiler plugin and 24-36 weeks to build. Agent SOP proves that human-readable markdown workflows work, but lacks structured execution. There is no portable workflow engine that combines a simple DAG format, human-readable definitions, and durable event-sourced execution in a single, buildable package.
|
||||
|
||||
## What Changes
|
||||
|
||||
Introduce **Ion** — a portable hybrid workflow engine that combines the three approaches:
|
||||
|
||||
- **Archon-style YAML DAG format** with `nodes:`, `depends_on:`, and trigger rules as the primary workflow definition
|
||||
- **Agent SOP-style `.sop.md` markdown** as a secondary human-readable format, auto-converted to the DAG representation
|
||||
- **Vercel-style event log** for deterministic replay and crash recovery, but simplified — no SWC plugin, no VM sandbox, no compile-time transforms
|
||||
- **Multi-backend storage** (filesystem for dev, SQLite/Postgres for production)
|
||||
- **CLI + library** dual distribution: use as a CLI tool or embed as a library
|
||||
- **No Rust compiler plugins, no SWC, no VM sandbox** — pure TypeScript/JavaScript, zero compile-time transforms
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
|
||||
- `dag-engine`: DAG execution engine with topological ordering, concurrent layers, trigger rules (`all_success`, `one_success`, `all_done`, `none_failed_min_one_success`), and `when:` condition evaluation
|
||||
- `yaml-format`: Workflow definition in YAML with 7 node types (command, prompt, bash, script, loop, approval, cancel) plus `depends_on`, `trigger_rule`, `output_format`, `retry`, `timeout`
|
||||
- `markdown-format`: `.sop.md` human-readable workflow format with RFC 2119 constraint keywords, auto-converted to DAG nodes
|
||||
- `event-sourcing`: Append-only event log for workflow runs with deterministic replay for crash recovery — simplified (no SWC, no VM sandbox)
|
||||
- `variable-substitution`: `$nodeId.output`, `$nodeId.output.field`, `$ARGUMENTS`, `$ARTIFACTS_DIR`, `$WORKFLOW_ID`, `$LOOP_PREV_OUTPUT`, `$REJECTION_REASON` with strict field access
|
||||
- `script-execution`: Script node type running TypeScript (bun/node) and Python (uv/python3) with `deps:` support and `timeout:`
|
||||
- `human-approval`: Approval gate nodes that pause execution for human review with `capture_response` and `on_reject` retry support
|
||||
- `storage-backends`: Pluggable storage — filesystem (dev), SQLite, Postgres — with IWorkflowStore interface
|
||||
- `workflow-lifecycle`: Run states `pending → running → paused/completed/failed/cancelled`, resume skipping completed nodes, event-driven observability
|
||||
- `cli-tool`: Command-line interface for listing, running, approving, rejecting, resuming, and cleaning up workflow runs
|
||||
- `library-api`: Programmatic API for embedding the engine in other applications
|
||||
|
||||
### Modified Capabilities
|
||||
|
||||
<!-- No existing specs to modify — this is a greenfield change. -->
|
||||
|
||||
## Impact
|
||||
|
||||
- **Greenfield project** — no existing code to modify, all new artifacts under `ion/` or equivalent package path
|
||||
- **Dependencies**: Zod (schema validation), nanoid/ulid (ID generation), js-yaml (YAML parsing), chokidar (file watching for dev mode)
|
||||
- **Optional dependencies**: better-sqlite3 / postgres.js (production storage backends), bun (fast script runtime), highlight.js (markdown rendering)
|
||||
- **No** Rust, SWC, wasm, or compile-time transforms in the core engine
|
||||
@@ -0,0 +1,76 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Workflow listing
|
||||
The CLI SHALL provide a `list` command that displays all discovered workflows and their descriptions.
|
||||
|
||||
#### Scenario: List workflows
|
||||
- **WHEN** `ion list` is run
|
||||
- **THEN** all discovered workflows SHALL be listed with name, description, and source (bundled/project)
|
||||
|
||||
### Requirement: Workflow execution
|
||||
The CLI SHALL provide a `run` command that executes a workflow by name with optional arguments.
|
||||
|
||||
#### Scenario: Run workflow with message
|
||||
- **WHEN** `ion run analyze "analyze the codebase"` is run
|
||||
- **THEN** the `analyze` workflow SHALL execute with the provided user message
|
||||
|
||||
#### Scenario: Run in specific directory
|
||||
- **WHEN** `ion run build --cwd /path/to/project` is run
|
||||
- **THEN** the workflow SHALL use the specified working directory
|
||||
|
||||
#### Scenario: Run with specific store
|
||||
- **WHEN** `ion run deploy --store sqlite --db-path ./ion.db` is run
|
||||
- **THEN** the specified store backend SHALL be used
|
||||
|
||||
### Requirement: Workflow approval commands
|
||||
The CLI SHALL provide `approve` and `reject` commands for responding to approval gates.
|
||||
|
||||
#### Scenario: Approve a paused workflow
|
||||
- **WHEN** `ion approve <run-id>` is run
|
||||
- **THEN** the workflow SHALL resume from the paused approval node
|
||||
|
||||
#### Scenario: Approve with comment
|
||||
- **WHEN** `ion approve <run-id> "looks good"` is run
|
||||
- **THEN** the comment SHALL be recorded and available as `$nodeId.output`
|
||||
|
||||
#### Scenario: Reject with reason
|
||||
- **WHEN** `ion reject <run-id> "needs changes"` is run
|
||||
- **THEN** `$REJECTION_REASON` SHALL be set to "needs changes"
|
||||
- **THEN** if `on_reject` is configured, the handler SHALL execute
|
||||
|
||||
### Requirement: Workflow run management
|
||||
The CLI SHALL provide `status`, `runs`, `resume`, `abandon`, and `cleanup` commands.
|
||||
|
||||
#### Scenario: Show running workflows
|
||||
- **WHEN** `ion status` is run
|
||||
- **THEN** all active (running + paused) workflow runs SHALL be displayed
|
||||
|
||||
#### Scenario: List recent runs
|
||||
- **WHEN** `ion runs` is run
|
||||
- **THEN** recent workflow runs SHALL be listed with status and timestamps
|
||||
|
||||
#### Scenario: Resume failed run
|
||||
- **WHEN** `ion resume <run-id>` is run
|
||||
- **THEN** the failed run SHALL be resumed, skipping completed nodes
|
||||
|
||||
#### Scenario: Abandon run
|
||||
- **WHEN** `ion abandon <run-id>` is run
|
||||
- **THEN** the run SHALL be marked as cancelled
|
||||
|
||||
#### Scenario: Cleanup old runs
|
||||
- **WHEN** `ion cleanup` is run (default 7 days)
|
||||
- **THEN** runs older than the retention period SHALL have their artifacts removed
|
||||
|
||||
### Requirement: SOP-to-YAML conversion
|
||||
The CLI SHALL provide a `convert` command to transpile `.sop.md` files to `.yaml`.
|
||||
|
||||
#### Scenario: Convert SOP to YAML
|
||||
- **WHEN** `ion convert workflow.sop.md` is run
|
||||
- **THEN** a `workflow.yaml` SHALL be written with the equivalent DAG representation
|
||||
|
||||
### Requirement: Machine-readable output
|
||||
Workflow commands SHALL support `--json` flag for machine-readable output.
|
||||
|
||||
#### Scenario: JSON output for automation
|
||||
- **WHEN** `ion list --json` is run
|
||||
- **THEN** output SHALL be valid JSON array of workflow objects
|
||||
@@ -0,0 +1,54 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: DAG topological execution
|
||||
The engine SHALL execute workflow nodes in topological order determined by `depends_on` edges using Kahn's algorithm.
|
||||
|
||||
#### Scenario: Independent nodes run concurrently
|
||||
- **WHEN** a workflow has nodes A and B with no `depends_on`
|
||||
- **THEN** both A and B SHALL execute in the same topological layer, concurrently
|
||||
|
||||
#### Scenario: Dependent nodes run sequentially
|
||||
- **WHEN** node B lists `depends_on: [A]`
|
||||
- **THEN** A SHALL complete before B begins
|
||||
|
||||
#### Scenario: Cycle detection
|
||||
- **WHEN** nodes form a cycle (A → B → C → A)
|
||||
- **THEN** the loader SHALL reject the workflow with a cycle detection error
|
||||
|
||||
### Requirement: Trigger rules
|
||||
The engine SHALL support 4 trigger rules for join semantics.
|
||||
|
||||
#### Scenario: all_success (default)
|
||||
- **WHEN** a node has multiple upstream dependencies and no explicit `trigger_rule`
|
||||
- **THEN** it SHALL only run if ALL upstream nodes completed successfully
|
||||
- **THEN** it SHALL be skipped if any upstream node failed
|
||||
|
||||
#### Scenario: one_success
|
||||
- **WHEN** a node sets `trigger_rule: one_success`
|
||||
- **THEN** it SHALL run if at least one upstream node completed successfully
|
||||
|
||||
#### Scenario: all_done
|
||||
- **WHEN** a node sets `trigger_rule: all_done`
|
||||
- **THEN** it SHALL run when all upstream nodes have finished (any status), regardless of success/failure
|
||||
|
||||
#### Scenario: none_failed_min_one_success
|
||||
- **WHEN** a node sets `trigger_rule: none_failed_min_one_success`
|
||||
- **THEN** it SHALL run only if no upstream node failed AND at least one succeeded
|
||||
|
||||
### Requirement: when conditions
|
||||
Nodes SHALL support a `when:` string that evaluates to a boolean condition.
|
||||
|
||||
#### Scenario: when condition prevents execution
|
||||
- **WHEN** a node has `when: "false"` or any expression that evaluates falsy
|
||||
- **THEN** the node SHALL be skipped as if its trigger_rule prevented execution
|
||||
|
||||
### Requirement: Node retry with configurable policy
|
||||
Nodes SHALL support a `retry` config with `max_attempts`, `delay_ms`, and `on_error` (transient|all).
|
||||
|
||||
#### Scenario: retry on transient error
|
||||
- **WHEN** a node with `retry: { max_attempts: 3 }` fails with a transient error
|
||||
- **THEN** it SHALL retry up to 3 times with configured delay between attempts
|
||||
|
||||
#### Scenario: retry exhausted
|
||||
- **WHEN** all retry attempts fail
|
||||
- **THEN** the node SHALL be marked as failed and trigger_rule evaluation proceeds
|
||||
@@ -0,0 +1,59 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Append-only event log
|
||||
Workflow runs SHALL produce append-only event records. Events SHALL NOT be modified after creation.
|
||||
|
||||
#### Scenario: Events are chronological
|
||||
- **WHEN** a workflow executes
|
||||
- **THEN** events SHALL be stored with monotonically increasing timestamps or sequence numbers
|
||||
- **THEN** event order SHALL match execution order
|
||||
|
||||
#### Scenario: Events are immutable
|
||||
- **WHEN** an event has been persisted
|
||||
- **THEN** it SHALL NOT be updated or deleted
|
||||
|
||||
### Requirement: Event types
|
||||
The event log SHALL support exactly 8 event types: `workflow_started`, `workflow_completed`, `workflow_failed`, `workflow_cancelled`, `node_started`, `node_completed`, `node_failed`, `node_skipped`.
|
||||
|
||||
#### Scenario: Workflow lifecycle events
|
||||
- **WHEN** a workflow run begins
|
||||
- **THEN** a `workflow_started` event SHALL be recorded
|
||||
- **WHEN** a workflow run completes successfully
|
||||
- **THEN** a `workflow_completed` event SHALL be recorded
|
||||
- **WHEN** a workflow run fails
|
||||
- **THEN** a `workflow_failed` event SHALL be recorded
|
||||
|
||||
#### Scenario: Node lifecycle events
|
||||
- **WHEN** a node begins execution
|
||||
- **THEN** a `node_started` event SHALL be recorded
|
||||
- **WHEN** a node completes successfully
|
||||
- **THEN** a `node_completed` event SHALL record the node's output
|
||||
- **WHEN** a node fails
|
||||
- **THEN** a `node_failed` event SHALL record the error
|
||||
- **WHEN** a node is skipped (trigger_rule not met)
|
||||
- **THEN** a `node_skipped` event SHALL be recorded
|
||||
|
||||
### Requirement: Deterministic replay for crash recovery
|
||||
When a workflow run is resumed after an interruption, the engine SHALL load completed node outputs from the event log and skip re-execution of completed nodes.
|
||||
|
||||
#### Scenario: Resume skips completed nodes
|
||||
- **WHEN** a workflow run is resumed after a crash
|
||||
- **THEN** all nodes with a `node_completed` event SHALL be skipped
|
||||
- **THEN** execution SHALL begin from the first node without a completed event
|
||||
|
||||
#### Scenario: Resume after partial execution
|
||||
- **WHEN** a workflow had 5 nodes and the first 3 completed before the crash
|
||||
- **THEN** nodes 1-3 SHALL be skipped (outputs loaded from event log)
|
||||
- **THEN** node 4 SHALL be re-executed
|
||||
|
||||
### Requirement: Event storage via plugable backend
|
||||
Events SHALL be persisted through the `IWorkflowStore` interface, with at least a filesystem backend.
|
||||
|
||||
#### Scenario: Filesystem event store
|
||||
- **WHEN** using the filesystem backend
|
||||
- **THEN** each run SHALL have a JSON file at `{runId}/events.jsonl`
|
||||
- **THEN** events SHALL be appended as newline-delimited JSON
|
||||
|
||||
#### Scenario: SQLite event store
|
||||
- **WHEN** using the SQLite backend
|
||||
- **THEN** events SHALL be stored in a `workflow_events` table with columns for run_id, sequence, event_type, timestamp, and payload
|
||||
@@ -0,0 +1,37 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Approval gate pauses execution
|
||||
An ApprovalNode SHALL pause workflow execution and send a message for human review. Execution SHALL only continue when the user approves or rejects.
|
||||
|
||||
#### Scenario: Approval pauses workflow
|
||||
- **WHEN** an approval node executes
|
||||
- **THEN** the workflow status SHALL transition to `paused`
|
||||
- **THEN** a message SHALL be sent with the approval message text
|
||||
|
||||
#### Scenario: Approve resumes execution
|
||||
- **WHEN** the user approves a paused workflow
|
||||
- **THEN** the workflow SHALL resume with the next node in the DAG
|
||||
|
||||
#### Scenario: Reject fails the node
|
||||
- **WHEN** the user rejects a paused workflow
|
||||
- **THEN** the node SHALL be marked as failed
|
||||
- **THEN** downstream nodes SHALL evaluate their trigger rules
|
||||
|
||||
### Requirement: Capture response from approval
|
||||
An approval node MAY support `capture_response: true` to store the user's comment as `$nodeId.output`.
|
||||
|
||||
#### Scenario: Approval with captured response
|
||||
- **WHEN** an approval node has `capture_response: true` and the user provides a comment during approval
|
||||
- **THEN** the comment SHALL be stored as the node's output, available via `$nodeId.output`
|
||||
|
||||
### Requirement: On-reject retry
|
||||
An approval node MAY specify `on_reject` with a `prompt` and optional `max_attempts` for re-presenting after rejection.
|
||||
|
||||
#### Scenario: Reject with retry prompt
|
||||
- **WHEN** an approval node has `on_reject: { prompt: "..." }` and the user rejects
|
||||
- **THEN** the on_reject prompt SHALL be executed (typically the AI revises based on feedback)
|
||||
- **THEN** the approval gate SHALL be re-presented to the user
|
||||
|
||||
#### Scenario: Max attempts exceeded
|
||||
- **WHEN** the number of rejections exceeds `on_reject.max_attempts`
|
||||
- **THEN** the node SHALL fail permanently
|
||||
@@ -0,0 +1,44 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Programmatic execution
|
||||
The engine SHALL export an `executeWorkflow()` function that accepts a workflow definition, store, and options.
|
||||
|
||||
#### Scenario: Execute workflow from code
|
||||
- **WHEN** a host application calls `executeWorkflow(workflowDef, store, { userMessage: "..." })`
|
||||
- **THEN** the workflow SHALL execute and return a `WorkflowExecutionResult`
|
||||
|
||||
### Requirement: Workflow parsing
|
||||
The engine SHALL export `parseWorkflow(yaml: string): WorkflowDefinition` and `parseWorkflowFile(path: string): WorkflowDefinition` functions.
|
||||
|
||||
#### Scenario: Parse YAML string
|
||||
- **WHEN** a host application calls `parseWorkflow(yamlString)`
|
||||
- **THEN** it SHALL return a validated `WorkflowDefinition`
|
||||
|
||||
#### Scenario: Parse YAML file
|
||||
- **WHEN** a host application calls `parseWorkflowFile("./workflows/my-workflow.yaml")`
|
||||
- **THEN** it SHALL read and parse the file, returning a validated `WorkflowDefinition`
|
||||
|
||||
### Requirement: Workflow discovery
|
||||
The engine SHALL export `discoverWorkflows(cwd: string): WorkflowLoadResult` for finding workflows in the filesystem.
|
||||
|
||||
#### Scenario: Discover workflows
|
||||
- **WHEN** a host application calls `discoverWorkflows(cwd)`
|
||||
- **THEN** it SHALL return all discovered workflows from the project's `.archon/workflows/` directory
|
||||
|
||||
### Requirement: Store constructors
|
||||
The engine SHALL export store constructors for each backend: `createFsStore(path)`, `createSqliteStore(path)`, `createPostgresStore(connectionString)`.
|
||||
|
||||
#### Scenario: Create filesystem store
|
||||
- **WHEN** a host application calls `createFsStore("./data")`
|
||||
- **THEN** it SHALL return an initialized `IWorkflowStore` using the filesystem backend
|
||||
|
||||
#### Scenario: Create SQLite store
|
||||
- **WHEN** a host application calls `createSqliteStore("./ion.db")`
|
||||
- **THEN** it SHALL return an initialized `IWorkflowStore` using SQLite
|
||||
|
||||
### Requirement: TypeScript types
|
||||
All public APIs SHALL export full TypeScript type definitions.
|
||||
|
||||
#### Scenario: Types available
|
||||
- **WHEN** a host application imports from the package
|
||||
- **THEN** `WorkflowDefinition`, `DagNode`, `NodeOutput`, `WorkflowRun`, `WorkflowExecutionResult`, `IWorkflowStore` types SHALL all be exported
|
||||
@@ -0,0 +1,34 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: SOP markdown as secondary format
|
||||
Workflows MAY be defined as `.sop.md` files in addition to YAML. The engine SHALL detect `.sop.md` files during discovery and transpile them to the DAG node representation.
|
||||
|
||||
#### Scenario: SOP file discovered alongside YAML
|
||||
- **WHEN** a `.sop.md` file exists in the workflows directory alongside `.yaml` workflow files
|
||||
- **THEN** both SHALL be discovered and listed as available workflows
|
||||
|
||||
#### Scenario: SOP transpiled to prompt nodes
|
||||
- **WHEN** a `.sop.md` file is loaded
|
||||
- **THEN** each `## Steps` section item SHALL become a `prompt:` node
|
||||
- **THEN** `## Parameters` SHALL be extracted as node metadata
|
||||
|
||||
### Requirement: RFC 2119 constraint extraction
|
||||
The transpiler SHALL extract RFC 2119 constraints from `**Constraints:**` blocks and embed them in the prompt text of the corresponding node.
|
||||
|
||||
#### Scenario: Constraints included in prompt
|
||||
- **WHEN** a step has `**Constraints:** - You MUST do X`
|
||||
- **THEN** the constraint text SHALL be appended to the node's prompt
|
||||
|
||||
### Requirement: Overview as workflow description
|
||||
The `## Overview` section of a `.sop.md` file SHALL become the workflow's `description` field.
|
||||
|
||||
#### Scenario: Overview maps to description
|
||||
- **WHEN** a `.sop.md` has `## Overview\nThis SOP does X`
|
||||
- **THEN** the resulting workflow SHALL have `description: "This SOP does X"`
|
||||
|
||||
### Requirement: Parameter acquisition constraints
|
||||
The transpiler SHALL validate that all required parameters from `## Parameters` are present before execution, using the constraint pattern from the SOP.
|
||||
|
||||
#### Scenario: Missing required parameter
|
||||
- **WHEN** a required parameter has no value provided
|
||||
- **THEN** the workflow SHALL prompt the user for the missing parameter before executing
|
||||
@@ -0,0 +1,44 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Inline script execution
|
||||
Script nodes SHALL execute inline TypeScript (runtime: `bun`) or Python (runtime: `uv`) code and capture stdout as the node output.
|
||||
|
||||
#### Scenario: Bun inline execution
|
||||
- **WHEN** a script node has `runtime: bun` and `script: console.log("hello")`
|
||||
- **THEN** the executor SHALL run the script via `bun -e`
|
||||
- **THEN** stdout SHALL be captured as `$nodeId.output`
|
||||
|
||||
#### Scenario: Python inline execution
|
||||
- **WHEN** a script node has `runtime: uv` and `script: print("hello")`
|
||||
- **THEN** the executor SHALL run the script via `uv run python -c`
|
||||
- **THEN** stdout SHALL be captured as `$nodeId.output`
|
||||
|
||||
### Requirement: Dependency installation
|
||||
Script nodes SHALL support a `deps:` array that installs dependencies before execution.
|
||||
|
||||
#### Scenario: Bun with npm deps
|
||||
- **WHEN** a script node has `runtime: bun` and `deps: ["lodash", "zod"]`
|
||||
- **THEN** the executor SHALL run `bun install lodash zod` before executing
|
||||
|
||||
#### Scenario: Python with pip deps
|
||||
- **WHEN** a script node has `runtime: uv` and `deps: ["requests", "click"]`
|
||||
- **THEN** the executor SHALL run `uv pip install requests click` before executing
|
||||
|
||||
### Requirement: Named script files
|
||||
Script nodes MAY reference named scripts from a `.archon/scripts/` directory by name instead of inline code.
|
||||
|
||||
#### Scenario: Named script discovery
|
||||
- **WHEN** a script node has `script: analyze` and `scripts/analyze.ts` exists
|
||||
- **THEN** the executor SHALL load and execute the file
|
||||
|
||||
#### Scenario: Runtime inferred from extension
|
||||
- **WHEN** a script has `runtime: bun` and the named file has a `.ts` extension
|
||||
- **THEN** the executor SHALL run it via `bun run`
|
||||
|
||||
### Requirement: Script timeout
|
||||
Script nodes SHALL support a `timeout:` field in milliseconds. If execution exceeds the timeout, the process SHALL be killed and the node SHALL fail.
|
||||
|
||||
#### Scenario: Timeout exceeded
|
||||
- **WHEN** a script node sets `timeout: 5000` and the script runs for 10 seconds
|
||||
- **THEN** the process SHALL be killed after 5 seconds
|
||||
- **THEN** the node SHALL be marked as failed with a timeout error
|
||||
@@ -0,0 +1,48 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: IWorkflowStore interface
|
||||
All storage backends SHALL implement the `IWorkflowStore` interface providing run lifecycle, event persistence, and node output retrieval.
|
||||
|
||||
#### Scenario: Store provides run CRUD
|
||||
- **WHEN** a workflow run is created
|
||||
- **THEN** `createWorkflowRun()` SHALL persist the run and return it
|
||||
- **WHEN** a workflow run status is updated
|
||||
- **THEN** `updateWorkflowRun()` SHALL persist the status change
|
||||
|
||||
#### Scenario: Store provides event persistence
|
||||
- **WHEN** a workflow event is created
|
||||
- **THEN** `createWorkflowEvent()` SHALL append it to the event log
|
||||
|
||||
#### Scenario: Store provides completed node outputs
|
||||
- **WHEN** a workflow is resumed
|
||||
- **THEN** `getCompletedDagNodeOutputs()` SHALL return all completed node outputs keyed by node ID
|
||||
|
||||
### Requirement: Filesystem backend
|
||||
The filesystem backend SHALL store each workflow run as files in a directory: `{artifactsDir}/{runId}/`.
|
||||
|
||||
#### Scenario: Filesystem stores events as JSONL
|
||||
- **WHEN** events are created using the filesystem backend
|
||||
- **THEN** each run SHALL have `events.jsonl` with newline-delimited JSON
|
||||
- **THEN** node outputs SHALL be stored as individual JSON files
|
||||
|
||||
#### Scenario: Filesystem stores run metadata
|
||||
- **WHEN** a run is created using the filesystem backend
|
||||
- **THEN** `run.json` SHALL contain the run metadata
|
||||
|
||||
### Requirement: SQLite backend
|
||||
The SQLite backend SHALL store workflow data in a SQLite database with tables for runs, events, and node outputs.
|
||||
|
||||
#### Scenario: SQLite stores runs table
|
||||
- **WHEN** using the SQLite backend
|
||||
- **THEN** a `workflow_runs` table SHALL exist with columns for id, workflow_name, status, user_message, created_at, updated_at
|
||||
|
||||
#### Scenario: SQLite stores events table
|
||||
- **WHEN** using the SQLite backend
|
||||
- **THEN** a `workflow_events` table SHALL exist with columns for run_id, sequence, event_type, timestamp, payload
|
||||
|
||||
### Requirement: Postgres backend
|
||||
The Postgres backend SHALL use a PostgreSQL database with the same schema as SQLite, accessed via the `IWorkflowStore` interface.
|
||||
|
||||
#### Scenario: Postgres uses same interface
|
||||
- **WHEN** switching from SQLite to Postgres
|
||||
- **THEN** no workflow engine code SHALL change — only the store implementation
|
||||
@@ -0,0 +1,57 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Node output references
|
||||
Prompts and commands SHALL support `$nodeId.output` to reference the output text of an upstream node, and `$nodeId.output.field` to reference a specific field from a structured output.
|
||||
|
||||
#### Scenario: Output reference substitution
|
||||
- **WHEN** a prompt contains `$analysis.output`
|
||||
- **THEN** it SHALL be replaced with the full output text of the node with id `analysis`
|
||||
|
||||
#### Scenario: Field reference with structured output
|
||||
- **WHEN** a prompt contains `$analysis.output.summary` and the upstream node declared `output_format: { type: "object", properties: { summary: ... } }`
|
||||
- **THEN** it SHALL be replaced with the value of the `summary` field from the parsed JSON output
|
||||
|
||||
#### Scenario: Missing node reference
|
||||
- **WHEN** a prompt references `$nonexistent.output`
|
||||
- **THEN** the reference SHALL resolve to an empty string with a warning
|
||||
|
||||
#### Scenario: Missing field on schemaless node
|
||||
- **WHEN** a prompt references `$node.output.field` and the upstream node has no `output_format` and its output is not valid JSON
|
||||
- **THEN** the consuming node SHALL fail with an error
|
||||
|
||||
#### Scenario: Strict field access for declared schemas
|
||||
- **WHEN** a prompt references `$node.output.field` and the upstream node's `output_format` declares properties but `field` is not among them
|
||||
- **THEN** the consuming node SHALL fail with a field-not-found error
|
||||
|
||||
### Requirement: Built-in variables
|
||||
The engine SHALL support `$ARGUMENTS`, `$ARTIFACTS_DIR`, `$WORKFLOW_ID`, `$BASE_BRANCH`, `$DOCS_DIR`.
|
||||
|
||||
#### Scenario: $ARGUMENTS substitution
|
||||
- **WHEN** a prompt contains `$ARGUMENTS`
|
||||
- **THEN** it SHALL be replaced with the full user message/arguments string
|
||||
|
||||
#### Scenario: $ARTIFACTS_DIR substitution
|
||||
- **WHEN** a prompt contains `$ARTIFACTS_DIR`
|
||||
- **THEN** it SHALL be replaced with the path to the run's artifact directory
|
||||
|
||||
#### Scenario: $WORKFLOW_ID substitution
|
||||
- **WHEN** a prompt contains `$WORKFLOW_ID`
|
||||
- **THEN** it SHALL be replaced with the workflow run ID
|
||||
|
||||
### Requirement: Loop-specific variables
|
||||
Loop nodes SHALL support `$LOOP_USER_INPUT` (from approve at interactive gates) and `$LOOP_PREV_OUTPUT` (output of the previous iteration).
|
||||
|
||||
#### Scenario: $LOOP_PREV_OUTPUT on first iteration
|
||||
- **WHEN** a loop node is on its first iteration
|
||||
- **THEN** `$LOOP_PREV_OUTPUT` SHALL resolve to an empty string
|
||||
|
||||
#### Scenario: $LOOP_PREV_OUTPUT on subsequent iterations
|
||||
- **WHEN** a loop node is on iteration 2+
|
||||
- **THEN** `$LOOP_PREV_OUTPUT` SHALL contain the cleaned output of the previous iteration
|
||||
|
||||
### Requirement: Approval-specific variables
|
||||
Approval nodes SHALL support `$REJECTION_REASON`.
|
||||
|
||||
#### Scenario: $REJECTION_REASON in on_reject prompt
|
||||
- **WHEN** an approval node is rejected with a reason
|
||||
- **THEN** `$REJECTION_REASON` SHALL contain the reviewer's feedback text
|
||||
@@ -0,0 +1,51 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Run states
|
||||
A workflow run SHALL transition through states: `pending → running → completed | failed | cancelled`. It MAY transition to `paused` for approval gates.
|
||||
|
||||
#### Scenario: Normal completion
|
||||
- **WHEN** all DAG nodes complete successfully
|
||||
- **THEN** the run status SHALL be `completed`
|
||||
|
||||
#### Scenario: Node failure
|
||||
- **WHEN** a node fails and no retry succeeds
|
||||
- **THEN** the run status SHALL be `failed`
|
||||
|
||||
#### Scenario: User cancellation
|
||||
- **WHEN** a user cancels a running workflow
|
||||
- **THEN** the run status SHALL be `cancelled`
|
||||
|
||||
#### Scenario: Approval pause
|
||||
- **WHEN** an approval node is reached
|
||||
- **THEN** the run status SHALL transition to `paused`
|
||||
- **THEN** it SHALL transition back to `running` on approval
|
||||
|
||||
### Requirement: Resume from failure
|
||||
A failed workflow SHALL support resumption, skipping already-completed nodes using stored outputs from the event log.
|
||||
|
||||
#### Scenario: Resume skips completed nodes
|
||||
- **WHEN** a failed workflow has 2 completed nodes out of 5
|
||||
- **THEN** resuming SHALL skip nodes 1-2 and re-execute from node 3
|
||||
|
||||
#### Scenario: Resume with always_run
|
||||
- **WHEN** a node has `always_run: true` and the workflow is resumed
|
||||
- **THEN** the node SHALL re-execute even if it completed previously
|
||||
|
||||
### Requirement: Event-based observability
|
||||
All lifecycle transitions SHALL emit typed events through the event emitter for observability and external subscribers.
|
||||
|
||||
#### Scenario: Events for every state transition
|
||||
- **WHEN** a workflow starts
|
||||
- **THEN** a `workflow_started` event SHALL be emitted
|
||||
- **WHEN** a workflow completes
|
||||
- **THEN** a `workflow_completed` event SHALL be emitted
|
||||
- **WHEN** a node starts/completes/fails/skips
|
||||
- **THEN** corresponding node events SHALL be emitted
|
||||
|
||||
### Requirement: Cleanup
|
||||
The engine SHALL support cleaning up old workflow runs and their artifacts.
|
||||
|
||||
#### Scenario: Cleanup by age
|
||||
- **WHEN** cleanup is invoked with a retention period (default 7 days)
|
||||
- **THEN** runs older than the retention period SHALL have their artifacts removed
|
||||
- **THEN** run records MAY be pruned from the store
|
||||
@@ -0,0 +1,77 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Workflow definition structure
|
||||
A workflow YAML SHALL have a top-level `name`, `description`, and `nodes:` array. It MAY have `provider`, `model`, `interactive`, `mutates_checkout`, `tags`.
|
||||
|
||||
#### Scenario: Minimal valid workflow
|
||||
- **WHEN** a YAML file contains a `name`, `description`, and at least one node
|
||||
- **THEN** the loader SHALL parse it as a valid workflow definition
|
||||
|
||||
#### Scenario: Missing name
|
||||
- **WHEN** a YAML file lacks a `name` field
|
||||
- **THEN** the loader SHALL reject it with a validation error
|
||||
|
||||
### Requirement: Seven node types
|
||||
The engine SHALL support exactly 7 node types, mutually exclusive per node: `command`, `prompt`, `bash`, `script`, `loop`, `approval`, `cancel`.
|
||||
|
||||
#### Scenario: Node with exactly one mode field
|
||||
- **WHEN** a node has `prompt:` but no other mode field
|
||||
- **THEN** it SHALL be classified as a PromptNode
|
||||
|
||||
#### Scenario: Node with multiple mode fields
|
||||
- **WHEN** a node has both `prompt:` and `bash:`
|
||||
- **THEN** the loader SHALL reject it with a mutual-exclusivity error
|
||||
|
||||
#### Scenario: Node with no mode field
|
||||
- **WHEN** a node has none of the 7 mode fields
|
||||
- **THEN** the loader SHALL reject it
|
||||
|
||||
### Requirement: Common node fields
|
||||
All node types SHALL support `id`, `depends_on`, `when`, `trigger_rule`, `retry`, `timeout`, `output_type`, `always_run`.
|
||||
|
||||
#### Scenario: Node id must be unique
|
||||
- **WHEN** two nodes in the same workflow share the same `id`
|
||||
- **THEN** the loader SHALL reject the workflow
|
||||
|
||||
### Requirement: Prompt node
|
||||
A PromptNode SHALL have a `prompt:` string field containing the AI prompt text.
|
||||
|
||||
#### Scenario: Empty prompt rejected
|
||||
- **WHEN** a node has `prompt: ""`
|
||||
- **THEN** the loader SHALL reject it
|
||||
|
||||
### Requirement: Bash node
|
||||
A BashNode SHALL have a `bash:` string field and MAY have `timeout` (ms). AI-specific fields SHALL be ignored with a warning.
|
||||
|
||||
#### Scenario: Bash node with timeout
|
||||
- **WHEN** a bash node includes `timeout: 30000`
|
||||
- **THEN** the executor SHALL kill the subprocess after 30 seconds
|
||||
|
||||
### Requirement: Script node
|
||||
A ScriptNode SHALL have `script:` (inline or named), `runtime:` (`bun` or `uv`), MAY have `deps:` and `timeout:`.
|
||||
|
||||
#### Scenario: Script with deps
|
||||
- **WHEN** a script node has `runtime: bun` and `deps: ["lodash"]`
|
||||
- **THEN** the executor SHALL install dependencies before running the script
|
||||
|
||||
#### Scenario: Named script from disk
|
||||
- **WHEN** `script: analyze` and a file `scripts/analyze.ts` exists
|
||||
- **THEN** the executor SHALL load and run it
|
||||
|
||||
### Requirement: Loop node
|
||||
A LoopNode SHALL have `loop:` with `prompt`, `until`, `max_iterations`, and optional `fresh_context`, `interactive`, `gate_message`, `until_bash`.
|
||||
|
||||
#### Scenario: Loop with completion signal
|
||||
- **WHEN** the AI response contains the `until` string
|
||||
- **THEN** the loop SHALL stop and the node SHALL complete
|
||||
|
||||
#### Scenario: Loop exceeds max_iterations
|
||||
- **WHEN** the loop reaches `max_iterations` without the completion signal
|
||||
- **THEN** the node SHALL fail
|
||||
|
||||
### Requirement: Cancel node
|
||||
A CancelNode SHALL have a `cancel:` string containing a reason. It SHALL terminate the workflow run.
|
||||
|
||||
#### Scenario: Cancel terminates workflow
|
||||
- **WHEN** a cancel node executes
|
||||
- **THEN** the workflow SHALL be marked as cancelled with the cancel reason
|
||||
@@ -0,0 +1,102 @@
|
||||
## 1. Project Scaffold
|
||||
|
||||
- [ ] 1.1 Initialize package with `package.json`, `tsconfig.json`, module structure (`src/`, `src/cli/`, `src/engine/`, `src/store/`, `src/format/`)
|
||||
- [ ] 1.2 Add core dependencies: `zod`, `js-yaml`, `nanoid`, `ulid`
|
||||
- [ ] 1.3 Configure build (tsc or bun build), lint, format, and test scripts
|
||||
- [ ] 1.4 Create public exports index (`src/index.ts`) with all type and function exports
|
||||
|
||||
## 2. Schema Layer — Workflow and Node Types
|
||||
|
||||
- [ ] 2.1 Implement `dag-node.ts`: Zod schema for all 7 node types with mutual-exclusivity superRefine, type guards, and AI-field warnings
|
||||
- [ ] 2.2 Implement `workflow.ts`: WorkflowDefinition schema extending WorkflowBase with nodes array, WorkflowExecutionResult, WorkflowSource types
|
||||
- [ ] 2.3 Implement `loop.ts`: LoopNodeConfig (prompt, until, max_iterations, fresh_context, interactive, gate_message, until_bash)
|
||||
- [ ] 2.4 Implement `retry.ts`: Retry config (max_attempts, delay_ms, on_error)
|
||||
- [ ] 2.5 Implement `workflow-run.ts`: WorkflowRun, WorkflowRunStatus, NodeState, NodeOutput, ApprovalContext schemas
|
||||
|
||||
## 3. YAML Format — Loader and Validation
|
||||
|
||||
- [ ] 3.1 Implement `loader.ts`: YAML parsing via js-yaml, per-node dagNodeSchema validation, DAG structure validation (unique IDs, depends_on refs, cycle detection via Kahn's)
|
||||
- [ ] 3.2 Implement `command-validation.ts`: Command name format validation
|
||||
- [ ] 3.3 Implement `model-validation.ts`: Provider/model resolution (optional — skip before AI provider integration)
|
||||
- [ ] 3.4 Add workflow-level validation: required fields, provider identity, node ref integrity
|
||||
|
||||
## 4. DAG Engine — Core Execution
|
||||
|
||||
- [ ] 4.1 Implement `deps.ts`: WorkflowDeps injection interface, IWorkflowPlatform, WorkflowConfig types
|
||||
- [ ] 4.2 Implement `dag-executor.ts`: Kahn's algorithm topological layering, `buildTopologicalLayers()`, `checkTriggerRule()` (4 trigger rules), Promise.allSettled concurrent layer execution
|
||||
- [ ] 4.3 Implement node dispatch: execution handlers for PromptNode (AI), CommandNode (command loading), BashNode (subprocess), CancelNode (termination)
|
||||
- [ ] 4.4 Implement `executor-shared.ts`: `substituteWorkflowVariables()`, `loadCommandPrompt()`, `classifyError()`, `safeSendMessage()`
|
||||
- [ ] 4.5 Implement `output-ref.ts`: `$nodeId.output` and `$nodeId.output.field` resolution with strict field access
|
||||
- [ ] 4.6 Implement `condition-evaluator.ts`: `when:` expression parser (==, !=, <, >, <=, >=, AND/OR, comparators with $nodeId.output)
|
||||
- [ ] 4.7 Implement `event-emitter.ts`: Typed events (workflow_started/completed/failed, node_started/completed/failed/skipped)
|
||||
|
||||
## 5. Event Sourcing — Persistence and Replay
|
||||
|
||||
- [ ] 5.1 Implement `store.ts`: IWorkflowStore interface (createWorkflowRun, getWorkflowRun, updateWorkflowRun, failWorkflowRun, createWorkflowEvent, getCompletedDagNodeOutputs, getActiveWorkflowRunByPath)
|
||||
- [ ] 5.2 Implement `executor.ts`: Top-level workflow orchestrator — create run, path-lock guard, dispatch to dag-executor, handle resume with prior completed nodes, event emission
|
||||
- [ ] 5.3 Implement event persistence: 8 event types stored chronologically, node outputs stored for resume
|
||||
- [ ] 5.4 Implement resume: `hydrateResumableRun()` loads prior completed node outputs, skips re-execution
|
||||
- [ ] 5.5 Implement cleanup: retention-based run record and artifact removal
|
||||
|
||||
## 6. Storage Backends
|
||||
|
||||
- [ ] 6.1 Implement filesystem store: `createFsStore(path)` — run.json per run, events.jsonl, node outputs as JSON files, file-level locking
|
||||
- [ ] 6.2 Implement SQLite store: `createSqliteStore(path)` — workflow_runs, workflow_events, node_outputs tables with WAL mode
|
||||
- [ ] 6.3 Implement Postgres store: `createPostgresStore(connectionString)` — same schema as SQLite, pg driver
|
||||
|
||||
## 7. Variable Substitution
|
||||
|
||||
- [ ] 7.1 Implement workflow-level variable substitution: $WORKFLOW_ID, $ARGUMENTS, $ARTIFACTS_DIR, $BASE_BRANCH, $DOCS_DIR
|
||||
- [ ] 7.2 Implement node output references in prompts: `$nodeId.output` (full text), `$nodeId.output.field` (structured field access)
|
||||
- [ ] 7.3 Implement loop-specific variables: `$LOOP_USER_INPUT`, `$LOOP_PREV_OUTPUT`, `$REJECTION_REASON`
|
||||
- [ ] 7.4 Implement command-level variable substitution: $1-$9 positional args
|
||||
|
||||
## 8. Script and Bash Execution
|
||||
|
||||
- [ ] 8.1 Implement BashNode execution: `bash -c` subprocess with timeout, stdout capture, env var injection
|
||||
- [ ] 8.2 Implement ScriptNode — bun runtime: inline `bun -e`, named scripts from `.archon/scripts/`, deps installation
|
||||
- [ ] 8.3 Implement ScriptNode — uv runtime: `uv run python -c`, named scripts, uv deps installation
|
||||
- [ ] 8.4 Implement `script-discovery.ts`: discover scripts by extension (.ts→bun, .py→uv) from project and home scopes
|
||||
|
||||
## 9. Approval Gates and Human-in-the-Loop
|
||||
|
||||
- [ ] 9.1 Implement ApprovalNode handler: pause workflow status, send approval message, store approval context
|
||||
- [ ] 9.2 Implement approve/resume: transition from paused→running, continue DAG execution
|
||||
- [ ] 9.3 Implement reject handling: reject node with reason, populate $REJECTION_REASON, execute on_reject prompt if configured
|
||||
- [ ] 9.4 Implement capture_response: store user comment as $nodeId.output
|
||||
- [ ] 9.5 Implement interactive loop support: loop.interactive=true pauses between iterations, gate_message shown to user
|
||||
|
||||
## 10. Loop Nodes
|
||||
|
||||
- [ ] 10.1 Implement LoopNode execution: iterative AI prompt loop with completion signal detection (`until`)
|
||||
- [ ] 10.2 Implement `max_iterations` enforcement: fail node when exceeded
|
||||
- [ ] 10.3 Implement `fresh_context` for loop iterations: new session vs. accumulated context
|
||||
- [ ] 10.4 Implement `until_bash`: bash exit code 0 as completion signal (alternative to text signal)
|
||||
|
||||
## 11. CLI Tool (MVP)
|
||||
|
||||
- [ ] 11.1 Implement main CLI entry point with subcommand routing (workflow list, run, status, resume)
|
||||
- [ ] 11.2 Implement `workflow list`: discover and display all workflows with source info
|
||||
- [ ] 11.3 Implement `workflow run`: execute workflow by name with arguments, --cwd, --store flags
|
||||
- [ ] 11.4 Implement `workflow status`: display active and recent runs
|
||||
- [ ] 11.5 Implement `workflow resume`: resume a failed workflow
|
||||
|
||||
## 12. Workflow Discovery
|
||||
|
||||
- [ ] 12.1 Implement `workflow-discovery.ts`: filesystem discovery across bundled→home→project scopes with precedence
|
||||
- [ ] 12.2 Implement bundled defaults: embedded default workflows (assist, plan, implement)
|
||||
- [ ] 12.3 Implement home-global scope: user-level workflows directory
|
||||
- [ ] 12.4 Implement project scope: repo-local `.workflows/` directory
|
||||
- [ ] 12.5 Implement resilient loading: per-file error handling, one broken YAML doesn't abort discovery
|
||||
|
||||
## 13. Testing
|
||||
|
||||
- [ ] 13.1 Unit test DAG executor: topological layering, trigger rules, when conditions, node output refs
|
||||
- [ ] 13.2 Unit test schema validation: all node types, mutual exclusivity, field validation
|
||||
- [ ] 13.3 Unit test variable substitution: $nodeId.output, $ARGUMENTS, $LOOP_PREV_OUTPUT edge cases
|
||||
- [ ] 13.4 Unit test condition evaluator: comparison operators, compound AND/OR, error cases
|
||||
- [ ] 13.5 Unit test filesystem store: create/read/update runs, events, node outputs, resume data
|
||||
- [ ] 13.6 Unit test SQLite store: same coverage as filesystem
|
||||
- [ ] 13.7 Unit test CLI commands: argument parsing, output formatting, approval flow
|
||||
- [ ] 13.8 Integration test: end-to-end workflow execution with bash and script nodes
|
||||
- [ ] 13.9 Integration test: resume after failure with prior node outputs loaded
|
||||
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-06-07
|
||||
@@ -0,0 +1,3 @@
|
||||
# memory-context-engineering
|
||||
|
||||
Spec-driven implementation of memory & context engineering patterns based on research of LangMem, DeerFlow, and CowAgent
|
||||
@@ -0,0 +1,164 @@
|
||||
## Context
|
||||
|
||||
Current agents have no durable memory beyond the immediate LLM context window. Research across three production-grade OSS repos (LangMem, DeerFlow, CowAgent) reveals a consistent architectural pattern: a **tiered memory pipeline** with short-term context management, long-term semantic extraction, and periodic background consolidation. This design synthesizes those patterns into a portable, framework-agnostic `memory-engine` module.
|
||||
|
||||
The engine must be:
|
||||
- **Portable** — works with any LLM, any agent framework, any embedding provider
|
||||
- **Tiered** — separates ephemeral session context from persistent long-term knowledge
|
||||
- **Efficient** — background processing, debounced writes, token-budget-aware formatting
|
||||
- **Searchable** — hybrid keyword + vector retrieval with scoring
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Provide a unified public API: `MemoryEngine` class with `manage()`, `search()`, `flush()`, `dream()` methods
|
||||
- Short-term context: token-budget windowing + incremental summarization (LangMem's `summarize_messages` pattern)
|
||||
- Long-term memory: LLM-extracted facts stored in SQLite with typed schemas (LangMem's `MemoryManager` + DeerFlow's fact model)
|
||||
- Tiered consolidation: context→daily→core pipeline with configurable promotion rules (CowAgent's 3-tier)
|
||||
- Hybrid search: FTS5 keyword + numpy-vectorized cosine similarity with weighted merge (CowAgent's `MemoryStorage`)
|
||||
- Background processing: debounced async queue for memory updates (DeerFlow's `MemoryUpdateQueue` + LangMem's `ReflectionExecutor`)
|
||||
- Agent tools: `manage_memory(content, action, id)` and `search_memory(query, limit)` as framework-agnostic callables
|
||||
|
||||
**Non-Goals:**
|
||||
- Not a standalone agent framework — integrates into existing loops
|
||||
- No built-in LLM provider — caller provides model
|
||||
- No built-in embedding provider — caller provides or we degrade to keyword-only
|
||||
- No real-time sync / distributed consensus — single-process design
|
||||
- No graph-based memory (entity-relationship knowledge graphs) — deferred to future
|
||||
|
||||
## Decisions
|
||||
|
||||
### D1: SQLite as the single persistence backend
|
||||
- **Choice**: SQLite with WAL mode for both keyword search (FTS5) and vector storage (BLOB embeddings)
|
||||
- **Rationale**: Zero-dependency, production-proven, FTS5 is stdlib-compatible, numpy integration in-process
|
||||
- **Alternatives considered**:
|
||||
- *JSON files* (DeerFlow) → simpler but no built-in search, concurrency issues
|
||||
- *External vector DB* (Pinecone, pgvector) → adds operational complexity, violates portability goal
|
||||
- *LMDB/RocksDB* → overkill, no FTS5 equivalent
|
||||
|
||||
### D2: Three-tier architecture with file-based daily layer
|
||||
- **Choice**: In-memory context tier → Markdown-file daily tier → SQLite-indexed core tier
|
||||
- **Rationale**: Daily Markdown files are human-readable, easily audited, and serve as the input to Deep Dream consolidation. Core tier is the indexed, searchable fact store.
|
||||
- **Alternatives considered**:
|
||||
- *Single SQLite DB for everything* → loses human-readability of daily records
|
||||
- *All in-memory* → no persistence across restarts
|
||||
|
||||
### D3: Fact extraction via structured LLM output (tool-calling pattern)
|
||||
- **Choice**: LLM returns structured JSON (DeerFlow pattern) rather than tool-calling-based extraction (LangMem trustcall pattern)
|
||||
- **Rationale**: Simpler, fewer dependencies, compatible with any LLM provider. LangMem's trustcall approach is more robust for complex multi-step edits but requires the `trustcall` library.
|
||||
- **Fallback**: Confidence-thresholded insertion with content-dedup hashing to prevent duplicates
|
||||
|
||||
### D4: Hybrid search with numpy-vectorized cosine similarity
|
||||
- **Choice**: Load relevant embeddings from SQLite, compute cosine similarity via `matrix @ vector` (numpy), merge with FTS5 BM25 scores
|
||||
- **Rationale**: ~100x faster than per-row Python loops. Uses numpy which is near-ubiquitous in Python ML.
|
||||
- **Fallback**: Pure-Python cosine similarity when numpy unavailable
|
||||
|
||||
### D5: Debounced background memory update queue
|
||||
- **Choice**: Thread-safe priority queue with configurable debounce timer (DeerFlow pattern)
|
||||
- **Rationale**: Prevents thundering-herd on LLM API during rapid conversation turns. Threaded execution avoids blocking the main agent loop.
|
||||
- **Alternatives considered**: asyncio queue → fine for async-only, but MemoryEngine must support sync callers
|
||||
|
||||
### D6: Namespace isolation via tuple-based scoping
|
||||
- **Choice**: `(scope_type, user_id, agent_id)` tuple namespace for multi-tenant isolation
|
||||
- **Rationale**: LangMem's `NamespaceTemplate` pattern proven in production. Allows `("user", "u-123")` or `("org", "acme", "agent-alpha")`.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ MemoryEngine │
|
||||
├─────────────────────────────────────────────────────────┤
|
||||
│ manage_memory(content, scope, metadata) → fact_id │
|
||||
│ search_memory(query, limit, scope) → SearchResults[] │
|
||||
│ flush_messages(messages, scope) → boolean │
|
||||
│ deep_dream(lookback_days, scope) → boolean │
|
||||
│ format_for_injection(scope, max_tokens) → str │
|
||||
└──────────────────────┬──────────────────────────────────┘
|
||||
│
|
||||
┌──────────────┼──────────────┐
|
||||
▼ ▼ ▼
|
||||
┌──────────────┐ ┌──────────┐ ┌──────────────┐
|
||||
│ Context Tier │ │ Daily │ │ Core Tier │
|
||||
│ (in-memory) │ │ Tier │ │ (SQLite + │
|
||||
│ │ │ (Markdown│ │ FTS5 + │
|
||||
│ RunningSumm. │ │ files) │ │ vectors) │
|
||||
│ token budget │ │ │ │ │
|
||||
└──────────────┘ │ Deep │ │ MemoryStore │
|
||||
│ Dream ───┼─┤ (facts) │
|
||||
└──────────┘ │ HybridSearch │
|
||||
└──────────────┘
|
||||
│
|
||||
┌────────┴────────┐
|
||||
▼ ▼
|
||||
┌────────────┐ ┌────────────────┐
|
||||
│ Keyword │ │ Vector Search │
|
||||
│ (FTS5) │ │ (numpy cosine) │
|
||||
└────────────┘ └────────────────┘
|
||||
```
|
||||
|
||||
### Data Flow
|
||||
|
||||
1. **Agent sends message** → Context tier tracks token budget, optionally summarizes
|
||||
2. **Conversation turn completes** → Messages queued to background `MemoryUpdateQueue`
|
||||
3. **Debounce timer fires** → `MemoryUpdater` calls LLM with current memory + conversation → extracts facts
|
||||
4. **Facts persisted** → Core tier SQLite: chunks table with embedding, FTS5 index
|
||||
5. **Daily recording** → `MemoryFlushManager` appends to `memory/YYYY-MM-DD.md`
|
||||
6. **Deep Dream (scheduled)** → LLM reads MEMORY.md + recent daily files → rewrites MEMORY.md → writes dream diary
|
||||
7. **Agent starts new session** → `format_for_injection()` reads core tier → builds token-budgeted context string → injects into system prompt
|
||||
|
||||
## Module Structure
|
||||
|
||||
```
|
||||
memory-engine/
|
||||
├── __init__.py # Public API: MemoryEngine, MemoryConfig
|
||||
├── config.py # Pydantic config model
|
||||
├── core/
|
||||
│ ├── __init__.py
|
||||
│ ├── store.py # MemoryStore (SQLite + FTS5 + vectors)
|
||||
│ ├── hybrid_search.py # Vector + keyword merge with temporal decay
|
||||
│ └── schemas.py # Memory, Fact, SearchResult models
|
||||
├── extraction/
|
||||
│ ├── __init__.py
|
||||
│ ├── manager.py # MemoryManager (LLM fact extraction)
|
||||
│ └── prompts.py # System prompts for memory extraction
|
||||
├── tiers/
|
||||
│ ├── __init__.py
|
||||
│ ├── context.py # ContextTier (short-term summarization)
|
||||
│ ├── daily.py # DailyTier (Markdown file management)
|
||||
│ └── core.py # CoreTier (long-term persistent store)
|
||||
├── background/
|
||||
│ ├── __init__.py
|
||||
│ ├── queue.py # MemoryUpdateQueue (debounced)
|
||||
│ └── deep_dream.py # Deep Dream consolidation
|
||||
├── tools/
|
||||
│ ├── __init__.py
|
||||
│ ├── manage.py # manage_memory callable
|
||||
│ └── search.py # search_memory callable
|
||||
├── embedding/
|
||||
│ ├── __init__.py
|
||||
│ ├── base.py # EmbeddingProvider ABC
|
||||
│ └── openai.py # OpenAI embedding implementation
|
||||
└── utils/
|
||||
├── __init__.py
|
||||
├── namespace.py # NamespaceTemplate
|
||||
├── token_counter.py # Token counting (tiktoken wrapper)
|
||||
└── chunker.py # Text chunking
|
||||
```
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
| Risk | Mitigation |
|
||||
|------|-----------|
|
||||
| [R1] LLM extraction latency blocks agent loop | Background queue with debounce — agent never waits for memory update |
|
||||
| [R2] Embedding API failures degrade search | Graceful degradation to keyword-only; vector results omitted, not fatal |
|
||||
| [R3] SQLite write contention under high concurrency | WAL mode + RLock per connection; single-process assumption |
|
||||
| [R4] FTS5 corrupted after crash | Self-healing on init: detect corrupt shadow tables, rebuild from chunks table |
|
||||
| [R5] Memory bloat from unbounded fact accumulation | Configurable `max_facts` limit (default 500); sorted by confidence, oldest trimmed |
|
||||
| [R6] Deep Dream overwrites valuable long-term data | Dream diary preserves audit trail; content-hash dedup prevents re-processing |
|
||||
| [R7] Token budget exceeded in context injection | `format_for_injection()` enforces strict token limit with truncation |
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Q1: Should Deep Dream be scheduled (cron) or event-driven (every N daily files)?
|
||||
- Q2: What is the default `max_facts` limit for the core tier?
|
||||
- Q3: Should the daily tier support per-user isolation (user-specific daily files) or always shared?
|
||||
@@ -0,0 +1,35 @@
|
||||
## Why
|
||||
|
||||
Current AI agents lack structured, durable memory beyond the immediate context window. Conversations are stateless, preferences are forgotten, and long-term learning is nonexistent. Three OSS repos (LangMem, DeerFlow, CowAgent) demonstrate production patterns for agent memory — but no unified, portable engine exists that combines short-term context management, long-term semantic memory, tiered consolidation, and hybrid retrieval. This change builds that engine by extracting and adapting the best patterns from all three.
|
||||
|
||||
## What Changes
|
||||
|
||||
- **New `memory-engine/` module** in the codebase providing a unified memory & context API
|
||||
- **Short-term context summarization** — token-budget-aware conversation windowing (LangMem pattern)
|
||||
- **Long-term semantic memory** — LLM-extracted facts stored with optional vector embeddings (LangMem/DeerFlow hybrid)
|
||||
- **Tiered memory architecture** — Context tier (ephemeral session) → Daily tier (summarized records) → Core tier (distilled long-term) (CowAgent pattern)
|
||||
- **Hybrid search** — Keyword (FTS5) + Vector (cosine similarity on embeddings) with weighted merge (CowAgent pattern)
|
||||
- **Background consolidation** — Debounced, async memory extraction pipeline (DeerFlow queue + LangMem ReflectionExecutor)
|
||||
- **Deep Dream distillation** — Periodic overnight LLM consolidation of daily records into core memory (CowAgent pattern)
|
||||
- **Memory tools for agents** — `manage_memory` and `search_memory` tool interfaces (LangMem pattern)
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
- `short-term-context`: Token-budget window management, conversation summarization, and context trimming for LLM interactions
|
||||
- `long-term-memory`: Persistent fact extraction, storage, and retrieval with Pydantic-typed schemas
|
||||
- `tiered-consolidation`: Three-tier memory pipeline (context→daily→core) with promotion rules and Deep Dream distillation
|
||||
- `hybrid-search`: Combined keyword (FTS5) + vector (embedding cosine similarity) search with weighted scoring and temporal decay
|
||||
- `memory-tools`: `manage_memory` (CRUD) and `search_memory` (semantic query) tools for agent integration
|
||||
- `background-processing`: Debounced async memory update queue with thread-pool execution
|
||||
|
||||
### Modified Capabilities
|
||||
<!-- No existing specs to modify — this is a greenfield module -->
|
||||
|
||||
## Impact
|
||||
|
||||
- New `memory-engine/` directory tree (no existing code modified)
|
||||
- Dependencies: `sqlite3` (stdlib), `numpy` (optional, for vector search), `pydantic` (schemas), `tiktoken` (token counting)
|
||||
- LLM provider integration via abstract `ChatModel` interface (not coupled to any provider)
|
||||
- Embedding provider integration via abstract `EmbeddingProvider` interface (supports OpenAI, local models)
|
||||
- Agent integration via simple tool interface (not coupled to any agent framework)
|
||||
@@ -0,0 +1,58 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Debounced memory update queue
|
||||
The system SHALL collect memory update requests into a queue and process them after a configurable debounce period.
|
||||
|
||||
#### Scenario: Items enqueued per (thread, user, agent) key
|
||||
- **WHEN` a conversation context is added to the queue
|
||||
- **THEN** it SHALL be keyed by `(thread_id, user_id, agent_name)` for deduplication
|
||||
- **WHEN** a second context arrives for the same key before processing
|
||||
- **THEN** the previous context SHALL be replaced with the newer one
|
||||
|
||||
#### Scenario: Debounce timer resets on each enqueue
|
||||
- **WHEN` a new item is enqueued
|
||||
- **THEN** the debounce timer SHALL reset to the configured `debounce_seconds`
|
||||
- **WHEN** no new items arrive within the debounce window
|
||||
- **THEN** the queue SHALL be processed
|
||||
|
||||
#### Scenario: Immediate processing option
|
||||
- **WHEN** `add_nowait()` is called instead of `add()`
|
||||
- **THEN** the queue SHALL start processing immediately in a background thread
|
||||
|
||||
### Requirement: Background thread execution for memory updates
|
||||
The system SHALL execute memory updates (LLM extraction + persistence) in a background thread to avoid blocking the agent loop.
|
||||
|
||||
#### Scenario: Async flush via threading.Thread
|
||||
- **WHEN` conversation messages are flushed to memory
|
||||
- **THEN** the flush SHALL run in a `threading.Thread` (daemon=True)
|
||||
- **THEN` the main agent SHALL NOT wait for the flush to complete
|
||||
|
||||
#### Scenario: Thread pool for sync LLM calls
|
||||
- **WHEN** a memory update requires a synchronous LLM call
|
||||
- **THEN** the call SHALL be offloaded to a `ThreadPoolExecutor` (max_workers=4)
|
||||
- **THEN** this SHALL prevent blocking the main event loop
|
||||
|
||||
### Requirement: Content deduplication for flush
|
||||
The system SHALL deduplicate message content before flushing to avoid redundant summarization.
|
||||
|
||||
#### Scenario: MD5 content hash dedup
|
||||
- **WHEN** messages are about to be flushed
|
||||
- **THEN** each message content SHALL be MD5-hashed
|
||||
- **WHEN** a hash matches a previously flushed message
|
||||
- **THEN** that message SHALL be skipped
|
||||
|
||||
#### Scenario: Scheduler pair stripping
|
||||
- **WHEN** messages contain scheduler-injected pairs (marked with `[SCHEDULED]` prefix)
|
||||
- **THEN** the scheduler user message and its paired assistant response SHALL be stripped before flushing
|
||||
|
||||
### Requirement: Configuration-driven memory processing
|
||||
The system SHALL support configuration to enable/disable background memory processing.
|
||||
|
||||
#### Scenario: Memory processing disabled
|
||||
- **WHEN** `memory_config.enabled` is `False`
|
||||
- **THEN** no memory updates SHALL be queued or processed
|
||||
- **THEN** queue `add()` calls SHALL be no-ops
|
||||
|
||||
#### Scenario: Rate limiting between updates
|
||||
- **WHEN** processing multiple queued memory updates
|
||||
- **THEN` a 0.5 second delay SHALL be inserted between updates to avoid LLM API rate limits
|
||||
@@ -0,0 +1,73 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Hybrid search with vector + keyword fusion
|
||||
The system SHALL combine vector similarity search and keyword search into unified ranked results.
|
||||
|
||||
#### Scenario: Vector search runs when embedding provider available
|
||||
- **WHEN** an embedding provider is configured
|
||||
- **THEN** the system SHALL compute a query embedding and perform cosine similarity search
|
||||
- **WHEN** no embedding provider is configured
|
||||
- **THEN** the system SHALL gracefully degrade to keyword-only search
|
||||
|
||||
#### Scenario: Keyword search always runs
|
||||
- **WHEN** a search query is submitted
|
||||
- **THEN** the system SHALL always perform keyword search regardless of embedding provider availability
|
||||
|
||||
#### Scenario: Weighted score merging
|
||||
- **WHEN** both vector and keyword results are available
|
||||
- **THEN** the final score SHALL be: `vector_weight * vector_score + keyword_weight * keyword_score`
|
||||
- **THEN** default weights SHALL be `vector_weight=0.7`, `keyword_weight=0.3`
|
||||
- **THEN** weights SHALL be configurable
|
||||
|
||||
### Requirement: Vector search via numpy cosine similarity
|
||||
The system SHALL perform vector search using numpy-vectorized cosine similarity for performance.
|
||||
|
||||
#### Scenario: Vectorized cosine similarity
|
||||
- **WHEN** numpy is available
|
||||
- **THEN** all chunk embeddings SHALL be loaded into a numpy matrix `(N, D)`
|
||||
- **THEN** cosine similarity SHALL be computed as `matrix @ query_vector` (BLAS matrix-vector multiply)
|
||||
- **THEN** top-K results SHALL be selected via `argpartition` (O(N) average)
|
||||
|
||||
#### Scenario: Pure-Python fallback
|
||||
- **WHEN** numpy is unavailable
|
||||
- **THEN** cosine similarity SHALL be computed per-row with pure Python
|
||||
- **THEN** results SHALL be sorted and the top K returned
|
||||
|
||||
### Requirement: Three-tier keyword search (FTS5 → trigram → LIKE)
|
||||
The system SHALL provide a cascading keyword search strategy for multi-language support.
|
||||
|
||||
#### Scenario: Standard FTS5 for ASCII queries
|
||||
- **WHEN** the query contains only ASCII characters
|
||||
- **THEN** the system SHALL use SQLite FTS5 with the unicode61 tokenizer
|
||||
- **THEN** BM25 ranking SHALL be converted to a `[0, 1)` score
|
||||
|
||||
#### Scenario: Trigram FTS5 for CJK queries
|
||||
- **WHEN** the query contains CJK (Chinese, Japanese, Korean) characters
|
||||
- **THEN** the system SHALL use SQLite FTS5 with the trigram tokenizer
|
||||
- **THEN** CJK character sequences and ASCII words SHALL be extracted and joined with AND
|
||||
|
||||
#### Scenario: LIKE fallback for edge cases
|
||||
- **WHEN** FTS5 is unavailable or returns empty results
|
||||
- **THEN** the system SHALL fall back to LIKE-based search
|
||||
- **THEN** CJK runs (1+ chars) and ASCII words (3+ chars) SHALL be matched independently
|
||||
|
||||
### Requirement: Temporal decay for dated memory files
|
||||
The system SHALL apply exponential decay to search scores for dated memory files.
|
||||
|
||||
#### Scenario: Decay applied to dated files
|
||||
- **WHEN** a memory chunk path matches `YYYY-MM-DD.md`
|
||||
- **THEN** the combined score SHALL be multiplied by `exp(-ln(2)/half_life * age_days)`
|
||||
- **THEN** the default `half_life` SHALL be 30 days
|
||||
- **WHEN** the path does not contain a date (e.g., `MEMORY.md`)
|
||||
- **THEN** no decay SHALL be applied (multiplier = 1.0)
|
||||
|
||||
### Requirement: Result filtering and limits
|
||||
The system SHALL filter search results by minimum score and maximum count.
|
||||
|
||||
#### Scenario: Min score threshold
|
||||
- **WHEN** search results are merged
|
||||
- **THEN** results with score below `min_score` (default 0.1) SHALL be discarded
|
||||
|
||||
#### Scenario: Max results limit
|
||||
- **WHEN** search results exceed `max_results`
|
||||
- **THEN** only the top `max_results` by combined score SHALL be returned
|
||||
@@ -0,0 +1,83 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Fact extraction from conversation
|
||||
The system SHALL extract structured facts from conversations using an LLM, with confidence scoring and category classification.
|
||||
|
||||
#### Scenario: Extract facts from conversation turn
|
||||
- **WHEN** a conversation turn (user message + assistant reply) is processed
|
||||
- **THEN** the system SHALL call the configured LLM with the conversation text
|
||||
- **THEN** the LLM response SHALL be parsed as structured JSON with facts
|
||||
- **THEN** each fact SHALL contain: `content`, `category`, `confidence` (0.0-1.0)
|
||||
|
||||
#### Scenario: Fact categories
|
||||
- **WHEN** a fact is extracted
|
||||
- **THEN** its `category` SHALL be one of: `preference`, `knowledge`, `context`, `behavior`, `goal`, `correction`
|
||||
- **THEN** the system SHALL validate the category against the allowed set
|
||||
|
||||
#### Scenario: Confidence thresholds
|
||||
- **WHEN** a fact's confidence is below the configurable threshold (default 0.5)
|
||||
- **THEN** the fact SHALL NOT be persisted
|
||||
- **THEN** the system SHALL log that a low-confidence fact was skipped
|
||||
|
||||
### Requirement: Fact CRUD operations
|
||||
The system SHALL support creating, reading, updating, and deleting memory facts.
|
||||
|
||||
#### Scenario: Create fact
|
||||
- **WHEN** a new fact is created
|
||||
- **THEN** it SHALL be assigned a unique ID (`fact_{uuid_hex[:8]}`)
|
||||
- **THEN** it SHALL be timestamped with ISO-8601 UTC
|
||||
- **THEN** it SHALL be persisted to the core store
|
||||
|
||||
#### Scenario: Delete fact by ID
|
||||
- **WHEN** a fact deletion is requested with a valid ID
|
||||
- **THEN** the fact SHALL be removed from the store
|
||||
- **THEN** the updated store SHALL be persisted
|
||||
|
||||
#### Scenario: Delete non-existent fact
|
||||
- **WHEN** a fact deletion is requested with an unknown ID
|
||||
- **THEN** the system SHALL raise `KeyError`
|
||||
|
||||
#### Scenario: Update fact
|
||||
- **WHEN** a fact update is requested with a valid ID
|
||||
- **THEN** the system SHALL update only the provided fields (`content`, `category`, `confidence`)
|
||||
- **THEN** the fact's `createdAt` SHALL NOT be modified
|
||||
- **THEN** the updated store SHALL be persisted
|
||||
|
||||
### Requirement: Content deduplication
|
||||
The system SHALL prevent duplicate facts by casefolded content comparison.
|
||||
|
||||
#### Scenario: Exact duplicate detected
|
||||
- **WHEN** a new fact's content (casefolded) matches an existing fact
|
||||
- **THEN** the new fact SHALL be skipped
|
||||
- **THEN** the existing fact SHALL remain unchanged
|
||||
- **THEN** the system SHALL log that a duplicate was skipped
|
||||
|
||||
#### Scenario: Near-duplicate with different casing
|
||||
- **WHEN** a new fact's content differs only in letter casing
|
||||
- **THEN** it SHALL be treated as a duplicate
|
||||
- **THEN** the new fact SHALL be skipped
|
||||
|
||||
### Requirement: Max facts limit
|
||||
The system SHALL enforce a configurable maximum number of stored facts (default 500).
|
||||
|
||||
#### Scenario: Fact count exceeds limit
|
||||
- **WHEN** adding a new fact would exceed `max_facts`
|
||||
- **THEN** the system SHALL sort existing facts by confidence (descending)
|
||||
- **THEN** the lowest-confidence fact SHALL be removed
|
||||
- **THEN** the new fact SHALL be added
|
||||
|
||||
### Requirement: Memory formatting for context injection
|
||||
The system SHALL format memory data into a compact string for injection into LLM system prompts, respecting a token budget.
|
||||
|
||||
#### Scenario: Format with all sections
|
||||
- **WHEN** memory data contains user context, history, and facts
|
||||
- **THEN** the output SHALL include: "User Context:" with work/personal/topOfMind
|
||||
- **THEN** the output SHALL include: "History:" with recent/earlier/background
|
||||
- **THEN** the output SHALL include: "Facts:" sorted by confidence descending
|
||||
- **THEN** each fact SHALL be formatted as: `- [{category} | {confidence:.2f}] {content}`
|
||||
|
||||
#### Scenario: Token budget enforcement
|
||||
- **WHEN** the formatted output exceeds `max_tokens` (default 2000)
|
||||
- **THEN** the system SHALL trim facts from lowest confidence up
|
||||
- **THEN** if still over budget, the output SHALL be truncated at the character level
|
||||
- **THEN** `"\n..."` SHALL be appended to indicate truncation
|
||||
@@ -0,0 +1,64 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: manage_memory tool
|
||||
The system SHALL provide a callable tool for creating, updating, and deleting persistent facts.
|
||||
|
||||
#### Scenario: Create a new fact
|
||||
- **WHEN** `manage_memory(content="...", action="create")` is called
|
||||
- **THEN** a new fact SHALL be created with the provided content
|
||||
- **THEN** a unique ID SHALL be auto-generated
|
||||
- **THEN** the return value SHALL be `"created memory <id>"`
|
||||
|
||||
#### Scenario: Update an existing fact
|
||||
- **WHEN** `manage_memory(content="...", action="update", id="<existing-id>")` is called
|
||||
- **THEN** the fact SHALL be updated with the new content
|
||||
- **THEN** the return value SHALL be `"updated memory <id>"`
|
||||
- **WHEN** no `id` is provided for an update action
|
||||
- **THEN** a ValueError SHALL be raised
|
||||
|
||||
#### Scenario: Delete a fact
|
||||
- **WHEN** `manage_memory(action="delete", id="<existing-id>")` is called
|
||||
- **THEN** the fact SHALL be deleted
|
||||
- **THEN** the return value SHALL be `"Deleted memory <id>"`
|
||||
- **WHEN** no `id` is provided for a delete action
|
||||
- **THEN** a ValueError SHALL be raised
|
||||
|
||||
#### Scenario: Configurable permitted actions
|
||||
- **WHEN** creating the tool with `actions_permitted=("create", "update")`
|
||||
- **THEN** the delete action SHALL NOT be available
|
||||
- **THEN** attempting a delete SHALL raise a ValueError
|
||||
|
||||
#### Scenario: Custom instructions
|
||||
- **WHEN** creating the tool with custom `instructions`
|
||||
- **THEN** those instructions SHALL be included in the tool description to guide LLM usage
|
||||
|
||||
### Requirement: search_memory tool
|
||||
The system SHALL provide a callable tool for searching stored facts by semantic query.
|
||||
|
||||
#### Scenario: Text query search
|
||||
- **WHEN** `search_memory(query="preference for dark mode", limit=10)` is called
|
||||
- **THEN** the system SHALL perform hybrid search (vector + keyword)
|
||||
- **THEN** results SHALL be returned as a serialized JSON list of fact objects
|
||||
|
||||
#### Scenario: Filtered search
|
||||
- **WHEN** `search_memory(query="...", filter={"category": "preference"})` is called
|
||||
- **THEN** results SHALL be filtered to match the specified criteria
|
||||
|
||||
#### Scenario: Configurable response format
|
||||
- **WHEN** `response_format="content_and_artifact"` is configured
|
||||
- **THEN** the tool SHALL return both serialized memories and raw memory objects
|
||||
|
||||
### Requirement: Namespace isolation for multi-tenant
|
||||
The system SHALL support namespace-based isolation of memory data across users, agents, or organizations.
|
||||
|
||||
#### Scenario: Runtime namespace resolution
|
||||
- **WHEN** a memory tool is called with a configuration containing `{"user_id": "u-123"}`
|
||||
- **THEN** the namespace SHALL be resolved to `("user", "u-123")` at runtime
|
||||
- **WHEN** calling with `{"org_id": "acme", "agent_id": "alpha"}`
|
||||
- **THEN** the namespace SHALL be `("org", "acme", "alpha")`
|
||||
|
||||
#### Scenario: Namespace templating
|
||||
- **WHEN** creating memory tools with `namespace=("{user_id}", "memories")`
|
||||
- **THEN** the `{user_id}` placeholder SHALL be replaced at runtime from configuration
|
||||
- **WHEN** a required config key is missing
|
||||
- **THEN** a ConfigurationError SHALL be raised
|
||||
@@ -0,0 +1,65 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Token budget management
|
||||
The system SHALL manage LLM context window limits by tracking token usage and triggering summarization when thresholds are exceeded.
|
||||
|
||||
#### Scenario: Token threshold exceeded
|
||||
- **WHEN** cumulative message tokens exceed `max_tokens` configuration
|
||||
- **THEN** the system SHALL identify messages to summarize starting from oldest
|
||||
- **THEN** the system SHALL replace summarized messages with a `RunningSummary` object
|
||||
- **THEN** the system SHALL ensure remaining messages + summary fit within `max_tokens` budget
|
||||
|
||||
#### Scenario: Partial token budget allocation
|
||||
- **WHEN** `max_summary_tokens` is configured (default 256)
|
||||
- **THEN** the system SHALL reserve `max_summary_tokens` tokens for the summary itself
|
||||
- **THEN** remaining messages SHALL be trimmed to fit within `max_tokens - max_summary_tokens`
|
||||
|
||||
### Requirement: Incremental summarization
|
||||
The system SHALL support incremental summarization across multiple turns, tracking which messages have already been summarized to avoid redundant work.
|
||||
|
||||
#### Scenario: First summarization
|
||||
- **WHEN** no existing `RunningSummary` exists and token threshold is exceeded
|
||||
- **THEN** the system SHALL call the LLM with an initial summary prompt
|
||||
- **THEN** the system SHALL return a `RunningSummary` with `summary`, `summarized_message_ids` set, and `last_summarized_message_id`
|
||||
|
||||
#### Scenario: Subsequent summarization (append)
|
||||
- **WHEN** a `RunningSummary` exists and new messages exceed threshold
|
||||
- **THEN** the system SHALL call the LLM with the existing summary plus new messages
|
||||
- **THEN** the system SHALL extend `summarized_message_ids` with newly summarized message IDs
|
||||
- **THEN** the system SHALL update `last_summarized_message_id`
|
||||
|
||||
### Requirement: Context trimming with summarization hook
|
||||
The system SHALL provide a hook that fires before messages are discarded, allowing the daily tier to capture summarized content.
|
||||
|
||||
#### Scenario: Pre-trim flush
|
||||
- **WHEN** messages are about to be discarded (summarized)
|
||||
- **THEN** the system SHALL fire a `memory_flush_hook` with the messages being summarized
|
||||
- **THEN** the hook SHALL queue the messages for async memory extraction
|
||||
- **THEN** the main thread SHALL NOT block on memory extraction
|
||||
|
||||
### Requirement: Token counting with fallback
|
||||
The system SHALL provide accurate token counting using `tiktoken` when available, with a char-based fallback.
|
||||
|
||||
#### Scenario: tiktoken available
|
||||
- **WHEN** tiktoken package is installed
|
||||
- **THEN** the system SHALL use `tiktoken.get_encoding("cl100k_base")` for token counting
|
||||
- **THEN** token counts SHALL be accurate per OpenAI/Anthropic tokenization
|
||||
|
||||
#### Scenario: tiktoken unavailable
|
||||
- **WHEN** tiktoken is not installed
|
||||
- **THEN** the system SHALL fall back to character-based estimation: `len(text) // 4`
|
||||
- **THEN** the system SHALL log a warning about missing tiktoken
|
||||
|
||||
### Requirement: Summarization node for LangGraph
|
||||
The system SHALL provide a `SummarizationNode` Runnable that integrates into LangGraph state graphs.
|
||||
|
||||
#### Scenario: Graph integration
|
||||
- **WHEN** `SummarizationNode` is added to a LangGraph workflow
|
||||
- **THEN** it SHALL read messages from `input_messages_key` (default "messages")
|
||||
- **THEN** it SHALL write updated messages to `output_messages_key` (default "summarized_messages")
|
||||
- **THEN** it SHALL store `RunningSummary` in `context.running_summary`
|
||||
|
||||
#### Scenario: Equality of input/output keys
|
||||
- **WHEN** `input_messages_key` equals `output_messages_key`
|
||||
- **THEN** the node SHALL emit a `RemoveMessage(REMOVE_ALL_MESSAGES)` to clear previous state
|
||||
- **THEN** the node SHALL write the new message list including the summary
|
||||
@@ -0,0 +1,64 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Three-tier memory architecture
|
||||
The system SHALL maintain three tiers of memory: Context (short-term/ephemeral), Daily (medium-term/file-based), and Core (long-term/distilled).
|
||||
|
||||
#### Scenario: Context tier stores active session
|
||||
- **WHEN** an agent conversation is in progress
|
||||
- **THEN** the context tier SHALL track messages, token usage, and running summary
|
||||
- **WHEN** the session ends or context is trimmed
|
||||
- **THEN** the context SHALL be flushed to the daily tier
|
||||
|
||||
#### Scenario: Daily tier persists as Markdown files
|
||||
- **WHEN** context is flushed
|
||||
- **THEN** the daily tier SHALL append summarized records to `memory/YYYY-MM-DD.md`
|
||||
- **THEN** each session block SHALL have a timestamped header (e.g., `## Trimmed Context (14:30)`)
|
||||
- **THEN** daily files SHALL be created lazily (only when first write occurs)
|
||||
|
||||
#### Scenario: Core tier stores distilled long-term knowledge
|
||||
- **WHEN** Deep Dream consolidation runs
|
||||
- **THEN** the core tier SHALL be updated by rewriting `MEMORY.md`
|
||||
- **THEN** `MEMORY.md` SHALL be formatted as Markdown with `- ` bullet items, optionally grouped under `## headings`
|
||||
|
||||
### Requirement: Daily memory file management
|
||||
The system SHALL manage daily memory files with automatic creation and lazy initialization.
|
||||
|
||||
#### Scenario: Lazy file creation
|
||||
- **WHEN** the first memory write occurs for a given day
|
||||
- **THEN** a file SHALL be created at `memory/YYYY-MM-DD.md` with a header `# Daily Memory: YYYY-MM-DD`
|
||||
|
||||
#### Scenario: Append-only writes
|
||||
- **WHEN** subsequent memory writes occur on the same day
|
||||
- **THEN** new entries SHALL be appended to the existing daily file
|
||||
|
||||
### Requirement: Deep Dream consolidation
|
||||
The system SHALL periodically consolidate daily memories into the core memory using LLM-based distillation.
|
||||
|
||||
#### Scenario: Deep Dream triggered
|
||||
- **WHEN** `deep_dream(lookback_days=N)` is called
|
||||
- **THEN** the system SHALL read current `MEMORY.md` and the last N daily files
|
||||
- **THEN** the LLM SHALL receive both the current memory and daily records
|
||||
- **THEN** the LLM SHALL return `[MEMORY]` and `[DREAM]` sections
|
||||
- **THEN** `MEMORY.md` SHALL be overwritten with the `[MEMORY]` content
|
||||
- **THEN** a dream diary SHALL be written to `memory/dreams/YYYY-MM-DD.md`
|
||||
|
||||
#### Scenario: Dedup prevents redundant runs
|
||||
- **WHEN** Deep Dream is called but daily content hash matches the last processed hash
|
||||
- **THEN** the operation SHALL be skipped
|
||||
|
||||
#### Scenario: No daily content skips gracefully
|
||||
- **WHEN** Deep Dream is called but no recent daily files have content
|
||||
- **THEN** the operation SHALL be skipped and existing `MEMORY.md` SHALL be preserved
|
||||
|
||||
#### Scenario: No-fabrication constraint
|
||||
- **WHEN** the LLM produces the `[MEMORY]` section
|
||||
- **THEN** it SHALL ONLY use information present in the source materials (current MEMORY.md + daily files)
|
||||
- **THEN** it SHALL NOT fabricate, infer, or add information not present in the source
|
||||
|
||||
### Requirement: Context summary injection
|
||||
The system SHALL support injecting daily summary text into the active message list for context continuity.
|
||||
|
||||
#### Scenario: Context summary callback
|
||||
- **WHEN** a daily memory flush completes
|
||||
- **THEN** an optional callback SHALL be invoked with the daily summary text
|
||||
- **THEN** the caller MAY inject the summary into the message list for continued context awareness
|
||||
@@ -0,0 +1,86 @@
|
||||
## 1. Module Scaffold & Data Schemas
|
||||
|
||||
- [x] 1.1 Create `memory-engine/` directory tree with all subdirectories and `__init__.py` files
|
||||
- [x] 1.2 Create `config.py` with `MemoryConfig` pydantic model (embedding, chunking, search, tier settings)
|
||||
- [x] 1.3 Create `core/schemas.py` with `MemoryChunk`, `SearchResult`, `Fact`, `RunningSummary`, `ExtractedMemory` data classes
|
||||
- [x] 1.4 Create `utils/token_counter.py` with tiktoken + char-fallback token counting
|
||||
- [x] 1.5 Create `utils/namespace.py` with `NamespaceTemplate` for runtime namespace resolution
|
||||
- [x] 1.6 Create `utils/chunker.py` with `TextChunker` (line-based, overlapping, configurable max_tokens)
|
||||
|
||||
## 2. Core Store: SQLite + FTS5 + Vector
|
||||
|
||||
- [x] 2.1 Create `core/store.py` with `MemoryStore` — SQLite init with WAL mode, FTS5 tables, integrity checks
|
||||
- [x] 2.2 Implement `create_chunks_table()` with embedding BLOB storage, indexes, meta table
|
||||
- [x] 2.3 Implement `create_fts5_tables()` with standard unicode61 tokenizer + trigram tokenizer for CJK
|
||||
- [x] 2.4 Implement FTS5 triggers (AFTER INSERT/UPDATE/DELETE) for auto-sync
|
||||
- [x] 2.5 Implement `save_chunk()` / `save_chunks_batch()` with SQLite UPSERT (INSERT ... ON CONFLICT DO UPDATE)
|
||||
- [x] 2.6 Implement `delete_by_path()`, `get_file_hash()`, `update_file_metadata()`
|
||||
- [x] 2.7 Implement FTS5 self-healing: `_fts5_state_inconsistent()`, `_fts5_shadow_corrupt()`, `reset_fts5()`
|
||||
- [x] 2.8 Implement embedding encode/decode (float32 BLOB via numpy, struct fallback, legacy JSON fallback)
|
||||
- [x] 2.9 Implement `get_stats()` and `close()` methods
|
||||
|
||||
## 3. Hybrid Search
|
||||
|
||||
- [x] 3.1 Implement `search_vector()` — numpy matrix cosine similarity with argpartition top-K (pure-Python fallback)
|
||||
- [x] 3.2 Implement FTS5 keyword search with BM25 scoring: `_search_fts5()`, `_search_fts5_trigram()`
|
||||
- [x] 3.3 Implement `_search_like()` — CJK (1+ chars) + ASCII word (3+ chars) with dynamic scoring
|
||||
- [x] 3.4 Implement `search_keyword()` — three-tier strategy (FTS5 → trigram FTS5 → LIKE)
|
||||
- [x] 3.5 Implement BM25 rank to score conversion (`0.3 + 0.69 * abs(r)/(1+abs(r))`)
|
||||
- [x] 3.6 Create `core/hybrid_search.py` with weighted merge (vector_weight, keyword_weight) + temporal decay
|
||||
- [x] 3.7 Implement `_compute_temporal_decay(path, half_life=30)` — exponential decay for dated files
|
||||
|
||||
## 4. LLM Memory Extraction
|
||||
|
||||
- [x] 4.1 Create `extraction/prompts.py` with memory update system prompt (structured JSON output)
|
||||
- [x] 4.2 Create `extraction/manager.py` with `MemoryUpdater` — LLM fact extraction from conversation
|
||||
- [x] 4.3 Implement `_prepare_update_prompt()` — loads current memory, formats conversation, builds prompt
|
||||
- [x] 4.4 Implement `_parse_memory_update_response()` — JSON extraction from LLM response (handles fences/thinking)
|
||||
- [x] 4.5 Implement `_apply_updates()` — update user/history sections, add/remove facts, enforce max_facts
|
||||
- [x] 4.6 Implement `create_fact()`, `update_fact()`, `delete_memory_fact()` CRUD operations
|
||||
- [x] 4.7 Implement content deduplication (casefold comparison) and confidence threshold filtering
|
||||
- [x] 4.8 Implement upload-mention scrubbing from memory data
|
||||
|
||||
## 5. Tiered Consolidation
|
||||
|
||||
- [x] 5.1 Create `tiers/daily.py` with `DailyTier` — lazy file creation, append-only writes with timestamped headers
|
||||
- [x] 5.2 Create `tiers/context.py` with `ContextTier` — short-term context window management with RunningSummary
|
||||
- [x] 5.3 Create `tiers/core.py` with `CoreTier` — wraps MemoryStore, manages MEMORY.md file
|
||||
- [x] 5.4 Create `tiers/__init__.py` with `flush_messages()` — context summarization + daily file append
|
||||
- [x] 5.5 Implement incremental summarization (initial summary, extend existing, RunningSummary tracking)
|
||||
- [x] 5.6 Create `background/deep_dream.py` with `DeepDream` — LLM-based MEMORY.md consolidation
|
||||
- [x] 5.7 Implement Deep Dream dedup (content-hash check), dream diary writing, empty-output guard
|
||||
|
||||
## 6. Background Processing Queue
|
||||
|
||||
- [x] 6.1 Create `background/queue.py` with `MemoryUpdateQueue` — thread-safe, debounced, keyed by (thread, user, agent)
|
||||
- [x] 6.2 Implement `add()` with debounce timer reset, `add_nowait()` for immediate processing
|
||||
- [x] 6.3 Implement timer-triggered processing with rate limiting between updates
|
||||
- [x] 6.4 Implement signal detection: `detect_correction()`, `detect_reinforcement()` with pattern matching
|
||||
- [x] 6.5 Create `background/__init__.py` with `flush_messages()` — dedup + background thread LLM summarization
|
||||
- [x] 6.6 Support `context_summary_callback` for in-context injection of summaries
|
||||
|
||||
## 7. Agent Tools & Public API
|
||||
|
||||
- [x] 7.1 Create `tools/manage.py` with `manage_memory()` — create/update/delete facts with namespace isolation
|
||||
- [x] 7.2 Create `tools/search.py` with `search_memory()` — hybrid search with query/filter/limit/offset
|
||||
- [x] 7.3 Implement `__init__.py` with `MemoryEngine` unified class: `manage()`, `search()`, `flush()`, `dream()`, `format_for_injection()`
|
||||
- [x] 7.4 Implement `format_for_injection()` — token-budgeted memory string for system prompts
|
||||
- [x] 7.5 Thread-safe singleton pattern for `MemoryUpdateQueue` and `MemoryStore`
|
||||
|
||||
## 8. Embedding Provider Interface
|
||||
|
||||
- [x] 8.1 Create `embedding/base.py` with `EmbeddingProvider` ABC — `embed_query()`, `embed_batch()`
|
||||
- [x] 8.2 Create `embedding/openai.py` with `OpenAIEmbeddingProvider` implementation
|
||||
- [x] 8.3 Implement `EmbeddingCache` — per-session cache keyed by (provider, model, text_hash)
|
||||
- [x] 8.4 Create `embedding/__init__.py` with `create_embedding_provider()` factory
|
||||
|
||||
## 9. Integration Tests
|
||||
|
||||
- [x] 9.1 Test short-term context summarization with token budget enforcement
|
||||
- [x] 9.2 Test long-term fact extraction with LLM mock
|
||||
- [x] 9.3 Test hybrid search: vector-only, keyword-only, and combined
|
||||
- [x] 9.4 Test tiered consolidation: flush → daily file → Deep Dream → MEMORY.md rewrite
|
||||
- [x] 9.5 Test background queue: debounce, dedup, async execution
|
||||
- [x] 9.6 Test namespace isolation: scoped searches across tenants
|
||||
- [x] 9.7 Test graceful degradation: no embeddings → keyword-only, no numpy → Python fallback
|
||||
- [x] 9.8 Test memory tools: create/update/delete/search round-trip
|
||||
@@ -0,0 +1,76 @@
|
||||
## Context
|
||||
|
||||
boocode currently has no persistent session management for its agents (the persona agents in data/AGENTS.md). When a session is interrupted, there's no recoverable audit trail, no way to detect repeated mistakes, and no mechanism to enforce learned behavioral guidelines across sessions.
|
||||
|
||||
audit-harness provides: hooks (PostToolUse buffer→Stop flush→UserPromptSubmit injection), skills (/start→/end→/recover→/report-daily), and a Python core (AuditContext) with unified index schema.
|
||||
|
||||
Parlant provides: GuidelineDocumentStore (versioned, tag/label filtered), JourneyStore (graph-based SOPs), and JourneyGuidelineProjection (node→guideline auto-conversion).
|
||||
|
||||
This design ports the high-value subset of both into boocode as agent-facing skills and a TypeScript core library.
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Define `.boo/runs/` directory convention with auto-creation and `.gitignore`
|
||||
- Port /start, /end, /recover, /report-daily as boocode skills (markdown)
|
||||
- Port user_correction record format and detection
|
||||
- Port GuidelineDocumentStore from Parlant as TypeScript service
|
||||
- Port Journey → guideline auto-projection (node→guideline conversion)
|
||||
- Implement guideline find_guideline() by content match
|
||||
- All features opt-in, zero breaking changes
|
||||
|
||||
**Non-Goals:**
|
||||
- AuditContext full Python class port (environment snapshots, anomaly lambdas)
|
||||
- Hooks implementation (PostToolUse/Stop/UserPromptSubmit) — separate batch
|
||||
- Parlant's vector DB / embedder infrastructure
|
||||
- Parlant's relationship resolver (ARQ)
|
||||
- Web UI for guideline management — CLI/skill-only
|
||||
|
||||
## Decisions
|
||||
|
||||
### Decision 1: Skill-based commands over CLI tools
|
||||
|
||||
**Choice**: Implement /start, /end, /recover, /report-daily as skill markdown files in `data/skills/boocode/`, following the existing `committing-changes` pattern.
|
||||
**Rationale**: boocode agents already load skills from this path. Adding a new skill is zero code change to the agent runtime — just a new markdown file with YAML frontmatter. CLI tools would require new API routes, dispatch logic, and frontend work.
|
||||
**Alternatives considered**: Fastify API routes (rejected — too heavy for agent-facing commands), shell scripts (rejected — platform-specific).
|
||||
|
||||
### Decision 2: JSONL buffer + index.json
|
||||
|
||||
**Choice**: Port audit-harness's file layout exactly: `audit_buffer.jsonl` for live writes, `audit_pending.jsonl` for agent-authored AUDIT blocks, per-session `audit_trail.jsonl` for flushed records, `index.json` for cross-session metadata.
|
||||
**Rationale**: audit-harness has production-miles with this layout. JSONL is grep-able, append-only, and needs no DB connection.
|
||||
**Alternatives considered**: Postgres (rejected — agents don't all have DB access), SQLite (rejected — adds a native dep).
|
||||
|
||||
### Decision 3: GUID-based session IDs
|
||||
|
||||
**Choice**: `adhoc_YYYYMMDD_HHMM` format for session IDs, matching audit-harness pattern.
|
||||
**Rationale**: Human-readable, sort-able, no collision risk within the same second.
|
||||
|
||||
### Decision 4: File-based GuidelineStore
|
||||
|
||||
**Choice**: Port GuidelineDocumentStore's abstract interface (create/list/read/update/delete/find) but use filesystem JSON storage instead of Parlant's DocumentDatabase.
|
||||
**Rationale**: boocode doesn't have Parlant's document DB abstraction. A JSON-file store is simpler and sufficient for single-user operation. The interface stays the same, so a future Postgres backend can be swapped in.
|
||||
**Alternatives considered**: Postgres backend (rejected — adds coupling), in-memory only (rejected — no persistence).
|
||||
|
||||
### Decision 5: Journey → guideline projection as pure function
|
||||
|
||||
**Choice**: Port `JourneyGuidelineProjection` as a pure function (not a class). Takes a Journey + its nodes/edges, returns Guideline[].
|
||||
**Rationale**: The projection logic (DFS traversal, node→guideline conversion, edge metadata grafting) is deterministic and has no side effects. A pure function is simpler to test and compose.
|
||||
**Alternatives considered**: Class with JourneyStore dependency (rejected — unnecessary indirection for our use case).
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
- **[Risk]** Skills grow stale if agent runtime doesn't load them → **Mitigation**: Test with existing agent by loading skill explicitly.
|
||||
- **[Risk]** JSONL file contention from multiple agents → **Mitigation**: Single-user homelab. Acceptable.
|
||||
- **[Risk]** GuidelineStore JSON files grow unbounded → **Mitigation**: TBD — add compaction/archival in future batch.
|
||||
- **[Trade-off]** File storage is simple but doesn't scale to multi-user → Acceptable for single-user.
|
||||
|
||||
## Migration / Rollout
|
||||
|
||||
1. Create openspec spec files (proposal/design/tasks/specs)
|
||||
2. Create `.boo/runs/` directory structure (service)
|
||||
3. Create 4 skill files in `data/skills/boocode/`
|
||||
4. Create core AuditContext TypeScript service
|
||||
5. Create GuidelineStore + Journey service
|
||||
6. Create user_correction utilities
|
||||
7. Update data/AGENTS.md with new agents
|
||||
8. Test with skill invocation
|
||||
@@ -0,0 +1,23 @@
|
||||
## Why
|
||||
|
||||
The audit-harness (hooks + skills + AuditContext) and Parlant (GuidelineStore + Journey engine) provide two proven patterns for agent session management. audit-harness solves context-window loss through persistent audit trails, graded recovery, and structured commands (/start → /end → /recover → /report-daily). Parlant solves behavioral consistency through a versioned guideline document store with tag/label-based retrieval, journey-based SOPs, and backtrack detection.
|
||||
|
||||
Porting these patterns into boocode's agent ecosystem gives every agent working in this repo persistent session management, cross-session user correction awareness, and behavioral guideline enforcement — without building any of it from scratch.
|
||||
|
||||
## What Changes
|
||||
|
||||
### New Capabilities
|
||||
|
||||
- **Data Directory Convention**: `.boo/runs/` directory with buffer files, session dirs, `.current_session` handshake, unified `index.json`. `AUDIT_DOT_DIR` env var for platform override.
|
||||
|
||||
- **Session Lifecycle Commands**: `/start` creates named audit sessions with auto-recovery (L0+L2). `/end` flushes buffers, runs integrity checks, generates `session_summary.md`. `/recover` graded context loading (L0–L3). `/report-daily` aggregates all sessions into a 7-section report; `/report-daily review` also runs morning self-review.
|
||||
|
||||
- **User Correction Tracking**: Structured `user_correction` records with `original_claim`/`correction`/`principle_extracted`/`persisted_to`. Auto-detected on `/end`. Correction-as-precedent enforcement when agent actions contradict prior corrections.
|
||||
|
||||
- **Behavioral Guidelines Store**: Versioned GuidelineDocumentStore ported from Parlant with condition+action+description content model, tag/label filtering, and content-based `find_guideline()`. Journey → guideline auto-projection (SOP nodes → guidelines with follow-up edges). Journey backtrack detection batch.
|
||||
|
||||
### Dependencies
|
||||
|
||||
- Existing audit-harness patterns (audit-context.py, hooks, skills) reference implementation.
|
||||
- Parlant's GuidelineStore (guidelines.py) and JourneyStore (journeys.py) reference implementation.
|
||||
- No new external services. File-based JSONL storage (audit-harness pattern).
|
||||
@@ -0,0 +1,80 @@
|
||||
# Behavioral Guidelines Store — Spec
|
||||
|
||||
## Guideline Entity
|
||||
|
||||
```typescript
|
||||
interface GuidelineContent {
|
||||
condition: string; // When...
|
||||
action: string | null; // Then...
|
||||
description: string | null;
|
||||
}
|
||||
|
||||
interface Guideline {
|
||||
id: string;
|
||||
creationUtc: string;
|
||||
content: GuidelineContent;
|
||||
enabled: boolean;
|
||||
tags: string[];
|
||||
labels: string[];
|
||||
metadata: Record<string, unknown>;
|
||||
criticality: "low" | "medium" | "high";
|
||||
title: string | null;
|
||||
priority: number;
|
||||
}
|
||||
```
|
||||
|
||||
## GuidelineDocumentStore
|
||||
|
||||
File-based JSON store at `.boo/guidelines/`. Versioned with migration support.
|
||||
|
||||
Methods:
|
||||
- `createGuideline(condition, action?, description?, ...) → Guideline`
|
||||
- `listGuidelines(tags?, labels?) → Guideline[]`
|
||||
- `readGuideline(id) → Guideline`
|
||||
- `updateGuideline(id, params) → Guideline`
|
||||
- `deleteGuideline(id) → void`
|
||||
- `findGuideline(content: {condition, action?}) → Guideline`
|
||||
|
||||
Version migration chain (port from Parlant v0.1.0 → v0.11.0):
|
||||
- v0.1.0 → v0.2.0: add enabled field
|
||||
- v0.2.0 → v0.3.0: remove guideline_set (migration script only)
|
||||
- v0.3.0 → v0.4.0: add optional action, description, metadata
|
||||
- v0.4.0 → v0.5.0: description as optional
|
||||
- v0.5.0 → v0.6.0: add criticality (default "medium")
|
||||
- v0.6.0 → v0.7.0: add composition_mode (optional)
|
||||
- v0.7.0 → v0.8.0: add track (default true)
|
||||
- v0.8.0 → v0.9.0: add labels (default empty)
|
||||
- v0.9.0 → v0.10.0: add priority (default 0)
|
||||
- v0.10.0 → v0.11.0: add title (default null)
|
||||
|
||||
## Tag & Label Filtering
|
||||
|
||||
- `listGuidelines({tags: ["tag1"]})` → guidelines with ANY of the specified tags
|
||||
- `listGuidelines({labels: ["label1"]})` → guidelines with ALL specified labels (subset match)
|
||||
- Combined: both filters apply (intersection)
|
||||
|
||||
## Journey → Guideline Projection
|
||||
|
||||
Port of Parlant's `JourneyGuidelineProjection.project_journey_to_guidelines()`:
|
||||
|
||||
- DFS traversal of Journey nodes from root
|
||||
- Each (edge, node) pair → one Guideline
|
||||
- Edge condition becomes guideline condition
|
||||
- Node action becomes guideline action
|
||||
- Edge/node metadata merged into guideline metadata with journey_node key
|
||||
- follow_ups list populated with downstream guideline IDs
|
||||
- BFS queue avoids infinite loops via visited set
|
||||
|
||||
## Journey Backtrack Detection
|
||||
|
||||
```typescript
|
||||
interface BacktrackCheck {
|
||||
journeyId: string;
|
||||
currentNodeId: string;
|
||||
previousNodeId: string;
|
||||
isBacktrack: boolean;
|
||||
recommendation: string | null;
|
||||
}
|
||||
```
|
||||
|
||||
Scans the edge list for source→target relationships. If the agent's current step has an edge back to a previously visited node (and that node is not in a forward path from current), it's flagged as a backtrack regression.
|
||||
@@ -0,0 +1,88 @@
|
||||
# Session Lifecycle Commands — Spec
|
||||
|
||||
## Overview
|
||||
|
||||
Four agent-invocable commands that manage audit session lifecycle. Each command is a skill markdown file loaded by the agent on invocation.
|
||||
|
||||
## /start
|
||||
|
||||
```
|
||||
/start "task description"
|
||||
```
|
||||
|
||||
Creates a named audit session:
|
||||
|
||||
1. Generate `session_id = adhoc_YYYYMMDD_HHMM`
|
||||
2. `mkdir -p .boo/runs/{session_id}`
|
||||
3. Write `session.json`:
|
||||
```json
|
||||
{
|
||||
"session_id": "adhoc_20260320_1400",
|
||||
"task": "task description",
|
||||
"start_time": "2026-03-20T14:00:00Z",
|
||||
"status": "in_progress",
|
||||
"expected_record_types": ["data", "change", "conversation"]
|
||||
}
|
||||
```
|
||||
4. Write `.boo/runs/.current_session` containing session_id (handshake for hooks)
|
||||
5. Run context recovery:
|
||||
- L0: read `index.json` → last 5 entries
|
||||
- L2: scan recent audit_trail.jsonl for `user_correction` records
|
||||
6. Output recovery summary: recent activity, corrections, priorities
|
||||
7. Check for unfinished sessions: scan for `status: "in_progress"` sessions, prompt user
|
||||
|
||||
## /end
|
||||
|
||||
```
|
||||
/end
|
||||
```
|
||||
|
||||
Ends the current audit session:
|
||||
|
||||
1. Read `.current_session` → get session_id
|
||||
2. Collect remaining buffer data from `audit_buffer.jsonl` + `audit_pending.jsonl`
|
||||
3. Append to `audit_trail.jsonl`
|
||||
4. Clear buffer files
|
||||
5. Extract `user_correction` records from audit_trail
|
||||
6. Run integrity checks:
|
||||
- Has records? (>0 audit_trail lines)
|
||||
- All files covered? (changes in audit_trail match modified files)
|
||||
- Corrections persisted? (persisted_to is non-empty)
|
||||
7. Generate `session_summary.md`
|
||||
8. Update `session.json` status=completed, end_time
|
||||
9. Clear `.current_session`
|
||||
|
||||
## /recover
|
||||
|
||||
```
|
||||
/recover # L0+L1+L2
|
||||
/recover full # L3 (full audit_trail)
|
||||
/recover {session_id} # load specific session
|
||||
```
|
||||
|
||||
Graded context loading:
|
||||
|
||||
- L0 (~200t): index.json → last 5 entries (id, task, status)
|
||||
- L1 (~500t): .current_session + session.json + last 3 audit_trail entries
|
||||
- L2 (~1000t): scan all audit_trails for user_correction records + conclusions + daily report §4+§6
|
||||
- L3 (~3000t): full audit_trail.jsonl + audit_pending.jsonl
|
||||
|
||||
## /report-daily
|
||||
|
||||
```
|
||||
/report-daily # today
|
||||
/report-daily 20260319 # specific date
|
||||
/report-daily review # + morning self-review
|
||||
```
|
||||
|
||||
7-section report:
|
||||
|
||||
1. Task overview (from index.json)
|
||||
2. Operation stats (tool counts)
|
||||
3. Change records (file modifications)
|
||||
4. User feedback & corrections
|
||||
5. Anomaly alerts
|
||||
6. Backlog tracking
|
||||
7. Integrity summary
|
||||
|
||||
`review` variant: adds morning self-review with trend analysis and recommended priorities.
|
||||
@@ -0,0 +1,42 @@
|
||||
# User Correction Tracking — Spec
|
||||
|
||||
## Record Schema
|
||||
|
||||
```typescript
|
||||
interface UserCorrectionRecord {
|
||||
record_type: "conversation";
|
||||
action_type: "user_correction";
|
||||
priority: "critical_for_recovery";
|
||||
timestamp: string; // ISO 8601
|
||||
original_claim: string; // what the agent said that was wrong
|
||||
correction: string; // what the user corrected it to
|
||||
principle_extracted: string; // general principle derived from this correction
|
||||
persisted_to: string[]; // files where this correction was documented
|
||||
}
|
||||
```
|
||||
|
||||
## Storage
|
||||
|
||||
User correction records are stored inline in `audit_trail.jsonl` as regular entries. They are extracted during `/end` and surfaced during `/recover` L2 loading.
|
||||
|
||||
## Detection
|
||||
|
||||
During `/end`, scan the session's `audit_trail.jsonl` for entries matching:
|
||||
- `action_type === "user_correction"`
|
||||
|
||||
Also scan `audit_pending.jsonl` for any pending correction records not yet flushed.
|
||||
|
||||
## persisted_to Field
|
||||
|
||||
When a correction is written to CLAUDE.md, coding standards, or other documentation, the file paths are recorded in `persisted_to[]`. This is populated manually by the agent when it persists the correction.
|
||||
|
||||
## Correction-as-Precedent
|
||||
|
||||
When an agent considers an action that contradicts a known `user_correction` record, it is flagged with a warning. The agent should:
|
||||
|
||||
1. Identify the contradiction (which rule is being violated)
|
||||
2. Surface the relevant correction record (with timestamp and original context)
|
||||
3. Propose an alternative that respects the correction
|
||||
4. If the contradiction is intentional, document why as a new correction
|
||||
|
||||
Detection logic: before each significant action, the agent scans loaded user_correction records from the current recovery context and checks if the proposed action matches any known `original_claim` pattern.
|
||||
@@ -0,0 +1,39 @@
|
||||
# port-audit-parlant-patterns — Implementation Complete
|
||||
|
||||
## boocontext (TypeScript) — src/audit/
|
||||
- [x] 1. Data Dir: `dotDir()`, `findRunsDir()`, `ensureRunsDir()` with .gitignore + AUDIT_DOT_DIR
|
||||
- [x] 2. Core Types: `RecordEntry`, `CompactRecord`, `Manifest`, `UserCorrectionRecord`, `SessionJson`, `SessionSummary`
|
||||
- [x] 3. Hash Utilities: `hashFile()`, `hashBytes()`, `hashDir()` via Node crypto SHA256
|
||||
- [x] 4. Anomaly: `AlertRule`, `Anomaly`, `checkAnomalies()` with default rules
|
||||
- [x] 5. AuditContext: `createBatchContext()` -> `record()` -> `recordCompact()` -> `finalize()` -> `save()` (writes manifest, trail, compact, anomalies, checksums, index)
|
||||
- [x] 6. AmbientContext: `AsyncLocalStorage` wrapper — `runWithAmbient()`, `getAmbientSession()`, `requireAmbientSession()`
|
||||
- [x] 7. Guideline Model: `GuidelineContent`, `Guideline`, `GuidelineStore`, `InMemoryGuidelineStore` with CRUD + tag/label filters
|
||||
- [x] 8. Guideline Matching: `MatchingContext`, `MatchingBatch` (Observational, Actionable, PreviouslyApplied, Disambiguation, ResponseAnalysis, LowCriticality), `GenericGuidelineMatchingStrategy`, retry policy
|
||||
- [x] 9. ARQ Generation: `SchematicGenerator`, typed output schemas per batch, `GenerationInfo` tracking, `createExecutionPlan()` with batch-parallel
|
||||
- [x] 10. Relationship Model: `RelationshipKind` (DEPENDS_ON, PRIORITIZES, ENTAILS, TAG_ALL, TAG_PRIORITIZES), `FileRelationshipStore`
|
||||
- [x] 11. Relational Resolver: 4-step iteration loop (deps -> prioritization -> priority -> entailment), `MAX_ITERATIONS=100`, `ResolutionKind` output
|
||||
- [x] 12. Graded Recovery: `recoverL0()`–`recoverL4()`, `scanUserCorrections()`, `formatRecoveryReport()` with source attribution
|
||||
- [x] 13. User Corrections: `detectCorrections()`, `addPersistedTarget()`, `findRelatedCorrections()`, `checkContradiction()`
|
||||
- [x] 14. Index: `readIndex()`, `writeIndex()` with atomic `.tmp` + `renameSync`
|
||||
- [x] 15. MCP Tools: `boocontext_audit_index` + `boocontext_audit_recover` registered in mcp-server.ts
|
||||
- [x] 16. Typecheck: `npx tsc --noEmit` passes clean
|
||||
|
||||
## codecontext (Go) — internal/audit/ + internal/mcp/
|
||||
- [x] 1. Record Types: `RecordEntry`, `CompactRecord`, `RecordStep`/`RecordAction` enums (pre-existing)
|
||||
- [x] 2. Index: `UpdateIndexEntry()` with idempotent upsert, `IndexEntry` schema, atomic `.tmp` + `os.Rename()` (pre-existing)
|
||||
- [x] 3. Hashchain: `HashFile()`, `HashBytes()`, `HashDir()`, `VerifyHashchain()` with `HashchainVerificationError` (pre-existing)
|
||||
- [x] 4. Directory: `DotDir()`, `RunsDir()`, `EnsureRunsDir()` with .gitignore + `AUDIT_DOT_DIR` (pre-existing)
|
||||
- [x] 5. Anomaly: `AlertRule`, `Anomaly`, `Manifest` types + `CheckAnomalies()` with condition evaluation (pre-existing stub, now evaluates total_records/error_rate/hash conditions)
|
||||
- [x] 6. GenerateChecksums: per-file SHA256 manifest (pre-existing)
|
||||
- [x] 7. Session Lifecycle: `SessionLifecycleManager` with `StartSession(task)`, `EndSession()`, `CurrentSession()` — creates adhoc session, writes .current_session, updates index
|
||||
- [x] 8. Trail Management: `TrailManager` with `AppendToBuffer()`, `PendingAppend()`, `AppendToTrail()`, `ReadTrail()`, `FlushBuffer()` — auto-generates session if none active
|
||||
- [x] 9. MCP Audit Tools: `codecontext_audit_start`, `codecontext_audit_end`, `codecontext_audit_status` in `internal/mcp/audit_tools.go`
|
||||
- [x] 10. MCP Middleware Hooks: `recordAuditBuffer()` in server struct, buffer after tool calls, flush on "ready"
|
||||
- [x] 11. Build: `go build ./...` passes clean
|
||||
|
||||
## boocode (Node.js) — apps/coder/src/services/
|
||||
- [x] 1. Session Service (`audit-session.ts`): `startSession()` with L0+L2 recovery, `endSession()` with integrity checks + session_summary.md, `recoverSession()` L0-L3 graded loading, `generateDailyReport()` 7-section report
|
||||
- [x] 2. Correction Service (`correction-service.ts`): `recordCorrection()`, `scanForCorrections()`, `checkContradiction()`, `markPersisted()` — JSON store at `.boo/corrections/`
|
||||
- [x] 3. Guideline Service (`guideline-service.ts`): `createGuideline()`, `listGuidelines()` with tag/label filters, version migration chain (v0.1.0->v0.11.0), `projectJourneyToGuidelines()` DFS, `checkBacktrack()` — JSON store at `.boo/guidelines/`
|
||||
- [x] 4. Skill commands: `command-start/SKILL.md`, `command-end/SKILL.md`, `command-recover/SKILL.md`, `command-report-daily/SKILL.md`
|
||||
- [x] 5. Typecheck: `pnpm -C apps/coder typecheck` passes clean
|
||||
@@ -1,4 +0,0 @@
|
||||
# v1.13.12-skills-audit
|
||||
|
||||
**Status:** Shipped. Archived.
|
||||
|
||||
@@ -1,4 +0,0 @@
|
||||
# v1.13.15-codecontext-synth
|
||||
|
||||
**Status:** Shipped. Archived.
|
||||
|
||||
@@ -1,4 +0,0 @@
|
||||
# v1.13.17-cross-repo-reads
|
||||
|
||||
**Status:** Shipped. Archived.
|
||||
|
||||
@@ -1,4 +0,0 @@
|
||||
# v1.13.18-codecontext-file-path
|
||||
|
||||
**Status:** Shipped. Archived.
|
||||
|
||||
@@ -1,4 +0,0 @@
|
||||
# v1.13.20-drop-legacy-cols
|
||||
|
||||
**Status:** Shipped. Archived.
|
||||
|
||||
@@ -1,4 +0,0 @@
|
||||
# v1.14-outer-loop
|
||||
|
||||
**Status:** Shipped. Archived.
|
||||
|
||||
@@ -1,4 +0,0 @@
|
||||
# v1.14.1-mcp-poc
|
||||
|
||||
**Status:** Shipped. Archived.
|
||||
|
||||
@@ -1,4 +0,0 @@
|
||||
# v1.14.x-html-artifact-panes
|
||||
|
||||
**Status:** Shipped. Archived.
|
||||
|
||||
@@ -1,4 +0,0 @@
|
||||
# v1.15-mcp-multi
|
||||
|
||||
**Status:** Shipped. Archived.
|
||||
|
||||
@@ -1,4 +0,0 @@
|
||||
# v2.0-boocoder
|
||||
|
||||
**Status:** Shipped. Archived.
|
||||
|
||||
@@ -1,5 +0,0 @@
|
||||
# v2.2-paseo-providers
|
||||
|
||||
**Status:** Shipped (`v2.2-paseo-providers`, `v2.2.1-pane-scoped-chats`). Archived.
|
||||
|
||||
Follow-up fixes shipped as `v2.2.1-pane-scoped-chats` (pane-scoped chats, tool UI, WS delta, inference payload).
|
||||
Reference in New Issue
Block a user