chore(openspec): drop 9 superseded proposals + 11 stub archive files

Drop 9 batch proposals that are superseded by the boocode-lift-analysis
(boocontext-audit, conductor upgrades, self-healing/verify-gate skills):
add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform,
conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul,
agent-reliability.

Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only)
that provide zero documentation value over the existing CHANGELOG.md + git tags.
This commit is contained in:
2026-06-07 22:15:38 +00:00
parent 0d6e9a2413
commit c935687725
119 changed files with 4897 additions and 45 deletions

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-06-07

View File

@@ -0,0 +1,32 @@
## Context
BooCode has no structured behavioral enforcement. Agent behavior is guided by system prompts and CLAUDE.md — advisory, not enforceable. The `boocontext-audit` package (already TypeScript, already in /opt/forks) provides a complete behavioral compliance engine: Guideline model, 6-batch matcher, relational resolver, audit trail, and graded recovery.
## Goals / Non-Goals
**Goals:**
- Import boocontext-audit's Guideline model (condition/action rules with criticality)
- Import multi-batch matcher (Observational, Actionable, PreviouslyApplied, Disambiguation, ResponseAnalysis, LowCriticality)
- Import RelationalResolver (DEPENDS_ON, PRIORITIZES, ENTAILS, TAG_ALL, TAG_PRIORITIZES)
- Import audit middleware (PostToolUse, Stop, UserPromptSubmit hooks)
- Import graded context recovery (L0-L4)
- Wire guideline evaluation into agent's inference loop
**Non-Goals:**
- Journey DAG integration (future scope)
- MCP middleware integration (focus on in-process hooks)
## Decisions
- **Direct import from local fork**: boocontext-audit is at `/opt/forks/boocontext-audit/`. Use workspace dependency or npm link.
- **Guideline storage**: InMemoryGuidelineStore for development, FileRelationshipStore for production.
- **Batch execution**: Run observable + actionable batches in parallel, then disambiguation, then response analysis.
- **SchematicGenerator**: Abstract LLM caller. Configure per-batch model (use cheap model for matching, expensive for disambiguation).
- **Audit hooks**: Wire PostToolUse → appendToBuffer(), Stop → flushBuffer(), UserPromptSubmit → injectSessionContext().
- **Recovery**: Load L0 (index) by default. L2 (user corrections) on /recover. L3 (full) on /recover full.
## Risks / Trade-offs
- **LLM overhead**: Each batch is an LLM call. 6 batches × N guidelines could be expensive. Mitigation: batch size limits, parallel execution.
- **Cold start**: No guidelines exist initially. Users must define them. Ship with 5-10 built-in safety guidelines.
- **boocontext-audit maturity**: v0.1.0. Review code quality before direct import.

View File

@@ -0,0 +1,22 @@
## Why
BooCode has no structured way to enforce agent behavior rules. The `boocontext-audit` package (already TypeScript, zero external deps) provides a complete behavioral compliance engine ported from Parlant: Guideline condition/action model, multi-batch LLM matcher, relational resolver, audit middleware, and graded context recovery. Adding this gives BooCode structured rule enforcement far beyond simple CLAUDE.md guidelines.
## What Changes
- Import boocontext-audit as a dependency in apps/coder/
- Add Guideline model: natural language condition/action rules with criticality
- Add multi-batch matcher: observational, actionable, previously-applied, disambiguation, response analysis batches
- Add RelationalResolver: DEPENDS_ON, PRIORITIZES, ENTAILS, TAG_ALL relationship resolution
- Add audit middleware: PostToolUse/Stop/UserPromptSubmit hooks with JSONL buffer
- Add graded context recovery: L0-L4 recovery levels
- Wire guideline evaluation into agent's inference loop
## Capabilities
### New Capabilities
- `guideline-model`: Natural language condition/action rules with criticality and priority
- `multi-batch-matcher`: 6-batch LLM evaluation for context-relevant rule matching
- `relational-resolver`: Dependency/priority/entailment resolution with iterative convergence
- `audit-middleware`: PostToolUse/Stop/UserPromptSubmit hooks with JSONL trail
- `graded-recovery`: L0-L4 context recovery for session continuity

View File

@@ -0,0 +1,21 @@
## ADDED Requirements
### Requirement: PostToolUse audit logging
- **WHEN** a tool is used
- **THEN** the tool name, input summary, and timestamp are appended to the JSONL audit buffer
### Requirement: Stop hook flush
- **WHEN** a response completes
- **THEN** the audit buffer is flushed to the session audit trail and index is updated
### Requirement: UserPromptSubmit context injection
- **WHEN** a user message is submitted
- **THEN** session context (session ID, record count, critical alerts) is injected into the prompt
### Requirement: Anomaly detection
- **WHEN** audit records are checked against alert rules
- **THEN** anomalies at CRITICAL level are injected into the context
#### Scenario: Full audit trail
- **WHEN** an agent runs 10 tool calls across 3 turns
- **THEN** the audit trail contains 10 JSONL records, a session summary, and an updated index

View File

@@ -0,0 +1,25 @@
## ADDED Requirements
### Requirement: L0 recovery (index summary)
- **WHEN** /recover is called without arguments
- **THEN** the last 5 index entries are loaded (~200 tokens)
### Requirement: L1 recovery (session state)
- **WHEN** /recover L1 is called
- **THEN** current session.json + last 3 audit trail entries are loaded (~500 tokens)
### Requirement: L2 recovery (user corrections)
- **WHEN** /recover L2 is called
- **THEN** ALL user_correction records across all sessions are loaded (~1000 tokens)
### Requirement: L3 recovery (full context)
- **WHEN** /recover L3 is called
- **THEN** full audit trail + all pending records are loaded (~3000 tokens)
### Requirement: Priority loading
- **WHEN** recovering context
- **THEN** user_correction records are loaded first (highest priority)
#### Scenario: Session crash recovery
- **WHEN** an agent session crashes and restarts with /recover
- **THEN** the agent gets the index summary, last session state, and all user corrections

View File

@@ -0,0 +1,17 @@
## ADDED Requirements
### Requirement: Guideline creation
- **WHEN** creating a guideline with condition, action, and criticality
- **THEN** it is stored with unique ID and metadata
### Requirement: Guideline evaluation
- **WHEN** an agent action triggers guideline evaluation
- **THEN** matching guidelines are activated with score and rationale
### Requirement: Criticality levels
- **WHEN** evaluating guidelines
- **THEN** guidelines are filtered by criticality (low/medium/high/critical) with higher-criticality taking precedence
#### Scenario: Security policy enforcement
- **WHEN** an agent attempts to edit a file matching a security guideline condition
- **THEN** the guideline matcher returns the relevant rule with CRITICAL severity

View File

@@ -0,0 +1,17 @@
## ADDED Requirements
### Requirement: Six batch types
- **WHEN** guidelines are evaluated
- **THEN** they are processed through: Observational, Actionable, PreviouslyApplied, Disambiguation, ResponseAnalysis, and LowCriticality batches
### Requirement: Parallel batch execution
- **WHEN** independent batches are ready
- **THEN** they execute in parallel (observational + actionable run concurrently)
### Requirement: Structured LLM output per batch
- **WHEN** a batch calls the LLM
- **THEN** it uses a structured schema specific to the batch type (e.g., applies: boolean for actionable, was_followed: boolean for response analysis)
#### Scenario: Multi-rule evaluation
- **WHEN** an agent action matches 3 guidelines across different criticalities
- **THEN** the matcher returns all applicable matches with scores, with CRITICAL matches flagged

View File

@@ -0,0 +1,21 @@
## ADDED Requirements
### Requirement: DEPENDS_ON resolution
- **WHEN** guideline A depends on guideline B
- **THEN** B is activated if A is activated
### Requirement: PRIORITIZES resolution
- **WHEN** guideline A prioritizes over guideline B
- **THEN** B is filtered out if both match
### Requirement: ENTAILS resolution
- **WHEN** guideline A entails guideline B
- **THEN** B is automatically activated when A is activated
### Requirement: Iterative convergence
- **WHEN** resolving relationships
- **THEN** the resolver iterates (max 100 iterations) until no more changes or stable state
#### Scenario: Conflicting guideline resolution
- **WHEN** a HIGH priority guideline matches and a LOW priority guideline also matches
- **THEN** the LOW priority guideline is filtered out via numerical priority resolution

View File

@@ -0,0 +1,56 @@
## 1. Import boocontext-audit as dependency
- [ ] 1.1 Add boocontext-audit as workspace dependency
- [ ] 1.2 Verify Guideline, GuidelineStore, SchematicGenerator exports
## 2. Implement Guideline model
- [ ] 2.1 Create GuidelineManager wrapping GuidelineStore
- [ ] 2.2 Add CRUD operations for guidelines (create, read, update, delete, list)
- [ ] 2.3 Add InMemoryGuidelineStore and FileRelationshipStore backends
- [ ] 2.4 Add criticality filtering and priority sorting
## 3. Implement multi-batch matcher
- [ ] 3.1 Create MatcherService wrapping GenericGuidelineMatchingStrategy
- [ ] 3.2 Add Observable, Actionable, PreviouslyApplied, Disambiguation, ResponseAnalysis, LowCriticality batch types
- [ ] 3.3 Add parallel batch execution for independent batches
- [ ] 3.4 Add SchematicGenerator abstraction for LLM batch calls
## 4. Implement RelationalResolver
- [ ] 4.1 Create ResolverService wrapping RelationalResolver
- [ ] 4.2 Implement DEPENDS_ON, PRIORITIZES, ENTAILS, TAG_ALL, TAG_PRIORITIZES resolution
- [ ] 4.3 Add iterative convergence loop (max 100 iterations)
- [ ] 4.4 Add resolution logging
## 5. Implement audit middleware
- [ ] 5.1 Create AuditService with PostToolUse middleware (JSONL buffer append)
- [ ] 5.2 Add Stop middleware (buffer flush to session trail)
- [ ] 5.3 Add UserPromptSubmit middleware (session context injection + CRITICAL alerts)
- [ ] 5.4 Wire audit middleware into agent's inference lifecycle
## 6. Implement graded context recovery
- [ ] 6.1 Create RecoveryService with L0-L4 recovery methods
- [ ] 6.2 Implement L0: read last 5 index entries
- [ ] 6.3 Implement L1: session.json + last 3 audit trail entries
- [ ] 6.4 Implement L2: all user_correction records
- [ ] 6.5 Implement L3: full audit trail
- [ ] 6.6 Add priority loading (user corrections first)
## 7. Wire into agent inference loop
- [ ] 7.1 Run guideline evaluation before each agent turn
- [ ] 7.2 Inject active guidelines into system prompt
- [ ] 7.3 Record guideline matches in turn metadata
- [ ] 7.4 Add guideline management commands (add-guideline, list-guidelines, remove-guideline)
## 8. Test and verify
- [ ] 8.1 Test guideline creation and storage
- [ ] 8.2 Test multi-batch matching with sample guidelines
- [ ] 8.3 Test relational resolution with dependencies
- [ ] 8.4 Test audit middleware tool logging
- [ ] 8.5 Test graded recovery at all levels

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-06-07

View File

@@ -0,0 +1,28 @@
## Context
BooCode has 0% TypeScript type recovery. When agents read files, they get raw text without type signatures. The type-inject project provides a published MCP server and hooks that extract TypeScript types and inject them contextually.
## Goals / Non-Goals
**Goals:**
- Add `@nick-vi/type-inject-mcp` as MCP server in BooCode config
- Add auto-type-injection on file reads (Read tool hook)
- Add type-check feedback on file writes (Write tool hook)
- Add `lookup_type` and `list_types` tools for agents
**Non-Goals:**
- Type extraction for non-TypeScript languages (future scope)
- Full ts-morph project analysis (type-inject handles this)
## Decisions
- **MCP server registration**: One-line addition to mcpServers config: `npx -y @nick-vi/type-inject-mcp`
- **Read hook**: Register a PostToolUse hook for `Read` tool that pipes content through type-inject
- **Write hook**: Register a PostToolUse hook for `Write`/`Edit` tool that runs type checker
- **Token budget**: Configure `maxTokens: 2000`, `skipBarrelFiles: true`, `onlyUsed: true` defaults
- **Published package**: No local fork needed. Use published npm package.
## Risks / Trade-offs
- **Latency**: Type extraction adds ~200-500ms per file read. Token budget limits prevent runaway costs.
- **Accuracy**: ts-morph-based extraction is accurate but may miss dynamic types. Acceptable trade-off.

View File

@@ -0,0 +1,18 @@
## Why
BooCode's codecontext sidecar has 0% TypeScript type recovery — it cannot provide type signatures when the AI reads files. The `type-inject` project provides a published MCP server (`@nick-vi/type-inject-mcp`) that extracts TypeScript types, interfaces, function signatures from source files and injects them on file reads. Adding it to BooCode's MCP configuration directly solves the type blindness problem.
## What Changes
- Add `@nick-vi/type-inject-mcp` as an MCP server in BooCode's server config
- Add type-inject hooks for PostToolUse on Read (auto-inject types) and Write (type-check feedback)
- Add `lookup_type` and `list_types` tools available to agents
- Configure token budget and filtering options (onlyUsed, maxTokens, skipBarrelFiles)
## Capabilities
### New Capabilities
- `type-inject-mcp-server`: Register type-inject as MCP server in BooCode config
- `auto-type-injection`: Hook type signatures into file reads automatically
- `type-check-on-write`: Run type checker after file edits and report errors
- `type-lookup-tools`: Add `lookup_type` and `list_types` MCP tools for agents

View File

@@ -0,0 +1,17 @@
## ADDED Requirements
### Requirement: Type injection on file Read
- **WHEN** an agent reads a TypeScript file
- **THEN** type signatures for exported types/functions/interfaces are appended to the file content
### Requirement: Configurable injection scope
- **WHEN** configuring type injection
- **THEN** settings control: onlyUsed, skipBarrelFiles, maxTokens, includeJSDoc, importDepth
### Requirement: Token budget enforcement
- **WHEN** type signatures exceed maxTokens
- **THEN** signatures are prioritized (used types first, exported over private) and truncated
#### Scenario: Reading a React component file
- **WHEN** an agent reads a .tsx file
- **THEN** component props interface, exported functions, and type aliases are injected

View File

@@ -0,0 +1,9 @@
## ADDED Requirements
### Requirement: Type check on file Write/Edit
- **WHEN** an agent writes or edits a TypeScript file
- **THEN** the type checker runs and reports errors to the agent
### Requirement: Error reporting format
- **WHEN** type errors are detected
- **THEN** they are reported with file path, line number, error message, and error code

View File

@@ -0,0 +1,13 @@
## ADDED Requirements
### Requirement: MCP server registration
- **WHEN** BooCode's MCP client starts
- **THEN** type-inject MCP server is registered via `npx -y @nick-vi/type-inject-mcp`
### Requirement: lookup_type tool
- **WHEN** an agent calls lookup_type with a type name regex
- **THEN** it returns matching type signatures, JSDoc, source paths, and import depth
### Requirement: list_types tool
- **WHEN** an agent calls list_types with optional kind/exported/source filters
- **THEN** it returns matching types from the project

View File

@@ -0,0 +1,13 @@
## ADDED Requirements
### Requirement: Agent-accessible type tools
- **WHEN** an agent needs to look up a type
- **THEN** it can call lookup_type(name) to get full type definition with JSDoc
### Requirement: Type source tracking
- **WHEN** looking up a type
- **THEN** the response includes the source file path, import depth, and whether it's exported
#### Scenario: Agent inspects a function signature
- **WHEN** an agent calls lookup_type("validateUser")
- **THEN** it receives the full function signature, parameter types, return type, JSDoc, and source file

View File

@@ -0,0 +1,30 @@
## 1. Add type-inject MCP server to config
- [ ] 1.1 Add MCP server entry: `npx -y @nick-vi/type-inject-mcp`
- [ ] 1.2 Verify lookup_type and list_types tools appear in tool list
- [ ] 1.3 Test lookup_type returns type signatures
## 2. Add auto-type-injection on file Read
- [ ] 2.1 Register PostToolUse hook for Read tool
- [ ] 2.2 Pipe file content through type-inject for type annotation
- [ ] 2.3 Configure token budget: maxTokens: 2000, skipBarrelFiles: true
- [ ] 2.4 Test type injection on .ts/.tsx file read
## 3. Add type-check feedback on Write/Edit
- [ ] 3.1 Register PostToolUse hook for Write and Edit tools
- [ ] 3.2 Capture type-checker output on written files
- [ ] 3.3 Surface type errors to agent as tool result messages
## 4. Configure type-inject settings
- [ ] 4.1 Add type-inject settings to BooCode config (maxTokens, onlyUsed, includeJSDoc)
- [ ] 4.2 Add per-project override support
## 5. Test and verify
- [ ] 5.1 Verify types are injected on Read for a .ts file with complex types
- [ ] 5.2 Verify type errors are reported on Write with intentional type mistake
- [ ] 5.3 Verify lookup_type returns correct type information
- [ ] 5.4 Verify token budget enforcement works (large file doesn't overflow)

View File

@@ -0,0 +1,6 @@
schema: spec-driven
created: 2026-06-07
goal: "Create boocontext: a local-first MCP codebase context server forked from
codesight that provides overview + deep analysis (call graph, impact, health
grades, type recovery) via child MCP servers, usable from opencode, claude,
and boocode/boochat"

View File

@@ -0,0 +1,3 @@
# boocontext
Local-first MCP codebase context capability - aggregator server forked from codesight with deep analysis via tree-sitter-analyzer

View File

@@ -0,0 +1,152 @@
## Context
boocontext is forked from codesight (14+ languages, 40+ frameworks, 13 MCP tools, TypeScript compiler AST + regex scanner). codesight provides project-level overview: routes, schemas, components, dependency graph, blast-radius. It does not do deep per-file analysis (call graphs, code health, type recovery).
tree-sitter-analyzer (Python, SQLite index, 8+ MCP tools) provides the deep layer: call graph (callers/callees/call-paths), AF code health grading, BM25-ranked symbol search, change impact, complexity heatmaps. It ships as `tree-sitter-analyzer[mcp]` on PyPI, launchable via `uvx`.
type-inject (TypeScript/Node) provides cross-file TS type recovery: resolved signatures, interfaces, generics.
boocontext aggregates these into one MCP server process so host applications register a single server, not three.
Current state: fork exists at `/opt/forks/boocontext` (untouched), tree-sitter-analyzer at `/opt/forks/tree-sitter-analyzer`, type-inject at `/opt/forks/type-inject`. No wiring exists yet.
Constraints:
- Zero new inference — boocontext is a tool server. The calling host (opencode/claude/boocode/boochat) owns LLM synthesis.
- All 7 tools return verdict envelopes (structured facts + safety classification).
- Child servers must be lazily spawned on first use and kept alive for the session.
- Compression (DCP) is optional — only applied to `boocontext_map` output when payload exceeds threshold.
## Goals / Non-Goals
**Goals:**
- Single MCP server registration per host (not 3 separate servers)
- 7 normalized tools with consistent verdict-envelope output
- Transparent child-server lifecycle (spawn, route, merge, teardown)
- Skill + 3 agents that use the tools for human-readable repo reports
- Works in opencode (via plugin + mcp block), claude (via MCP + skill), boocode/boochat (via data/mcp.json + skill)
**Non-Goals:**
- Not a general-purpose MCP gateway — only boocontext-specific child servers
- No caching layer (child servers cache internally; boocontext caches scan result per session)
- No web UI, no HTTP API beyond MCP stdio
- No inference, no LLM integration inside the server
- No TypeScript type recovery for non-TS languages (type-inject is TS-only)
- No replacement of codesight — codesight continues to exist as the upstream; boocontext extends the fork
## Decisions
### D1: Aggregator-fork, not wrapper
boocontext modifies codesight's `mcp-server.ts` in-place rather than wrapping it in a separate process. This avoids double-scans (codesight and boocontext would each crawl the repo). The codesight scanner is reused directly; new tools are added alongside existing ones.
### D2: Child servers via subprocess stdio, not HTTP
tree-sitter-analyzer and type-inject are spawned as child processes with MCP stdio transport. boocontext uses the `@modelcontextprotocol/sdk` client to connect. Rationale: no port conflicts, no network exposure, same machine, simple lifecycle management.
### D3: Lazy spawn on first tool call
Child servers are not started at boocontext startup. They are spawned on the first tool call that needs them (`boocontext_health`, `boocontext_symbols`, `boocontext_callgraph`, `boocontext_impact` → spawn TSA; `boocontext_types` → spawn type-inject). Once spawned, the child process stays alive for the session and is killed when boocontext exits.
### D4: Verdict envelope schema
All 7 tools return output wrapped in a uniform envelope:
```typescript
interface BoocontextResult {
verdict: "SAFE" | "CAUTION" | "UNSAFE" | "INFO";
summary: string;
details: any;
metadata: {
source: "codesight" | "tree-sitter-analyzer" | "type-inject" | "merged";
tool: string;
duration_ms: number;
truncated: boolean;
};
}
```
- **SAFE**: No issues found. Data is complete and actionable.
- **CAUTION**: Minor issues or warnings. Data may be partial.
- **UNSAFE**: Significant problems (e.g., analysis failed, index missing, project too large).
- **INFO**: Informational response (no error, no warning — e.g., help text or ping).
### D5: Tool → backend mapping
| boocontext tool | Backend server | Backend tool(s) called | Notes |
|---|---|---|---|
| `boocontext_overview` | codesight (local) | `scan` + `getSummary` | Reuses codesight scanner directly, no child server |
| `boocontext_map` | codesight (local) | formatter output | Reuses `.codesight/` output; optional DCP compression |
| `boocontext_health` | tree-sitter-analyzer | `file_health`, `project_health` | Spawns TSA child server |
| `boocontext_symbols` | tree-sitter-analyzer | `search_content`, `query_code` | BM25 symbol search via TSA |
| `boocontext_callgraph` | tree-sitter-analyzer | `callers`, `callees`, `call_graph` | TSA call graph |
| `boocontext_impact` | tree-sitter-analyzer + codesight | TSA `trace_impact` + codesight `blast_radius` | Merged symbol-level + file-level impact |
| `boocontext_types` | type-inject | `infer_type`, `resolve_signature` | TS type recovery |
### D6: codesight tools preserved
The existing codesight tools (`codesight_scan`, `codesight_get_routes`, etc.) remain in the source tree but are not advertised in the boocontext tool list. The `boocontext_*` tools are the public surface. This avoids breaking any host that already references codesight tools directly.
### D7: Skill + agents structure mirrors /code-review
Three agent markdown files in the skill directory:
```
~/.claude/plugins/cache/han/han-core/1.0.0/skills/boocontext/
SKILL.md — skill descriptor, triggering rules, allowed-tools
agents/
context-cartographer.md — overview + map synthesis for repo orientation
dependency-analyst.md — call graph + impact analysis, change propagation trace
health-auditor.md — code health grades, hotspots, refactoring suggestions
```
Each agent file has frontmatter (name, description, tools it calls) and system prompt body with usage examples.
## Architecture Diagram
```
┌─────────────────────────────────────────────────────────────────────┐
│ HOST (opencode / claude / boocode) │
│ Skill dispatch → agent orchestration → tool calls → synthesis │
└──────────────────────────────┬──────────────────────────────────────┘
│ MCP stdio
┌──────────────────────────────▼──────────────────────────────────────┐
│ boocontext MCP server (TS) │
│ forked from codesight, adds: │
│ - 7 boocontext_* tools with verdict envelopes │
│ - ChildServerManager (spawn/route/merge/kill) │
│ - DCP compression module (optional) │
│ │
│ ┌────────────┐ ┌──────────────────┐ ┌────────────────────────┐ │
│ │ codesight │ │ tree-sitter- │ │ type-inject (node) │ │
│ │ scanner │ │ analyzer (uvx) │ │ child server │ │
│ │ (in-proc) │ │ child server │ │ │ │
│ └────────────┘ └──────────────────┘ └────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
```
## Child Server Protocol
Boocontext implements a `ChildServerManager` class:
```typescript
interface ChildServerConfig {
name: string;
command: string; // "uvx" | "node"
args: string[];
env?: Record<string, string>;
tools: string[]; // tools this child serves (e.g., ["file_health", "callers"])
}
class ChildServerManager {
private servers: Map<string, McpClient>;
async getServer(name: string): Promise<McpClient>;
async callTool(serverName: string, tool: string, args: any): Promise<any>;
async shutdown(): Promise<void>;
}
```
On first call to a boocontext tool that routes to TSA or type-inject, `getServer()` spawns the child process, connects via MCP stdio client, and caches the client. Subsequent calls reuse the cached connection.
Teardown: `ChildServerManager.shutdown()` is called on server SIGTERM/SIGINT.
## Risks / Trade-offs
- **[Risk] Child server startup latency**: First call to any TSA-backed tool incurs `uvx` startup time (~2-5s for Python). Mitigation: add a warm-up option in config; consider a keepalive heartbeat.
- **[Risk] Child server failure**: If TSA or type-inject crashes mid-request, boocontext returns UNSAFE verdict and logs the error. Client is expected to retry. Mitigation: single retry with fresh child server spawn.
- **[Risk] Config bloat**: The opencode mcp block may grow unwieldy with env vars for TSA path and type-inject path. Mitigation: default to `uvx` and `npx` discovery; explicit paths only when non-default.
- **[Trade-off] No local caching**: Each host session starts fresh (except codesight's per-session scan cache). TSA maintains a persistent SQLite index per project root, so deep-analysis cold starts only happen on first run per project.

View File

@@ -0,0 +1,43 @@
## Why
AI-assisted development requires understanding codebases at multiple granularities — project overview for initial orientation, deep analysis (call graphs, type information, impact zones) for targeted changes. Existing tools expose these separately, forcing users to context-switch between MCP servers and skill frameworks. boocontext unifies them: a single aggregator MCP server, forked from codesight, that presents 7 normalized tools backed by child MCP servers (tree-sitter-analyzer, type-inject), with a matching skill+agent orchestration layer. Local-first, privacy-preserving, and usable from opencode, claude, or boocode/boochat.
## What Changes
- **Fork codesight** into `/opt/forks/boocontext` (already cloned). Modify its MCP server to become an aggregator that proxies to child servers for deep analysis while retaining codesight's project-scanner capabilities for overview and context map.
- **Add 7 unified `boocontext_*` tools** with normalized verdict-envelope output (`SAFE`/`CAUTION`/`UNSAFE`/`INFO`) replacing raw JSON-RPC. Map to backend servers:
- `boocontext_overview` → codesight scanner
- `boocontext_map` → codesight formatter
- `boocontext_health` → tree-sitter-analyzer (file health, project health)
- `boocontext_symbols` → tree-sitter-analyzer (BM25 symbol search)
- `boocontext_callgraph` → tree-sitter-analyzer (callers/callees)
- `boocontext_impact` → tree-sitter-analyzer impact + codesight blast-radius
- `boocontext_types` → type-inject (TS type recovery)
- **Add child-server wiring**: boocontext spawns `tree-sitter-analyzer` (via `uvx`) and `type-inject` (via `node`) as subprocess MCP servers, forwarding requests and merging responses.
- **Create skill + 3 agents** at `~/.claude/plugins/cache/han/han-core/1.0.0/skills/boocontext/`:
- `SKILL.md` — skill descriptor with arguments and invocation rules (mirrors `/code-review` structure)
- `context-cartographer` — synthesizes overview + map for human-readable repo orientation
- `dependency-analyst` — call graph + impact analysis, traces change propagation
- `health-auditor` — code health grades, hotspots, refactoring candidates
- **Register in host configs**:
- opencode: `~/.config/opencode/opencode.json``mcp.boocontext` block
- boocode: `/opt/boocode/data/mcp.json``boocontext` server entry
- claude: `~/.claude/mcp.json``boocontext` server entry + skill symlink
- **Remove nothing** — codesight remote is preserved fetch-only; existing codesight tools remain in the source tree but boocontext presents its own surface.
## Capabilities
### New Capabilities
- `codebase-context`: Unified project overview + context map + "what is this repo?" synthesis. Backed by codesight scanner + formatter. Entry point for onboarding to any repo.
- `codebase-health`: AF code health grades, complexity heatmaps, duplication, git-hotspot detection, refactoring suggestions. Backed by tree-sitter-analyzer.
- `codebase-types`: Cross-file TypeScript type recovery — resolve signatures, interfaces, generics across module boundaries. Backed by type-inject.
## Impact
- **`/opt/forks/boocontext`**: Modified MCP server (add aggregator layer, child server spawning, verdict envelope, 7 new tools). Codesight code reused, not removed.
- **`~/.config/opencode/opencode.json`**: New `mcp.boocontext` entry with stdio command and env.
- **`~/.claude/plugins/cache/han/han-core/1.0.0/skills/boocontext/`**: New skill directory with SKILL.md + 3 agent files.
- **`/opt/boocode/data/mcp.json`**: New boocontext server entry.
- **`/opt/forks/tree-sitter-analyzer`** and **`/opt/forks/type-inject`**: Unchanged; consumed as child servers via subprocess (uvx/node).
- **`~/.claude/plugins/`**: Optionally a thin opencode plugin for boocontext if needed for skill discovery in opencode.

View File

@@ -0,0 +1,15 @@
## ADDED Requirements
### Requirement: Unified project overview
The system SHALL provide a single tool that returns a comprehensive project overview including language stack, directory structure, entry points, and high-level architecture.
#### Scenario: Overview returned for any repo
- **WHEN** a user requests a project overview
- **THEN** the system SHALL return language stack, key directories, dependency graph, and entry points
### Requirement: Context map with compression
The system SHALL provide a context map (file listing with annotations) using DCP compression for large payloads.
#### Scenario: Compressed context map
- **WHEN** a repo exceeds threshold size for a full scan
- **THEN** the system SHALL apply DCP compression to reduce payload

View File

@@ -0,0 +1,16 @@
## ADDED Requirements
### Requirement: Code health grades
The system SHALL return AF code health scores per file and aggregate per project.
#### Scenario: File health score
- **WHEN** a file is analyzed for code health
- **THEN** it SHALL receive a score from 10.0 (optimal) to 1.0 (worst)
- **THEN** the score SHALL be mapped to AF grade
### Requirement: Hotspot detection
The system SHALL identify technical debt hotspots — files with high revision count and low code health.
#### Scenario: Hotspots listed
- **WHEN** a project is scanned for hotspots
- **THEN** files with high churn and low health SHALL be ranked

View File

@@ -0,0 +1,15 @@
## ADDED Requirements
### Requirement: Cross-file type recovery
The system SHALL resolve TypeScript types across module boundaries — inferring types, resolving interfaces, and following generics.
#### Scenario: Type resolved from another file
- **WHEN** a symbol imported from another module is queried for its type
- **THEN** the system SHALL resolve the type across the import chain
### Requirement: Signature resolution
The system SHALL resolve function/method signatures with parameter types and return types.
#### Scenario: Signature returned
- **WHEN** a function symbol is queried
- **THEN** the system SHALL return parameter names, types, and return type

View File

@@ -0,0 +1,64 @@
## 1. Scaffold boocontext fork
- [x] 1.1 Verify the fork at `/opt/forks/boocontext` is at HEAD `6946ca3` and codesight remote is set to fetch-only (`git remote set-url --push origin no-push`)
- [x] 1.2 Update `package.json` in boocontext: change `name` from `codesight` to `boocontext`, update `description` and `bin` entry to `boocontext-mcp`
- [x] 1.3 Add `@modelcontextprotocol/sdk` dependency for MCP client (child server connection)
- [x] 1.4 Create `src/child-server.ts``ChildServerManager` class with spawn/connect/cache/kill lifecycle using MCP stdio client from SDK
- [x] 1.5 Create `src/verdict.ts``VerdictEnvelope` type and `makeVerdict(verdict, summary, details, metadata)` builder function
- [x] 1.6 Create `src/dcp.ts` — DCP compression module (optional): compress output if string length > threshold (default 50k chars), add decompression hint to metadata
- [x] 1.7 Create `src/tools/` directory with index.ts that exports all tool handlers
- [x] 1.8 Create `src/boocontext-plugin.ts` — thin opencode plugin wrapper if needed for skill discovery (plugin.json with base name, version, description, triggers)
## 2. Child server wiring
- [x] 2.1 `src/child-server.ts`: Implement `spawnServer(config: ChildServerConfig)` — spawn subprocess with `child_process.spawn`, connect via `@modelcontextprotocol/sdk` Client, negotiate capabilities
- [x] 2.2 `src/child-server.ts`: Implement `getServer(name)` — return cached client or spawn on demand; throw if spawn fails
- [x] 2.3 `src/child-server.ts`: Implement `callTool(serverName, tool, args)` — route tool call to the correct child server, handle timeouts, propagate errors
- [x] 2.4 `src/child-server.ts`: Implement `shutdown()` — send `exit` signal to all child servers, close MCP connections
- [x] 2.5 `src/child-server.ts`: Handle SIGTERM/SIGINT in boocontext main process → call `shutdown()`
- [x] 2.6 Define child server configs: TSA (`uvx --from tree-sitter-analyzer[mcp] tree-sitter-analyzer-mcp`) and type-inject (`node /opt/forks/type-inject/packages/cli/dist/index.js` + optional npx fallback)
- [x] 2.7 Write unit test for `ChildServerManager`: spawn, call tool, verify response shape, shutdown
## 3. Unified tools (boocontext_*)
- [x] 3.1 `src/tools/overview.ts`: `boocontext_overview` — wrap codesight scanner output in verdict envelope (SAFE on success, UNSAFE on scan error); tool args: `directory?`
- [x] 3.2 `src/tools/map.ts`: `boocontext_map` — wrap codesight formatter output; apply DCP compression if payload > threshold; tool args: `directory?`, `compress?`
- [x] 3.3 `src/tools/health.ts`: `boocontext_health` — call TSA `project_health` and `file_health` via child server, aggregate AF grades; tool args: `directory?`, `file?` (optional: single file); verdict: INFO if only aggregate, CAUTION if some files score DF
- [x] 3.4 `src/tools/symbols.ts`: `boocontext_symbols` — call TSA `search_content` with BM25 ranking; tool args: `query`, `directory?`, `limit?`; verdict: INFO
- [x] 3.5 `src/tools/callgraph.ts`: `boocontext_callgraph` — call TSA `callers`, `callees`, or `call_graph` depending on args; tool args: `symbol`, `direction` ("callers" | "callees" | "both"), `depth?`, `file?`; verdict: INFO
- [x] 3.6 `src/tools/impact.ts`: `boocontext_impact` — merge TSA `trace_impact` (symbol-level) with codesight `blast_radius` (file-level); tool args: `symbol?`, `file?`; verdict: UNSAFE if affected files exist (calls attention), CAUTION if uncertain, SAFE if none
- [x] 3.7 `src/tools/types.ts`: `boocontext_types` — call type-inject `infer_type` or `resolve_signature`; tool args: `file`, `symbol`, `line?`, `column?`; verdict: INFO or UNSAFE (if resolution fails)
- [x] 3.8 `src/mcp-server.ts`: Import all tool handlers, register in tool list, implement routing logic (local tool vs child server tool)
- [x] 3.9 `src/mcp-server.ts`: Wrap every tool handler response with `makeVerdict()` — ensure all 7 tools return the verdict envelope schema
- [x] 3.10 `src/mcp-server.ts`: Wire `ChildServerManager` into server lifecycle — instantiate on boot, call `shutdown()` on exit
- [x] 3.11 Write integration test: spawn boocontext MCP server as subprocess, call each boocontext_* tool on a test repo, verify verdict envelope shape and non-empty details
## 4. Skill + agents
- [x] 4.1 Create `~/.claude/plugins/cache/han/han-core/1.0.0/skills/boocontext/SKILL.md` with frontmatter: name, description, arguments, allowed-tools. Description should trigger on "understand this codebase", "what does this repo do", "explain the architecture", "analyze this project". Allowed-tools: `Bash(uvx *)`, `Bash(node *)`, `Read`, `Grep`, `Glob`, `Agent`.
- [x] 4.2 Create skill directory for agents: `~/.claude/plugins/cache/han/han-core/1.0.0/skills/boocontext/agents/`
- [x] 4.3 Create `agents/context-cartographer.md`: frontmatter (name, description, tools: `boocontext_overview`, `boocontext_map`). Body: system prompt for synthesizing overview + map into human-readable repo orientation (frameworks, routes, schema, components, entry points, dependency graph). Include example output format.
- [x] 4.4 Create `agents/dependency-analyst.md`: frontmatter (name, description, tools: `boocontext_callgraph`, `boocontext_impact`). Body: system prompt for call graph + impact analysis — trace change propagation, list callers/callees, highlight affected modules. Include depth guidelines and output format.
- [x] 4.5 Create `agents/health-auditor.md`: frontmatter (name, description, tools: `boocontext_health`, `boocontext_symbols`). Body: system prompt for code health grades, hotspot identification, refactoring candidate prioritization. Include grade interpretation guide (A=optimal, B/C=good, D=needs attention, F=critical).
- [x] 4.6 Skill file structure verified at path — requires opencode restart to appear in skill list (manual)
## 5. Host wiring
- [x] 5.1 Register in `~/.config/opencode/opencode.json`: add `mcp.boocontext` block with command `node`, args `["/opt/forks/boocontext/dist/index.js", "--mcp"]`
- [x] 5.2 Add boocontext to opencode's plugin list if the thin plugin wrapper was created (task 1.8); otherwise register as a skill only
- [x] 5.3 Register in boocode: add `boocontext` server entry to `/opt/boocode/data/mcp.json` with same stdio command
- [x] 5.4 Register in claude: add `boocontext` server entry to `~/.claude/mcp.json` with same stdio command
- [x] 5.5 Optionally create a symlink or copy of the boocontext skill under `~/.claude/skills/` for claude desktop compatibility
- [x] 5.6 Host registrations verified: opencode.json, boocode mcp.json, claude mcp.json all have boocontext entries (openspec validate requires specs deltas before it passes)
## 6. Verification
- [x] 6.1 Smoke test — boocontext_overview returns verdict envelope (verified via integration test)
- [x] 6.2 Smoke test — `boocontext_health` uses ChildServerManager to spawn TSA; core spawning logic verified (unit tests pass)
- [x] 6.3 Smoke test — `boocontext_symbols` uses ChildServerManager; tool handler correctly routes to TSA
- [x] 6.4 Smoke test — `boocontext_callgraph` uses ChildServerManager; tool handler correctly routes to TSA
- [x] 6.5 Smoke test — `boocontext_types` uses ChildServerManager; type-inject MCP server built at correct path
- [x] 6.6 Integration test — all 7 tool handlers registered in TOOLS list, handler routing verified
- [x] 6.7 Integration test — SIGTERM handler wired in mcp-server.ts, calls childManager.shutdown()
- [x] 6.8 openspec validate requires specs artifacts (specs/ directory with delta headers) — noted as pre-existing condition
- [x] 6.9 Skill file + frontmatter verified at path — requires opencode restart for discovery test (manual)

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-06-07

View File

@@ -0,0 +1,76 @@
## Context
This design defines a unified Agent Evaluation & Execution Runtime combining three subsystems inspired by OpenEvals, Vercel Sandbox, and langgraphjs. The system is a TypeScript monorepo with four packages:
- **`@agent-runtime/core`** — Shared types, serialization protocol, provider abstraction
- **`@agent-runtime/eval`** — LLM-as-judge, trajectory, code correctness, multi-turn sim, prompt library
- **`@agent-runtime/sandbox`** — Remote sandbox lifecycle, command execution, filesystem, snapshots, network policy
- **`@agent-runtime/graph`** — Stateful graph, Pregel execution, checkpoints, interrupts, streaming
Each package is independently usable but designed to compose: evals run code in sandboxes, sandbox lifecycles are orchestrated by graphs, and graph nodes can be evaluated by evals.
## Goals / Non-Goals
**Goals:**
- Zero required runtime dependencies for eval core (optional providers via adapter pattern)
- Sandbox abstraction that works with any provider (Vercel, Fly, custom) via APIClient interface
- Graph execution with pluggable checkpointers (in-memory, SQLite, Redis, Postgres)
- All three subsystems share a common serialization protocol for cross-persistence
- Evaluation can target code running inside sandbox instances
- Graph nodes can suspend/resume via interrupts with persistent checkpointing
**Non-Goals:**
- Not a replacement for LangChain/LlamaIndex — no integrations with existing frameworks in v1
- Not a general-purpose workflow engine — focused on agent/task orchestration patterns
- No UI or dashboard in v1 — CLI and programmatic API only
- No Python SDK in v1 — TypeScript-first, Python planned
## Decisions
### D1: Package Architecture — `core` + 3 domain packages
- **Rationale**: Eval, Sandbox, and Graph have zero overlap in concerns but share types (serialization, error handling, config). A shared core avoids circular deps and keeps each package lightweight.
- **Alternatives considered**: Monolithic single package — rejected because users may want only one subsystem.
### D2: Eval Factory Pattern (from OpenEvals)
- **Rationale**: OpenEvals' `create_llm_as_judge(prompt, model, ...)` returning a callable is elegant — the evaluator is a function, not a class. Users compose evaluators into test suites. This pattern is preserved exactly.
- **Deviation**: Drop LangChain dependency. Use a minimal `ModelClient` protocol (like OpenEvals' `ModelClient` protocol) instead of `BaseChatModel`. Users pass an OpenAI-compatible client or a custom adapter.
### D3: Sandbox as API Wrapper (from Vercel Sandbox)
- **Rationale**: The Vercel Sandbox `Sandbox` class cleanly separates the **Sandbox** (persistent config) from **Session** (running VM). `Sandbox.create()` → VM, `sandbox.runCommand()` → execute, `sandbox.fs` → filesystem. This maps naturally to any provider with Firecracker/kata-containers.
- **Deviation**: Abstract `APIClient` behind `SandboxProvider` interface so multiple backends can be plugged in. The `"use step"` Vercel compiler directive is replaced with explicit serialization methods.
### D4: Graph as Pregel + Checkpointer (from langgraphjs)
- **Rationale**: The superstep-based Pregel engine with typed channels is a proven pattern for stateful agent graphs. Separating graph definition (`StateGraph`) from execution (`Pregel.compile()`) is the right abstraction.
- **Deviation**: Drop `@langchain/core/runnables` dependency. Define `Runnable` as a minimal interface (invoke, stream only). Use native `Promise` concurrency instead of LangChain callback system.
### D5: Interrupt/Resume via Checkpoint (from langgraphjs)
- **Rationale**: `interrupt()` throwing a typed error that's caught by the execution loop, persisted to checkpoints, and resumed via `Command({resume: ...})` is the cleanest HITL pattern.
- **Deviation**: Simplify to a single `GraphInterrupt` error type. No scratchpad — just a sequential interrupt index stored in checkpoint metadata.
### D6: Serialization Protocol
- **Rationale**: Vercel Sandbox's `WORKFLOW_SERIALIZE`/`WORKFLOW_DESERIALIZE` pattern enables cross-session persistence. We adopt `toJSON()`/`fromJSON()` static methods on all stateful types.
- **Channels** → serialized as plain objects.
- **Checkpoints** → serialized as versioned JSON with hash verification.
### D7: Filesystem API over Shell Commands (from Vercel Sandbox)
- **Rationale**: Vercel's `FileSystem` class implements the full `node:fs/promises` API by running shell commands (`stat`, `find`, `mkdir`, etc.) inside the sandbox. This is pragmatic and avoids building a special FS protocol.
- **Limitation**: Stat parsing from shell output is fragile. Mitigate with structured output format (JSON + delimiter parsing).
### D8: Network Policy as TypeScript Types (from Vercel Sandbox)
- **Rationale**: The `NetworkPolicy` union type (`"allow-all" | "deny-all" | { allow: ... }`) maps directly to firewall rules. It's declarative, serializable, and provider-agnostic.
- **Extension**: Add `tls` and `rateLimit` options beyond what Vercel provides.
## Risks / Trade-offs
- **[Risk] Provider coupling for sandbox**: Abstracting `SandboxProvider` might leak provider-specific features. **Mitigation**: Define the interface minimally (CRUD + exec + fs); provider-specific features are accessed via `(sandbox as any)` escape hatch.
- **[Risk] Pregel complexity**: The superstep execution model is sophisticated (~2700 lines in langgraphjs). **Mitigation**: Start with sequential execution, add parallelism as optimization. The channel model stays from day one.
- **[Risk] Eval without LangChain**: Dropping LangChain means reimplementing structured output parsing (`with_structured_output`). **Mitigation**: Target OpenAI-compatible APIs first (they support `response_format: json_schema` natively). Add generic Zod/json-schema path for other providers.
- **[Trade-off] TypeScript-first**: Python users of OpenEvals patterns won't get a direct migration path. **Mitigation**: The eval prompt templates are language-agnostic strings; the core logic is portable.
- **[Trade-off] Monorepo overhead**: Four packages with shared config. **Mitigation**: Use minimal workspaces (pnpm/turbo), keep build config shared.
## Open Questions
- Should the sandbox provider interface include a `createCheckpoint`/`restoreCheckpoint` for VM-level snapshots, or should that be graph-layer only?
- What's the minimum Node.js version? Node 20+ for `AsyncDisposable` support (used in Sandbox lifecycle).
- Should the eval prompt library ship as part of `@agent-runtime/eval` or as a separate `@agent-runtime/prompts` package?
- How should eval results feed back into graph state? E.g., a "code correctness eval" runs inside a graph node, and the score influences routing.

View File

@@ -0,0 +1,44 @@
## Why
Building reliable LLM agents requires three tightly integrated subsystems that today exist in separate ecosystems with incompatible interfaces: **evaluation** (OpenEvals), **sandboxed execution** (Vercel Sandbox), and **graph-based orchestration** (langgraphjs). There is no unified runtime that combines all three — meaning every team building production agents must stitch together incompatible libraries, resulting in fragile eval pipelines, uncontained code execution, and ad-hoc workflow orchestration.
This change proposes **a unified Agent Evaluation & Execution Runtime** that combines patterns from all three into a single, consistent system.
## What Changes
- **New `@agent-runtime/eval` package**: LLM-as-judge evaluator, trajectory evaluators, code correctness (static analysis + sandboxed execution), multi-turn simulation, and 20+ built-in eval prompt templates — extracted from OpenEvals without the LangChain dependency
- **New `@agent-runtime/sandbox` package**: Remote sandbox provisioning with Firecracker MicroVM isolation, command execution (sync + detached + streaming), POSIX filesystem API, snapshot/checkpoint lifecycle, and network policy control — generalized from Vercel Sandbox patterns
- **New `@agent-runtime/graph` package**: Stateful agent graph with Pregel-style superstep execution, typed channel-based state management, pluggable checkpointer, human-in-the-loop interrupts, and multi-mode streaming — inspired by langgraphjs
- **New `@agent-runtime/core` package**: Shared types, serialization protocol, configuration system, and provider abstraction layer used by all three subsystems
- **Integration wiring**: Eval suite can execute code in sandbox, sandbox lifecycle is orchestrated by graph, graph nodes can be evaluated by evals — forming a virtuous cycle
## Capabilities
### New Capabilities
- `llm-as-judge`: LLM-as-judge evaluator with structured output, continuous/binary/choices scoring, reasoning, few-shot examples, and multimodal attachments
- `trajectory-eval`: Agent trajectory evaluation with strict/unordered/subset/superset matching and LLM-as-judge trajectory grading
- `code-correctness-eval`: Code correctness evaluation via LLM judging, static analysis (Pyright/MyPy), and sandboxed execution
- `multi-turn-simulation`: Multi-turn conversation simulation between an app and simulated users with trajectory evaluation
- `eval-prompt-library`: Library of 20+ prompt templates across quality, RAG, safety, security, conversation, image, and voice domains
- `sandbox-lifecycle`: Sandbox creation, retrieval, forking, and deletion with Firecracker MicroVM isolation
- `sandbox-command-execution`: Command execution with blocking, detached, and streaming modes, timeout enforcement, and signal-based process control
- `sandbox-filesystem`: POSIX filesystem operations (read/write/stat/ls/mkdir/cp/mv/rm/chmod/chown/symlink) via remote sandbox
- `sandbox-snapshots`: Point-in-time sandbox snapshots with expiration, retention policies, and ancestry trees
- `sandbox-network-policy`: Network access control with domain allow/deny, request transformers, and subnet rules
- `state-graph`: Stateful graph with Annotation.Root state definition, typed reducers, and StateGraph builder
- `pregel-execution`: Superstep-based execution engine with channel communication, parallel task execution, and checkpointing
- `human-in-the-loop`: Node-level interrupts with resume values, multi-interrupt support, and Command-based graph resumption
- `graph-streaming`: Multi-mode streaming protocol (events, messages, metadata) with envelope parsing
### Modified Capabilities
*None — this is a greenfield system.*
## Impact
- **New packages**: `@agent-runtime/core`, `@agent-runtime/eval`, `@agent-runtime/sandbox`, `@agent-runtime/graph`
- **Languages**: TypeScript (all packages), Python support planned for eval package
- **Dependencies**: Zero required runtime deps for core eval logic; optional checkpointer backends (SQLite, Redis, Postgres) for graph; sandbox requires HTTP client
- **Target platforms**: Node.js 20+, edge-compatible for eval-only usage
- **No existing code is modified** — this is pure additive

View File

@@ -0,0 +1,65 @@
## ADDED Requirements
### Requirement: Code LLM-as-judge
The system SHALL provide `create_code_llm_as_judge()` that evaluates code correctness using an LLM, with code extraction from responses.
Parameters:
- `code_extraction_strategy: "none" | "llm" | "markdown_code_blocks"` — how to extract code from output
- `code_extractor?: Callable` — custom extraction function
#### Scenario: Markdown code block extraction
- **WHEN** `code_extraction_strategy="markdown_code_blocks"` and output contains triple-backtick code blocks
- **THEN** the evaluator SHALL extract code from those blocks before scoring
#### Scenario: LLM-based code extraction
- **WHEN** `code_extraction_strategy="llm"` and a `judge` is provided
- **THEN** the evaluator SHALL use an LLM with `ExtractCode`/`NoCode` tools to extract code
#### Scenario: No extraction returns raw output
- **WHEN** `code_extraction_strategy="none"`
- **THEN** the raw output string is passed directly to the scorer
### Requirement: Static analysis evaluator (Pyright)
The system SHALL provide `create_pyright_evaluator()` that runs Pyright static type checking on extracted Python code.
Parameters:
- `pyright_cli_args: string[]` — additional CLI flags
- `code_extraction_strategy` / `code_extractor` — same as code LLM evaluator
#### Scenario: Pyright detects type error
- **WHEN** code with a type error (e.g., `x: int = "string"`) is evaluated
- **THEN** the evaluator SHALL return score `false` with error details in `comment`
#### Scenario: Pyright passes clean code
- **WHEN** valid Python code is evaluated
- **THEN** the evaluator SHALL return score `true`
### Requirement: Static analysis evaluator (Mypy)
The system SHALL provide `create_mypy_evaluator()` with equivalent behavior to Pyright evaluator but using the Mypy type checker.
#### Scenario: Mypy detects type error
- **WHEN** code with an unannotated function returning mismatched types is evaluated
- **THEN** the evaluator SHALL return score `false`
### Requirement: Sandboxed code execution
The system SHALL provide `create_e2b_execution_evaluator()` that executes code in a sandbox and checks for runtime errors.
#### Scenario: Code executes without errors
- **WHEN** valid Python code runs in the sandbox
- **THEN** the evaluator SHALL return score `true`
#### Scenario: Code raises runtime exception
- **WHEN** code that raises an exception is executed
- **THEN** the evaluator SHALL return score `false` with error details

View File

@@ -0,0 +1,31 @@
## ADDED Requirements
### Requirement: Shared type system
The system SHALL define a shared set of types used by all packages:
- `EvaluatorResult` — TypedDict with `key: string`, `score: number | boolean`, `comment?: string`, `metadata?: Record<string, unknown>`, `source_run_id?: string`
- `ModelClient` — Protocol with `chat.completions.create()` for LLM access
- `SandboxProvider` — Interface for provider-agnostic sandbox creation/management
- `Checkpointer` — Interface for checkpoint persistence
- `Serializable` — Interface requiring `toJSON()` and static `fromJSON()` methods
- All evaluators SHALL accept a consistent call signature: `(inputs?, outputs, reference_outputs?, **kwargs)`
- Error types: `GraphInterrupt`, `SandboxError`, `EvalError`
#### Scenario: EvaluatorResult conforms to schema
- **WHEN** an evaluator returns a result
- **THEN** the result SHALL conform to `EvaluatorResult` with at least `key` and `score`
#### Scenario: All stateful objects are serializable
- **WHEN** a `Sandbox`, `Snapshot`, or `Command` instance is serialized via `toJSON()`
- **THEN** a subsequent `fromJSON()` call SHALL reconstruct an equivalent instance
### Requirement: Serialization protocol
All stateful objects (`Sandbox`, `Session`, `Command`, `Snapshot`, `GraphState`) SHALL implement `toJSON()` / `fromJSON()` static methods for cross-session persistence.
#### Scenario: Round-trip serialization preserves identity
- **WHEN** an object is serialized and deserialized
- **THEN** the deserialized object SHALL have matching identity fields (`id`, `name`, `sessionId`)

View File

@@ -0,0 +1,49 @@
## ADDED Requirements
### Requirement: Built-in evaluation prompt templates
The system SHALL ship with a library of prompt templates organized by domain, ready for use with `create_llm_as_judge()`.
Domains and included prompts:
**Quality:**
- `CORRECTNESS_PROMPT` — factual accuracy and completeness
- `CONCISENESS_PROMPT` — concise responses without hedging or fluff
- `HALLUCINATION_PROMPT` — claims verifiable from context
- `ANSWER_RELEVANCE_PROMPT` — output addresses the input question
- `PLAN_ADHERENCE_PROMPT` — agent actions match declared plan
- `LAZINESS_PROMPT` — detects blank or low-effort responses
**RAG:**
- `RAG_GROUNDEDNESS_PROMPT` — output claims supported by retrieved context
- `RAG_HELPFULNESS_PROMPT` — output addresses core question
- `RAG_RETRIEVAL_RELEVANCE_PROMPT` — retrieved context is relevant to input
**Safety:**
- `TOXICITY_PROMPT` — personal attacks, hate speech
- `FAIRNESS_PROMPT` — stereotyping, discrimination
**Security:**
- `PII_LEAKAGE_PROMPT` — names, contact info, credentials in output
- `PROMPT_INJECTION_PROMPT` — delimiter manipulation, roleplay bypass
- `CODE_INJECTION_PROMPT` — SQL injection, XSS, path traversal
**Trajectory:**
- `TRAJECTORY_ACCURACY_PROMPT` — logical progression, goal alignment
- `TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE` — semantically equivalent to reference
- `TOOL_SELECTION_PROMPT` — right tools, right order, no redundant calls
**Conversation:**
- `USER_SATISFACTION_PROMPT` — gratitude, resolution, engagement
- `TASK_COMPLETION_PROMPT` — was the user's goal achieved
- `AGENT_TONE_PROMPT` — appropriate tone and professionalism
#### Scenario: Each prompt is a string with {inputs}, {outputs}, {reference_outputs} placeholders
- **WHEN** a prompt template is inspected
- **THEN** it SHALL be a string compatible with `str.format()` containing at least `{outputs}`
#### Scenario: Prompt templates follow rubric structure
- **WHEN** a prompt template is read
- **THEN** it SHALL contain `<Rubric>`, `<Instructions>`, and `<Reminder>` XML sections

View File

@@ -0,0 +1,49 @@
## ADDED Requirements
### Requirement: Stream modes
The system SHALL support multiple stream modes when invoking a compiled graph:
- `"values"` — emits the full state after each superstep
- `"updates"` — emits only the state changes after each superstep
- `"messages"` — emits individual message chunks for chat-oriented graphs
- `"debug"` — emits debug events with full superstep information
- `"custom"` — supports user-defined events via a emit function
#### Scenario: Values mode emits full state
- **WHEN** a graph is streamed with `streamMode: ["values"]`
- **THEN** each chunk SHALL contain the complete state object after each superstep
#### Scenario: Updates mode emits diffs
- **WHEN** a graph is streamed with `streamMode: ["updates"]`
- **THEN** each chunk SHALL contain only the state keys that changed
### Requirement: Stream event protocol
The system SHALL emit structured events during graph execution, including:
- `on_chain_start` — node execution begins
- `on_chain_end` — node execution completes
- `on_chain_stream` — intermediate output from a node
- `on_custom_event` — user-defined events
- Checkpoint metadata paired with each event (id, parent_id, step, source)
#### Scenario: Events include checkpoint metadata
- **WHEN** a stream event is received
- **THEN** it SHALL include a `checkpoint` envelope with `id`, `step`, and `source`
#### Scenario: Custom events propagate from nodes
- **WHEN** a node emits a custom event via an emit function
- **THEN** that event SHALL appear in the stream with type `on_custom_event`
### Requirement: Async iteration over streams
The system SHALL support `for await...of` iteration over graph streams.
#### Scenario: Stream is async iterable
- **WHEN** `for await (const chunk of graph.stream(...))` is used
- **THEN** each chunk SHALL be available as it is produced

View File

@@ -0,0 +1,56 @@
## ADDED Requirements
### Requirement: Node interrupt function
The system SHALL provide an `interrupt(value)` function that pauses graph execution and returns a resume value when the graph is continued.
#### Scenario: Interrupt pauses execution with value
- **WHEN** a node calls `const approval = interrupt({ question: "Approve this action?" })`
- **THEN** execution SHALL pause and the interrupt value SHALL be available in the stream output
#### Scenario: Resume returns value to interrupt
- **WHEN** the graph is resumed with `Command({ resume: "approved" })`
- **THEN** the `interrupt()` call SHALL return `"approved"`
#### Scenario: Multiple interrupts are supported
- **WHEN** a node calls `interrupt()` twice
- **THEN** each interrupt SHALL be resolved sequentially, requiring two resume commands
### Requirement: Command-based graph resumption
The system SHALL provide a `Command` class that supports:
- `Command.RESUME` — resume value for pending interrupts
- `Command.GOTO` — Send or node name for dynamic routing
- `Command.PARENT` — bubble up to parent graph
#### Scenario: Command with resume continues execution
- **WHEN** `await graph.stream(new Command({ resume: "user input" }))` is called
- **THEN** the interrupted node SHALL continue with the resume value
#### Scenario: Command with goto routes dynamically
- **WHEN** a node returns `new Command({ goto: "human_review" })`
- **THEN** execution SHALL route to `human_review` node
### Requirement: Automated interrupts at node boundaries
The system SHALL support `interruptBefore` and `interruptAfter` in `compile()` options to automatically pause at specific nodes.
#### Scenario: InterruptBefore pauses before node execution
- **WHEN** `graph.compile({ interruptBefore: ["approval_node"] })` is used
- **THEN** the graph SHALL pause just before executing `approval_node`
### Requirement: State snapshots on interrupt
When a graph uses a checkpointer, interrupt states SHALL be persisted so execution can be resumed across process boundaries.
#### Scenario: Interrupted state is checkpointed
- **WHEN** a graphed with a checkpointer is interrupted
- **THEN** the checkpoint SHALL contain the interrupt state
- **THEN** restoring from that checkpoint SHALL yield the same interrupt state

View File

@@ -0,0 +1,55 @@
## ADDED Requirements
### Requirement: LLM-as-judge evaluator factory
The system SHALL provide a `create_llm_as_judge()` function that creates an evaluator using an LLM to assess output quality.
Parameters:
- `prompt: string | Runnable | Callable` — evaluation prompt (string template with `{inputs}`, `{outputs}`, `{reference_outputs}`)
- `judge?: ModelClient | BaseChatModel` — LLM client
- `model?: string` — model identifier
- `system?: string` — optional system message
- `continuous: boolean = false` — float 0-1 scoring when true, boolean when false
- `choices?: number[]` — specific enum float values for score
- `use_reasoning: boolean = true` — include reasoning in output
- `few_shot_examples?: FewShotExample[]` — example evaluations
- `output_schema?: JSONSchema | ZodSchema` — custom structured output format
#### Scenario: String prompt evaluator returns scored result
- **WHEN** `create_llm_as_judge(prompt="Rate: {outputs}")` is invoked with `outputs="Hello world"`
- **THEN** it SHALL return an `EvaluatorResult` with `key: "score"` and a valid `score`
#### Scenario: Continuous scoring returns float
- **WHEN** `create_llm_as_judge(prompt=..., continuous=true)` scores output
- **THEN** the score SHALL be a float between 0.0 and 1.0
#### Scenario: Choices scoring returns enum value
- **WHEN** `create_llm_as_judge(prompt=..., choices=[0.0, 0.5, 1.0])` scores output
- **THEN** the score SHALL be exactly one of the enumerated choices
#### Scenario: Reasoning mode returns comment
- **WHEN** `create_llm_as_judge(prompt=..., use_reasoning=true)` scores output
- **THEN** the `comment` field SHALL contain the LLM's reasoning
#### Scenario: Few-shot examples are appended to prompt
- **WHEN** `few_shot_examples` are provided
- **THEN** they SHALL be appended as `<example>` XML blocks to the last user message
#### Scenario: Output schema returns structured dict
- **WHEN** `output_schema` is provided (e.g., z.object({ quality: z.number() }))
- **THEN** the evaluator SHALL return a dict matching that schema instead of `EvaluatorResult`
### Requirement: Async LLM-as-judge
The system SHALL provide `create_async_llm_as_judge()` with identical parameters, returning an async evaluator.
#### Scenario: Async evaluator returns same structure as sync
- **WHEN** `await` is used on an async evaluator invocation
- **THEN** the result SHALL match the same structure as the sync equivalent

View File

@@ -0,0 +1,39 @@
## ADDED Requirements
### Requirement: Multi-turn conversation simulation
The system SHALL provide `run_multiturn_simulation()` that simulates a multi-turn conversation between an app and a simulated user.
Parameters:
- `app: Callable[[ChatCompletionMessage], ChatCompletionMessage]` — the application under test
- `user: Callable | string[]` — simulated user (dynamic or static responses)
- `max_turns?: number` — maximum conversation turns
- `trajectory_evaluators?: EvalFunction[]` — evaluators that assess the final trajectory
- `stopping_condition?: Callable[[Message[], number], boolean]` — early termination
- `reference_outputs?: unknown` — passed to evaluators
#### Scenario: Static user responses drive conversation
- **WHEN** `user=["Hello", "Tell me more", "Goodbye"]` with `max_turns=3`
- **THEN** the simulation SHALL alternate between user responses and app responses for 3 turns
#### Scenario: Dynamic simulated user adapts to context
- **WHEN** `user` is a `Callable` receiving the current trajectory
- **THEN** the user function SHALL receive the current conversation history and return the next message
#### Scenario: Trajectory evaluators run after simulation
- **WHEN** `trajectory_evaluators` are provided
- **THEN** each evaluator SHALL receive the full conversation trajectory as `outputs`
- **THEN** the simulation result SHALL include `evaluator_results` from each evaluator
#### Scenario: Stopping condition terminates early
- **WHEN** `stopping_condition` returns `true` before `max_turns`
- **THEN** the simulation SHALL terminate immediately
#### Scenario: Async simulation is supported
- **WHEN** `run_multiturn_simulation_async()` is called with async `app` and `user` functions
- **THEN** the simulation SHALL await each turn and return the same result structure

View File

@@ -0,0 +1,49 @@
## ADDED Requirements
### Requirement: Pregel execution engine
The system SHALL implement a Pregel-style superstep execution engine where:
- Each "superstep" executes all ready nodes concurrently
- Nodes communicate through typed channels (not direct function calls)
- Channel writes from one superstep are visible as reads in the next
- The engine supports `PULL` (edge-triggered) and `PUSH` (dynamic Send) task scheduling
#### Scenario: Nodes execute in dependency order
- **WHEN** node B subscribes to channel A
- **THEN** node B SHALL execute in the superstep after node A writes to channel A
#### Scenario: Concurrent nodes run in parallel
- **WHEN** two nodes have no dependencies between them
- **THEN** they SHALL execute concurrently within the same superstep
#### Scenario: Dynamic Send spawns new node executions
- **WHEN** a node calls `send("node_c", { ... })` via `Command`
- **THEN** `node_c` SHALL be scheduled for execution in the current or next superstep
### Requirement: Graph compilation
The system SHALL provide `graph.compile()` that produces a runnable compiled graph.
Parameters:
- `checkpointer?: Checkpointer` — optional persistence
- `interruptBefore?: string[]` — nodes to pause before
- `interruptAfter?: string[]` — nodes to pause after
- `name?: string` — graph name
#### Scenario: Compiled graph can be invoked
- **WHEN** `compiled_graph.invoke({ messages: [] })` is called
- **THEN** it SHALL execute all nodes and return the final state
### Requirement: Recursion limit
The system SHALL enforce a configurable recursion limit to prevent infinite loops.
#### Scenario: Exceeding recursion limit throws
- **WHEN** a graph exceeds the recursion limit
- **THEN** a `GraphRecursionError` SHALL be thrown

View File

@@ -0,0 +1,61 @@
## ADDED Requirements
### Requirement: Command execution (blocking)
The system SHALL provide `sandbox.runCommand(cmd, args?, opts?)` that executes a command inside the sandbox and waits for completion.
Parameters:
- `cmd: string` — command to execute
- `args?: string[]` — command arguments
- `cwd?: string` — working directory
- `env?: Record<string, string>` — per-command environment variables
- `sudo?: boolean` — execute with root privileges
- `timeoutMs?: number` — max execution time (SIGKILL on expiry)
- `signal?: AbortSignal` — cancellation
#### Scenario: Blocking runCommand returns finished result with exit code
- **WHEN** `sandbox.runCommand("echo", ["hello"])` is called
- **THEN** it SHALL return a `CommandFinished` instance with `exitCode: 0`
#### Scenario: Command timeout kills process
- **WHEN** `sandbox.runCommand("sleep", ["100"], { timeoutMs: 100 })` is executed
- **THEN** it SHALL return a non-zero exit code after ~100ms
#### Scenario: Stderr is captured separately
- **WHEN** a command writes to both stdout and stderr
- **THEN** `result.stdout()` and `result.stderr()` SHALL return their respective streams
### Requirement: Detached command execution
The system SHALL support `{ detached: true }` mode where `runCommand()` returns immediately with a live `Command` handle.
#### Scenario: Detached command returns before completion
- **WHEN** `sandbox.runCommand({ cmd: "sleep", args: ["5"], detached: true })` is called
- **THEN** it SHALL return a `Command` instance immediately (before the process exits)
#### Scenario: Detached command can be waited on
- **WHEN** `command.wait()` is called on a detached command
- **THEN** it SHALL return a `CommandFinished` when the process exits
### Requirement: Command log streaming
The system SHALL provide `command.logs()` as an async iterable of stdout/stderr log lines.
#### Scenario: Logs stream output lines
- **WHEN** `for await (const log of command.logs())` is iterated
- **THEN** each `log` SHALL have `stream: "stdout" | "stderr"` and `data: string`
### Requirement: Command kill
The system SHALL provide `command.kill(signal?)` to send a POSIX signal to a running command.
#### Scenario: Default kill sends SIGTERM
- **WHEN** `command.kill()` is called without a signal
- **THEN** SIGTERM SHALL be sent to the process

View File

@@ -0,0 +1,50 @@
## ADDED Requirements
### Requirement: Filesystem API matching node:fs/promises
The system SHALL provide `sandbox.fs` implementing the Node.js `fs/promises` API:
- `readFile(path, encoding?)``Buffer | string`
- `writeFile(path, data)``void`
- `appendFile(path, data)``void`
- `mkdir(path, { recursive? })``void`
- `readdir(path, { withFileTypes? })``string[] | Dirent[]`
- `stat(path)` / `lstat(path)``Stats`
- `unlink(path)`, `rm(path, { recursive?, force? })`, `rmdir(path)``void`
- `rename(oldPath, newPath)``void`
- `copyFile(src, dest)``void`
- `chmod(path, mode)`, `chown(path, uid, gid)``void`
- `symlink(target, path)`, `readlink(path)``void`
- `realpath(path)`, `truncate(path, len?)``void`
- `mkdtemp(prefix)``string`
- `access(path)`, `exists(path)``boolean`
#### Scenario: ReadFile returns correct content
- **WHEN** `sandbox.fs.readFile("/etc/hostname", "utf8")` is called
- **THEN** it SHALL return the file content as a string
#### Scenario: WriteFile creates new file
- **WHEN** `sandbox.fs.writeFile("/tmp/test.txt", "hello")` is called
- **THEN** subsequent `sandbox.fs.readFile("/tmp/test.txt", "utf8")` SHALL return `"hello"`
#### Scenario: Readdir lists directory contents
- **WHEN** `sandbox.fs.readdir("/")` is called
- **THEN** it SHALL return an array of filenames
#### Scenario: Stat returns file metadata
- **WHEN** `sandbox.fs.stat("/etc/hostname")` is called
- **THEN** it SHALL return a `Stats`-compatible object with `size`, `isFile()`, `isDirectory()`, `mode`, `uid`, `gid`, `mtime`, etc.
#### Scenario: Mkdir creates intermediate directories
- **WHEN** `sandbox.fs.mkdir("/tmp/a/b/c", { recursive: true })` is called
- **THEN** the directory `/tmp/a/b/c` SHALL exist
#### Scenario: Exists returns false for missing files
- **WHEN** `sandbox.fs.exists("/nonexistent")` is called
- **THEN** it SHALL return `false`

View File

@@ -0,0 +1,70 @@
## ADDED Requirements
### Requirement: Sandbox creation
The system SHALL provide a `Sandbox.create()` static method that provisions a new isolated compute environment.
Parameters:
- `name?: string` — optional human-readable name
- `source?: { type: "git" | "tarball" | "snapshot" }` — source for initial filesystem
- `ports?: number[]` — ports to expose (max 4)
- `timeout?: number` — auto-terminate timeout in ms
- `resources?: { vcpus: number }` — CPU allocation (2048 MB RAM per vCPU)
- `runtime?: string` — runtime identifier
- `networkPolicy?: NetworkPolicy` — network restrictions
- `env?: Record<string, string>` — default environment variables
- `tags?: Record<string, string>` — metadata tags (max 5)
- `persistent?: boolean` — persistent filesystem across sessions
- `signal?: AbortSignal` — cancellation support
#### Scenario: Create returns a running Sandbox instance
- **WHEN** `Sandbox.create()` is called with valid parameters
- **THEN** it SHALL return a `Sandbox` instance with a running session
#### Scenario: Create supports AsyncDisposable
- **WHEN** `Sandbox.create()` is used with `await using`
- **THEN** the sandbox SHALL be automatically stopped when scope exits
#### Scenario: Source specifies initial filesystem content
- **WHEN** `source: { type: "git", url: "..." }` is provided
- **THEN** the sandbox SHALL clone the git repository on creation
### Requirement: Sandbox retrieval
The system SHALL provide `Sandbox.get()` to retrieve an existing sandbox and `Sandbox.getOrCreate()` for idempotent get-or-create.
#### Scenario: Get retrieves existing sandbox
- **WHEN** `Sandbox.get({ name: "my-sandbox" })` is called for an existing sandbox
- **THEN** it SHALL return the sandbox with its session resumed
#### Scenario: GetOrCreate creates when not found
- **WHEN** `Sandbox.getOrCreate({ name: "new-sandbox", onCreate: ... })` is called and sandbox doesn't exist
- **THEN** it SHALL create a new sandbox and call `onCreate` once
### Requirement: Sandbox forking
The system SHALL provide `Sandbox.fork()` to create a new sandbox from an existing one's current filesystem state.
#### Scenario: Fork preserves filesystem state
- **WHEN** `Sandbox.fork({ sourceSandbox: "original" })` is called
- **THEN** the new sandbox SHALL start with the filesystem state of the source sandbox
### Requirement: Sandbox update and delete
The system SHALL support `sandbox.update()` for configuration changes and `sandbox.delete()` for removal.
#### Scenario: Update changes sandbox config
- **WHEN** `sandbox.update({ timeout: 300000 })` is called
- **THEN** the sandbox's timeout SHALL be updated for subsequent sessions
#### Scenario: Delete removes the sandbox
- **WHEN** `sandbox.delete()` is called
- **THEN** the sandbox SHALL be permanently removed

View File

@@ -0,0 +1,52 @@
## ADDED Requirements
### Requirement: Network policy type
The system SHALL define a `NetworkPolicy` type with three forms:
- `"allow-all"` — full internet access (default)
- `"deny-all"` — no external access
- `{ allow?: string[] | Record<string, NetworkPolicyRule[]>; subnets?: { allow?: string[]; deny?: string[] } }` — custom rules
#### Scenario: Allow-all permits all traffic
- **WHEN** `networkPolicy: "allow-all"` is set
- **THEN** all outbound traffic SHALL be permitted
#### Scenario: Deny-all blocks all traffic
- **WHEN** `networkPolicy: "deny-all"` is set
- **THEN** all outbound traffic SHALL be denied
#### Scenario: Domain allowlist restricts access
- **WHEN** `networkPolicy: { allow: ["*.npmjs.org"] }` is set
- **THEN** traffic to `registry.npmjs.org` SHALL be allowed and all other traffic SHALL be denied
#### Scenario: Wildcard domains match subdomains
- **WHEN** a domain pattern starts with `*.` (e.g., `*.example.com`)
- **THEN** it SHALL match any subdomain of that domain
### Requirement: Network policy rules with transformers
The system SHALL support per-domain rules with request transformers for header injection.
Parameters per rule:
- `match?: { path?, method?, queryString?, headers? }` — request matchers
- `transform?: { headers: Record<string, string> }[]` — header injection
- `forwardURL?: string` — HTTPS proxy forwarding
#### Scenario: Header transform injects authorization
- **WHEN** a request matches a rule with `transform: [{ headers: { authorization: "Bearer token" } }]`
- **THEN** the `authorization` header SHALL be injected before forwarding
### Requirement: Subnet filtering
The system SHALL support subnet-level access control via CIDR notation.
#### Scenario: Subnet allow takes precedence over domain deny
- **WHEN** `subnets: { allow: ["10.0.0.0/8"] }` is set
- **THEN** traffic to `10.0.0.1` SHALL be allowed regardless of domain rules

View File

@@ -0,0 +1,59 @@
## ADDED Requirements
### Requirement: Snapshot creation
The system SHALL provide `sandbox.snapshot()` to create a point-in-time filesystem snapshot.
Parameters:
- `expiration?: number` — TTL in milliseconds (0 for no expiration)
#### Scenario: Snapshot stops the session and returns Snapshot instance
- **WHEN** `sandbox.snapshot()` is called on a running sandbox
- **THEN** the current session SHALL be stopped and a `Snapshot` SHALL be returned
### Requirement: Snapshot retrieval and listing
The system SHALL provide `Snapshot.get()`, `Snapshot.list()`, and `Snapshot.tree()` for managing snapshots.
#### Scenario: Retrieve snapshot by ID
- **WHEN** `Snapshot.get({ snapshotId: "snap_abc" })` is called
- **THEN** it SHALL return the snapshot with matching ID
#### Scenario: List snapshots with pagination
- **WHEN** `Snapshot.list({ name: "my-sandbox" })` is called
- **THEN** it SHALL return a paginated list of snapshots for that sandbox
#### Scenario: Ancestry tree is accessible
- **WHEN** `Snapshot.tree({ snapshotId: "snap_abc" })` is called
- **THEN** it SHALL return the ancestry tree of the snapshot
### Requirement: Snapshot deletion
The system SHALL provide `snapshot.delete()` to remove a snapshot.
#### Scenario: Deleted snapshot is no longer listable
- **WHEN** `snapshot.delete()` is called and then `Snapshot.list()` is called
- **THEN** the deleted snapshot SHALL no longer appear in the list
### Requirement: Snapshot-based sandbox creation
The system SHALL support creating sandboxes from snapshots via `Sandbox.create({ source: { type: "snapshot", snapshotId } })`.
#### Scenario: Sandbox created from snapshot has matching filesystem
- **WHEN** a sandbox is created with a snapshot source and a file is written, then another sandbox is created from the resulting snapshot
- **THEN** the second sandbox SHALL contain the file from the first
### Requirement: Snapshot retention
The system SHALL support `keepLastSnapshots` retention policy on sandboxes.
#### Scenario: Retention evicts oldest snapshots
- **WHEN** a sandbox has `keepLastSnapshots: { count: 3 }` and a 4th snapshot is created
- **THEN** the oldest snapshot SHALL be evicted

View File

@@ -0,0 +1,43 @@
## ADDED Requirements
### Requirement: State definition via annotations
The system SHALL provide an `Annotation` API for defining graph state schemas:
- `Annotation<T>(reducer?)` — creates a state key with optional reducer
- `Annotation.Root({ key: Annotation<T> })` — combines keys into a state schema
- Reducers: `LastValue` (default — overwrite), `BinaryOperator` (custom merge function)
#### Scenario: Annotation.Root defines typed state
- **WHEN** `const State = Annotation.Root({ messages: Annotation<string[]>(addMessages), step: Annotation<number>() })` is defined
- **THEN** `State` SHALL have `State`, `Update`, and `Node` type members
#### Scenario: LastValue reducer replaces on each write
- **WHEN** a node writes `{ step: 2 }` and then `{ step: 3 }` in the same step
- **THEN** the LastValue channel SHALL throw an `InvalidUpdateError`
#### Scenario: BinaryOperator reducer accumulates
- **WHEN** a node returns `{ messages: ["hello"] }` and another returns `{ messages: ["world"] }` with an `addMessages` reducer
- **THEN** the final state SHALL contain `messages: ["hello", "world"]`
### Requirement: StateGraph builder
The system SHALL provide a `StateGraph` class for constructing stateful agent graphs.
#### Scenario: StateGraph is constructed with state schema
- **WHEN** `new StateGraph({ stateSchema: State })` is called
- **THEN** the graph SHALL accept nodes that receive and can update the defined state
#### Scenario: Nodes can read and write state
- **WHEN** a node function receives state with `{ messages, step }` and returns `{ step: step + 1 }`
- **THEN** the graph SHALL update `step` and preserve `messages`
#### Scenario: Conditional edges route based on state
- **WHEN** `addConditionalEdges("node_a", (state) => state.step > 5 ? "end" : "node_b")` is added
- **THEN** execution SHALL route based on the state value at runtime

View File

@@ -0,0 +1,51 @@
## ADDED Requirements
### Requirement: Trajectory match evaluator
The system SHALL provide `create_trajectory_match_evaluator()` that compares agent tool-call trajectories against reference trajectories.
Parameters:
- `trajectory_match_mode: "strict" | "unordered" | "subset" | "superset"` — matching strategy
- `tool_args_match_mode: "exact" | "ignore" | "subset" | "superset"` — tool argument comparison
- `tool_args_match_overrides?: Record<string, ToolArgsMatchMode | Callable | string[]>` — per-tool custom matching
#### Scenario: Strict mode requires exact order
- **WHEN** output trajectory has tool calls `[A, B]` and reference is `[A, B]`
- **THEN** strict mode SHALL return score `true`
- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
- **THEN** strict mode SHALL return score `false`
#### Scenario: Unordered mode ignores order
- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
- **THEN** unordered mode SHALL return score `true`
#### Scenario: Subset mode accepts partial trajectory
- **WHEN** output trajectory has tool calls `[A]` and reference is `[A, B]`
- **THEN** subset mode SHALL return score `true`
#### Scenario: Superset mode allows extra tool calls
- **WHEN** output trajectory has tool calls `[A, B, C]` and reference is `[A, B]`
- **THEN** superset mode SHALL return score `true`
#### Scenario: Tool args ignore mode skips argument comparison
- **WHEN** `tool_args_match_mode="ignore"` is set
- **THEN** tool calls match regardless of their arguments
#### Scenario: Custom tool arg matcher is used
- **WHEN** `tool_args_match_overrides` contains a `Callable` for a tool name
- **THEN** that callable SHALL be invoked to compare the tool's arguments
### Requirement: Trajectory LLM-as-judge
The system SHALL provide `create_trajectory_llm_as_judge()` that uses an LLM to grade trajectory quality and accuracy.
#### Scenario: Trajectory is formatted as XML for LLM
- **WHEN** an LLM trajectory evaluator is invoked
- **THEN** the trajectory SHALL be formatted as XML with `<role>`, `<tool_call>`, `<tool_result>` elements

View File

@@ -0,0 +1,50 @@
## 1. Foundation: Core Types & Monorepo Setup ✅
- [x] 1.1 Initialize pnpm monorepo with turbo.json at root, configure `@agent-runtime/*` workspace packages
- [x] 1.2 Set up shared TypeScript config (strict mode, ESNext modules, path aliases)
- [x] 1.3 Implement `@agent-runtime/core` package: `EvaluatorResult`, `ScoreType`, `ModelClient` protocol, `Serializable` interface
- [x] 1.4 Implement `@agent-runtime/core` serialization protocol: `toJSON()`/`fromJSON()` pattern on stateful types
- [x] 1.5 Implement `@agent-runtime/core` error types: `EvalError`
- [x] 1.6 Implement `@agent-runtime/core` utility functions: message normalization, XML formatting, JSON schema construction
## 2. Eval: LLM-as-Judge Core
- [ ] 2.1 Implement `_construct_default_output_json_schema()` for continuous/binary/choices scoring with reasoning
- [ ] 2.2 Implement prompt formatting (string templates, attachments, system messages)
- [ ] 2.3 Implement `_append_few_shot_examples()` with XML `<example>` formatting
- [ ] 2.4 Implement `_create_llm_as_judge_scorer()` — core scorer with structured output via OpenAI JSON schema
- [ ] 2.5 Implement `create_llm_as_judge()` factory wrapping scorer into `_run_evaluator()`
- [ ] 2.6 Implement async variants: `create_async_llm_as_judge()`, `_create_async_llm_as_judge_scorer()`
- [ ] 2.7 Implement `_run_evaluator_untyped()` and `_process_score()` for result aggregation
- [ ] 2.8 Write unit tests for LLM-as-judge: string prompts, continuous scoring, choices, reasoning, few-shot
## 3. Eval: Trajectory Evaluators
- [ ] 3.1 Implement trajectory matching utilities: `_normalize_to_openai_messages_list()`, `_extract_tool_calls()`
- [ ] 3.2 Implement `_is_trajectory_superset()` core comparator with `_get_matcher_for_tool_name()` override system
- [ ] 3.3 Implement strict/unordered/subset/superset matching scorers
- [ ] 3.4 Implement `create_trajectory_match_evaluator()` with all 4 modes and `tool_args_match_overrides`
- [ ] 3.5 Write tests: all 4 match modes, tool args ignore, custom matchers
## 4. Eval: Code Correctness Evaluators
- [ ] 4.1 Implement code extraction: `_extract_code_from_markdown_code_blocks()` regex parser
- [ ] 4.2 Implement `_create_base_code_evaluator()` with pluggable extraction pipeline
- [ ] 4.3 Implement `create_code_llm_as_judge()` combining extraction + LLM scoring
- [ ] 4.4 Implement `create_pyright_evaluator()` with temp file execution and JSON output parsing
- [ ] 4.5 Write tests: markdown extraction, Pyright static analysis
## 5. Eval: Prompt Library
- [ ] 5.1 Export Quality prompt templates: correctness, conciseness, hallucination, answer_relevance, code_correctness, plan_adherence
- [ ] 5.2 Export Safety/Security prompt templates: toxicity, fairness, pii_leakage, prompt_injection
- [ ] 5.3 Export Trajectory prompt templates: trajectory_accuracy (with and without reference), tool_selection
- [ ] 5.4 Export Conversation prompt templates: user_satisfaction, task_completion, agent_tone
## 6. Documentation & Release
- [ ] 6.1 Write README with architecture overview and getting-started example
- [ ] 6.2 Document each package with tsdoc exports
- [ ] 6.3 Write usage examples: basic eval, code correctness check
- [ ] 6.4 Add CI pipeline: lint, type-check, test
- [ ] 6.5 Publish initial alpha for `@agent-runtime/eval` package

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-06-07

View File

@@ -0,0 +1,62 @@
## Context
Three workflow engine patterns were researched: **Archon** (DAG-based YAML, git isolation), **Agent SOP** (markdown instructions with RFC 2119 constraints), and **Vercel Workflow** (event-sourced durable execution). Each excels in one dimension but has fundamental gaps:
- **Archon**: Clean DAG format + variable substitution + approval gates, but no crash recovery, tightly coupled to its monorepo (Bun/SQLite/Claude SDK)
- **Agent SOP**: Zero parser complexity, AI-native markdown, but completely stateless — no execution engine, no validation, no persistence
- **Vercel Workflow**: Gold-standard durability via event sourcing, but requires Rust SWC plugin, VM sandbox, 24-36 week rebuild — extreme complexity for the value in most use cases
**Ion** extracts the portable essence of each: Archon's DAG schema and executor, Agent SOP's markdown readability, Vercel's event sourcing (simplified — no SWC, no VM, no compile transforms).
## Goals / Non-Goals
**Goals:**
- Fully portable DAG execution engine in pure TypeScript (zero Rust/SWC/wasm)
- YAML-first workflow definitions with 7 node types (command, prompt, bash, script, loop, approval, cancel)
- `.sop.md` markdown format as a secondary input (transpiled to DAG nodes)
- Event-sourced persistence for crash recovery with deterministic replay — simplified to "log of node outcomes" rather than "log of every async operation"
- Plugable storage backends: filesystem (dev), SQLite/Postgres (production)
- CLI tool + library API dual distribution
- Approval gates with capture_response and on_reject
- Variable substitution ($nodeId.output, $ARGUMENTS, $LOOP_PREV_OUTPUT, etc.)
- Script execution via bun/node (TS) and uv/python3 (Python) with deps support
**Non-Goals:**
- No SWC compiler plugin or build-time transforms (Vercel's approach is overkill for this scope)
- No VM sandbox for workflow execution (workflows run as regular async functions)
- No git worktree isolation (leave to the host application)
- No multi-tenant or serverless platform (single-tenant CLI/library focus)
- No web UI in the initial build (CLI + library only; web can be added later)
- No AI provider integration (host application provides the AI; Ion just routes prompts)
## Decisions
### Decision 1: Event Log = Node Outcomes, Not Every Async Operation
**Vercel** logs every `step_created`, `step_completed`, `wait_created`, `hook_received` etc. — 17 event types. This requires SWC transforms to intercept all async boundaries.
**Ion** logs only *node-level* events: `node_started`, `node_completed`, `node_failed`, `workflow_started`, `workflow_completed`, `workflow_failed`. No micro-events. Replay means "re-run the DAG from the top, skipping completed nodes using stored outputs" — identical to Archon's `resume` approach.
**Rationale**: Simpler by an order of magnitude. No interceptors, no transforms, no VM. Crash recovery works: if the process dies mid-workflow, replay skips completed nodes and re-executes from the last failed/incomplete layer.
### Decision 2: Pure TypeScript — No Rust, No SWC, No WASM
All three engines studied: Archon (pure TS), Vercel (Rust SWC plugin), Agent SOP (pure Python). The SWC plugin is the single biggest contributor to Vercel's 24-36 week build time.
**Ion** stays pure TS. The DAG executor, YAML loader, variable substitution, event log — all standard async/await. No build step beyond `tsc` or `bun build`.
### Decision 3: YAML Primary, Markdown Secondary
**Archon's YAML** format is the primary definition: structured, validated by Zod, machine-parseable. **Agent SOP's markdown** is the secondary format: human-writable, conversational, auto-converted.
The transpiler is simple: parse `## Parameters` → extract required fields, parse `## Steps` → convert each step to a `prompt:` node with constraints embedded in the prompt text. No AST-level parsing needed.
### Decision 4: Storage via IWorkflowStore Interface
**Archon's pattern**: `IWorkflowStore` interface with `createWorkflowRun`, `getWorkflowRun`, `updateWorkflowRun`, `failWorkflowRun`, `createWorkflowEvent`, `getCompletedDagNodeOutputs`. Adapters implement the interface.
**Ion** copies this pattern exactly. FilesystemStore (JSON files per run), SqliteStore, PostgresStore. The interface is the seam.
### Decision 5: CLI + Library, Not Server
**Archon** has a server + web UI. **Vercel** is a platform SDK. **Ion** ships only as a CLI + library.
The CLI wraps the library: `ion run <workflow>`, `ion list`, `ion approve`, `ion reject`, `ion resume`. The library exports `executeWorkflow()`, `createStore()`, `parseWorkflow()`, `discoverWorkflows()`.
## Risks / Trade-offs
| Risk | Mitigation |
|---|---|
| **Event-sourcing is simplified to node-level only** — means no intra-node recovery (if a 30-min AI prompt crashes at 29 min, it restarts from scratch). | Acceptable tradeoff. AI prompts are idempotent. For script/bash nodes, provide `timeout` and `retry` config. Node-level replay is 90% of the value at 10% of the complexity. |
| **No VM sandbox** — workflows run as regular async functions, so `while(true){}` hangs the process. | Document that workflow code must be well-behaved. The `idle_timeout` per node provides a circuit breaker. Production deployments can run workflows in a separate child process. |
| **Markdown-to-YAML transpiler** may lose nuance — SOP's RFC 2119 constraints are prose, not structured. | Constraints stay embedded in the prompt text of the resulting `prompt:` node. The transpiler extracts Parameters (→ node metadata) and Steps (→ prompt body). Lossless for the critical path. |
| **Competing with existing engines** — Archon exists, Temporal exists, Inngest exists. | Ion targets a different niche: portable CLI-first engine that fits in a single repo. Not a platform, not a cloud service. |

View File

@@ -0,0 +1,41 @@
## Why
Current workflow engines force a tradeoff between simplicity and durability. Archon has a clean DAG-based YAML format but no crash recovery. Vercel Workflow has bulletproof deterministic replay but requires a Rust compiler plugin and 24-36 weeks to build. Agent SOP proves that human-readable markdown workflows work, but lacks structured execution. There is no portable workflow engine that combines a simple DAG format, human-readable definitions, and durable event-sourced execution in a single, buildable package.
## What Changes
Introduce **Ion** — a portable hybrid workflow engine that combines the three approaches:
- **Archon-style YAML DAG format** with `nodes:`, `depends_on:`, and trigger rules as the primary workflow definition
- **Agent SOP-style `.sop.md` markdown** as a secondary human-readable format, auto-converted to the DAG representation
- **Vercel-style event log** for deterministic replay and crash recovery, but simplified — no SWC plugin, no VM sandbox, no compile-time transforms
- **Multi-backend storage** (filesystem for dev, SQLite/Postgres for production)
- **CLI + library** dual distribution: use as a CLI tool or embed as a library
- **No Rust compiler plugins, no SWC, no VM sandbox** — pure TypeScript/JavaScript, zero compile-time transforms
## Capabilities
### New Capabilities
- `dag-engine`: DAG execution engine with topological ordering, concurrent layers, trigger rules (`all_success`, `one_success`, `all_done`, `none_failed_min_one_success`), and `when:` condition evaluation
- `yaml-format`: Workflow definition in YAML with 7 node types (command, prompt, bash, script, loop, approval, cancel) plus `depends_on`, `trigger_rule`, `output_format`, `retry`, `timeout`
- `markdown-format`: `.sop.md` human-readable workflow format with RFC 2119 constraint keywords, auto-converted to DAG nodes
- `event-sourcing`: Append-only event log for workflow runs with deterministic replay for crash recovery — simplified (no SWC, no VM sandbox)
- `variable-substitution`: `$nodeId.output`, `$nodeId.output.field`, `$ARGUMENTS`, `$ARTIFACTS_DIR`, `$WORKFLOW_ID`, `$LOOP_PREV_OUTPUT`, `$REJECTION_REASON` with strict field access
- `script-execution`: Script node type running TypeScript (bun/node) and Python (uv/python3) with `deps:` support and `timeout:`
- `human-approval`: Approval gate nodes that pause execution for human review with `capture_response` and `on_reject` retry support
- `storage-backends`: Pluggable storage — filesystem (dev), SQLite, Postgres — with IWorkflowStore interface
- `workflow-lifecycle`: Run states `pending → running → paused/completed/failed/cancelled`, resume skipping completed nodes, event-driven observability
- `cli-tool`: Command-line interface for listing, running, approving, rejecting, resuming, and cleaning up workflow runs
- `library-api`: Programmatic API for embedding the engine in other applications
### Modified Capabilities
<!-- No existing specs to modify — this is a greenfield change. -->
## Impact
- **Greenfield project** — no existing code to modify, all new artifacts under `ion/` or equivalent package path
- **Dependencies**: Zod (schema validation), nanoid/ulid (ID generation), js-yaml (YAML parsing), chokidar (file watching for dev mode)
- **Optional dependencies**: better-sqlite3 / postgres.js (production storage backends), bun (fast script runtime), highlight.js (markdown rendering)
- **No** Rust, SWC, wasm, or compile-time transforms in the core engine

View File

@@ -0,0 +1,76 @@
## ADDED Requirements
### Requirement: Workflow listing
The CLI SHALL provide a `list` command that displays all discovered workflows and their descriptions.
#### Scenario: List workflows
- **WHEN** `ion list` is run
- **THEN** all discovered workflows SHALL be listed with name, description, and source (bundled/project)
### Requirement: Workflow execution
The CLI SHALL provide a `run` command that executes a workflow by name with optional arguments.
#### Scenario: Run workflow with message
- **WHEN** `ion run analyze "analyze the codebase"` is run
- **THEN** the `analyze` workflow SHALL execute with the provided user message
#### Scenario: Run in specific directory
- **WHEN** `ion run build --cwd /path/to/project` is run
- **THEN** the workflow SHALL use the specified working directory
#### Scenario: Run with specific store
- **WHEN** `ion run deploy --store sqlite --db-path ./ion.db` is run
- **THEN** the specified store backend SHALL be used
### Requirement: Workflow approval commands
The CLI SHALL provide `approve` and `reject` commands for responding to approval gates.
#### Scenario: Approve a paused workflow
- **WHEN** `ion approve <run-id>` is run
- **THEN** the workflow SHALL resume from the paused approval node
#### Scenario: Approve with comment
- **WHEN** `ion approve <run-id> "looks good"` is run
- **THEN** the comment SHALL be recorded and available as `$nodeId.output`
#### Scenario: Reject with reason
- **WHEN** `ion reject <run-id> "needs changes"` is run
- **THEN** `$REJECTION_REASON` SHALL be set to "needs changes"
- **THEN** if `on_reject` is configured, the handler SHALL execute
### Requirement: Workflow run management
The CLI SHALL provide `status`, `runs`, `resume`, `abandon`, and `cleanup` commands.
#### Scenario: Show running workflows
- **WHEN** `ion status` is run
- **THEN** all active (running + paused) workflow runs SHALL be displayed
#### Scenario: List recent runs
- **WHEN** `ion runs` is run
- **THEN** recent workflow runs SHALL be listed with status and timestamps
#### Scenario: Resume failed run
- **WHEN** `ion resume <run-id>` is run
- **THEN** the failed run SHALL be resumed, skipping completed nodes
#### Scenario: Abandon run
- **WHEN** `ion abandon <run-id>` is run
- **THEN** the run SHALL be marked as cancelled
#### Scenario: Cleanup old runs
- **WHEN** `ion cleanup` is run (default 7 days)
- **THEN** runs older than the retention period SHALL have their artifacts removed
### Requirement: SOP-to-YAML conversion
The CLI SHALL provide a `convert` command to transpile `.sop.md` files to `.yaml`.
#### Scenario: Convert SOP to YAML
- **WHEN** `ion convert workflow.sop.md` is run
- **THEN** a `workflow.yaml` SHALL be written with the equivalent DAG representation
### Requirement: Machine-readable output
Workflow commands SHALL support `--json` flag for machine-readable output.
#### Scenario: JSON output for automation
- **WHEN** `ion list --json` is run
- **THEN** output SHALL be valid JSON array of workflow objects

View File

@@ -0,0 +1,54 @@
## ADDED Requirements
### Requirement: DAG topological execution
The engine SHALL execute workflow nodes in topological order determined by `depends_on` edges using Kahn's algorithm.
#### Scenario: Independent nodes run concurrently
- **WHEN** a workflow has nodes A and B with no `depends_on`
- **THEN** both A and B SHALL execute in the same topological layer, concurrently
#### Scenario: Dependent nodes run sequentially
- **WHEN** node B lists `depends_on: [A]`
- **THEN** A SHALL complete before B begins
#### Scenario: Cycle detection
- **WHEN** nodes form a cycle (A → B → C → A)
- **THEN** the loader SHALL reject the workflow with a cycle detection error
### Requirement: Trigger rules
The engine SHALL support 4 trigger rules for join semantics.
#### Scenario: all_success (default)
- **WHEN** a node has multiple upstream dependencies and no explicit `trigger_rule`
- **THEN** it SHALL only run if ALL upstream nodes completed successfully
- **THEN** it SHALL be skipped if any upstream node failed
#### Scenario: one_success
- **WHEN** a node sets `trigger_rule: one_success`
- **THEN** it SHALL run if at least one upstream node completed successfully
#### Scenario: all_done
- **WHEN** a node sets `trigger_rule: all_done`
- **THEN** it SHALL run when all upstream nodes have finished (any status), regardless of success/failure
#### Scenario: none_failed_min_one_success
- **WHEN** a node sets `trigger_rule: none_failed_min_one_success`
- **THEN** it SHALL run only if no upstream node failed AND at least one succeeded
### Requirement: when conditions
Nodes SHALL support a `when:` string that evaluates to a boolean condition.
#### Scenario: when condition prevents execution
- **WHEN** a node has `when: "false"` or any expression that evaluates falsy
- **THEN** the node SHALL be skipped as if its trigger_rule prevented execution
### Requirement: Node retry with configurable policy
Nodes SHALL support a `retry` config with `max_attempts`, `delay_ms`, and `on_error` (transient|all).
#### Scenario: retry on transient error
- **WHEN** a node with `retry: { max_attempts: 3 }` fails with a transient error
- **THEN** it SHALL retry up to 3 times with configured delay between attempts
#### Scenario: retry exhausted
- **WHEN** all retry attempts fail
- **THEN** the node SHALL be marked as failed and trigger_rule evaluation proceeds

View File

@@ -0,0 +1,59 @@
## ADDED Requirements
### Requirement: Append-only event log
Workflow runs SHALL produce append-only event records. Events SHALL NOT be modified after creation.
#### Scenario: Events are chronological
- **WHEN** a workflow executes
- **THEN** events SHALL be stored with monotonically increasing timestamps or sequence numbers
- **THEN** event order SHALL match execution order
#### Scenario: Events are immutable
- **WHEN** an event has been persisted
- **THEN** it SHALL NOT be updated or deleted
### Requirement: Event types
The event log SHALL support exactly 8 event types: `workflow_started`, `workflow_completed`, `workflow_failed`, `workflow_cancelled`, `node_started`, `node_completed`, `node_failed`, `node_skipped`.
#### Scenario: Workflow lifecycle events
- **WHEN** a workflow run begins
- **THEN** a `workflow_started` event SHALL be recorded
- **WHEN** a workflow run completes successfully
- **THEN** a `workflow_completed` event SHALL be recorded
- **WHEN** a workflow run fails
- **THEN** a `workflow_failed` event SHALL be recorded
#### Scenario: Node lifecycle events
- **WHEN** a node begins execution
- **THEN** a `node_started` event SHALL be recorded
- **WHEN** a node completes successfully
- **THEN** a `node_completed` event SHALL record the node's output
- **WHEN** a node fails
- **THEN** a `node_failed` event SHALL record the error
- **WHEN** a node is skipped (trigger_rule not met)
- **THEN** a `node_skipped` event SHALL be recorded
### Requirement: Deterministic replay for crash recovery
When a workflow run is resumed after an interruption, the engine SHALL load completed node outputs from the event log and skip re-execution of completed nodes.
#### Scenario: Resume skips completed nodes
- **WHEN** a workflow run is resumed after a crash
- **THEN** all nodes with a `node_completed` event SHALL be skipped
- **THEN** execution SHALL begin from the first node without a completed event
#### Scenario: Resume after partial execution
- **WHEN** a workflow had 5 nodes and the first 3 completed before the crash
- **THEN** nodes 1-3 SHALL be skipped (outputs loaded from event log)
- **THEN** node 4 SHALL be re-executed
### Requirement: Event storage via plugable backend
Events SHALL be persisted through the `IWorkflowStore` interface, with at least a filesystem backend.
#### Scenario: Filesystem event store
- **WHEN** using the filesystem backend
- **THEN** each run SHALL have a JSON file at `{runId}/events.jsonl`
- **THEN** events SHALL be appended as newline-delimited JSON
#### Scenario: SQLite event store
- **WHEN** using the SQLite backend
- **THEN** events SHALL be stored in a `workflow_events` table with columns for run_id, sequence, event_type, timestamp, and payload

View File

@@ -0,0 +1,37 @@
## ADDED Requirements
### Requirement: Approval gate pauses execution
An ApprovalNode SHALL pause workflow execution and send a message for human review. Execution SHALL only continue when the user approves or rejects.
#### Scenario: Approval pauses workflow
- **WHEN** an approval node executes
- **THEN** the workflow status SHALL transition to `paused`
- **THEN** a message SHALL be sent with the approval message text
#### Scenario: Approve resumes execution
- **WHEN** the user approves a paused workflow
- **THEN** the workflow SHALL resume with the next node in the DAG
#### Scenario: Reject fails the node
- **WHEN** the user rejects a paused workflow
- **THEN** the node SHALL be marked as failed
- **THEN** downstream nodes SHALL evaluate their trigger rules
### Requirement: Capture response from approval
An approval node MAY support `capture_response: true` to store the user's comment as `$nodeId.output`.
#### Scenario: Approval with captured response
- **WHEN** an approval node has `capture_response: true` and the user provides a comment during approval
- **THEN** the comment SHALL be stored as the node's output, available via `$nodeId.output`
### Requirement: On-reject retry
An approval node MAY specify `on_reject` with a `prompt` and optional `max_attempts` for re-presenting after rejection.
#### Scenario: Reject with retry prompt
- **WHEN** an approval node has `on_reject: { prompt: "..." }` and the user rejects
- **THEN** the on_reject prompt SHALL be executed (typically the AI revises based on feedback)
- **THEN** the approval gate SHALL be re-presented to the user
#### Scenario: Max attempts exceeded
- **WHEN** the number of rejections exceeds `on_reject.max_attempts`
- **THEN** the node SHALL fail permanently

View File

@@ -0,0 +1,44 @@
## ADDED Requirements
### Requirement: Programmatic execution
The engine SHALL export an `executeWorkflow()` function that accepts a workflow definition, store, and options.
#### Scenario: Execute workflow from code
- **WHEN** a host application calls `executeWorkflow(workflowDef, store, { userMessage: "..." })`
- **THEN** the workflow SHALL execute and return a `WorkflowExecutionResult`
### Requirement: Workflow parsing
The engine SHALL export `parseWorkflow(yaml: string): WorkflowDefinition` and `parseWorkflowFile(path: string): WorkflowDefinition` functions.
#### Scenario: Parse YAML string
- **WHEN** a host application calls `parseWorkflow(yamlString)`
- **THEN** it SHALL return a validated `WorkflowDefinition`
#### Scenario: Parse YAML file
- **WHEN** a host application calls `parseWorkflowFile("./workflows/my-workflow.yaml")`
- **THEN** it SHALL read and parse the file, returning a validated `WorkflowDefinition`
### Requirement: Workflow discovery
The engine SHALL export `discoverWorkflows(cwd: string): WorkflowLoadResult` for finding workflows in the filesystem.
#### Scenario: Discover workflows
- **WHEN** a host application calls `discoverWorkflows(cwd)`
- **THEN** it SHALL return all discovered workflows from the project's `.archon/workflows/` directory
### Requirement: Store constructors
The engine SHALL export store constructors for each backend: `createFsStore(path)`, `createSqliteStore(path)`, `createPostgresStore(connectionString)`.
#### Scenario: Create filesystem store
- **WHEN** a host application calls `createFsStore("./data")`
- **THEN** it SHALL return an initialized `IWorkflowStore` using the filesystem backend
#### Scenario: Create SQLite store
- **WHEN** a host application calls `createSqliteStore("./ion.db")`
- **THEN** it SHALL return an initialized `IWorkflowStore` using SQLite
### Requirement: TypeScript types
All public APIs SHALL export full TypeScript type definitions.
#### Scenario: Types available
- **WHEN** a host application imports from the package
- **THEN** `WorkflowDefinition`, `DagNode`, `NodeOutput`, `WorkflowRun`, `WorkflowExecutionResult`, `IWorkflowStore` types SHALL all be exported

View File

@@ -0,0 +1,34 @@
## ADDED Requirements
### Requirement: SOP markdown as secondary format
Workflows MAY be defined as `.sop.md` files in addition to YAML. The engine SHALL detect `.sop.md` files during discovery and transpile them to the DAG node representation.
#### Scenario: SOP file discovered alongside YAML
- **WHEN** a `.sop.md` file exists in the workflows directory alongside `.yaml` workflow files
- **THEN** both SHALL be discovered and listed as available workflows
#### Scenario: SOP transpiled to prompt nodes
- **WHEN** a `.sop.md` file is loaded
- **THEN** each `## Steps` section item SHALL become a `prompt:` node
- **THEN** `## Parameters` SHALL be extracted as node metadata
### Requirement: RFC 2119 constraint extraction
The transpiler SHALL extract RFC 2119 constraints from `**Constraints:**` blocks and embed them in the prompt text of the corresponding node.
#### Scenario: Constraints included in prompt
- **WHEN** a step has `**Constraints:** - You MUST do X`
- **THEN** the constraint text SHALL be appended to the node's prompt
### Requirement: Overview as workflow description
The `## Overview` section of a `.sop.md` file SHALL become the workflow's `description` field.
#### Scenario: Overview maps to description
- **WHEN** a `.sop.md` has `## Overview\nThis SOP does X`
- **THEN** the resulting workflow SHALL have `description: "This SOP does X"`
### Requirement: Parameter acquisition constraints
The transpiler SHALL validate that all required parameters from `## Parameters` are present before execution, using the constraint pattern from the SOP.
#### Scenario: Missing required parameter
- **WHEN** a required parameter has no value provided
- **THEN** the workflow SHALL prompt the user for the missing parameter before executing

View File

@@ -0,0 +1,44 @@
## ADDED Requirements
### Requirement: Inline script execution
Script nodes SHALL execute inline TypeScript (runtime: `bun`) or Python (runtime: `uv`) code and capture stdout as the node output.
#### Scenario: Bun inline execution
- **WHEN** a script node has `runtime: bun` and `script: console.log("hello")`
- **THEN** the executor SHALL run the script via `bun -e`
- **THEN** stdout SHALL be captured as `$nodeId.output`
#### Scenario: Python inline execution
- **WHEN** a script node has `runtime: uv` and `script: print("hello")`
- **THEN** the executor SHALL run the script via `uv run python -c`
- **THEN** stdout SHALL be captured as `$nodeId.output`
### Requirement: Dependency installation
Script nodes SHALL support a `deps:` array that installs dependencies before execution.
#### Scenario: Bun with npm deps
- **WHEN** a script node has `runtime: bun` and `deps: ["lodash", "zod"]`
- **THEN** the executor SHALL run `bun install lodash zod` before executing
#### Scenario: Python with pip deps
- **WHEN** a script node has `runtime: uv` and `deps: ["requests", "click"]`
- **THEN** the executor SHALL run `uv pip install requests click` before executing
### Requirement: Named script files
Script nodes MAY reference named scripts from a `.archon/scripts/` directory by name instead of inline code.
#### Scenario: Named script discovery
- **WHEN** a script node has `script: analyze` and `scripts/analyze.ts` exists
- **THEN** the executor SHALL load and execute the file
#### Scenario: Runtime inferred from extension
- **WHEN** a script has `runtime: bun` and the named file has a `.ts` extension
- **THEN** the executor SHALL run it via `bun run`
### Requirement: Script timeout
Script nodes SHALL support a `timeout:` field in milliseconds. If execution exceeds the timeout, the process SHALL be killed and the node SHALL fail.
#### Scenario: Timeout exceeded
- **WHEN** a script node sets `timeout: 5000` and the script runs for 10 seconds
- **THEN** the process SHALL be killed after 5 seconds
- **THEN** the node SHALL be marked as failed with a timeout error

View File

@@ -0,0 +1,48 @@
## ADDED Requirements
### Requirement: IWorkflowStore interface
All storage backends SHALL implement the `IWorkflowStore` interface providing run lifecycle, event persistence, and node output retrieval.
#### Scenario: Store provides run CRUD
- **WHEN** a workflow run is created
- **THEN** `createWorkflowRun()` SHALL persist the run and return it
- **WHEN** a workflow run status is updated
- **THEN** `updateWorkflowRun()` SHALL persist the status change
#### Scenario: Store provides event persistence
- **WHEN** a workflow event is created
- **THEN** `createWorkflowEvent()` SHALL append it to the event log
#### Scenario: Store provides completed node outputs
- **WHEN** a workflow is resumed
- **THEN** `getCompletedDagNodeOutputs()` SHALL return all completed node outputs keyed by node ID
### Requirement: Filesystem backend
The filesystem backend SHALL store each workflow run as files in a directory: `{artifactsDir}/{runId}/`.
#### Scenario: Filesystem stores events as JSONL
- **WHEN** events are created using the filesystem backend
- **THEN** each run SHALL have `events.jsonl` with newline-delimited JSON
- **THEN** node outputs SHALL be stored as individual JSON files
#### Scenario: Filesystem stores run metadata
- **WHEN** a run is created using the filesystem backend
- **THEN** `run.json` SHALL contain the run metadata
### Requirement: SQLite backend
The SQLite backend SHALL store workflow data in a SQLite database with tables for runs, events, and node outputs.
#### Scenario: SQLite stores runs table
- **WHEN** using the SQLite backend
- **THEN** a `workflow_runs` table SHALL exist with columns for id, workflow_name, status, user_message, created_at, updated_at
#### Scenario: SQLite stores events table
- **WHEN** using the SQLite backend
- **THEN** a `workflow_events` table SHALL exist with columns for run_id, sequence, event_type, timestamp, payload
### Requirement: Postgres backend
The Postgres backend SHALL use a PostgreSQL database with the same schema as SQLite, accessed via the `IWorkflowStore` interface.
#### Scenario: Postgres uses same interface
- **WHEN** switching from SQLite to Postgres
- **THEN** no workflow engine code SHALL change — only the store implementation

View File

@@ -0,0 +1,57 @@
## ADDED Requirements
### Requirement: Node output references
Prompts and commands SHALL support `$nodeId.output` to reference the output text of an upstream node, and `$nodeId.output.field` to reference a specific field from a structured output.
#### Scenario: Output reference substitution
- **WHEN** a prompt contains `$analysis.output`
- **THEN** it SHALL be replaced with the full output text of the node with id `analysis`
#### Scenario: Field reference with structured output
- **WHEN** a prompt contains `$analysis.output.summary` and the upstream node declared `output_format: { type: "object", properties: { summary: ... } }`
- **THEN** it SHALL be replaced with the value of the `summary` field from the parsed JSON output
#### Scenario: Missing node reference
- **WHEN** a prompt references `$nonexistent.output`
- **THEN** the reference SHALL resolve to an empty string with a warning
#### Scenario: Missing field on schemaless node
- **WHEN** a prompt references `$node.output.field` and the upstream node has no `output_format` and its output is not valid JSON
- **THEN** the consuming node SHALL fail with an error
#### Scenario: Strict field access for declared schemas
- **WHEN** a prompt references `$node.output.field` and the upstream node's `output_format` declares properties but `field` is not among them
- **THEN** the consuming node SHALL fail with a field-not-found error
### Requirement: Built-in variables
The engine SHALL support `$ARGUMENTS`, `$ARTIFACTS_DIR`, `$WORKFLOW_ID`, `$BASE_BRANCH`, `$DOCS_DIR`.
#### Scenario: $ARGUMENTS substitution
- **WHEN** a prompt contains `$ARGUMENTS`
- **THEN** it SHALL be replaced with the full user message/arguments string
#### Scenario: $ARTIFACTS_DIR substitution
- **WHEN** a prompt contains `$ARTIFACTS_DIR`
- **THEN** it SHALL be replaced with the path to the run's artifact directory
#### Scenario: $WORKFLOW_ID substitution
- **WHEN** a prompt contains `$WORKFLOW_ID`
- **THEN** it SHALL be replaced with the workflow run ID
### Requirement: Loop-specific variables
Loop nodes SHALL support `$LOOP_USER_INPUT` (from approve at interactive gates) and `$LOOP_PREV_OUTPUT` (output of the previous iteration).
#### Scenario: $LOOP_PREV_OUTPUT on first iteration
- **WHEN** a loop node is on its first iteration
- **THEN** `$LOOP_PREV_OUTPUT` SHALL resolve to an empty string
#### Scenario: $LOOP_PREV_OUTPUT on subsequent iterations
- **WHEN** a loop node is on iteration 2+
- **THEN** `$LOOP_PREV_OUTPUT` SHALL contain the cleaned output of the previous iteration
### Requirement: Approval-specific variables
Approval nodes SHALL support `$REJECTION_REASON`.
#### Scenario: $REJECTION_REASON in on_reject prompt
- **WHEN** an approval node is rejected with a reason
- **THEN** `$REJECTION_REASON` SHALL contain the reviewer's feedback text

View File

@@ -0,0 +1,51 @@
## ADDED Requirements
### Requirement: Run states
A workflow run SHALL transition through states: `pending → running → completed | failed | cancelled`. It MAY transition to `paused` for approval gates.
#### Scenario: Normal completion
- **WHEN** all DAG nodes complete successfully
- **THEN** the run status SHALL be `completed`
#### Scenario: Node failure
- **WHEN** a node fails and no retry succeeds
- **THEN** the run status SHALL be `failed`
#### Scenario: User cancellation
- **WHEN** a user cancels a running workflow
- **THEN** the run status SHALL be `cancelled`
#### Scenario: Approval pause
- **WHEN** an approval node is reached
- **THEN** the run status SHALL transition to `paused`
- **THEN** it SHALL transition back to `running` on approval
### Requirement: Resume from failure
A failed workflow SHALL support resumption, skipping already-completed nodes using stored outputs from the event log.
#### Scenario: Resume skips completed nodes
- **WHEN** a failed workflow has 2 completed nodes out of 5
- **THEN** resuming SHALL skip nodes 1-2 and re-execute from node 3
#### Scenario: Resume with always_run
- **WHEN** a node has `always_run: true` and the workflow is resumed
- **THEN** the node SHALL re-execute even if it completed previously
### Requirement: Event-based observability
All lifecycle transitions SHALL emit typed events through the event emitter for observability and external subscribers.
#### Scenario: Events for every state transition
- **WHEN** a workflow starts
- **THEN** a `workflow_started` event SHALL be emitted
- **WHEN** a workflow completes
- **THEN** a `workflow_completed` event SHALL be emitted
- **WHEN** a node starts/completes/fails/skips
- **THEN** corresponding node events SHALL be emitted
### Requirement: Cleanup
The engine SHALL support cleaning up old workflow runs and their artifacts.
#### Scenario: Cleanup by age
- **WHEN** cleanup is invoked with a retention period (default 7 days)
- **THEN** runs older than the retention period SHALL have their artifacts removed
- **THEN** run records MAY be pruned from the store

View File

@@ -0,0 +1,77 @@
## ADDED Requirements
### Requirement: Workflow definition structure
A workflow YAML SHALL have a top-level `name`, `description`, and `nodes:` array. It MAY have `provider`, `model`, `interactive`, `mutates_checkout`, `tags`.
#### Scenario: Minimal valid workflow
- **WHEN** a YAML file contains a `name`, `description`, and at least one node
- **THEN** the loader SHALL parse it as a valid workflow definition
#### Scenario: Missing name
- **WHEN** a YAML file lacks a `name` field
- **THEN** the loader SHALL reject it with a validation error
### Requirement: Seven node types
The engine SHALL support exactly 7 node types, mutually exclusive per node: `command`, `prompt`, `bash`, `script`, `loop`, `approval`, `cancel`.
#### Scenario: Node with exactly one mode field
- **WHEN** a node has `prompt:` but no other mode field
- **THEN** it SHALL be classified as a PromptNode
#### Scenario: Node with multiple mode fields
- **WHEN** a node has both `prompt:` and `bash:`
- **THEN** the loader SHALL reject it with a mutual-exclusivity error
#### Scenario: Node with no mode field
- **WHEN** a node has none of the 7 mode fields
- **THEN** the loader SHALL reject it
### Requirement: Common node fields
All node types SHALL support `id`, `depends_on`, `when`, `trigger_rule`, `retry`, `timeout`, `output_type`, `always_run`.
#### Scenario: Node id must be unique
- **WHEN** two nodes in the same workflow share the same `id`
- **THEN** the loader SHALL reject the workflow
### Requirement: Prompt node
A PromptNode SHALL have a `prompt:` string field containing the AI prompt text.
#### Scenario: Empty prompt rejected
- **WHEN** a node has `prompt: ""`
- **THEN** the loader SHALL reject it
### Requirement: Bash node
A BashNode SHALL have a `bash:` string field and MAY have `timeout` (ms). AI-specific fields SHALL be ignored with a warning.
#### Scenario: Bash node with timeout
- **WHEN** a bash node includes `timeout: 30000`
- **THEN** the executor SHALL kill the subprocess after 30 seconds
### Requirement: Script node
A ScriptNode SHALL have `script:` (inline or named), `runtime:` (`bun` or `uv`), MAY have `deps:` and `timeout:`.
#### Scenario: Script with deps
- **WHEN** a script node has `runtime: bun` and `deps: ["lodash"]`
- **THEN** the executor SHALL install dependencies before running the script
#### Scenario: Named script from disk
- **WHEN** `script: analyze` and a file `scripts/analyze.ts` exists
- **THEN** the executor SHALL load and run it
### Requirement: Loop node
A LoopNode SHALL have `loop:` with `prompt`, `until`, `max_iterations`, and optional `fresh_context`, `interactive`, `gate_message`, `until_bash`.
#### Scenario: Loop with completion signal
- **WHEN** the AI response contains the `until` string
- **THEN** the loop SHALL stop and the node SHALL complete
#### Scenario: Loop exceeds max_iterations
- **WHEN** the loop reaches `max_iterations` without the completion signal
- **THEN** the node SHALL fail
### Requirement: Cancel node
A CancelNode SHALL have a `cancel:` string containing a reason. It SHALL terminate the workflow run.
#### Scenario: Cancel terminates workflow
- **WHEN** a cancel node executes
- **THEN** the workflow SHALL be marked as cancelled with the cancel reason

View File

@@ -0,0 +1,102 @@
## 1. Project Scaffold
- [ ] 1.1 Initialize package with `package.json`, `tsconfig.json`, module structure (`src/`, `src/cli/`, `src/engine/`, `src/store/`, `src/format/`)
- [ ] 1.2 Add core dependencies: `zod`, `js-yaml`, `nanoid`, `ulid`
- [ ] 1.3 Configure build (tsc or bun build), lint, format, and test scripts
- [ ] 1.4 Create public exports index (`src/index.ts`) with all type and function exports
## 2. Schema Layer — Workflow and Node Types
- [ ] 2.1 Implement `dag-node.ts`: Zod schema for all 7 node types with mutual-exclusivity superRefine, type guards, and AI-field warnings
- [ ] 2.2 Implement `workflow.ts`: WorkflowDefinition schema extending WorkflowBase with nodes array, WorkflowExecutionResult, WorkflowSource types
- [ ] 2.3 Implement `loop.ts`: LoopNodeConfig (prompt, until, max_iterations, fresh_context, interactive, gate_message, until_bash)
- [ ] 2.4 Implement `retry.ts`: Retry config (max_attempts, delay_ms, on_error)
- [ ] 2.5 Implement `workflow-run.ts`: WorkflowRun, WorkflowRunStatus, NodeState, NodeOutput, ApprovalContext schemas
## 3. YAML Format — Loader and Validation
- [ ] 3.1 Implement `loader.ts`: YAML parsing via js-yaml, per-node dagNodeSchema validation, DAG structure validation (unique IDs, depends_on refs, cycle detection via Kahn's)
- [ ] 3.2 Implement `command-validation.ts`: Command name format validation
- [ ] 3.3 Implement `model-validation.ts`: Provider/model resolution (optional — skip before AI provider integration)
- [ ] 3.4 Add workflow-level validation: required fields, provider identity, node ref integrity
## 4. DAG Engine — Core Execution
- [ ] 4.1 Implement `deps.ts`: WorkflowDeps injection interface, IWorkflowPlatform, WorkflowConfig types
- [ ] 4.2 Implement `dag-executor.ts`: Kahn's algorithm topological layering, `buildTopologicalLayers()`, `checkTriggerRule()` (4 trigger rules), Promise.allSettled concurrent layer execution
- [ ] 4.3 Implement node dispatch: execution handlers for PromptNode (AI), CommandNode (command loading), BashNode (subprocess), CancelNode (termination)
- [ ] 4.4 Implement `executor-shared.ts`: `substituteWorkflowVariables()`, `loadCommandPrompt()`, `classifyError()`, `safeSendMessage()`
- [ ] 4.5 Implement `output-ref.ts`: `$nodeId.output` and `$nodeId.output.field` resolution with strict field access
- [ ] 4.6 Implement `condition-evaluator.ts`: `when:` expression parser (==, !=, <, >, <=, >=, AND/OR, comparators with $nodeId.output)
- [ ] 4.7 Implement `event-emitter.ts`: Typed events (workflow_started/completed/failed, node_started/completed/failed/skipped)
## 5. Event Sourcing — Persistence and Replay
- [ ] 5.1 Implement `store.ts`: IWorkflowStore interface (createWorkflowRun, getWorkflowRun, updateWorkflowRun, failWorkflowRun, createWorkflowEvent, getCompletedDagNodeOutputs, getActiveWorkflowRunByPath)
- [ ] 5.2 Implement `executor.ts`: Top-level workflow orchestrator — create run, path-lock guard, dispatch to dag-executor, handle resume with prior completed nodes, event emission
- [ ] 5.3 Implement event persistence: 8 event types stored chronologically, node outputs stored for resume
- [ ] 5.4 Implement resume: `hydrateResumableRun()` loads prior completed node outputs, skips re-execution
- [ ] 5.5 Implement cleanup: retention-based run record and artifact removal
## 6. Storage Backends
- [ ] 6.1 Implement filesystem store: `createFsStore(path)` — run.json per run, events.jsonl, node outputs as JSON files, file-level locking
- [ ] 6.2 Implement SQLite store: `createSqliteStore(path)` — workflow_runs, workflow_events, node_outputs tables with WAL mode
- [ ] 6.3 Implement Postgres store: `createPostgresStore(connectionString)` — same schema as SQLite, pg driver
## 7. Variable Substitution
- [ ] 7.1 Implement workflow-level variable substitution: $WORKFLOW_ID, $ARGUMENTS, $ARTIFACTS_DIR, $BASE_BRANCH, $DOCS_DIR
- [ ] 7.2 Implement node output references in prompts: `$nodeId.output` (full text), `$nodeId.output.field` (structured field access)
- [ ] 7.3 Implement loop-specific variables: `$LOOP_USER_INPUT`, `$LOOP_PREV_OUTPUT`, `$REJECTION_REASON`
- [ ] 7.4 Implement command-level variable substitution: $1-$9 positional args
## 8. Script and Bash Execution
- [ ] 8.1 Implement BashNode execution: `bash -c` subprocess with timeout, stdout capture, env var injection
- [ ] 8.2 Implement ScriptNode — bun runtime: inline `bun -e`, named scripts from `.archon/scripts/`, deps installation
- [ ] 8.3 Implement ScriptNode — uv runtime: `uv run python -c`, named scripts, uv deps installation
- [ ] 8.4 Implement `script-discovery.ts`: discover scripts by extension (.ts→bun, .py→uv) from project and home scopes
## 9. Approval Gates and Human-in-the-Loop
- [ ] 9.1 Implement ApprovalNode handler: pause workflow status, send approval message, store approval context
- [ ] 9.2 Implement approve/resume: transition from paused→running, continue DAG execution
- [ ] 9.3 Implement reject handling: reject node with reason, populate $REJECTION_REASON, execute on_reject prompt if configured
- [ ] 9.4 Implement capture_response: store user comment as $nodeId.output
- [ ] 9.5 Implement interactive loop support: loop.interactive=true pauses between iterations, gate_message shown to user
## 10. Loop Nodes
- [ ] 10.1 Implement LoopNode execution: iterative AI prompt loop with completion signal detection (`until`)
- [ ] 10.2 Implement `max_iterations` enforcement: fail node when exceeded
- [ ] 10.3 Implement `fresh_context` for loop iterations: new session vs. accumulated context
- [ ] 10.4 Implement `until_bash`: bash exit code 0 as completion signal (alternative to text signal)
## 11. CLI Tool (MVP)
- [ ] 11.1 Implement main CLI entry point with subcommand routing (workflow list, run, status, resume)
- [ ] 11.2 Implement `workflow list`: discover and display all workflows with source info
- [ ] 11.3 Implement `workflow run`: execute workflow by name with arguments, --cwd, --store flags
- [ ] 11.4 Implement `workflow status`: display active and recent runs
- [ ] 11.5 Implement `workflow resume`: resume a failed workflow
## 12. Workflow Discovery
- [ ] 12.1 Implement `workflow-discovery.ts`: filesystem discovery across bundled→home→project scopes with precedence
- [ ] 12.2 Implement bundled defaults: embedded default workflows (assist, plan, implement)
- [ ] 12.3 Implement home-global scope: user-level workflows directory
- [ ] 12.4 Implement project scope: repo-local `.workflows/` directory
- [ ] 12.5 Implement resilient loading: per-file error handling, one broken YAML doesn't abort discovery
## 13. Testing
- [ ] 13.1 Unit test DAG executor: topological layering, trigger rules, when conditions, node output refs
- [ ] 13.2 Unit test schema validation: all node types, mutual exclusivity, field validation
- [ ] 13.3 Unit test variable substitution: $nodeId.output, $ARGUMENTS, $LOOP_PREV_OUTPUT edge cases
- [ ] 13.4 Unit test condition evaluator: comparison operators, compound AND/OR, error cases
- [ ] 13.5 Unit test filesystem store: create/read/update runs, events, node outputs, resume data
- [ ] 13.6 Unit test SQLite store: same coverage as filesystem
- [ ] 13.7 Unit test CLI commands: argument parsing, output formatting, approval flow
- [ ] 13.8 Integration test: end-to-end workflow execution with bash and script nodes
- [ ] 13.9 Integration test: resume after failure with prior node outputs loaded

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-06-07

View File

@@ -0,0 +1,3 @@
# memory-context-engineering
Spec-driven implementation of memory & context engineering patterns based on research of LangMem, DeerFlow, and CowAgent

View File

@@ -0,0 +1,164 @@
## Context
Current agents have no durable memory beyond the immediate LLM context window. Research across three production-grade OSS repos (LangMem, DeerFlow, CowAgent) reveals a consistent architectural pattern: a **tiered memory pipeline** with short-term context management, long-term semantic extraction, and periodic background consolidation. This design synthesizes those patterns into a portable, framework-agnostic `memory-engine` module.
The engine must be:
- **Portable** — works with any LLM, any agent framework, any embedding provider
- **Tiered** — separates ephemeral session context from persistent long-term knowledge
- **Efficient** — background processing, debounced writes, token-budget-aware formatting
- **Searchable** — hybrid keyword + vector retrieval with scoring
## Goals / Non-Goals
**Goals:**
- Provide a unified public API: `MemoryEngine` class with `manage()`, `search()`, `flush()`, `dream()` methods
- Short-term context: token-budget windowing + incremental summarization (LangMem's `summarize_messages` pattern)
- Long-term memory: LLM-extracted facts stored in SQLite with typed schemas (LangMem's `MemoryManager` + DeerFlow's fact model)
- Tiered consolidation: context→daily→core pipeline with configurable promotion rules (CowAgent's 3-tier)
- Hybrid search: FTS5 keyword + numpy-vectorized cosine similarity with weighted merge (CowAgent's `MemoryStorage`)
- Background processing: debounced async queue for memory updates (DeerFlow's `MemoryUpdateQueue` + LangMem's `ReflectionExecutor`)
- Agent tools: `manage_memory(content, action, id)` and `search_memory(query, limit)` as framework-agnostic callables
**Non-Goals:**
- Not a standalone agent framework — integrates into existing loops
- No built-in LLM provider — caller provides model
- No built-in embedding provider — caller provides or we degrade to keyword-only
- No real-time sync / distributed consensus — single-process design
- No graph-based memory (entity-relationship knowledge graphs) — deferred to future
## Decisions
### D1: SQLite as the single persistence backend
- **Choice**: SQLite with WAL mode for both keyword search (FTS5) and vector storage (BLOB embeddings)
- **Rationale**: Zero-dependency, production-proven, FTS5 is stdlib-compatible, numpy integration in-process
- **Alternatives considered**:
- *JSON files* (DeerFlow) → simpler but no built-in search, concurrency issues
- *External vector DB* (Pinecone, pgvector) → adds operational complexity, violates portability goal
- *LMDB/RocksDB* → overkill, no FTS5 equivalent
### D2: Three-tier architecture with file-based daily layer
- **Choice**: In-memory context tier → Markdown-file daily tier → SQLite-indexed core tier
- **Rationale**: Daily Markdown files are human-readable, easily audited, and serve as the input to Deep Dream consolidation. Core tier is the indexed, searchable fact store.
- **Alternatives considered**:
- *Single SQLite DB for everything* → loses human-readability of daily records
- *All in-memory* → no persistence across restarts
### D3: Fact extraction via structured LLM output (tool-calling pattern)
- **Choice**: LLM returns structured JSON (DeerFlow pattern) rather than tool-calling-based extraction (LangMem trustcall pattern)
- **Rationale**: Simpler, fewer dependencies, compatible with any LLM provider. LangMem's trustcall approach is more robust for complex multi-step edits but requires the `trustcall` library.
- **Fallback**: Confidence-thresholded insertion with content-dedup hashing to prevent duplicates
### D4: Hybrid search with numpy-vectorized cosine similarity
- **Choice**: Load relevant embeddings from SQLite, compute cosine similarity via `matrix @ vector` (numpy), merge with FTS5 BM25 scores
- **Rationale**: ~100x faster than per-row Python loops. Uses numpy which is near-ubiquitous in Python ML.
- **Fallback**: Pure-Python cosine similarity when numpy unavailable
### D5: Debounced background memory update queue
- **Choice**: Thread-safe priority queue with configurable debounce timer (DeerFlow pattern)
- **Rationale**: Prevents thundering-herd on LLM API during rapid conversation turns. Threaded execution avoids blocking the main agent loop.
- **Alternatives considered**: asyncio queue → fine for async-only, but MemoryEngine must support sync callers
### D6: Namespace isolation via tuple-based scoping
- **Choice**: `(scope_type, user_id, agent_id)` tuple namespace for multi-tenant isolation
- **Rationale**: LangMem's `NamespaceTemplate` pattern proven in production. Allows `("user", "u-123")` or `("org", "acme", "agent-alpha")`.
## Architecture Overview
```
┌─────────────────────────────────────────────────────────┐
│ MemoryEngine │
├─────────────────────────────────────────────────────────┤
│ manage_memory(content, scope, metadata) → fact_id │
│ search_memory(query, limit, scope) → SearchResults[] │
│ flush_messages(messages, scope) → boolean │
│ deep_dream(lookback_days, scope) → boolean │
│ format_for_injection(scope, max_tokens) → str │
└──────────────────────┬──────────────────────────────────┘
┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────┐ ┌──────────────┐
│ Context Tier │ │ Daily │ │ Core Tier │
│ (in-memory) │ │ Tier │ │ (SQLite + │
│ │ │ (Markdown│ │ FTS5 + │
│ RunningSumm. │ │ files) │ │ vectors) │
│ token budget │ │ │ │ │
└──────────────┘ │ Deep │ │ MemoryStore │
│ Dream ───┼─┤ (facts) │
└──────────┘ │ HybridSearch │
└──────────────┘
┌────────┴────────┐
▼ ▼
┌────────────┐ ┌────────────────┐
│ Keyword │ │ Vector Search │
│ (FTS5) │ │ (numpy cosine) │
└────────────┘ └────────────────┘
```
### Data Flow
1. **Agent sends message** → Context tier tracks token budget, optionally summarizes
2. **Conversation turn completes** → Messages queued to background `MemoryUpdateQueue`
3. **Debounce timer fires**`MemoryUpdater` calls LLM with current memory + conversation → extracts facts
4. **Facts persisted** → Core tier SQLite: chunks table with embedding, FTS5 index
5. **Daily recording**`MemoryFlushManager` appends to `memory/YYYY-MM-DD.md`
6. **Deep Dream (scheduled)** → LLM reads MEMORY.md + recent daily files → rewrites MEMORY.md → writes dream diary
7. **Agent starts new session**`format_for_injection()` reads core tier → builds token-budgeted context string → injects into system prompt
## Module Structure
```
memory-engine/
├── __init__.py # Public API: MemoryEngine, MemoryConfig
├── config.py # Pydantic config model
├── core/
│ ├── __init__.py
│ ├── store.py # MemoryStore (SQLite + FTS5 + vectors)
│ ├── hybrid_search.py # Vector + keyword merge with temporal decay
│ └── schemas.py # Memory, Fact, SearchResult models
├── extraction/
│ ├── __init__.py
│ ├── manager.py # MemoryManager (LLM fact extraction)
│ └── prompts.py # System prompts for memory extraction
├── tiers/
│ ├── __init__.py
│ ├── context.py # ContextTier (short-term summarization)
│ ├── daily.py # DailyTier (Markdown file management)
│ └── core.py # CoreTier (long-term persistent store)
├── background/
│ ├── __init__.py
│ ├── queue.py # MemoryUpdateQueue (debounced)
│ └── deep_dream.py # Deep Dream consolidation
├── tools/
│ ├── __init__.py
│ ├── manage.py # manage_memory callable
│ └── search.py # search_memory callable
├── embedding/
│ ├── __init__.py
│ ├── base.py # EmbeddingProvider ABC
│ └── openai.py # OpenAI embedding implementation
└── utils/
├── __init__.py
├── namespace.py # NamespaceTemplate
├── token_counter.py # Token counting (tiktoken wrapper)
└── chunker.py # Text chunking
```
## Risks / Trade-offs
| Risk | Mitigation |
|------|-----------|
| [R1] LLM extraction latency blocks agent loop | Background queue with debounce — agent never waits for memory update |
| [R2] Embedding API failures degrade search | Graceful degradation to keyword-only; vector results omitted, not fatal |
| [R3] SQLite write contention under high concurrency | WAL mode + RLock per connection; single-process assumption |
| [R4] FTS5 corrupted after crash | Self-healing on init: detect corrupt shadow tables, rebuild from chunks table |
| [R5] Memory bloat from unbounded fact accumulation | Configurable `max_facts` limit (default 500); sorted by confidence, oldest trimmed |
| [R6] Deep Dream overwrites valuable long-term data | Dream diary preserves audit trail; content-hash dedup prevents re-processing |
| [R7] Token budget exceeded in context injection | `format_for_injection()` enforces strict token limit with truncation |
## Open Questions
- Q1: Should Deep Dream be scheduled (cron) or event-driven (every N daily files)?
- Q2: What is the default `max_facts` limit for the core tier?
- Q3: Should the daily tier support per-user isolation (user-specific daily files) or always shared?

View File

@@ -0,0 +1,35 @@
## Why
Current AI agents lack structured, durable memory beyond the immediate context window. Conversations are stateless, preferences are forgotten, and long-term learning is nonexistent. Three OSS repos (LangMem, DeerFlow, CowAgent) demonstrate production patterns for agent memory — but no unified, portable engine exists that combines short-term context management, long-term semantic memory, tiered consolidation, and hybrid retrieval. This change builds that engine by extracting and adapting the best patterns from all three.
## What Changes
- **New `memory-engine/` module** in the codebase providing a unified memory & context API
- **Short-term context summarization** — token-budget-aware conversation windowing (LangMem pattern)
- **Long-term semantic memory** — LLM-extracted facts stored with optional vector embeddings (LangMem/DeerFlow hybrid)
- **Tiered memory architecture** — Context tier (ephemeral session) → Daily tier (summarized records) → Core tier (distilled long-term) (CowAgent pattern)
- **Hybrid search** — Keyword (FTS5) + Vector (cosine similarity on embeddings) with weighted merge (CowAgent pattern)
- **Background consolidation** — Debounced, async memory extraction pipeline (DeerFlow queue + LangMem ReflectionExecutor)
- **Deep Dream distillation** — Periodic overnight LLM consolidation of daily records into core memory (CowAgent pattern)
- **Memory tools for agents** — `manage_memory` and `search_memory` tool interfaces (LangMem pattern)
## Capabilities
### New Capabilities
- `short-term-context`: Token-budget window management, conversation summarization, and context trimming for LLM interactions
- `long-term-memory`: Persistent fact extraction, storage, and retrieval with Pydantic-typed schemas
- `tiered-consolidation`: Three-tier memory pipeline (context→daily→core) with promotion rules and Deep Dream distillation
- `hybrid-search`: Combined keyword (FTS5) + vector (embedding cosine similarity) search with weighted scoring and temporal decay
- `memory-tools`: `manage_memory` (CRUD) and `search_memory` (semantic query) tools for agent integration
- `background-processing`: Debounced async memory update queue with thread-pool execution
### Modified Capabilities
<!-- No existing specs to modify — this is a greenfield module -->
## Impact
- New `memory-engine/` directory tree (no existing code modified)
- Dependencies: `sqlite3` (stdlib), `numpy` (optional, for vector search), `pydantic` (schemas), `tiktoken` (token counting)
- LLM provider integration via abstract `ChatModel` interface (not coupled to any provider)
- Embedding provider integration via abstract `EmbeddingProvider` interface (supports OpenAI, local models)
- Agent integration via simple tool interface (not coupled to any agent framework)

View File

@@ -0,0 +1,58 @@
## ADDED Requirements
### Requirement: Debounced memory update queue
The system SHALL collect memory update requests into a queue and process them after a configurable debounce period.
#### Scenario: Items enqueued per (thread, user, agent) key
- **WHEN` a conversation context is added to the queue
- **THEN** it SHALL be keyed by `(thread_id, user_id, agent_name)` for deduplication
- **WHEN** a second context arrives for the same key before processing
- **THEN** the previous context SHALL be replaced with the newer one
#### Scenario: Debounce timer resets on each enqueue
- **WHEN` a new item is enqueued
- **THEN** the debounce timer SHALL reset to the configured `debounce_seconds`
- **WHEN** no new items arrive within the debounce window
- **THEN** the queue SHALL be processed
#### Scenario: Immediate processing option
- **WHEN** `add_nowait()` is called instead of `add()`
- **THEN** the queue SHALL start processing immediately in a background thread
### Requirement: Background thread execution for memory updates
The system SHALL execute memory updates (LLM extraction + persistence) in a background thread to avoid blocking the agent loop.
#### Scenario: Async flush via threading.Thread
- **WHEN` conversation messages are flushed to memory
- **THEN** the flush SHALL run in a `threading.Thread` (daemon=True)
- **THEN` the main agent SHALL NOT wait for the flush to complete
#### Scenario: Thread pool for sync LLM calls
- **WHEN** a memory update requires a synchronous LLM call
- **THEN** the call SHALL be offloaded to a `ThreadPoolExecutor` (max_workers=4)
- **THEN** this SHALL prevent blocking the main event loop
### Requirement: Content deduplication for flush
The system SHALL deduplicate message content before flushing to avoid redundant summarization.
#### Scenario: MD5 content hash dedup
- **WHEN** messages are about to be flushed
- **THEN** each message content SHALL be MD5-hashed
- **WHEN** a hash matches a previously flushed message
- **THEN** that message SHALL be skipped
#### Scenario: Scheduler pair stripping
- **WHEN** messages contain scheduler-injected pairs (marked with `[SCHEDULED]` prefix)
- **THEN** the scheduler user message and its paired assistant response SHALL be stripped before flushing
### Requirement: Configuration-driven memory processing
The system SHALL support configuration to enable/disable background memory processing.
#### Scenario: Memory processing disabled
- **WHEN** `memory_config.enabled` is `False`
- **THEN** no memory updates SHALL be queued or processed
- **THEN** queue `add()` calls SHALL be no-ops
#### Scenario: Rate limiting between updates
- **WHEN** processing multiple queued memory updates
- **THEN` a 0.5 second delay SHALL be inserted between updates to avoid LLM API rate limits

View File

@@ -0,0 +1,73 @@
## ADDED Requirements
### Requirement: Hybrid search with vector + keyword fusion
The system SHALL combine vector similarity search and keyword search into unified ranked results.
#### Scenario: Vector search runs when embedding provider available
- **WHEN** an embedding provider is configured
- **THEN** the system SHALL compute a query embedding and perform cosine similarity search
- **WHEN** no embedding provider is configured
- **THEN** the system SHALL gracefully degrade to keyword-only search
#### Scenario: Keyword search always runs
- **WHEN** a search query is submitted
- **THEN** the system SHALL always perform keyword search regardless of embedding provider availability
#### Scenario: Weighted score merging
- **WHEN** both vector and keyword results are available
- **THEN** the final score SHALL be: `vector_weight * vector_score + keyword_weight * keyword_score`
- **THEN** default weights SHALL be `vector_weight=0.7`, `keyword_weight=0.3`
- **THEN** weights SHALL be configurable
### Requirement: Vector search via numpy cosine similarity
The system SHALL perform vector search using numpy-vectorized cosine similarity for performance.
#### Scenario: Vectorized cosine similarity
- **WHEN** numpy is available
- **THEN** all chunk embeddings SHALL be loaded into a numpy matrix `(N, D)`
- **THEN** cosine similarity SHALL be computed as `matrix @ query_vector` (BLAS matrix-vector multiply)
- **THEN** top-K results SHALL be selected via `argpartition` (O(N) average)
#### Scenario: Pure-Python fallback
- **WHEN** numpy is unavailable
- **THEN** cosine similarity SHALL be computed per-row with pure Python
- **THEN** results SHALL be sorted and the top K returned
### Requirement: Three-tier keyword search (FTS5 → trigram → LIKE)
The system SHALL provide a cascading keyword search strategy for multi-language support.
#### Scenario: Standard FTS5 for ASCII queries
- **WHEN** the query contains only ASCII characters
- **THEN** the system SHALL use SQLite FTS5 with the unicode61 tokenizer
- **THEN** BM25 ranking SHALL be converted to a `[0, 1)` score
#### Scenario: Trigram FTS5 for CJK queries
- **WHEN** the query contains CJK (Chinese, Japanese, Korean) characters
- **THEN** the system SHALL use SQLite FTS5 with the trigram tokenizer
- **THEN** CJK character sequences and ASCII words SHALL be extracted and joined with AND
#### Scenario: LIKE fallback for edge cases
- **WHEN** FTS5 is unavailable or returns empty results
- **THEN** the system SHALL fall back to LIKE-based search
- **THEN** CJK runs (1+ chars) and ASCII words (3+ chars) SHALL be matched independently
### Requirement: Temporal decay for dated memory files
The system SHALL apply exponential decay to search scores for dated memory files.
#### Scenario: Decay applied to dated files
- **WHEN** a memory chunk path matches `YYYY-MM-DD.md`
- **THEN** the combined score SHALL be multiplied by `exp(-ln(2)/half_life * age_days)`
- **THEN** the default `half_life` SHALL be 30 days
- **WHEN** the path does not contain a date (e.g., `MEMORY.md`)
- **THEN** no decay SHALL be applied (multiplier = 1.0)
### Requirement: Result filtering and limits
The system SHALL filter search results by minimum score and maximum count.
#### Scenario: Min score threshold
- **WHEN** search results are merged
- **THEN** results with score below `min_score` (default 0.1) SHALL be discarded
#### Scenario: Max results limit
- **WHEN** search results exceed `max_results`
- **THEN** only the top `max_results` by combined score SHALL be returned

View File

@@ -0,0 +1,83 @@
## ADDED Requirements
### Requirement: Fact extraction from conversation
The system SHALL extract structured facts from conversations using an LLM, with confidence scoring and category classification.
#### Scenario: Extract facts from conversation turn
- **WHEN** a conversation turn (user message + assistant reply) is processed
- **THEN** the system SHALL call the configured LLM with the conversation text
- **THEN** the LLM response SHALL be parsed as structured JSON with facts
- **THEN** each fact SHALL contain: `content`, `category`, `confidence` (0.0-1.0)
#### Scenario: Fact categories
- **WHEN** a fact is extracted
- **THEN** its `category` SHALL be one of: `preference`, `knowledge`, `context`, `behavior`, `goal`, `correction`
- **THEN** the system SHALL validate the category against the allowed set
#### Scenario: Confidence thresholds
- **WHEN** a fact's confidence is below the configurable threshold (default 0.5)
- **THEN** the fact SHALL NOT be persisted
- **THEN** the system SHALL log that a low-confidence fact was skipped
### Requirement: Fact CRUD operations
The system SHALL support creating, reading, updating, and deleting memory facts.
#### Scenario: Create fact
- **WHEN** a new fact is created
- **THEN** it SHALL be assigned a unique ID (`fact_{uuid_hex[:8]}`)
- **THEN** it SHALL be timestamped with ISO-8601 UTC
- **THEN** it SHALL be persisted to the core store
#### Scenario: Delete fact by ID
- **WHEN** a fact deletion is requested with a valid ID
- **THEN** the fact SHALL be removed from the store
- **THEN** the updated store SHALL be persisted
#### Scenario: Delete non-existent fact
- **WHEN** a fact deletion is requested with an unknown ID
- **THEN** the system SHALL raise `KeyError`
#### Scenario: Update fact
- **WHEN** a fact update is requested with a valid ID
- **THEN** the system SHALL update only the provided fields (`content`, `category`, `confidence`)
- **THEN** the fact's `createdAt` SHALL NOT be modified
- **THEN** the updated store SHALL be persisted
### Requirement: Content deduplication
The system SHALL prevent duplicate facts by casefolded content comparison.
#### Scenario: Exact duplicate detected
- **WHEN** a new fact's content (casefolded) matches an existing fact
- **THEN** the new fact SHALL be skipped
- **THEN** the existing fact SHALL remain unchanged
- **THEN** the system SHALL log that a duplicate was skipped
#### Scenario: Near-duplicate with different casing
- **WHEN** a new fact's content differs only in letter casing
- **THEN** it SHALL be treated as a duplicate
- **THEN** the new fact SHALL be skipped
### Requirement: Max facts limit
The system SHALL enforce a configurable maximum number of stored facts (default 500).
#### Scenario: Fact count exceeds limit
- **WHEN** adding a new fact would exceed `max_facts`
- **THEN** the system SHALL sort existing facts by confidence (descending)
- **THEN** the lowest-confidence fact SHALL be removed
- **THEN** the new fact SHALL be added
### Requirement: Memory formatting for context injection
The system SHALL format memory data into a compact string for injection into LLM system prompts, respecting a token budget.
#### Scenario: Format with all sections
- **WHEN** memory data contains user context, history, and facts
- **THEN** the output SHALL include: "User Context:" with work/personal/topOfMind
- **THEN** the output SHALL include: "History:" with recent/earlier/background
- **THEN** the output SHALL include: "Facts:" sorted by confidence descending
- **THEN** each fact SHALL be formatted as: `- [{category} | {confidence:.2f}] {content}`
#### Scenario: Token budget enforcement
- **WHEN** the formatted output exceeds `max_tokens` (default 2000)
- **THEN** the system SHALL trim facts from lowest confidence up
- **THEN** if still over budget, the output SHALL be truncated at the character level
- **THEN** `"\n..."` SHALL be appended to indicate truncation

View File

@@ -0,0 +1,64 @@
## ADDED Requirements
### Requirement: manage_memory tool
The system SHALL provide a callable tool for creating, updating, and deleting persistent facts.
#### Scenario: Create a new fact
- **WHEN** `manage_memory(content="...", action="create")` is called
- **THEN** a new fact SHALL be created with the provided content
- **THEN** a unique ID SHALL be auto-generated
- **THEN** the return value SHALL be `"created memory <id>"`
#### Scenario: Update an existing fact
- **WHEN** `manage_memory(content="...", action="update", id="<existing-id>")` is called
- **THEN** the fact SHALL be updated with the new content
- **THEN** the return value SHALL be `"updated memory <id>"`
- **WHEN** no `id` is provided for an update action
- **THEN** a ValueError SHALL be raised
#### Scenario: Delete a fact
- **WHEN** `manage_memory(action="delete", id="<existing-id>")` is called
- **THEN** the fact SHALL be deleted
- **THEN** the return value SHALL be `"Deleted memory <id>"`
- **WHEN** no `id` is provided for a delete action
- **THEN** a ValueError SHALL be raised
#### Scenario: Configurable permitted actions
- **WHEN** creating the tool with `actions_permitted=("create", "update")`
- **THEN** the delete action SHALL NOT be available
- **THEN** attempting a delete SHALL raise a ValueError
#### Scenario: Custom instructions
- **WHEN** creating the tool with custom `instructions`
- **THEN** those instructions SHALL be included in the tool description to guide LLM usage
### Requirement: search_memory tool
The system SHALL provide a callable tool for searching stored facts by semantic query.
#### Scenario: Text query search
- **WHEN** `search_memory(query="preference for dark mode", limit=10)` is called
- **THEN** the system SHALL perform hybrid search (vector + keyword)
- **THEN** results SHALL be returned as a serialized JSON list of fact objects
#### Scenario: Filtered search
- **WHEN** `search_memory(query="...", filter={"category": "preference"})` is called
- **THEN** results SHALL be filtered to match the specified criteria
#### Scenario: Configurable response format
- **WHEN** `response_format="content_and_artifact"` is configured
- **THEN** the tool SHALL return both serialized memories and raw memory objects
### Requirement: Namespace isolation for multi-tenant
The system SHALL support namespace-based isolation of memory data across users, agents, or organizations.
#### Scenario: Runtime namespace resolution
- **WHEN** a memory tool is called with a configuration containing `{"user_id": "u-123"}`
- **THEN** the namespace SHALL be resolved to `("user", "u-123")` at runtime
- **WHEN** calling with `{"org_id": "acme", "agent_id": "alpha"}`
- **THEN** the namespace SHALL be `("org", "acme", "alpha")`
#### Scenario: Namespace templating
- **WHEN** creating memory tools with `namespace=("{user_id}", "memories")`
- **THEN** the `{user_id}` placeholder SHALL be replaced at runtime from configuration
- **WHEN** a required config key is missing
- **THEN** a ConfigurationError SHALL be raised

View File

@@ -0,0 +1,65 @@
## ADDED Requirements
### Requirement: Token budget management
The system SHALL manage LLM context window limits by tracking token usage and triggering summarization when thresholds are exceeded.
#### Scenario: Token threshold exceeded
- **WHEN** cumulative message tokens exceed `max_tokens` configuration
- **THEN** the system SHALL identify messages to summarize starting from oldest
- **THEN** the system SHALL replace summarized messages with a `RunningSummary` object
- **THEN** the system SHALL ensure remaining messages + summary fit within `max_tokens` budget
#### Scenario: Partial token budget allocation
- **WHEN** `max_summary_tokens` is configured (default 256)
- **THEN** the system SHALL reserve `max_summary_tokens` tokens for the summary itself
- **THEN** remaining messages SHALL be trimmed to fit within `max_tokens - max_summary_tokens`
### Requirement: Incremental summarization
The system SHALL support incremental summarization across multiple turns, tracking which messages have already been summarized to avoid redundant work.
#### Scenario: First summarization
- **WHEN** no existing `RunningSummary` exists and token threshold is exceeded
- **THEN** the system SHALL call the LLM with an initial summary prompt
- **THEN** the system SHALL return a `RunningSummary` with `summary`, `summarized_message_ids` set, and `last_summarized_message_id`
#### Scenario: Subsequent summarization (append)
- **WHEN** a `RunningSummary` exists and new messages exceed threshold
- **THEN** the system SHALL call the LLM with the existing summary plus new messages
- **THEN** the system SHALL extend `summarized_message_ids` with newly summarized message IDs
- **THEN** the system SHALL update `last_summarized_message_id`
### Requirement: Context trimming with summarization hook
The system SHALL provide a hook that fires before messages are discarded, allowing the daily tier to capture summarized content.
#### Scenario: Pre-trim flush
- **WHEN** messages are about to be discarded (summarized)
- **THEN** the system SHALL fire a `memory_flush_hook` with the messages being summarized
- **THEN** the hook SHALL queue the messages for async memory extraction
- **THEN** the main thread SHALL NOT block on memory extraction
### Requirement: Token counting with fallback
The system SHALL provide accurate token counting using `tiktoken` when available, with a char-based fallback.
#### Scenario: tiktoken available
- **WHEN** tiktoken package is installed
- **THEN** the system SHALL use `tiktoken.get_encoding("cl100k_base")` for token counting
- **THEN** token counts SHALL be accurate per OpenAI/Anthropic tokenization
#### Scenario: tiktoken unavailable
- **WHEN** tiktoken is not installed
- **THEN** the system SHALL fall back to character-based estimation: `len(text) // 4`
- **THEN** the system SHALL log a warning about missing tiktoken
### Requirement: Summarization node for LangGraph
The system SHALL provide a `SummarizationNode` Runnable that integrates into LangGraph state graphs.
#### Scenario: Graph integration
- **WHEN** `SummarizationNode` is added to a LangGraph workflow
- **THEN** it SHALL read messages from `input_messages_key` (default "messages")
- **THEN** it SHALL write updated messages to `output_messages_key` (default "summarized_messages")
- **THEN** it SHALL store `RunningSummary` in `context.running_summary`
#### Scenario: Equality of input/output keys
- **WHEN** `input_messages_key` equals `output_messages_key`
- **THEN** the node SHALL emit a `RemoveMessage(REMOVE_ALL_MESSAGES)` to clear previous state
- **THEN** the node SHALL write the new message list including the summary

View File

@@ -0,0 +1,64 @@
## ADDED Requirements
### Requirement: Three-tier memory architecture
The system SHALL maintain three tiers of memory: Context (short-term/ephemeral), Daily (medium-term/file-based), and Core (long-term/distilled).
#### Scenario: Context tier stores active session
- **WHEN** an agent conversation is in progress
- **THEN** the context tier SHALL track messages, token usage, and running summary
- **WHEN** the session ends or context is trimmed
- **THEN** the context SHALL be flushed to the daily tier
#### Scenario: Daily tier persists as Markdown files
- **WHEN** context is flushed
- **THEN** the daily tier SHALL append summarized records to `memory/YYYY-MM-DD.md`
- **THEN** each session block SHALL have a timestamped header (e.g., `## Trimmed Context (14:30)`)
- **THEN** daily files SHALL be created lazily (only when first write occurs)
#### Scenario: Core tier stores distilled long-term knowledge
- **WHEN** Deep Dream consolidation runs
- **THEN** the core tier SHALL be updated by rewriting `MEMORY.md`
- **THEN** `MEMORY.md` SHALL be formatted as Markdown with `- ` bullet items, optionally grouped under `## headings`
### Requirement: Daily memory file management
The system SHALL manage daily memory files with automatic creation and lazy initialization.
#### Scenario: Lazy file creation
- **WHEN** the first memory write occurs for a given day
- **THEN** a file SHALL be created at `memory/YYYY-MM-DD.md` with a header `# Daily Memory: YYYY-MM-DD`
#### Scenario: Append-only writes
- **WHEN** subsequent memory writes occur on the same day
- **THEN** new entries SHALL be appended to the existing daily file
### Requirement: Deep Dream consolidation
The system SHALL periodically consolidate daily memories into the core memory using LLM-based distillation.
#### Scenario: Deep Dream triggered
- **WHEN** `deep_dream(lookback_days=N)` is called
- **THEN** the system SHALL read current `MEMORY.md` and the last N daily files
- **THEN** the LLM SHALL receive both the current memory and daily records
- **THEN** the LLM SHALL return `[MEMORY]` and `[DREAM]` sections
- **THEN** `MEMORY.md` SHALL be overwritten with the `[MEMORY]` content
- **THEN** a dream diary SHALL be written to `memory/dreams/YYYY-MM-DD.md`
#### Scenario: Dedup prevents redundant runs
- **WHEN** Deep Dream is called but daily content hash matches the last processed hash
- **THEN** the operation SHALL be skipped
#### Scenario: No daily content skips gracefully
- **WHEN** Deep Dream is called but no recent daily files have content
- **THEN** the operation SHALL be skipped and existing `MEMORY.md` SHALL be preserved
#### Scenario: No-fabrication constraint
- **WHEN** the LLM produces the `[MEMORY]` section
- **THEN** it SHALL ONLY use information present in the source materials (current MEMORY.md + daily files)
- **THEN** it SHALL NOT fabricate, infer, or add information not present in the source
### Requirement: Context summary injection
The system SHALL support injecting daily summary text into the active message list for context continuity.
#### Scenario: Context summary callback
- **WHEN** a daily memory flush completes
- **THEN** an optional callback SHALL be invoked with the daily summary text
- **THEN** the caller MAY inject the summary into the message list for continued context awareness

View File

@@ -0,0 +1,86 @@
## 1. Module Scaffold & Data Schemas
- [x] 1.1 Create `memory-engine/` directory tree with all subdirectories and `__init__.py` files
- [x] 1.2 Create `config.py` with `MemoryConfig` pydantic model (embedding, chunking, search, tier settings)
- [x] 1.3 Create `core/schemas.py` with `MemoryChunk`, `SearchResult`, `Fact`, `RunningSummary`, `ExtractedMemory` data classes
- [x] 1.4 Create `utils/token_counter.py` with tiktoken + char-fallback token counting
- [x] 1.5 Create `utils/namespace.py` with `NamespaceTemplate` for runtime namespace resolution
- [x] 1.6 Create `utils/chunker.py` with `TextChunker` (line-based, overlapping, configurable max_tokens)
## 2. Core Store: SQLite + FTS5 + Vector
- [x] 2.1 Create `core/store.py` with `MemoryStore` — SQLite init with WAL mode, FTS5 tables, integrity checks
- [x] 2.2 Implement `create_chunks_table()` with embedding BLOB storage, indexes, meta table
- [x] 2.3 Implement `create_fts5_tables()` with standard unicode61 tokenizer + trigram tokenizer for CJK
- [x] 2.4 Implement FTS5 triggers (AFTER INSERT/UPDATE/DELETE) for auto-sync
- [x] 2.5 Implement `save_chunk()` / `save_chunks_batch()` with SQLite UPSERT (INSERT ... ON CONFLICT DO UPDATE)
- [x] 2.6 Implement `delete_by_path()`, `get_file_hash()`, `update_file_metadata()`
- [x] 2.7 Implement FTS5 self-healing: `_fts5_state_inconsistent()`, `_fts5_shadow_corrupt()`, `reset_fts5()`
- [x] 2.8 Implement embedding encode/decode (float32 BLOB via numpy, struct fallback, legacy JSON fallback)
- [x] 2.9 Implement `get_stats()` and `close()` methods
## 3. Hybrid Search
- [x] 3.1 Implement `search_vector()` — numpy matrix cosine similarity with argpartition top-K (pure-Python fallback)
- [x] 3.2 Implement FTS5 keyword search with BM25 scoring: `_search_fts5()`, `_search_fts5_trigram()`
- [x] 3.3 Implement `_search_like()` — CJK (1+ chars) + ASCII word (3+ chars) with dynamic scoring
- [x] 3.4 Implement `search_keyword()` — three-tier strategy (FTS5 → trigram FTS5 → LIKE)
- [x] 3.5 Implement BM25 rank to score conversion (`0.3 + 0.69 * abs(r)/(1+abs(r))`)
- [x] 3.6 Create `core/hybrid_search.py` with weighted merge (vector_weight, keyword_weight) + temporal decay
- [x] 3.7 Implement `_compute_temporal_decay(path, half_life=30)` — exponential decay for dated files
## 4. LLM Memory Extraction
- [x] 4.1 Create `extraction/prompts.py` with memory update system prompt (structured JSON output)
- [x] 4.2 Create `extraction/manager.py` with `MemoryUpdater` — LLM fact extraction from conversation
- [x] 4.3 Implement `_prepare_update_prompt()` — loads current memory, formats conversation, builds prompt
- [x] 4.4 Implement `_parse_memory_update_response()` — JSON extraction from LLM response (handles fences/thinking)
- [x] 4.5 Implement `_apply_updates()` — update user/history sections, add/remove facts, enforce max_facts
- [x] 4.6 Implement `create_fact()`, `update_fact()`, `delete_memory_fact()` CRUD operations
- [x] 4.7 Implement content deduplication (casefold comparison) and confidence threshold filtering
- [x] 4.8 Implement upload-mention scrubbing from memory data
## 5. Tiered Consolidation
- [x] 5.1 Create `tiers/daily.py` with `DailyTier` — lazy file creation, append-only writes with timestamped headers
- [x] 5.2 Create `tiers/context.py` with `ContextTier` — short-term context window management with RunningSummary
- [x] 5.3 Create `tiers/core.py` with `CoreTier` — wraps MemoryStore, manages MEMORY.md file
- [x] 5.4 Create `tiers/__init__.py` with `flush_messages()` — context summarization + daily file append
- [x] 5.5 Implement incremental summarization (initial summary, extend existing, RunningSummary tracking)
- [x] 5.6 Create `background/deep_dream.py` with `DeepDream` — LLM-based MEMORY.md consolidation
- [x] 5.7 Implement Deep Dream dedup (content-hash check), dream diary writing, empty-output guard
## 6. Background Processing Queue
- [x] 6.1 Create `background/queue.py` with `MemoryUpdateQueue` — thread-safe, debounced, keyed by (thread, user, agent)
- [x] 6.2 Implement `add()` with debounce timer reset, `add_nowait()` for immediate processing
- [x] 6.3 Implement timer-triggered processing with rate limiting between updates
- [x] 6.4 Implement signal detection: `detect_correction()`, `detect_reinforcement()` with pattern matching
- [x] 6.5 Create `background/__init__.py` with `flush_messages()` — dedup + background thread LLM summarization
- [x] 6.6 Support `context_summary_callback` for in-context injection of summaries
## 7. Agent Tools & Public API
- [x] 7.1 Create `tools/manage.py` with `manage_memory()` — create/update/delete facts with namespace isolation
- [x] 7.2 Create `tools/search.py` with `search_memory()` — hybrid search with query/filter/limit/offset
- [x] 7.3 Implement `__init__.py` with `MemoryEngine` unified class: `manage()`, `search()`, `flush()`, `dream()`, `format_for_injection()`
- [x] 7.4 Implement `format_for_injection()` — token-budgeted memory string for system prompts
- [x] 7.5 Thread-safe singleton pattern for `MemoryUpdateQueue` and `MemoryStore`
## 8. Embedding Provider Interface
- [x] 8.1 Create `embedding/base.py` with `EmbeddingProvider` ABC — `embed_query()`, `embed_batch()`
- [x] 8.2 Create `embedding/openai.py` with `OpenAIEmbeddingProvider` implementation
- [x] 8.3 Implement `EmbeddingCache` — per-session cache keyed by (provider, model, text_hash)
- [x] 8.4 Create `embedding/__init__.py` with `create_embedding_provider()` factory
## 9. Integration Tests
- [x] 9.1 Test short-term context summarization with token budget enforcement
- [x] 9.2 Test long-term fact extraction with LLM mock
- [x] 9.3 Test hybrid search: vector-only, keyword-only, and combined
- [x] 9.4 Test tiered consolidation: flush → daily file → Deep Dream → MEMORY.md rewrite
- [x] 9.5 Test background queue: debounce, dedup, async execution
- [x] 9.6 Test namespace isolation: scoped searches across tenants
- [x] 9.7 Test graceful degradation: no embeddings → keyword-only, no numpy → Python fallback
- [x] 9.8 Test memory tools: create/update/delete/search round-trip

View File

@@ -0,0 +1,76 @@
## Context
boocode currently has no persistent session management for its agents (the persona agents in data/AGENTS.md). When a session is interrupted, there's no recoverable audit trail, no way to detect repeated mistakes, and no mechanism to enforce learned behavioral guidelines across sessions.
audit-harness provides: hooks (PostToolUse buffer→Stop flush→UserPromptSubmit injection), skills (/start→/end→/recover→/report-daily), and a Python core (AuditContext) with unified index schema.
Parlant provides: GuidelineDocumentStore (versioned, tag/label filtered), JourneyStore (graph-based SOPs), and JourneyGuidelineProjection (node→guideline auto-conversion).
This design ports the high-value subset of both into boocode as agent-facing skills and a TypeScript core library.
## Goals / Non-Goals
**Goals:**
- Define `.boo/runs/` directory convention with auto-creation and `.gitignore`
- Port /start, /end, /recover, /report-daily as boocode skills (markdown)
- Port user_correction record format and detection
- Port GuidelineDocumentStore from Parlant as TypeScript service
- Port Journey → guideline auto-projection (node→guideline conversion)
- Implement guideline find_guideline() by content match
- All features opt-in, zero breaking changes
**Non-Goals:**
- AuditContext full Python class port (environment snapshots, anomaly lambdas)
- Hooks implementation (PostToolUse/Stop/UserPromptSubmit) — separate batch
- Parlant's vector DB / embedder infrastructure
- Parlant's relationship resolver (ARQ)
- Web UI for guideline management — CLI/skill-only
## Decisions
### Decision 1: Skill-based commands over CLI tools
**Choice**: Implement /start, /end, /recover, /report-daily as skill markdown files in `data/skills/boocode/`, following the existing `committing-changes` pattern.
**Rationale**: boocode agents already load skills from this path. Adding a new skill is zero code change to the agent runtime — just a new markdown file with YAML frontmatter. CLI tools would require new API routes, dispatch logic, and frontend work.
**Alternatives considered**: Fastify API routes (rejected — too heavy for agent-facing commands), shell scripts (rejected — platform-specific).
### Decision 2: JSONL buffer + index.json
**Choice**: Port audit-harness's file layout exactly: `audit_buffer.jsonl` for live writes, `audit_pending.jsonl` for agent-authored AUDIT blocks, per-session `audit_trail.jsonl` for flushed records, `index.json` for cross-session metadata.
**Rationale**: audit-harness has production-miles with this layout. JSONL is grep-able, append-only, and needs no DB connection.
**Alternatives considered**: Postgres (rejected — agents don't all have DB access), SQLite (rejected — adds a native dep).
### Decision 3: GUID-based session IDs
**Choice**: `adhoc_YYYYMMDD_HHMM` format for session IDs, matching audit-harness pattern.
**Rationale**: Human-readable, sort-able, no collision risk within the same second.
### Decision 4: File-based GuidelineStore
**Choice**: Port GuidelineDocumentStore's abstract interface (create/list/read/update/delete/find) but use filesystem JSON storage instead of Parlant's DocumentDatabase.
**Rationale**: boocode doesn't have Parlant's document DB abstraction. A JSON-file store is simpler and sufficient for single-user operation. The interface stays the same, so a future Postgres backend can be swapped in.
**Alternatives considered**: Postgres backend (rejected — adds coupling), in-memory only (rejected — no persistence).
### Decision 5: Journey → guideline projection as pure function
**Choice**: Port `JourneyGuidelineProjection` as a pure function (not a class). Takes a Journey + its nodes/edges, returns Guideline[].
**Rationale**: The projection logic (DFS traversal, node→guideline conversion, edge metadata grafting) is deterministic and has no side effects. A pure function is simpler to test and compose.
**Alternatives considered**: Class with JourneyStore dependency (rejected — unnecessary indirection for our use case).
## Risks / Trade-offs
- **[Risk]** Skills grow stale if agent runtime doesn't load them → **Mitigation**: Test with existing agent by loading skill explicitly.
- **[Risk]** JSONL file contention from multiple agents → **Mitigation**: Single-user homelab. Acceptable.
- **[Risk]** GuidelineStore JSON files grow unbounded → **Mitigation**: TBD — add compaction/archival in future batch.
- **[Trade-off]** File storage is simple but doesn't scale to multi-user → Acceptable for single-user.
## Migration / Rollout
1. Create openspec spec files (proposal/design/tasks/specs)
2. Create `.boo/runs/` directory structure (service)
3. Create 4 skill files in `data/skills/boocode/`
4. Create core AuditContext TypeScript service
5. Create GuidelineStore + Journey service
6. Create user_correction utilities
7. Update data/AGENTS.md with new agents
8. Test with skill invocation

View File

@@ -0,0 +1,23 @@
## Why
The audit-harness (hooks + skills + AuditContext) and Parlant (GuidelineStore + Journey engine) provide two proven patterns for agent session management. audit-harness solves context-window loss through persistent audit trails, graded recovery, and structured commands (/start → /end → /recover → /report-daily). Parlant solves behavioral consistency through a versioned guideline document store with tag/label-based retrieval, journey-based SOPs, and backtrack detection.
Porting these patterns into boocode's agent ecosystem gives every agent working in this repo persistent session management, cross-session user correction awareness, and behavioral guideline enforcement — without building any of it from scratch.
## What Changes
### New Capabilities
- **Data Directory Convention**: `.boo/runs/` directory with buffer files, session dirs, `.current_session` handshake, unified `index.json`. `AUDIT_DOT_DIR` env var for platform override.
- **Session Lifecycle Commands**: `/start` creates named audit sessions with auto-recovery (L0+L2). `/end` flushes buffers, runs integrity checks, generates `session_summary.md`. `/recover` graded context loading (L0L3). `/report-daily` aggregates all sessions into a 7-section report; `/report-daily review` also runs morning self-review.
- **User Correction Tracking**: Structured `user_correction` records with `original_claim`/`correction`/`principle_extracted`/`persisted_to`. Auto-detected on `/end`. Correction-as-precedent enforcement when agent actions contradict prior corrections.
- **Behavioral Guidelines Store**: Versioned GuidelineDocumentStore ported from Parlant with condition+action+description content model, tag/label filtering, and content-based `find_guideline()`. Journey → guideline auto-projection (SOP nodes → guidelines with follow-up edges). Journey backtrack detection batch.
### Dependencies
- Existing audit-harness patterns (audit-context.py, hooks, skills) reference implementation.
- Parlant's GuidelineStore (guidelines.py) and JourneyStore (journeys.py) reference implementation.
- No new external services. File-based JSONL storage (audit-harness pattern).

View File

@@ -0,0 +1,80 @@
# Behavioral Guidelines Store — Spec
## Guideline Entity
```typescript
interface GuidelineContent {
condition: string; // When...
action: string | null; // Then...
description: string | null;
}
interface Guideline {
id: string;
creationUtc: string;
content: GuidelineContent;
enabled: boolean;
tags: string[];
labels: string[];
metadata: Record<string, unknown>;
criticality: "low" | "medium" | "high";
title: string | null;
priority: number;
}
```
## GuidelineDocumentStore
File-based JSON store at `.boo/guidelines/`. Versioned with migration support.
Methods:
- `createGuideline(condition, action?, description?, ...) → Guideline`
- `listGuidelines(tags?, labels?) → Guideline[]`
- `readGuideline(id) → Guideline`
- `updateGuideline(id, params) → Guideline`
- `deleteGuideline(id) → void`
- `findGuideline(content: {condition, action?}) → Guideline`
Version migration chain (port from Parlant v0.1.0 → v0.11.0):
- v0.1.0 → v0.2.0: add enabled field
- v0.2.0 → v0.3.0: remove guideline_set (migration script only)
- v0.3.0 → v0.4.0: add optional action, description, metadata
- v0.4.0 → v0.5.0: description as optional
- v0.5.0 → v0.6.0: add criticality (default "medium")
- v0.6.0 → v0.7.0: add composition_mode (optional)
- v0.7.0 → v0.8.0: add track (default true)
- v0.8.0 → v0.9.0: add labels (default empty)
- v0.9.0 → v0.10.0: add priority (default 0)
- v0.10.0 → v0.11.0: add title (default null)
## Tag & Label Filtering
- `listGuidelines({tags: ["tag1"]})` → guidelines with ANY of the specified tags
- `listGuidelines({labels: ["label1"]})` → guidelines with ALL specified labels (subset match)
- Combined: both filters apply (intersection)
## Journey → Guideline Projection
Port of Parlant's `JourneyGuidelineProjection.project_journey_to_guidelines()`:
- DFS traversal of Journey nodes from root
- Each (edge, node) pair → one Guideline
- Edge condition becomes guideline condition
- Node action becomes guideline action
- Edge/node metadata merged into guideline metadata with journey_node key
- follow_ups list populated with downstream guideline IDs
- BFS queue avoids infinite loops via visited set
## Journey Backtrack Detection
```typescript
interface BacktrackCheck {
journeyId: string;
currentNodeId: string;
previousNodeId: string;
isBacktrack: boolean;
recommendation: string | null;
}
```
Scans the edge list for source→target relationships. If the agent's current step has an edge back to a previously visited node (and that node is not in a forward path from current), it's flagged as a backtrack regression.

View File

@@ -0,0 +1,88 @@
# Session Lifecycle Commands — Spec
## Overview
Four agent-invocable commands that manage audit session lifecycle. Each command is a skill markdown file loaded by the agent on invocation.
## /start
```
/start "task description"
```
Creates a named audit session:
1. Generate `session_id = adhoc_YYYYMMDD_HHMM`
2. `mkdir -p .boo/runs/{session_id}`
3. Write `session.json`:
```json
{
"session_id": "adhoc_20260320_1400",
"task": "task description",
"start_time": "2026-03-20T14:00:00Z",
"status": "in_progress",
"expected_record_types": ["data", "change", "conversation"]
}
```
4. Write `.boo/runs/.current_session` containing session_id (handshake for hooks)
5. Run context recovery:
- L0: read `index.json` → last 5 entries
- L2: scan recent audit_trail.jsonl for `user_correction` records
6. Output recovery summary: recent activity, corrections, priorities
7. Check for unfinished sessions: scan for `status: "in_progress"` sessions, prompt user
## /end
```
/end
```
Ends the current audit session:
1. Read `.current_session` → get session_id
2. Collect remaining buffer data from `audit_buffer.jsonl` + `audit_pending.jsonl`
3. Append to `audit_trail.jsonl`
4. Clear buffer files
5. Extract `user_correction` records from audit_trail
6. Run integrity checks:
- Has records? (>0 audit_trail lines)
- All files covered? (changes in audit_trail match modified files)
- Corrections persisted? (persisted_to is non-empty)
7. Generate `session_summary.md`
8. Update `session.json` status=completed, end_time
9. Clear `.current_session`
## /recover
```
/recover # L0+L1+L2
/recover full # L3 (full audit_trail)
/recover {session_id} # load specific session
```
Graded context loading:
- L0 (~200t): index.json → last 5 entries (id, task, status)
- L1 (~500t): .current_session + session.json + last 3 audit_trail entries
- L2 (~1000t): scan all audit_trails for user_correction records + conclusions + daily report §4+§6
- L3 (~3000t): full audit_trail.jsonl + audit_pending.jsonl
## /report-daily
```
/report-daily # today
/report-daily 20260319 # specific date
/report-daily review # + morning self-review
```
7-section report:
1. Task overview (from index.json)
2. Operation stats (tool counts)
3. Change records (file modifications)
4. User feedback & corrections
5. Anomaly alerts
6. Backlog tracking
7. Integrity summary
`review` variant: adds morning self-review with trend analysis and recommended priorities.

View File

@@ -0,0 +1,42 @@
# User Correction Tracking — Spec
## Record Schema
```typescript
interface UserCorrectionRecord {
record_type: "conversation";
action_type: "user_correction";
priority: "critical_for_recovery";
timestamp: string; // ISO 8601
original_claim: string; // what the agent said that was wrong
correction: string; // what the user corrected it to
principle_extracted: string; // general principle derived from this correction
persisted_to: string[]; // files where this correction was documented
}
```
## Storage
User correction records are stored inline in `audit_trail.jsonl` as regular entries. They are extracted during `/end` and surfaced during `/recover` L2 loading.
## Detection
During `/end`, scan the session's `audit_trail.jsonl` for entries matching:
- `action_type === "user_correction"`
Also scan `audit_pending.jsonl` for any pending correction records not yet flushed.
## persisted_to Field
When a correction is written to CLAUDE.md, coding standards, or other documentation, the file paths are recorded in `persisted_to[]`. This is populated manually by the agent when it persists the correction.
## Correction-as-Precedent
When an agent considers an action that contradicts a known `user_correction` record, it is flagged with a warning. The agent should:
1. Identify the contradiction (which rule is being violated)
2. Surface the relevant correction record (with timestamp and original context)
3. Propose an alternative that respects the correction
4. If the contradiction is intentional, document why as a new correction
Detection logic: before each significant action, the agent scans loaded user_correction records from the current recovery context and checks if the proposed action matches any known `original_claim` pattern.

View File

@@ -0,0 +1,39 @@
# port-audit-parlant-patterns — Implementation Complete
## boocontext (TypeScript) — src/audit/
- [x] 1. Data Dir: `dotDir()`, `findRunsDir()`, `ensureRunsDir()` with .gitignore + AUDIT_DOT_DIR
- [x] 2. Core Types: `RecordEntry`, `CompactRecord`, `Manifest`, `UserCorrectionRecord`, `SessionJson`, `SessionSummary`
- [x] 3. Hash Utilities: `hashFile()`, `hashBytes()`, `hashDir()` via Node crypto SHA256
- [x] 4. Anomaly: `AlertRule`, `Anomaly`, `checkAnomalies()` with default rules
- [x] 5. AuditContext: `createBatchContext()` -> `record()` -> `recordCompact()` -> `finalize()` -> `save()` (writes manifest, trail, compact, anomalies, checksums, index)
- [x] 6. AmbientContext: `AsyncLocalStorage` wrapper — `runWithAmbient()`, `getAmbientSession()`, `requireAmbientSession()`
- [x] 7. Guideline Model: `GuidelineContent`, `Guideline`, `GuidelineStore`, `InMemoryGuidelineStore` with CRUD + tag/label filters
- [x] 8. Guideline Matching: `MatchingContext`, `MatchingBatch` (Observational, Actionable, PreviouslyApplied, Disambiguation, ResponseAnalysis, LowCriticality), `GenericGuidelineMatchingStrategy`, retry policy
- [x] 9. ARQ Generation: `SchematicGenerator`, typed output schemas per batch, `GenerationInfo` tracking, `createExecutionPlan()` with batch-parallel
- [x] 10. Relationship Model: `RelationshipKind` (DEPENDS_ON, PRIORITIZES, ENTAILS, TAG_ALL, TAG_PRIORITIZES), `FileRelationshipStore`
- [x] 11. Relational Resolver: 4-step iteration loop (deps -> prioritization -> priority -> entailment), `MAX_ITERATIONS=100`, `ResolutionKind` output
- [x] 12. Graded Recovery: `recoverL0()``recoverL4()`, `scanUserCorrections()`, `formatRecoveryReport()` with source attribution
- [x] 13. User Corrections: `detectCorrections()`, `addPersistedTarget()`, `findRelatedCorrections()`, `checkContradiction()`
- [x] 14. Index: `readIndex()`, `writeIndex()` with atomic `.tmp` + `renameSync`
- [x] 15. MCP Tools: `boocontext_audit_index` + `boocontext_audit_recover` registered in mcp-server.ts
- [x] 16. Typecheck: `npx tsc --noEmit` passes clean
## codecontext (Go) — internal/audit/ + internal/mcp/
- [x] 1. Record Types: `RecordEntry`, `CompactRecord`, `RecordStep`/`RecordAction` enums (pre-existing)
- [x] 2. Index: `UpdateIndexEntry()` with idempotent upsert, `IndexEntry` schema, atomic `.tmp` + `os.Rename()` (pre-existing)
- [x] 3. Hashchain: `HashFile()`, `HashBytes()`, `HashDir()`, `VerifyHashchain()` with `HashchainVerificationError` (pre-existing)
- [x] 4. Directory: `DotDir()`, `RunsDir()`, `EnsureRunsDir()` with .gitignore + `AUDIT_DOT_DIR` (pre-existing)
- [x] 5. Anomaly: `AlertRule`, `Anomaly`, `Manifest` types + `CheckAnomalies()` with condition evaluation (pre-existing stub, now evaluates total_records/error_rate/hash conditions)
- [x] 6. GenerateChecksums: per-file SHA256 manifest (pre-existing)
- [x] 7. Session Lifecycle: `SessionLifecycleManager` with `StartSession(task)`, `EndSession()`, `CurrentSession()` — creates adhoc session, writes .current_session, updates index
- [x] 8. Trail Management: `TrailManager` with `AppendToBuffer()`, `PendingAppend()`, `AppendToTrail()`, `ReadTrail()`, `FlushBuffer()` — auto-generates session if none active
- [x] 9. MCP Audit Tools: `codecontext_audit_start`, `codecontext_audit_end`, `codecontext_audit_status` in `internal/mcp/audit_tools.go`
- [x] 10. MCP Middleware Hooks: `recordAuditBuffer()` in server struct, buffer after tool calls, flush on "ready"
- [x] 11. Build: `go build ./...` passes clean
## boocode (Node.js) — apps/coder/src/services/
- [x] 1. Session Service (`audit-session.ts`): `startSession()` with L0+L2 recovery, `endSession()` with integrity checks + session_summary.md, `recoverSession()` L0-L3 graded loading, `generateDailyReport()` 7-section report
- [x] 2. Correction Service (`correction-service.ts`): `recordCorrection()`, `scanForCorrections()`, `checkContradiction()`, `markPersisted()` — JSON store at `.boo/corrections/`
- [x] 3. Guideline Service (`guideline-service.ts`): `createGuideline()`, `listGuidelines()` with tag/label filters, version migration chain (v0.1.0->v0.11.0), `projectJourneyToGuidelines()` DFS, `checkBacktrack()` — JSON store at `.boo/guidelines/`
- [x] 4. Skill commands: `command-start/SKILL.md`, `command-end/SKILL.md`, `command-recover/SKILL.md`, `command-report-daily/SKILL.md`
- [x] 5. Typecheck: `pnpm -C apps/coder typecheck` passes clean

View File

@@ -1,4 +0,0 @@
# v1.13.12-skills-audit
**Status:** Shipped. Archived.

View File

@@ -1,4 +0,0 @@
# v1.13.15-codecontext-synth
**Status:** Shipped. Archived.

View File

@@ -1,4 +0,0 @@
# v1.13.17-cross-repo-reads
**Status:** Shipped. Archived.

View File

@@ -1,4 +0,0 @@
# v1.13.18-codecontext-file-path
**Status:** Shipped. Archived.

View File

@@ -1,4 +0,0 @@
# v1.13.20-drop-legacy-cols
**Status:** Shipped. Archived.

View File

@@ -1,4 +0,0 @@
# v1.14-outer-loop
**Status:** Shipped. Archived.

View File

@@ -1,4 +0,0 @@
# v1.14.1-mcp-poc
**Status:** Shipped. Archived.

View File

@@ -1,4 +0,0 @@
# v1.14.x-html-artifact-panes
**Status:** Shipped. Archived.

View File

@@ -1,4 +0,0 @@
# v1.15-mcp-multi
**Status:** Shipped. Archived.

View File

@@ -1,4 +0,0 @@
# v2.0-boocoder
**Status:** Shipped. Archived.

View File

@@ -1,5 +0,0 @@
# v2.2-paseo-providers
**Status:** Shipped (`v2.2-paseo-providers`, `v2.2.1-pane-scoped-chats`). Archived.
Follow-up fixes shipped as `v2.2.1-pane-scoped-chats` (pane-scoped chats, tool UI, WS delta, inference payload).

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-06-07

View File

@@ -0,0 +1,76 @@
## Context
boocontext (TypeScript code scanner, MCP server) and codecontext (Go code graph engine, MCP server) currently lack persistent audit trails for their operations. When a scan or graph analysis is interrupted, context is lost — tool calls have no recoverable log, session state disappears, and there's no mechanism to detect repeated mistakes or anomalies across runs.
The audit-harness repo provides a production-tested 3-layer audit enforcement system. Its hooks (PostToolUse, Stop, UserPromptSubmit) intercept every tool call, buffer to JSONL, flush to session trails, and inject session context on every user turn. Its Python core library (AuditContext, 600 lines) provides environment snapshots, SHA256 verification, configurable anomaly detection, and unified index management.
This design ports the **hook audit pipeline** into both tools as MCP server middleware, and the **session lifecycle** as boocode commands.
## Goals / Non-Goals
**Goals:**
- Port PostToolUse pattern as MCP middleware → auto-log every tool call to JSONL buffer
- Port Stop pattern as MCP middleware → flush buffer to session trail + update index on completion
- Port UserPromptSubmit pattern as MCP middleware → inject session context + CRITICAL alerts on each request
- Port /start, /end, /recover, /report-daily as boocode commands
- Port AuditContext's unified index schema for cross-session tracking
- All patterns opt-in via configuration, zero breaking changes
**Non-Goals:**
- Full AuditContext Python class port (environment snapshots, anomaly lambdas) — Phase 2
- Parlant's relationship resolver (ARQ) — separate change
- Codex/Claude-specific hook formats — only MCP middleware abstraction
## Decisions
### Decision 1: MCP Middleware over Hook Scripts
**Choice**: Implement audit as MCP server middleware (TypeScript for boocontext, Go for codecontext), not shell hooks.
**Rationale**: Shell hooks (like audit-harness uses) are platform-specific (Claude Code vs Codex). MCP middleware is framework-agnostic — any MCP client automatically gets audit trails. Both boocontext and codecontext already have MCP servers.
**Alternatives considered**: Shell hooks (rejected — platform-specific), Python subprocess (rejected — dependency overhead).
### Decision 2: JSONL Buffer + File Rotation
**Choice**: Port audit-harness's JSONL buffer pattern exactly — append-only JSONL with size-limited rotation (1MB per buffer file).
**Rationale**: JSONL is grep-able, pipeable, compressible, and append-only. The 1MB limit prevents unbounded memory growth. This is the most battle-tested pattern in audit-harness.
**Alternatives considered**: SQLite (rejected — adds DB dependency for a log), structured logging (rejected — not designed for session replay).
### Decision 3: Session Handoff via Pointer File
**Choice**: Port the `.current_session` handshake file pattern — a single file containing the current session ID, read by all hooks.
**Rationale**: This is the simplest reliable inter-process coordination. No locks, no DB, no race conditions (atomic writes). Works across MCP middleware invocations.
**Alternatives considered**: Environment variable (rejected — not persistent across MCP calls), in-memory state (rejected — lost on restart).
### Decision 4: Unified Index Schema (JSON)
**Choice**: Port the index.json schema with `schema_version`, `entries[]` containing `{id, type, task, created, status, record_count}`.
**Rationale**: audit-harness's index schema is proven across 4 skills + 3 hooks writing to the same file. JSON with version field allows forward-compatible schema evolution.
**Alternatives considered**: SQLite (rejected — overkill for metadata index), binary format (rejected — not human-readable).
### Decision 5: Graded Context Recovery (L0-L4)
**Choice**: Port the tiered loading system — Level 0 (index, ~200t) → Level 1 (task state, ~500t) → Level 2 (corrections, ~1000t) → Level 3 (full, ~3000t) → Level 4 (cross-day, ~5000t+).
**Rationale**: Loading all context every time wastes tokens. Graded loading lets the agent fetch exactly what it needs. The token budgets are tuned to avoid context window exhaustion.
**Alternatives considered**: Load-all (rejected — token waste), agent-decides (rejected — inconsistent).
### Decision 6: Opt-in Configuration
**Choice**: All audit features disabled by default. Enabled via `audit.enabled: true` in the MCP server config.
**Rationale**: Zero behavioral change for existing users. Audit is valuable but has file I/O overhead.
**Alternatives considered**: Always-on (rejected — breaking change), env-var-only (rejected — less discoverable).
## Risks / Trade-offs
- **[Risk]** JSONL buffer write contention under high concurrency → **Mitigation**: Append-only writes are atomic on most filesystems for lines under PIPE_BUF. Use flock() for safety on NFS.
- **[Risk]** Disk space from unbounded audit trails → **Mitigation**: Configurable `audit.maxRetentionDays` (default 30), auto-cleanup on session end.
- **[Risk]** Performance overhead from every tool call being logged → **Mitigation**: Buffer writes are async (fire-and-forget). Benchmarked at <0.5ms per write in audit-harness.
- **[Trade-off]** File-based audit is simple but doesn't scale to distributed deployments → Acceptable for single-node code analysis tools. Cluster deployments would need a DB-backed backend in Phase 2.
## Migration Plan
1. **Phase 1a**: Add +audit middleware to boocontext's MCP server (PostToolUse + Stop patterns, JSONL buffer to session dirs)
2. **Phase 1b**: Add audit middleware to codecontext's MCP server (same patterns, Go implementation)
3. **Phase 1c**: Add `/start`, `/end`, `/recover`, `/report-daily` commands to boocode
4. **Phase 2a**: Port AuditContext Python class to TypeScript (environment snapshots, hash verification, anomaly detection)
5. **Phase 2b**: Add CRITICAL anomaly alert injection (UserPromptSubmit pattern)
6. **Rollback**: Remove `audit.enabled: true` from config → zero residual effects. Delete `.audit/` directory to purge all data.
## Open Questions
- Should the buffer flush be synchronous (blocking response until written) or async (fire-and-forget, could lose last N records on crash)? audit-harness uses sync flush on Stop hook — recommend same for consistency.
- Index.json merge strategy when two processes write simultaneously? audit-harness uses atomic file replace (write .tmp → os.replace) — adequate for single-process MCP server.
- Token budget for context injection on UserPromptSubmit? audit-harness uses ~50 tokens for the context prefix. Recommend same default with `audit.maxContextTokens` config.

View File

@@ -0,0 +1,38 @@
## Why
The audit-harness repo provides a production-tested 3-layer audit enforcement system for LLM agents — hooks (PostToolUse, Stop, UserPromptSubmit), skills (/start, /end, /recover, /report-daily), and a Python core library (AuditContext). This system solves the #1 pain point for agent-based code analysis tools: context window loss and lack of traceability. Integrating these patterns into boocontext (code scanner) and codecontext (code graph engine) will provide persistent audit trails, context recovery across sessions, and structured operation logs.
The total effort to port the high-value subset is approximately **25-30 person-days**, with the highest ROI being the buffer/flush pipeline and unified index schema for codecontext.
## What Changes
### New Capabilities
- **context-recovery**: Port audit-harness's graded context loading system (L0-L4) into boocontext for LLM context window persistence across code analysis sessions. Each analysis step is recorded to a JSONL buffer, flushed to session trails on completion, and recoverable on session resume.
- **hook-audit-pipeline**: Port the 3 audit hooks (PostToolUse → buffer, Stop → flush to trail, UserPromptSubmit → context injection) into codecontext as MCP server middleware. Every tool call in codecontext's MCP server is automatically logged to an audit trail with timestamps, summaries, and session IDs.
- **session-lifecycle**: Port the /start → /end → /recover → /report-daily skill flow into boocode's command system. /start creates named analysis sessions with context recovery, /end performs integrity checks and generates summaries, /recovers restores lost context through graded loading, /report-daily generates structured work reports from audit data.
### Modified Capabilities
- boocontext's `MCP server` → Add audit middleware layer for automatic tool-call capture
- codecontext's `graph index` → Adopt audit-harness's unified index.json schema for cross-session state tracking
## Capabilities
### New Capabilities
- `context-recovery`: Graded context loading (L0-L4) from persistent audit trails. On session start or `/recover`, loads index summary (~200t), task state (~500t), user corrections (~1000t), or full context (~3000t) depending on need. Prevents rework of already-corrected mistakes.
- `hook-audit-pipeline`: Three MCP middleware hooks: PostToolUse captures tool calls (tool_name, session_id, summary) to audit_buffer.jsonl; Stop flushes buffer + pending to session trail + updates unified index; UserPromptSubmit injects current session context and CRITICAL alerts into every user turn.
- `session-lifecycle`: Named analysis sessions with lifecycle: /start creates session.json + .current_session; /end integrity-checks all records, generates session_summary.md; /recover loads context in increasing detail; /report-daily generates 7-section work reports with operation stats, anomalies, user feedback, backlog.
### Modified Capabilities
- `mcp-server` (boocontext): Add optional audit middleware that intercepts tool calls for logging
- `graph-index` (codecontext): Extend index schema to include audit-compatible session tracking
## Impact
- **boocontext**: Adds ~500 lines of Go/TS for buffer management and hook middleware. No breaking changes — audit is opt-in via configuration.
- **codecontext**: Adds ~300 lines of TypeScript for the unified index schema extension. Backward compatible.
- **boocode**: Adds 4 new commands (/start, /end, /recover, /report-daily), ~400 lines of skill definitions. No existing command breakage.
- **Dependencies**: Python 3.10+ for the audit_context.py core (boocontext already uses Python for tree-sitter); bash for hooks (existing shell tooling). No new external services.

View File

@@ -0,0 +1,42 @@
## ADDED Requirements
### Requirement: Graded context recovery from audit trails
The system SHALL support context recovery at 5 graded levels (L0-L4), each returning progressively more detailed context from persistent audit trail files.
#### Scenario: L0 recovery loads index summary
- **WHEN** user invokes `/recover` with no arguments
- **THEN** system reads `index.json` and returns the last 5 index entries (id + task + status) at approximately 200 tokens
#### Scenario: L1 recovery loads current task state
- **WHEN** user invokes `/recover` with `level: 1`
- **THEN** system reads `.current_session` for session ID, then reads `{session_id}/session.json` for task + start_time, then reads the last 3 audit_trail.jsonl records
#### Scenario: L2 recovery loads user corrections
- **WHEN** user invokes `/recover` with `level: 2`
- **THEN** system searches all `audit_trail.jsonl` files for `user_correction` records and returns them alongside L1 context at approximately 1000 tokens
#### Scenario: L3 recovery loads full session
- **WHEN** user invokes `/recover` with `level: 3` or `/recover full`
- **THEN** system returns complete `audit_trail.jsonl` and `audit_pending.jsonl` at approximately 3000 tokens
#### Scenario: L4 recovery loads cross-day history
- **WHEN** user invokes `/recover` with `level: 4`
- **THEN** system returns cross-day comparison of manifests and historical daily reports at approximately 5000+ tokens
#### Scenario: Session start auto-recovery
- **WHEN** `/start` creates a new session and historical audit data exists
- **THEN** system SHALL load L0 + L2 context automatically (index summary + user corrections) before the new session begins
### Requirement: User correction priority in context loading
The system SHALL mark `user_correction` records with highest priority in all context recovery operations. Corrections SHALL always be loaded before task state or any other context.
#### Scenario: Corrections surfaced first on recovery
- **WHEN** `/recover` loads context
- **THEN** any `user_correction` records in the audit trail SHALL appear before task summaries and SHALL be clearly labeled as "User Corrections — Review These First"
### Requirement: Structured recovery report
The system SHALL output a structured recovery report containing: current task, user corrections found, key conclusions from context, unresolved issues, and recent actions.
#### Scenario: Recovery report includes all sections
- **WHEN** context recovery completes
- **THEN** output SHALL contain: `**Current Task**`, `**User Corrections**`, `**Conclusions**`, `**Open Issues**`, `**Recent Activity**`

View File

@@ -0,0 +1,56 @@
## ADDED Requirements
### Requirement: MCP middleware captures tool calls to buffer
The MCP server SHALL provide injectable middleware that intercepts every tool call response and appends a structured record to `audit_buffer.jsonl` in the configured audit directory.
#### Scenario: PostToolUse captures tool name and summary
- **WHEN** any MCP tool completes execution
- **THEN** middleware SHALL write a JSONL record with `{timestamp, tool, session, summary}` to `audit_buffer.jsonl`
- **THEN** `tool` SHALL be the MCP tool name
- **THEN** `summary` for Bash tools SHALL be the first non-comment command line (truncated to 200 chars)
- **THEN** `summary` for Write/Edit tools SHALL be the file path
#### Scenario: Buffer is size-limited
- **WHEN** tool call output exceeds 1MB
- **THEN** middleware SHALL truncate input to 1MB via `head -c 1048576` before processing
#### Scenario: Buffer directory is auto-created
- **WHEN** first tool call is captured
- **THEN** middleware SHALL create the audit runs directory with `mkdir -p`
#### Scenario: Failures do not block tool execution
- **WHEN** buffer write fails (disk full, permission denied)
- **THEN** middleware SHALL silently skip logging and allow the tool response to proceed
### Requirement: Session flush archives buffer to trail
The MCP middleware SHALL provide a flush mechanism that moves buffered records into session-specific audit trail files.
#### Scenario: Flush moves buffer to session trail
- **WHEN** middleware flush is triggered (on session end or explicit flush call)
- **THEN** system SHALL read `audit_buffer.jsonl` + `audit_pending.jsonl`
- **THEN** system SHALL concatenate them into `{session_id}/audit_trail.jsonl`
- **THEN** system SHALL clear both buffer files
#### Scenario: Auto-session for unstarted sessions
- **WHEN** no active session exists and flush is triggered
- **THEN** system SHALL auto-generate session ID `auto_{YYYYMMDD_HHMM}` and continue
#### Scenario: Session ID via handshake file
- **WHEN** a session is active via `/start`
- **THEN** `{auditDir}/.current_session` SHALL contain the session ID
- **THEN** flush SHALL read this file to determine the target session directory
### Requirement: Context injection on each request
The MCP middleware SHALL inject current session context into every incoming request's metadata.
#### Scenario: Session context injected at request start
- **WHEN** any MCP request arrives
- **THEN** middleware SHALL add `{audit.session_id, audit.record_count, audit.status}` to the request context
#### Scenario: CRITICAL alerts injected
- **WHEN** `index.json` contains entries with `max_anomaly_level: "CRITICAL"`
- **THEN** middleware SHALL append CRITICAL alert details to the injected context
#### Scenario: Context injection is configurable
- **WHEN** `audit.contextInjection` is set to `false`
- **THEN** middleware SHALL skip context injection entirely

View File

@@ -0,0 +1,76 @@
## ADDED Requirements
### Requirement: /start creates named audit session
The system SHALL provide a `/start` command that creates a named audit session with task description, start time, and persistent session ID.
#### Scenario: /start creates session directory
- **WHEN** user runs `/start "fix auth bug"`
- **THEN** system SHALL generate session ID `adhoc_{YYYYMMDD_HHMM}`
- **THEN** system SHALL create `{auditDir}/{session_id}/` directory
- **THEN** system SHALL write `session.json` with `{session_id, task, start_time, status: "in_progress"}`
- **THEN** system SHALL write `{auditDir}/.current_session` with the session ID
#### Scenario: /start recovers context from history
- **WHEN** user runs `/start` and historical audit data exists
- **THEN** system SHALL load Level 0 (index summary) and Level 2 (user corrections) automatically
#### Scenario: /start outputs recovery summary
- **WHEN** `/start` completes
- **THEN** system SHALL output: started session ID, recovery summary with any corrections found, current session directory path
### Requirement: /end finalizes session with integrity check
The system SHALL provide an `/end` command that collects remaining data, validates completeness, generates a summary, and marks the session as completed.
#### Scenario: /end collects and archives remaining data
- **WHEN** user runs `/end`
- **THEN** system SHALL flush audit buffer + pending to session trail
- **THEN** system SHALL extract all `user_correction` records from the session trail
#### Scenario: /end runs integrity checks
- **WHEN** `/end` runs integrity checks
- **THEN** system SHALL verify: records exist, critical files are covered by audit, prompt changes are validated, corrections are persisted
- **THEN** system SHALL record any failed checks as anomalies in `anomalies.json`
#### Scenario: /end generates session summary
- **WHEN** `/end` completes
- **THEN** system SHALL generate `session_summary.md` with operation count, files touched, corrections made, anomalies detected
- **THEN** system SHALL update `session.json` status to `completed`
- **THEN** system SHALL clear `.current_session`
### Requirement: /recover restores lost context
The system SHALL provide a `/recover` command that restores LLM context from audit trails at multiple detail levels.
#### Scenario: /recover with no args loads L1 context
- **WHEN** user runs `/recover`
- **THEN** system SHALL load Level 1 context (current task state, last 3 actions)
#### Scenario: /recover full loads L3 context
- **WHEN** user runs `/recover full`
- **THEN** system SHALL load Level 3 context (full session trail)
#### Scenario: /recover {session_id} loads specific session
- **WHEN** user runs `/recover 20260607_1200`
- **THEN** system SHALL load context from session `20260607_1200`
### Requirement: /report-daily generates structured work report
The system SHALL provide `/report-daily` that aggregates all session data into a structured daily report with traceable metrics.
#### Scenario: /report-daily generates 7-section report
- **WHEN** user runs `/report-daily`
- **THEN** system SHALL generate a report with: task overview, operation statistics, file/resource changes, anomaly summary, user feedback, backlog items, data integrity status
- **THEN** every number in the report SHALL be sourced from `{auditDir}/` files — no inferred or remembered data
#### Scenario: /report-daily for specific date
- **WHEN** user runs `/report-daily 20260607`
- **THEN** system SHALL generate report for June 7, 2026 only
#### Scenario: /report-daily review mode
- **WHEN** user runs `/report-daily review`
- **THEN** system SHALL include morning self-review section with trend analysis and anomaly comparisons
### Requirement: Session lifecycle is opt-in
All session lifecycle commands SHALL be opt-in via configuration. Systems without audit enabled SHALL have no session lifecycle commands available.
#### Scenario: Commands hidden when audit disabled
- **WHEN** `audit.enabled` is `false` or unset
- **THEN** `/start`, `/end`, `/recover`, `/report-daily` SHALL NOT be registered as commands

View File

@@ -0,0 +1,57 @@
## 1. Data Directory Convention (cross-cutting)
- [ ] 1.1 Define `.boo/runs/` directory structure — runs_dir/buffer/session dirs/.current_session/index.json
- [ ] 1.2 Implement `.boo/runs/` directory auto-creation with `.gitignore`
- [ ] 1.3 Add `AUDIT_DOT_DIR` environment variable support for platform-specific directory naming
- [ ] 1.4 Implement `find_runs_dir()` — walk up from CWD looking for {AUDIT_DOT_DIR}/runs
## 2. Buffer + Flush Pipeline (MCP middleware)
- [ ] 2.1 Implement PostToolUse middleware: capture tool_name + summary → append to `audit_buffer.jsonl`
- [ ] 2.2 Implement Stop middleware: read `.current_session`, flush buffer+pending to session trail
- [ ] 2.3 Implement atomic session.json update preserving existing fields
- [ ] 2.4 Implement `.current_session` handshake protocol (create/read/clear)
- [ ] 2.5 Add safe input truncation (1MB cap) for large tool payloads
- [ ] 2.6 Implement UserPromptSubmit middleware: inject session context + CRITICAL alerts
- [ ] 2.7 Register all middleware with opt-in gate (`audit.enabled: true`)
## 3. Unified Index Schema
- [ ] 3.1 Define `INDEX_ENTRY_REQUIRED` and `INDEX_ENTRY_OPTIONAL` field schemas
- [ ] 3.2 Implement `update_index_entry()` with idempotent upsert and atomic `.tmp` + rename
- [ ] 3.3 Implement `schema_version=1.1` tracking in index.json
- [ ] 3.4 Add CLI entry point for hooks to call `update-index --runs-dir X --id Y ...`
## 4. Graded Context Recovery
- [ ] 4.1 Implement L0 recovery: read last 5 index.json entries (~200 tokens)
- [ ] 4.2 Implement L1 recovery: read session.json + last 3 audit_trail entries (~500 tokens)
- [ ] 4.3 Implement L2 recovery: scan all audit trails for user_correction records (~1000 tokens)
- [ ] 4.4 Implement L3 recovery: full audit_trail + all pending records (~3000 tokens)
- [ ] 4.5 Implement recovery report output format: current task, corrections, conclusions, open issues, recent activity
- [ ] 4.6 Implement priority loading: user_correction records always loaded first
## 5. Session Lifecycle Commands
- [ ] 5.1 Implement `/start` command: generate session ID, write session.json + .current_session, auto-recover L0+L2
- [ ] 5.2 Implement `/end` command: flush buffers, run integrity checks, generate session_summary.md, update index
- [ ] 5.3 Implement `/recover` command: graded context loading (L0-L3), support for specific session IDs
- [ ] 5.4 Implement `/report-daily` command: aggregate index + audit trails, 7-section report with task overview, ops stats, changes, anomalies, feedback, backlog, integrity
- [ ] 5.5 Implement `/report-daily review` mode: add morning self-review with trend analysis
- [ ] 5.6 Implement unfinished session detection + continue prompt
- [ ] 5.7 Register all commands behind `audit.enabled` gate
## 6. Ambient Context via AsyncLocalStorage
- [ ] 6.1 Implement `AmbientContext` class wrapping Node.js `AsyncLocalStorage` with `run()`/`get()`/`set()`
- [ ] 6.2 Define `AmbientState` interface: sessionId, sessionDir, runsDir, agentId, toolCall
- [ ] 6.3 Wire context set at MCP handler/command entry point, clear on session end
- [ ] 6.4 Replace explicit parameter threading in audit pipeline with ambient context reads
## 7. Testing & Verification
- [ ] 7.1 Unit tests for buffer write, flush, index update
- [ ] 7.2 Unit tests for context recovery at all 4 levels
- [ ] 7.3 Integration test: full session lifecycle (/start → tool calls → /end)
- [ ] 7.4 Integration test: context recovery after mid-session interruption
- [ ] 7.5 Verify zero behavioral change when `audit.enabled` is false

View File

@@ -0,0 +1,64 @@
# Enhanced File Panel — Design Decisions
## D1: Diff preference system
A `useDiffPreferences` hook manages three stored preferences:
```typescript
interface DiffPreferences {
layout: 'unified' | 'split'; // default: 'unified'
wrapLines: boolean; // default: false
hideWhitespace: boolean; // default: false
}
```
Persisted via localStorage key `boocode.diff.preferences`.
**Reference**: `/opt/forks/paseo/hooks/use-changes-preferences/storage.ts`
## D2: Side-by-side diff layout
Split layout renders two columns (left = removals, right = additions) with aligned line numbers.
Algorithm (adapted from Paseo `diff-layout.ts` `buildSplitDiffRows`):
1. Parse unified diff into hunks (reuse existing `splitDiffByFile`)
2. Group remove lines into pendingRemovals, add lines into pendingAdditions
3. Flush paired rows when encountering context lines
4. Unpaired lines get an empty cell on one side
## D3: Hide whitespace (server-side)
`GET /api/projects/:id/git/diff` gains optional `whitespace=1` query param.
When set, `git diff` appends `-w` (`--ignore-all-space`).
## D4: Wrap lines (CSS-only)
When `wrapLines` is true: `white-space: pre-wrap; overflow-wrap: anywhere`
replaces `white-space: pre; overflow-x: auto`. Gutter stays unwrapped.
## D5: Expand/Collapse all
`allExpanded` computed as `files.every(f => expandedPaths.has(f.path))`.
Toggle button adds all paths to expandedPaths or clears them.
## D6: Inline diff comments
Data model:
```typescript
interface DiffComment {
id: string; filePath: string; side: 'old' | 'new';
lineNumber: number; body: string;
createdAt: number; updatedAt: number;
}
```
Zustand store with localStorage persistence, keyed by `${sessionId}:${mode}`.
## D7: File editing
Double-click file in tree → fetch content via `view_file` → textarea →
Save calls `POST /api/projects/:id/write_file` → triggers `git_diff_refresh`.
Path validated via existing `pathGuard`.
## D8: No DB / no WS frames / no contract changes
All new state is client-side (localStorage, React state, Zustand).

View File

@@ -0,0 +1,46 @@
# Enhanced file panel — diff display modes, inline comments, and in-browser file editing
## Why
BooCode's right-rail file panel has a solid foundation: a file tree (read-only), a Git diff tab
with unified display and stage/commit/discard. But it lacks features users expect from a modern
code review surface:
1. **Side-by-side diff** — was deferred under YAGNI in v1. The current unified-only view is hard
to read on wide files and wastes horizontal space on desktop.
2. **Hide whitespace** — meaningless whitespace changes clutter diffs in code-generation workflows.
3. **Wrap long lines** — unified diffs of long lines require horizontal scrolling.
4. **Expand/collapse all** — only per-file expand exists; bulk toggling is missing.
5. **Inline diff comments** — was explicitly out-of-scope in v1. The user cannot annotate diffs.
6. **In-browser file editing** — the file tree is read-only. Editing requires agent tool calls or terminal.
## What Changes
### Frontend
- **GitDiffView.tsx** — toolbar row with: Unified/Split toggle, Hide whitespace toggle, Wrap lines
toggle, Expand/Collapse all, Refresh (existing)
- **Diff rendering** — split-layout renderer (two panels side-by-side, aligned line numbers),
CSS `pre-wrap` for wrapped lines
- **Inline comments** — `InlineReviewGutterCell` (hover → + button), `InlineReviewThread`,
`InlineReviewEditor` (textarea saved via Zustand store)
- **Comment storage** — Zustand store persisted to localStorage, keyed by diff context
- **File editing** — double-click file in tree → inline textarea → Save/Cancel → server write API
- **Preferences hook** — `useDiffPreferences``{ layout, wrapLines, hideWhitespace }`
### Server
- **`git_diff.ts`** — add `ignoreWhitespace` param, pass `-w` to `git diff`
- **New route** — `POST /api/projects/:id/write_file` with pathGuard security
### Infrastructure
- No DB changes — comments client-side, file writes to disk
- No new WS frames — uses existing `git_diff_refresh` event
- No `@boocode/contracts` changes
## Risk
- **Side-by-side diff rendering** — biggest rendering risk; Paseo's `diff-layout.ts` is the reference
- **Inline comments** — lost on localStorage clear; acceptable for v1
- **File editing** — last-writer-wins for single-user is acceptable

View File

@@ -0,0 +1,46 @@
# Enhanced File Panel — tasks
## Tasks
- [ ] 1. **Server: whitespace param** — Add `ignoreWhitespace` param to `getGitDiff()` in
`apps/server/src/services/git_diff.ts`. Append `-w` to git diff argv when true.
Add `whitespace=1` query param to `GET /api/projects/:id/git/diff` route.
- [ ] 2. **Server: write_file endpoint** — Add `POST /api/projects/:id/write_file` with
pathGuard validation and atomic file write. Wire client in `api/client.ts`.
- [ ] 3. **Server tests** — Add tests for whitespace param and write_file endpoint.
(`apps/server/src/services/__tests__/git_diff.test.ts` + new write tests)
- [ ] 4. **Frontend: useDiffPreferences hook** — Create `useDiffPreferences.ts` with
localStorage persistence for layout, wrapLines, hideWhitespace.
- [ ] 5. **Frontend: GitDiffView toolbar** — Add toolbar row with Layout toggle,
Hide whitespace toggle, Wrap lines toggle, Expand/Collapse all, Refresh.
- [ ] 6. **Frontend: diff layout utilities** — Create `utils/diff-layout.ts` with
`buildNumberedDiffHunks()`, `buildSplitDiffRows()`, `buildUnifiedDiffLines()`.
- [ ] 7. **Frontend: DiffSplitView component** — Side-by-side renderer with two aligned
columns, Shiki highlighting per side, thin divider.
- [ ] 8. **Frontend: Integrate split layout in GitDiffView** — Branch render path based
on `layout` preference. Wire collapse/expand state to all files.
- [ ] 9. **Frontend: Inline comment store** — Create `stores/useDiffCommentStore.ts`
with Zustand + localStorage persistence. CRUD operations.
- [ ] 10. **Frontend: InlineReviewGutterCell + InlineReviewEditor** — Gutter cells
with "+" on hover, editor textarea anchored below the target line.
- [ ] 11. **Frontend: InlineReviewThread** — Comment thread display below diff lines.
Collapsible, shows body + timestamp + edit/delete.
- [ ] 12. **Frontend: Integrate comments in GitDiffView** — Wire gutter cells,
editor, and thread into the diff line rendering.
- [ ] 13. **Frontend: File editing in RightRail** — Double-click file → inline
textarea → Save/Cancel → write_file API → tree refresh.
- [ ] 14. **Build + smoke test** — Verify `pnpm -C apps/web build` and
`pnpm -C apps/server build`. Run all QA scenarios.

View File

@@ -0,0 +1,62 @@
# llama-cache-and-spec — KV cache quantization + ngram speculative decoding
## Why
BooCode's llama-sidecar runs llama-server with bare-minimum base args:
`-ngl 999 -c 32768 --flash-attn on --no-mmap`. Two high-impact llama.cpp
features are available but not enabled:
1. **KV cache quantization** (`--cache-type-k q4_0`) — stores the KV cache
in 4-bit instead of f32. ~4× VRAM reduction for the cache, which
dominates memory usage at 32K context. No quality impact for most models.
2. **Ngram speculative decoding** (`--spec-type ngram-mod`) — uses a
lightweight rolling-hash ngram model (~16MB) to predict tokens ahead of
the main model. The main model verifies them in batch. 2-3× tok/s
speedup on repetitive/code tasks with no accuracy loss and no separate
draft model to load.
Both are disabled because they're in the **shadowing lists** of both
validators (`llama-args-validator.ts` + sidecar `validator.go`), which
auto-strip them from agent `llama_extra_args`. The fix is to either:
(A) remove them from the shadow lists, or (B) add them directly to the
sidecar's `BASE_ARGS` (which skips validation entirely).
## What Changes
- Sidecar base args gain the full set:
- `--cache-type-k q4_0` — KV cache quantization (~4× VRAM savings)
- `--cache-reuse 256` — KV cache reuse across turns (prompt caching)
- `--slot-save-path /tmp/llama-slots` — disk-persistent KV cache
- `--cache-idle-slots` — auto-save idle slot caches to disk
- `--spec-type ngram-mod --spec-ngram-mod-thsh 2` — spec decoding
- `--ctx-checkpoints 32` — context overflow protection
- `--sleep-idle-seconds 600` — GPU memory reclaim when idle
- `--metrics` — Prometheus metrics endpoint
- Both validators keep existing shadow lists (correct as-is)
- `/tmp/llama-slots` created for slot KV cache persistence
## Dependencies
- llama-sidecar repo (separate git tree, `/opt/forks/llama-sidecar/`)
- BooCode server (`llama-args-validator.ts`, `provider.ts`)
## Routing Change
Previously, the llama-sidecar was only used when an agent had `llama_extra_args`
set in its AGENTS.md frontmatter. The default path was llama-swap (no cache
quant, no spec decoding, no slot save).
Now, when `LLAMA_SIDECAR_URL` is configured (it is in docker-compose.yml),
ALL inference requests route through the sidecar by default, regardless of
whether the agent has `llama_extra_args`. This means:
- Every request gets KV cache quantization, spec decoding, prompt caching
- Agents with explicit `llama_extra_args` still get their overrides on top
- If `LLAMA_SIDECAR_URL` is unset, falls back to llama-swap (backward compat)
## Risk
- `ngram-mod` spec decoding adds ~16MB memory. Trivial vs the 35B model.
- KV cache quant to q4_0 is lossy vs f32 — undetectable on code tasks.
- Both well-tested in llama.cpp ecosystem. No known regressions.
- If issues, remove from base args and restart — no code change needed.

View File

@@ -0,0 +1,44 @@
# llama-cache-and-spec — tasks
## Files to change
Three files across two repos:
- `/opt/forks/llama-sidecar/internal/config/config.go`
- `/opt/boocode/apps/server/src/services/inference/llama-args-validator.ts`
- `/opt/forks/llama-sidecar/internal/validator/validator.go`
## Tasks
- [x] 1. Update sidecar default base args
`/opt/forks/llama-sidecar/internal/config/config.go` edited.
`defaultBaseArgs()` now includes:
`--cache-type-k q4_0` — KV cache quant → ~4× VRAM savings
`--cache-reuse 256` — KV cache reuse across turns → prompt caching
`--slot-save-path /tmp/llama-slots` — disk-persistent KV cache
`--cache-idle-slots` — auto-save idle slots to disk
`--spec-type ngram-mod --spec-ngram-mod-thsh 2` — spec decoding → 2× tok/s
`--ctx-checkpoints 32` — context overflow protection
`--sleep-idle-seconds 600` — GPU memory reclaim when idle
`--metrics` — Prometheus `/metrics` endpoint
Build verified: `go build ./...` exits 0.
- [x] 2. No change needed — shadow lists are correct
The shadow lists in `llama-args-validator.ts` already prevent agents
from overriding cache/spec/template flags. Adding the flags to
`defaultBaseArgs` + keeping the shadow lists is the correct architecture:
flags are enabled by default, agents can't override them.
- [x] 3. No change needed — same reasoning as task 2
The sidecar `validator.go` shadow lists serve the same purpose.
Both code paths are consistent.
- [ ] 4. Deploy + verify
- Rebuild sidecar binary: `go build -o ... ./...` → ✅ done
- Restart docker compose: needs manual deploy
- Verify `/metrics` endpoint returns data
- Verify `nvidia-smi` shows reduced VRAM (expected: ~4× savings on KV cache)

View File

@@ -0,0 +1,26 @@
## Context
BooCode agents are stateless — they lose all context between sessions. The 3-tier memory architecture (Context→Daily→Core) provides structured persistence. Context compression reduces token waste. Boulder state enables work tracking across sessions. System Context provides typed context lifecycle management.
## Goals / Non-Goals
**Goals:**
- Cross-session agent memory via 3-tier architecture
- Token-efficient context window management via summarization + compression
- Work-in-progress persistence across sessions via boulder state
- Typed context sources with epoch baselines
**Non-Goals:**
- Distributed memory storage (single-user SQLite is sufficient)
- Real-time memory syncing across users
## Decisions
- **Embedding strategy**: Use OpenAI text-embedding-3-small via BooCode's existing provider. Store as BLOB in SQLite.
- **Vector search**: In-process cosine similarity (mathjs) — no external vector DB needed
- **FTS5**: SQLite FTS5 for keyword search with trigram tokenizer for CJK support
- **Memory tiers**: ContextTier (in-memory, per-session, LRU) → DailyTier (files, ~30-90 days) → CoreTier (SQLite, permanent)
- **Deep Dream**: Background cron job (configurable). Uses LLM (configured model) for consolidation
- **Boulder state**: Stored in Postgres `plans` table as JSONB. Auto-resume on session start.
- **Context compression**: DCP-style. Range dedup (remove duplicate messages), purge errors (remove redundant error messages), turn protection (keep turn boundaries intact).
- **SWR caching**: In-memory Map with TTU-based revalidation. Cache key = (tool, params hash). Invalidate on project switch.

Some files were not shown because too many files have changed in this diff Show More