chore(openspec): drop 9 superseded proposals + 11 stub archive files

Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
2026-06-07 22:15:38 +00:00
parent 0d6e9a2413
commit c935687725
119 changed files with 4897 additions and 45 deletions
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/code-correctness-eval/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/code-correctness-eval/spec.md
@@ -0,0 +1,65 @@
+## ADDED Requirements
+
+### Requirement: Code LLM-as-judge
+
+The system SHALL provide `create_code_llm_as_judge()` that evaluates code correctness using an LLM, with code extraction from responses.
+
+Parameters:
+- `code_extraction_strategy: "none" | "llm" | "markdown_code_blocks"` — how to extract code from output
+- `code_extractor?: Callable` — custom extraction function
+
+#### Scenario: Markdown code block extraction
+
+- **WHEN** `code_extraction_strategy="markdown_code_blocks"` and output contains triple-backtick code blocks
+- **THEN** the evaluator SHALL extract code from those blocks before scoring
+
+#### Scenario: LLM-based code extraction
+
+- **WHEN** `code_extraction_strategy="llm"` and a `judge` is provided
+- **THEN** the evaluator SHALL use an LLM with `ExtractCode`/`NoCode` tools to extract code
+
+#### Scenario: No extraction returns raw output
+
+- **WHEN** `code_extraction_strategy="none"`
+- **THEN** the raw output string is passed directly to the scorer
+
+### Requirement: Static analysis evaluator (Pyright)
+
+The system SHALL provide `create_pyright_evaluator()` that runs Pyright static type checking on extracted Python code.
+
+Parameters:
+- `pyright_cli_args: string[]` — additional CLI flags
+- `code_extraction_strategy` / `code_extractor` — same as code LLM evaluator
+
+#### Scenario: Pyright detects type error
+
+- **WHEN** code with a type error (e.g., `x: int = "string"`) is evaluated
+- **THEN** the evaluator SHALL return score `false` with error details in `comment`
+
+#### Scenario: Pyright passes clean code
+
+- **WHEN** valid Python code is evaluated
+- **THEN** the evaluator SHALL return score `true`
+
+### Requirement: Static analysis evaluator (Mypy)
+
+The system SHALL provide `create_mypy_evaluator()` with equivalent behavior to Pyright evaluator but using the Mypy type checker.
+
+#### Scenario: Mypy detects type error
+
+- **WHEN** code with an unannotated function returning mismatched types is evaluated
+- **THEN** the evaluator SHALL return score `false`
+
+### Requirement: Sandboxed code execution
+
+The system SHALL provide `create_e2b_execution_evaluator()` that executes code in a sandbox and checks for runtime errors.
+
+#### Scenario: Code executes without errors
+
+- **WHEN** valid Python code runs in the sandbox
+- **THEN** the evaluator SHALL return score `true`
+
+#### Scenario: Code raises runtime exception
+
+- **WHEN** code that raises an exception is executed
+- **THEN** the evaluator SHALL return score `false` with error details
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/core-types/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/core-types/spec.md
@@ -0,0 +1,31 @@
+## ADDED Requirements
+
+### Requirement: Shared type system
+
+The system SHALL define a shared set of types used by all packages:
+- `EvaluatorResult` — TypedDict with `key: string`, `score: number | boolean`, `comment?: string`, `metadata?: Record<string, unknown>`, `source_run_id?: string`
+- `ModelClient` — Protocol with `chat.completions.create()` for LLM access
+- `SandboxProvider` — Interface for provider-agnostic sandbox creation/management
+- `Checkpointer` — Interface for checkpoint persistence
+- `Serializable` — Interface requiring `toJSON()` and static `fromJSON()` methods
+- All evaluators SHALL accept a consistent call signature: `(inputs?, outputs, reference_outputs?, **kwargs)`
+- Error types: `GraphInterrupt`, `SandboxError`, `EvalError`
+
+#### Scenario: EvaluatorResult conforms to schema
+
+- **WHEN** an evaluator returns a result
+- **THEN** the result SHALL conform to `EvaluatorResult` with at least `key` and `score`
+
+#### Scenario: All stateful objects are serializable
+
+- **WHEN** a `Sandbox`, `Snapshot`, or `Command` instance is serialized via `toJSON()`
+- **THEN** a subsequent `fromJSON()` call SHALL reconstruct an equivalent instance
+
+### Requirement: Serialization protocol
+
+All stateful objects (`Sandbox`, `Session`, `Command`, `Snapshot`, `GraphState`) SHALL implement `toJSON()` / `fromJSON()` static methods for cross-session persistence.
+
+#### Scenario: Round-trip serialization preserves identity
+
+- **WHEN** an object is serialized and deserialized
+- **THEN** the deserialized object SHALL have matching identity fields (`id`, `name`, `sessionId`)
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/eval-prompt-library/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/eval-prompt-library/spec.md
@@ -0,0 +1,49 @@
+## ADDED Requirements
+
+### Requirement: Built-in evaluation prompt templates
+
+The system SHALL ship with a library of prompt templates organized by domain, ready for use with `create_llm_as_judge()`.
+
+Domains and included prompts:
+
+**Quality:**
+- `CORRECTNESS_PROMPT` — factual accuracy and completeness
+- `CONCISENESS_PROMPT` — concise responses without hedging or fluff
+- `HALLUCINATION_PROMPT` — claims verifiable from context
+- `ANSWER_RELEVANCE_PROMPT` — output addresses the input question
+- `PLAN_ADHERENCE_PROMPT` — agent actions match declared plan
+- `LAZINESS_PROMPT` — detects blank or low-effort responses
+
+**RAG:**
+- `RAG_GROUNDEDNESS_PROMPT` — output claims supported by retrieved context
+- `RAG_HELPFULNESS_PROMPT` — output addresses core question
+- `RAG_RETRIEVAL_RELEVANCE_PROMPT` — retrieved context is relevant to input
+
+**Safety:**
+- `TOXICITY_PROMPT` — personal attacks, hate speech
+- `FAIRNESS_PROMPT` — stereotyping, discrimination
+
+**Security:**
+- `PII_LEAKAGE_PROMPT` — names, contact info, credentials in output
+- `PROMPT_INJECTION_PROMPT` — delimiter manipulation, roleplay bypass
+- `CODE_INJECTION_PROMPT` — SQL injection, XSS, path traversal
+
+**Trajectory:**
+- `TRAJECTORY_ACCURACY_PROMPT` — logical progression, goal alignment
+- `TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE` — semantically equivalent to reference
+- `TOOL_SELECTION_PROMPT` — right tools, right order, no redundant calls
+
+**Conversation:**
+- `USER_SATISFACTION_PROMPT` — gratitude, resolution, engagement
+- `TASK_COMPLETION_PROMPT` — was the user's goal achieved
+- `AGENT_TONE_PROMPT` — appropriate tone and professionalism
+
+#### Scenario: Each prompt is a string with {inputs}, {outputs}, {reference_outputs} placeholders
+
+- **WHEN** a prompt template is inspected
+- **THEN** it SHALL be a string compatible with `str.format()` containing at least `{outputs}`
+
+#### Scenario: Prompt templates follow rubric structure
+
+- **WHEN** a prompt template is read
+- **THEN** it SHALL contain `<Rubric>`, `<Instructions>`, and `<Reminder>` XML sections
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/graph-streaming/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/graph-streaming/spec.md
@@ -0,0 +1,49 @@
+## ADDED Requirements
+
+### Requirement: Stream modes
+
+The system SHALL support multiple stream modes when invoking a compiled graph:
+
+- `"values"` — emits the full state after each superstep
+- `"updates"` — emits only the state changes after each superstep
+- `"messages"` — emits individual message chunks for chat-oriented graphs
+- `"debug"` — emits debug events with full superstep information
+- `"custom"` — supports user-defined events via a emit function
+
+#### Scenario: Values mode emits full state
+
+- **WHEN** a graph is streamed with `streamMode: ["values"]`
+- **THEN** each chunk SHALL contain the complete state object after each superstep
+
+#### Scenario: Updates mode emits diffs
+
+- **WHEN** a graph is streamed with `streamMode: ["updates"]`
+- **THEN** each chunk SHALL contain only the state keys that changed
+
+### Requirement: Stream event protocol
+
+The system SHALL emit structured events during graph execution, including:
+- `on_chain_start` — node execution begins
+- `on_chain_end` — node execution completes
+- `on_chain_stream` — intermediate output from a node
+- `on_custom_event` — user-defined events
+- Checkpoint metadata paired with each event (id, parent_id, step, source)
+
+#### Scenario: Events include checkpoint metadata
+
+- **WHEN** a stream event is received
+- **THEN** it SHALL include a `checkpoint` envelope with `id`, `step`, and `source`
+
+#### Scenario: Custom events propagate from nodes
+
+- **WHEN** a node emits a custom event via an emit function
+- **THEN** that event SHALL appear in the stream with type `on_custom_event`
+
+### Requirement: Async iteration over streams
+
+The system SHALL support `for await...of` iteration over graph streams.
+
+#### Scenario: Stream is async iterable
+
+- **WHEN** `for await (const chunk of graph.stream(...))` is used
+- **THEN** each chunk SHALL be available as it is produced
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/human-in-the-loop/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/human-in-the-loop/spec.md
@@ -0,0 +1,56 @@
+## ADDED Requirements
+
+### Requirement: Node interrupt function
+
+The system SHALL provide an `interrupt(value)` function that pauses graph execution and returns a resume value when the graph is continued.
+
+#### Scenario: Interrupt pauses execution with value
+
+- **WHEN** a node calls `const approval = interrupt({ question: "Approve this action?" })`
+- **THEN** execution SHALL pause and the interrupt value SHALL be available in the stream output
+
+#### Scenario: Resume returns value to interrupt
+
+- **WHEN** the graph is resumed with `Command({ resume: "approved" })`
+- **THEN** the `interrupt()` call SHALL return `"approved"`
+
+#### Scenario: Multiple interrupts are supported
+
+- **WHEN** a node calls `interrupt()` twice
+- **THEN** each interrupt SHALL be resolved sequentially, requiring two resume commands
+
+### Requirement: Command-based graph resumption
+
+The system SHALL provide a `Command` class that supports:
+- `Command.RESUME` — resume value for pending interrupts
+- `Command.GOTO` — Send or node name for dynamic routing
+- `Command.PARENT` — bubble up to parent graph
+
+#### Scenario: Command with resume continues execution
+
+- **WHEN** `await graph.stream(new Command({ resume: "user input" }))` is called
+- **THEN** the interrupted node SHALL continue with the resume value
+
+#### Scenario: Command with goto routes dynamically
+
+- **WHEN** a node returns `new Command({ goto: "human_review" })`
+- **THEN** execution SHALL route to `human_review` node
+
+### Requirement: Automated interrupts at node boundaries
+
+The system SHALL support `interruptBefore` and `interruptAfter` in `compile()` options to automatically pause at specific nodes.
+
+#### Scenario: InterruptBefore pauses before node execution
+
+- **WHEN** `graph.compile({ interruptBefore: ["approval_node"] })` is used
+- **THEN** the graph SHALL pause just before executing `approval_node`
+
+### Requirement: State snapshots on interrupt
+
+When a graph uses a checkpointer, interrupt states SHALL be persisted so execution can be resumed across process boundaries.
+
+#### Scenario: Interrupted state is checkpointed
+
+- **WHEN** a graphed with a checkpointer is interrupted
+- **THEN** the checkpoint SHALL contain the interrupt state
+- **THEN** restoring from that checkpoint SHALL yield the same interrupt state
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/llm-as-judge/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/llm-as-judge/spec.md
@@ -0,0 +1,55 @@
+## ADDED Requirements
+
+### Requirement: LLM-as-judge evaluator factory
+
+The system SHALL provide a `create_llm_as_judge()` function that creates an evaluator using an LLM to assess output quality.
+
+Parameters:
+- `prompt: string | Runnable | Callable` — evaluation prompt (string template with `{inputs}`, `{outputs}`, `{reference_outputs}`)
+- `judge?: ModelClient | BaseChatModel` — LLM client
+- `model?: string` — model identifier
+- `system?: string` — optional system message
+- `continuous: boolean = false` — float 0-1 scoring when true, boolean when false
+- `choices?: number[]` — specific enum float values for score
+- `use_reasoning: boolean = true` — include reasoning in output
+- `few_shot_examples?: FewShotExample[]` — example evaluations
+- `output_schema?: JSONSchema | ZodSchema` — custom structured output format
+
+#### Scenario: String prompt evaluator returns scored result
+
+- **WHEN** `create_llm_as_judge(prompt="Rate: {outputs}")` is invoked with `outputs="Hello world"`
+- **THEN** it SHALL return an `EvaluatorResult` with `key: "score"` and a valid `score`
+
+#### Scenario: Continuous scoring returns float
+
+- **WHEN** `create_llm_as_judge(prompt=..., continuous=true)` scores output
+- **THEN** the score SHALL be a float between 0.0 and 1.0
+
+#### Scenario: Choices scoring returns enum value
+
+- **WHEN** `create_llm_as_judge(prompt=..., choices=[0.0, 0.5, 1.0])` scores output
+- **THEN** the score SHALL be exactly one of the enumerated choices
+
+#### Scenario: Reasoning mode returns comment
+
+- **WHEN** `create_llm_as_judge(prompt=..., use_reasoning=true)` scores output
+- **THEN** the `comment` field SHALL contain the LLM's reasoning
+
+#### Scenario: Few-shot examples are appended to prompt
+
+- **WHEN** `few_shot_examples` are provided
+- **THEN** they SHALL be appended as `<example>` XML blocks to the last user message
+
+#### Scenario: Output schema returns structured dict
+
+- **WHEN** `output_schema` is provided (e.g., z.object({ quality: z.number() }))
+- **THEN** the evaluator SHALL return a dict matching that schema instead of `EvaluatorResult`
+
+### Requirement: Async LLM-as-judge
+
+The system SHALL provide `create_async_llm_as_judge()` with identical parameters, returning an async evaluator.
+
+#### Scenario: Async evaluator returns same structure as sync
+
+- **WHEN** `await` is used on an async evaluator invocation
+- **THEN** the result SHALL match the same structure as the sync equivalent
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/multi-turn-simulation/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/multi-turn-simulation/spec.md
@@ -0,0 +1,39 @@
+## ADDED Requirements
+
+### Requirement: Multi-turn conversation simulation
+
+The system SHALL provide `run_multiturn_simulation()` that simulates a multi-turn conversation between an app and a simulated user.
+
+Parameters:
+- `app: Callable[[ChatCompletionMessage], ChatCompletionMessage]` — the application under test
+- `user: Callable | string[]` — simulated user (dynamic or static responses)
+- `max_turns?: number` — maximum conversation turns
+- `trajectory_evaluators?: EvalFunction[]` — evaluators that assess the final trajectory
+- `stopping_condition?: Callable[[Message[], number], boolean]` — early termination
+- `reference_outputs?: unknown` — passed to evaluators
+
+#### Scenario: Static user responses drive conversation
+
+- **WHEN** `user=["Hello", "Tell me more", "Goodbye"]` with `max_turns=3`
+- **THEN** the simulation SHALL alternate between user responses and app responses for 3 turns
+
+#### Scenario: Dynamic simulated user adapts to context
+
+- **WHEN** `user` is a `Callable` receiving the current trajectory
+- **THEN** the user function SHALL receive the current conversation history and return the next message
+
+#### Scenario: Trajectory evaluators run after simulation
+
+- **WHEN** `trajectory_evaluators` are provided
+- **THEN** each evaluator SHALL receive the full conversation trajectory as `outputs`
+- **THEN** the simulation result SHALL include `evaluator_results` from each evaluator
+
+#### Scenario: Stopping condition terminates early
+
+- **WHEN** `stopping_condition` returns `true` before `max_turns`
+- **THEN** the simulation SHALL terminate immediately
+
+#### Scenario: Async simulation is supported
+
+- **WHEN** `run_multiturn_simulation_async()` is called with async `app` and `user` functions
+- **THEN** the simulation SHALL await each turn and return the same result structure
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/pregel-execution/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/pregel-execution/spec.md
@@ -0,0 +1,49 @@
+## ADDED Requirements
+
+### Requirement: Pregel execution engine
+
+The system SHALL implement a Pregel-style superstep execution engine where:
+
+- Each "superstep" executes all ready nodes concurrently
+- Nodes communicate through typed channels (not direct function calls)
+- Channel writes from one superstep are visible as reads in the next
+- The engine supports `PULL` (edge-triggered) and `PUSH` (dynamic Send) task scheduling
+
+#### Scenario: Nodes execute in dependency order
+
+- **WHEN** node B subscribes to channel A
+- **THEN** node B SHALL execute in the superstep after node A writes to channel A
+
+#### Scenario: Concurrent nodes run in parallel
+
+- **WHEN** two nodes have no dependencies between them
+- **THEN** they SHALL execute concurrently within the same superstep
+
+#### Scenario: Dynamic Send spawns new node executions
+
+- **WHEN** a node calls `send("node_c", { ... })` via `Command`
+- **THEN** `node_c` SHALL be scheduled for execution in the current or next superstep
+
+### Requirement: Graph compilation
+
+The system SHALL provide `graph.compile()` that produces a runnable compiled graph.
+
+Parameters:
+- `checkpointer?: Checkpointer` — optional persistence
+- `interruptBefore?: string[]` — nodes to pause before
+- `interruptAfter?: string[]` — nodes to pause after
+- `name?: string` — graph name
+
+#### Scenario: Compiled graph can be invoked
+
+- **WHEN** `compiled_graph.invoke({ messages: [] })` is called
+- **THEN** it SHALL execute all nodes and return the final state
+
+### Requirement: Recursion limit
+
+The system SHALL enforce a configurable recursion limit to prevent infinite loops.
+
+#### Scenario: Exceeding recursion limit throws
+
+- **WHEN** a graph exceeds the recursion limit
+- **THEN** a `GraphRecursionError` SHALL be thrown
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-command-execution/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-command-execution/spec.md
@@ -0,0 +1,61 @@
+## ADDED Requirements
+
+### Requirement: Command execution (blocking)
+
+The system SHALL provide `sandbox.runCommand(cmd, args?, opts?)` that executes a command inside the sandbox and waits for completion.
+
+Parameters:
+- `cmd: string` — command to execute
+- `args?: string[]` — command arguments
+- `cwd?: string` — working directory
+- `env?: Record<string, string>` — per-command environment variables
+- `sudo?: boolean` — execute with root privileges
+- `timeoutMs?: number` — max execution time (SIGKILL on expiry)
+- `signal?: AbortSignal` — cancellation
+
+#### Scenario: Blocking runCommand returns finished result with exit code
+
+- **WHEN** `sandbox.runCommand("echo", ["hello"])` is called
+- **THEN** it SHALL return a `CommandFinished` instance with `exitCode: 0`
+
+#### Scenario: Command timeout kills process
+
+- **WHEN** `sandbox.runCommand("sleep", ["100"], { timeoutMs: 100 })` is executed
+- **THEN** it SHALL return a non-zero exit code after ~100ms
+
+#### Scenario: Stderr is captured separately
+
+- **WHEN** a command writes to both stdout and stderr
+- **THEN** `result.stdout()` and `result.stderr()` SHALL return their respective streams
+
+### Requirement: Detached command execution
+
+The system SHALL support `{ detached: true }` mode where `runCommand()` returns immediately with a live `Command` handle.
+
+#### Scenario: Detached command returns before completion
+
+- **WHEN** `sandbox.runCommand({ cmd: "sleep", args: ["5"], detached: true })` is called
+- **THEN** it SHALL return a `Command` instance immediately (before the process exits)
+
+#### Scenario: Detached command can be waited on
+
+- **WHEN** `command.wait()` is called on a detached command
+- **THEN** it SHALL return a `CommandFinished` when the process exits
+
+### Requirement: Command log streaming
+
+The system SHALL provide `command.logs()` as an async iterable of stdout/stderr log lines.
+
+#### Scenario: Logs stream output lines
+
+- **WHEN** `for await (const log of command.logs())` is iterated
+- **THEN** each `log` SHALL have `stream: "stdout" | "stderr"` and `data: string`
+
+### Requirement: Command kill
+
+The system SHALL provide `command.kill(signal?)` to send a POSIX signal to a running command.
+
+#### Scenario: Default kill sends SIGTERM
+
+- **WHEN** `command.kill()` is called without a signal
+- **THEN** SIGTERM SHALL be sent to the process
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-filesystem/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-filesystem/spec.md
@@ -0,0 +1,50 @@
+## ADDED Requirements
+
+### Requirement: Filesystem API matching node:fs/promises
+
+The system SHALL provide `sandbox.fs` implementing the Node.js `fs/promises` API:
+
+- `readFile(path, encoding?)` → `Buffer | string`
+- `writeFile(path, data)` → `void`
+- `appendFile(path, data)` → `void`
+- `mkdir(path, { recursive? })` → `void`
+- `readdir(path, { withFileTypes? })` → `string[] | Dirent[]`
+- `stat(path)` / `lstat(path)` → `Stats`
+- `unlink(path)`, `rm(path, { recursive?, force? })`, `rmdir(path)` → `void`
+- `rename(oldPath, newPath)` → `void`
+- `copyFile(src, dest)` → `void`
+- `chmod(path, mode)`, `chown(path, uid, gid)` → `void`
+- `symlink(target, path)`, `readlink(path)` → `void`
+- `realpath(path)`, `truncate(path, len?)` → `void`
+- `mkdtemp(prefix)` → `string`
+- `access(path)`, `exists(path)` → `boolean`
+
+#### Scenario: ReadFile returns correct content
+
+- **WHEN** `sandbox.fs.readFile("/etc/hostname", "utf8")` is called
+- **THEN** it SHALL return the file content as a string
+
+#### Scenario: WriteFile creates new file
+
+- **WHEN** `sandbox.fs.writeFile("/tmp/test.txt", "hello")` is called
+- **THEN** subsequent `sandbox.fs.readFile("/tmp/test.txt", "utf8")` SHALL return `"hello"`
+
+#### Scenario: Readdir lists directory contents
+
+- **WHEN** `sandbox.fs.readdir("/")` is called
+- **THEN** it SHALL return an array of filenames
+
+#### Scenario: Stat returns file metadata
+
+- **WHEN** `sandbox.fs.stat("/etc/hostname")` is called
+- **THEN** it SHALL return a `Stats`-compatible object with `size`, `isFile()`, `isDirectory()`, `mode`, `uid`, `gid`, `mtime`, etc.
+
+#### Scenario: Mkdir creates intermediate directories
+
+- **WHEN** `sandbox.fs.mkdir("/tmp/a/b/c", { recursive: true })` is called
+- **THEN** the directory `/tmp/a/b/c` SHALL exist
+
+#### Scenario: Exists returns false for missing files
+
+- **WHEN** `sandbox.fs.exists("/nonexistent")` is called
+- **THEN** it SHALL return `false`
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-lifecycle/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-lifecycle/spec.md
@@ -0,0 +1,70 @@
+## ADDED Requirements
+
+### Requirement: Sandbox creation
+
+The system SHALL provide a `Sandbox.create()` static method that provisions a new isolated compute environment.
+
+Parameters:
+- `name?: string` — optional human-readable name
+- `source?: { type: "git" | "tarball" | "snapshot" }` — source for initial filesystem
+- `ports?: number[]` — ports to expose (max 4)
+- `timeout?: number` — auto-terminate timeout in ms
+- `resources?: { vcpus: number }` — CPU allocation (2048 MB RAM per vCPU)
+- `runtime?: string` — runtime identifier
+- `networkPolicy?: NetworkPolicy` — network restrictions
+- `env?: Record<string, string>` — default environment variables
+- `tags?: Record<string, string>` — metadata tags (max 5)
+- `persistent?: boolean` — persistent filesystem across sessions
+- `signal?: AbortSignal` — cancellation support
+
+#### Scenario: Create returns a running Sandbox instance
+
+- **WHEN** `Sandbox.create()` is called with valid parameters
+- **THEN** it SHALL return a `Sandbox` instance with a running session
+
+#### Scenario: Create supports AsyncDisposable
+
+- **WHEN** `Sandbox.create()` is used with `await using`
+- **THEN** the sandbox SHALL be automatically stopped when scope exits
+
+#### Scenario: Source specifies initial filesystem content
+
+- **WHEN** `source: { type: "git", url: "..." }` is provided
+- **THEN** the sandbox SHALL clone the git repository on creation
+
+### Requirement: Sandbox retrieval
+
+The system SHALL provide `Sandbox.get()` to retrieve an existing sandbox and `Sandbox.getOrCreate()` for idempotent get-or-create.
+
+#### Scenario: Get retrieves existing sandbox
+
+- **WHEN** `Sandbox.get({ name: "my-sandbox" })` is called for an existing sandbox
+- **THEN** it SHALL return the sandbox with its session resumed
+
+#### Scenario: GetOrCreate creates when not found
+
+- **WHEN** `Sandbox.getOrCreate({ name: "new-sandbox", onCreate: ... })` is called and sandbox doesn't exist
+- **THEN** it SHALL create a new sandbox and call `onCreate` once
+
+### Requirement: Sandbox forking
+
+The system SHALL provide `Sandbox.fork()` to create a new sandbox from an existing one's current filesystem state.
+
+#### Scenario: Fork preserves filesystem state
+
+- **WHEN** `Sandbox.fork({ sourceSandbox: "original" })` is called
+- **THEN** the new sandbox SHALL start with the filesystem state of the source sandbox
+
+### Requirement: Sandbox update and delete
+
+The system SHALL support `sandbox.update()` for configuration changes and `sandbox.delete()` for removal.
+
+#### Scenario: Update changes sandbox config
+
+- **WHEN** `sandbox.update({ timeout: 300000 })` is called
+- **THEN** the sandbox's timeout SHALL be updated for subsequent sessions
+
+#### Scenario: Delete removes the sandbox
+
+- **WHEN** `sandbox.delete()` is called
+- **THEN** the sandbox SHALL be permanently removed
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-network-policy/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-network-policy/spec.md
@@ -0,0 +1,52 @@
+## ADDED Requirements
+
+### Requirement: Network policy type
+
+The system SHALL define a `NetworkPolicy` type with three forms:
+
+- `"allow-all"` — full internet access (default)
+- `"deny-all"` — no external access
+- `{ allow?: string[] | Record<string, NetworkPolicyRule[]>; subnets?: { allow?: string[]; deny?: string[] } }` — custom rules
+
+#### Scenario: Allow-all permits all traffic
+
+- **WHEN** `networkPolicy: "allow-all"` is set
+- **THEN** all outbound traffic SHALL be permitted
+
+#### Scenario: Deny-all blocks all traffic
+
+- **WHEN** `networkPolicy: "deny-all"` is set
+- **THEN** all outbound traffic SHALL be denied
+
+#### Scenario: Domain allowlist restricts access
+
+- **WHEN** `networkPolicy: { allow: ["*.npmjs.org"] }` is set
+- **THEN** traffic to `registry.npmjs.org` SHALL be allowed and all other traffic SHALL be denied
+
+#### Scenario: Wildcard domains match subdomains
+
+- **WHEN** a domain pattern starts with `*.` (e.g., `*.example.com`)
+- **THEN** it SHALL match any subdomain of that domain
+
+### Requirement: Network policy rules with transformers
+
+The system SHALL support per-domain rules with request transformers for header injection.
+
+Parameters per rule:
+- `match?: { path?, method?, queryString?, headers? }` — request matchers
+- `transform?: { headers: Record<string, string> }[]` — header injection
+- `forwardURL?: string` — HTTPS proxy forwarding
+
+#### Scenario: Header transform injects authorization
+
+- **WHEN** a request matches a rule with `transform: [{ headers: { authorization: "Bearer token" } }]`
+- **THEN** the `authorization` header SHALL be injected before forwarding
+
+### Requirement: Subnet filtering
+
+The system SHALL support subnet-level access control via CIDR notation.
+
+#### Scenario: Subnet allow takes precedence over domain deny
+
+- **WHEN** `subnets: { allow: ["10.0.0.0/8"] }` is set
+- **THEN** traffic to `10.0.0.1` SHALL be allowed regardless of domain rules
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-snapshots/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-snapshots/spec.md
@@ -0,0 +1,59 @@
+## ADDED Requirements
+
+### Requirement: Snapshot creation
+
+The system SHALL provide `sandbox.snapshot()` to create a point-in-time filesystem snapshot.
+
+Parameters:
+- `expiration?: number` — TTL in milliseconds (0 for no expiration)
+
+#### Scenario: Snapshot stops the session and returns Snapshot instance
+
+- **WHEN** `sandbox.snapshot()` is called on a running sandbox
+- **THEN** the current session SHALL be stopped and a `Snapshot` SHALL be returned
+
+### Requirement: Snapshot retrieval and listing
+
+The system SHALL provide `Snapshot.get()`, `Snapshot.list()`, and `Snapshot.tree()` for managing snapshots.
+
+#### Scenario: Retrieve snapshot by ID
+
+- **WHEN** `Snapshot.get({ snapshotId: "snap_abc" })` is called
+- **THEN** it SHALL return the snapshot with matching ID
+
+#### Scenario: List snapshots with pagination
+
+- **WHEN** `Snapshot.list({ name: "my-sandbox" })` is called
+- **THEN** it SHALL return a paginated list of snapshots for that sandbox
+
+#### Scenario: Ancestry tree is accessible
+
+- **WHEN** `Snapshot.tree({ snapshotId: "snap_abc" })` is called
+- **THEN** it SHALL return the ancestry tree of the snapshot
+
+### Requirement: Snapshot deletion
+
+The system SHALL provide `snapshot.delete()` to remove a snapshot.
+
+#### Scenario: Deleted snapshot is no longer listable
+
+- **WHEN** `snapshot.delete()` is called and then `Snapshot.list()` is called
+- **THEN** the deleted snapshot SHALL no longer appear in the list
+
+### Requirement: Snapshot-based sandbox creation
+
+The system SHALL support creating sandboxes from snapshots via `Sandbox.create({ source: { type: "snapshot", snapshotId } })`.
+
+#### Scenario: Sandbox created from snapshot has matching filesystem
+
+- **WHEN** a sandbox is created with a snapshot source and a file is written, then another sandbox is created from the resulting snapshot
+- **THEN** the second sandbox SHALL contain the file from the first
+
+### Requirement: Snapshot retention
+
+The system SHALL support `keepLastSnapshots` retention policy on sandboxes.
+
+#### Scenario: Retention evicts oldest snapshots
+
+- **WHEN** a sandbox has `keepLastSnapshots: { count: 3 }` and a 4th snapshot is created
+- **THEN** the oldest snapshot SHALL be evicted
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/state-graph/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/state-graph/spec.md
@@ -0,0 +1,43 @@
+## ADDED Requirements
+
+### Requirement: State definition via annotations
+
+The system SHALL provide an `Annotation` API for defining graph state schemas:
+
+- `Annotation<T>(reducer?)` — creates a state key with optional reducer
+- `Annotation.Root({ key: Annotation<T> })` — combines keys into a state schema
+- Reducers: `LastValue` (default — overwrite), `BinaryOperator` (custom merge function)
+
+#### Scenario: Annotation.Root defines typed state
+
+- **WHEN** `const State = Annotation.Root({ messages: Annotation<string[]>(addMessages), step: Annotation<number>() })` is defined
+- **THEN** `State` SHALL have `State`, `Update`, and `Node` type members
+
+#### Scenario: LastValue reducer replaces on each write
+
+- **WHEN** a node writes `{ step: 2 }` and then `{ step: 3 }` in the same step
+- **THEN** the LastValue channel SHALL throw an `InvalidUpdateError`
+
+#### Scenario: BinaryOperator reducer accumulates
+
+- **WHEN** a node returns `{ messages: ["hello"] }` and another returns `{ messages: ["world"] }` with an `addMessages` reducer
+- **THEN** the final state SHALL contain `messages: ["hello", "world"]`
+
+### Requirement: StateGraph builder
+
+The system SHALL provide a `StateGraph` class for constructing stateful agent graphs.
+
+#### Scenario: StateGraph is constructed with state schema
+
+- **WHEN** `new StateGraph({ stateSchema: State })` is called
+- **THEN** the graph SHALL accept nodes that receive and can update the defined state
+
+#### Scenario: Nodes can read and write state
+
+- **WHEN** a node function receives state with `{ messages, step }` and returns `{ step: step + 1 }`
+- **THEN** the graph SHALL update `step` and preserve `messages`
+
+#### Scenario: Conditional edges route based on state
+
+- **WHEN** `addConditionalEdges("node_a", (state) => state.step > 5 ? "end" : "node_b")` is added
+- **THEN** execution SHALL route based on the state value at runtime
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/trajectory-eval/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/trajectory-eval/spec.md
@@ -0,0 +1,51 @@
+## ADDED Requirements
+
+### Requirement: Trajectory match evaluator
+
+The system SHALL provide `create_trajectory_match_evaluator()` that compares agent tool-call trajectories against reference trajectories.
+
+Parameters:
+- `trajectory_match_mode: "strict" | "unordered" | "subset" | "superset"` — matching strategy
+- `tool_args_match_mode: "exact" | "ignore" | "subset" | "superset"` — tool argument comparison
+- `tool_args_match_overrides?: Record<string, ToolArgsMatchMode | Callable | string[]>` — per-tool custom matching
+
+#### Scenario: Strict mode requires exact order
+
+- **WHEN** output trajectory has tool calls `[A, B]` and reference is `[A, B]`
+- **THEN** strict mode SHALL return score `true`
+- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
+- **THEN** strict mode SHALL return score `false`
+
+#### Scenario: Unordered mode ignores order
+
+- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
+- **THEN** unordered mode SHALL return score `true`
+
+#### Scenario: Subset mode accepts partial trajectory
+
+- **WHEN** output trajectory has tool calls `[A]` and reference is `[A, B]`
+- **THEN** subset mode SHALL return score `true`
+
+#### Scenario: Superset mode allows extra tool calls
+
+- **WHEN** output trajectory has tool calls `[A, B, C]` and reference is `[A, B]`
+- **THEN** superset mode SHALL return score `true`
+
+#### Scenario: Tool args ignore mode skips argument comparison
+
+- **WHEN** `tool_args_match_mode="ignore"` is set
+- **THEN** tool calls match regardless of their arguments
+
+#### Scenario: Custom tool arg matcher is used
+
+- **WHEN** `tool_args_match_overrides` contains a `Callable` for a tool name
+- **THEN** that callable SHALL be invoked to compare the tool's arguments
+
+### Requirement: Trajectory LLM-as-judge
+
+The system SHALL provide `create_trajectory_llm_as_judge()` that uses an LLM to grade trajectory quality and accuracy.
+
+#### Scenario: Trajectory is formatted as XML for LLM
+
+- **WHEN** an LLM trajectory evaluator is invoked
+- **THEN** the trajectory SHALL be formatted as XML with `<role>`, `<tool_call>`, `<tool_result>` elements