chore(openspec): drop 9 superseded proposals + 11 stub archive files

Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
2026-06-07 22:15:38 +00:00
parent 0d6e9a2413
commit c935687725
119 changed files with 4897 additions and 45 deletions
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/.openspec.yaml
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-06-07
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/design.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/design.md
@@ -0,0 +1,76 @@
+## Context
+
+This design defines a unified Agent Evaluation & Execution Runtime combining three subsystems inspired by OpenEvals, Vercel Sandbox, and langgraphjs. The system is a TypeScript monorepo with four packages:
+
+- **`@agent-runtime/core`** — Shared types, serialization protocol, provider abstraction
+- **`@agent-runtime/eval`** — LLM-as-judge, trajectory, code correctness, multi-turn sim, prompt library
+- **`@agent-runtime/sandbox`** — Remote sandbox lifecycle, command execution, filesystem, snapshots, network policy
+- **`@agent-runtime/graph`** — Stateful graph, Pregel execution, checkpoints, interrupts, streaming
+
+Each package is independently usable but designed to compose: evals run code in sandboxes, sandbox lifecycles are orchestrated by graphs, and graph nodes can be evaluated by evals.
+
+## Goals / Non-Goals
+
+**Goals:**
+- Zero required runtime dependencies for eval core (optional providers via adapter pattern)
+- Sandbox abstraction that works with any provider (Vercel, Fly, custom) via APIClient interface
+- Graph execution with pluggable checkpointers (in-memory, SQLite, Redis, Postgres)
+- All three subsystems share a common serialization protocol for cross-persistence
+- Evaluation can target code running inside sandbox instances
+- Graph nodes can suspend/resume via interrupts with persistent checkpointing
+
+**Non-Goals:**
+- Not a replacement for LangChain/LlamaIndex — no integrations with existing frameworks in v1
+- Not a general-purpose workflow engine — focused on agent/task orchestration patterns
+- No UI or dashboard in v1 — CLI and programmatic API only
+- No Python SDK in v1 — TypeScript-first, Python planned
+
+## Decisions
+
+### D1: Package Architecture — `core` + 3 domain packages
+- **Rationale**: Eval, Sandbox, and Graph have zero overlap in concerns but share types (serialization, error handling, config). A shared core avoids circular deps and keeps each package lightweight.
+- **Alternatives considered**: Monolithic single package — rejected because users may want only one subsystem.
+
+### D2: Eval Factory Pattern (from OpenEvals)
+- **Rationale**: OpenEvals' `create_llm_as_judge(prompt, model, ...)` returning a callable is elegant — the evaluator is a function, not a class. Users compose evaluators into test suites. This pattern is preserved exactly.
+- **Deviation**: Drop LangChain dependency. Use a minimal `ModelClient` protocol (like OpenEvals' `ModelClient` protocol) instead of `BaseChatModel`. Users pass an OpenAI-compatible client or a custom adapter.
+
+### D3: Sandbox as API Wrapper (from Vercel Sandbox)
+- **Rationale**: The Vercel Sandbox `Sandbox` class cleanly separates the **Sandbox** (persistent config) from **Session** (running VM). `Sandbox.create()` → VM, `sandbox.runCommand()` → execute, `sandbox.fs` → filesystem. This maps naturally to any provider with Firecracker/kata-containers.
+- **Deviation**: Abstract `APIClient` behind `SandboxProvider` interface so multiple backends can be plugged in. The `"use step"` Vercel compiler directive is replaced with explicit serialization methods.
+
+### D4: Graph as Pregel + Checkpointer (from langgraphjs)
+- **Rationale**: The superstep-based Pregel engine with typed channels is a proven pattern for stateful agent graphs. Separating graph definition (`StateGraph`) from execution (`Pregel.compile()`) is the right abstraction.
+- **Deviation**: Drop `@langchain/core/runnables` dependency. Define `Runnable` as a minimal interface (invoke, stream only). Use native `Promise` concurrency instead of LangChain callback system.
+
+### D5: Interrupt/Resume via Checkpoint (from langgraphjs)
+- **Rationale**: `interrupt()` throwing a typed error that's caught by the execution loop, persisted to checkpoints, and resumed via `Command({resume: ...})` is the cleanest HITL pattern.
+- **Deviation**: Simplify to a single `GraphInterrupt` error type. No scratchpad — just a sequential interrupt index stored in checkpoint metadata.
+
+### D6: Serialization Protocol
+- **Rationale**: Vercel Sandbox's `WORKFLOW_SERIALIZE`/`WORKFLOW_DESERIALIZE` pattern enables cross-session persistence. We adopt `toJSON()`/`fromJSON()` static methods on all stateful types.
+- **Channels** → serialized as plain objects.
+- **Checkpoints** → serialized as versioned JSON with hash verification.
+
+### D7: Filesystem API over Shell Commands (from Vercel Sandbox)
+- **Rationale**: Vercel's `FileSystem` class implements the full `node:fs/promises` API by running shell commands (`stat`, `find`, `mkdir`, etc.) inside the sandbox. This is pragmatic and avoids building a special FS protocol.
+- **Limitation**: Stat parsing from shell output is fragile. Mitigate with structured output format (JSON + delimiter parsing).
+
+### D8: Network Policy as TypeScript Types (from Vercel Sandbox)
+- **Rationale**: The `NetworkPolicy` union type (`"allow-all" | "deny-all" | { allow: ... }`) maps directly to firewall rules. It's declarative, serializable, and provider-agnostic.
+- **Extension**: Add `tls` and `rateLimit` options beyond what Vercel provides.
+
+## Risks / Trade-offs
+
+- **[Risk] Provider coupling for sandbox**: Abstracting `SandboxProvider` might leak provider-specific features. **Mitigation**: Define the interface minimally (CRUD + exec + fs); provider-specific features are accessed via `(sandbox as any)` escape hatch.
+- **[Risk] Pregel complexity**: The superstep execution model is sophisticated (~2700 lines in langgraphjs). **Mitigation**: Start with sequential execution, add parallelism as optimization. The channel model stays from day one.
+- **[Risk] Eval without LangChain**: Dropping LangChain means reimplementing structured output parsing (`with_structured_output`). **Mitigation**: Target OpenAI-compatible APIs first (they support `response_format: json_schema` natively). Add generic Zod/json-schema path for other providers.
+- **[Trade-off] TypeScript-first**: Python users of OpenEvals patterns won't get a direct migration path. **Mitigation**: The eval prompt templates are language-agnostic strings; the core logic is portable.
+- **[Trade-off] Monorepo overhead**: Four packages with shared config. **Mitigation**: Use minimal workspaces (pnpm/turbo), keep build config shared.
+
+## Open Questions
+
+- Should the sandbox provider interface include a `createCheckpoint`/`restoreCheckpoint` for VM-level snapshots, or should that be graph-layer only?
+- What's the minimum Node.js version? Node 20+ for `AsyncDisposable` support (used in Sandbox lifecycle).
+- Should the eval prompt library ship as part of `@agent-runtime/eval` or as a separate `@agent-runtime/prompts` package?
+- How should eval results feed back into graph state? E.g., a "code correctness eval" runs inside a graph node, and the score influences routing.
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/proposal.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/proposal.md
@@ -0,0 +1,44 @@
+## Why
+
+Building reliable LLM agents requires three tightly integrated subsystems that today exist in separate ecosystems with incompatible interfaces: **evaluation** (OpenEvals), **sandboxed execution** (Vercel Sandbox), and **graph-based orchestration** (langgraphjs). There is no unified runtime that combines all three — meaning every team building production agents must stitch together incompatible libraries, resulting in fragile eval pipelines, uncontained code execution, and ad-hoc workflow orchestration.
+
+This change proposes **a unified Agent Evaluation & Execution Runtime** that combines patterns from all three into a single, consistent system.
+
+## What Changes
+
+- **New `@agent-runtime/eval` package**: LLM-as-judge evaluator, trajectory evaluators, code correctness (static analysis + sandboxed execution), multi-turn simulation, and 20+ built-in eval prompt templates — extracted from OpenEvals without the LangChain dependency
+- **New `@agent-runtime/sandbox` package**: Remote sandbox provisioning with Firecracker MicroVM isolation, command execution (sync + detached + streaming), POSIX filesystem API, snapshot/checkpoint lifecycle, and network policy control — generalized from Vercel Sandbox patterns
+- **New `@agent-runtime/graph` package**: Stateful agent graph with Pregel-style superstep execution, typed channel-based state management, pluggable checkpointer, human-in-the-loop interrupts, and multi-mode streaming — inspired by langgraphjs
+- **New `@agent-runtime/core` package**: Shared types, serialization protocol, configuration system, and provider abstraction layer used by all three subsystems
+- **Integration wiring**: Eval suite can execute code in sandbox, sandbox lifecycle is orchestrated by graph, graph nodes can be evaluated by evals — forming a virtuous cycle
+
+## Capabilities
+
+### New Capabilities
+
+- `llm-as-judge`: LLM-as-judge evaluator with structured output, continuous/binary/choices scoring, reasoning, few-shot examples, and multimodal attachments
+- `trajectory-eval`: Agent trajectory evaluation with strict/unordered/subset/superset matching and LLM-as-judge trajectory grading
+- `code-correctness-eval`: Code correctness evaluation via LLM judging, static analysis (Pyright/MyPy), and sandboxed execution
+- `multi-turn-simulation`: Multi-turn conversation simulation between an app and simulated users with trajectory evaluation
+- `eval-prompt-library`: Library of 20+ prompt templates across quality, RAG, safety, security, conversation, image, and voice domains
+- `sandbox-lifecycle`: Sandbox creation, retrieval, forking, and deletion with Firecracker MicroVM isolation
+- `sandbox-command-execution`: Command execution with blocking, detached, and streaming modes, timeout enforcement, and signal-based process control
+- `sandbox-filesystem`: POSIX filesystem operations (read/write/stat/ls/mkdir/cp/mv/rm/chmod/chown/symlink) via remote sandbox
+- `sandbox-snapshots`: Point-in-time sandbox snapshots with expiration, retention policies, and ancestry trees
+- `sandbox-network-policy`: Network access control with domain allow/deny, request transformers, and subnet rules
+- `state-graph`: Stateful graph with Annotation.Root state definition, typed reducers, and StateGraph builder
+- `pregel-execution`: Superstep-based execution engine with channel communication, parallel task execution, and checkpointing
+- `human-in-the-loop`: Node-level interrupts with resume values, multi-interrupt support, and Command-based graph resumption
+- `graph-streaming`: Multi-mode streaming protocol (events, messages, metadata) with envelope parsing
+
+### Modified Capabilities
+
+*None — this is a greenfield system.*
+
+## Impact
+
+- **New packages**: `@agent-runtime/core`, `@agent-runtime/eval`, `@agent-runtime/sandbox`, `@agent-runtime/graph`
+- **Languages**: TypeScript (all packages), Python support planned for eval package
+- **Dependencies**: Zero required runtime deps for core eval logic; optional checkpointer backends (SQLite, Redis, Postgres) for graph; sandbox requires HTTP client
+- **Target platforms**: Node.js 20+, edge-compatible for eval-only usage
+- **No existing code is modified** — this is pure additive
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/code-correctness-eval/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/code-correctness-eval/spec.md
@@ -0,0 +1,65 @@
+## ADDED Requirements
+
+### Requirement: Code LLM-as-judge
+
+The system SHALL provide `create_code_llm_as_judge()` that evaluates code correctness using an LLM, with code extraction from responses.
+
+Parameters:
+- `code_extraction_strategy: "none" | "llm" | "markdown_code_blocks"` — how to extract code from output
+- `code_extractor?: Callable` — custom extraction function
+
+#### Scenario: Markdown code block extraction
+
+- **WHEN** `code_extraction_strategy="markdown_code_blocks"` and output contains triple-backtick code blocks
+- **THEN** the evaluator SHALL extract code from those blocks before scoring
+
+#### Scenario: LLM-based code extraction
+
+- **WHEN** `code_extraction_strategy="llm"` and a `judge` is provided
+- **THEN** the evaluator SHALL use an LLM with `ExtractCode`/`NoCode` tools to extract code
+
+#### Scenario: No extraction returns raw output
+
+- **WHEN** `code_extraction_strategy="none"`
+- **THEN** the raw output string is passed directly to the scorer
+
+### Requirement: Static analysis evaluator (Pyright)
+
+The system SHALL provide `create_pyright_evaluator()` that runs Pyright static type checking on extracted Python code.
+
+Parameters:
+- `pyright_cli_args: string[]` — additional CLI flags
+- `code_extraction_strategy` / `code_extractor` — same as code LLM evaluator
+
+#### Scenario: Pyright detects type error
+
+- **WHEN** code with a type error (e.g., `x: int = "string"`) is evaluated
+- **THEN** the evaluator SHALL return score `false` with error details in `comment`
+
+#### Scenario: Pyright passes clean code
+
+- **WHEN** valid Python code is evaluated
+- **THEN** the evaluator SHALL return score `true`
+
+### Requirement: Static analysis evaluator (Mypy)
+
+The system SHALL provide `create_mypy_evaluator()` with equivalent behavior to Pyright evaluator but using the Mypy type checker.
+
+#### Scenario: Mypy detects type error
+
+- **WHEN** code with an unannotated function returning mismatched types is evaluated
+- **THEN** the evaluator SHALL return score `false`
+
+### Requirement: Sandboxed code execution
+
+The system SHALL provide `create_e2b_execution_evaluator()` that executes code in a sandbox and checks for runtime errors.
+
+#### Scenario: Code executes without errors
+
+- **WHEN** valid Python code runs in the sandbox
+- **THEN** the evaluator SHALL return score `true`
+
+#### Scenario: Code raises runtime exception
+
+- **WHEN** code that raises an exception is executed
+- **THEN** the evaluator SHALL return score `false` with error details
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/core-types/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/core-types/spec.md
@@ -0,0 +1,31 @@
+## ADDED Requirements
+
+### Requirement: Shared type system
+
+The system SHALL define a shared set of types used by all packages:
+- `EvaluatorResult` — TypedDict with `key: string`, `score: number | boolean`, `comment?: string`, `metadata?: Record<string, unknown>`, `source_run_id?: string`
+- `ModelClient` — Protocol with `chat.completions.create()` for LLM access
+- `SandboxProvider` — Interface for provider-agnostic sandbox creation/management
+- `Checkpointer` — Interface for checkpoint persistence
+- `Serializable` — Interface requiring `toJSON()` and static `fromJSON()` methods
+- All evaluators SHALL accept a consistent call signature: `(inputs?, outputs, reference_outputs?, **kwargs)`
+- Error types: `GraphInterrupt`, `SandboxError`, `EvalError`
+
+#### Scenario: EvaluatorResult conforms to schema
+
+- **WHEN** an evaluator returns a result
+- **THEN** the result SHALL conform to `EvaluatorResult` with at least `key` and `score`
+
+#### Scenario: All stateful objects are serializable
+
+- **WHEN** a `Sandbox`, `Snapshot`, or `Command` instance is serialized via `toJSON()`
+- **THEN** a subsequent `fromJSON()` call SHALL reconstruct an equivalent instance
+
+### Requirement: Serialization protocol
+
+All stateful objects (`Sandbox`, `Session`, `Command`, `Snapshot`, `GraphState`) SHALL implement `toJSON()` / `fromJSON()` static methods for cross-session persistence.
+
+#### Scenario: Round-trip serialization preserves identity
+
+- **WHEN** an object is serialized and deserialized
+- **THEN** the deserialized object SHALL have matching identity fields (`id`, `name`, `sessionId`)
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/eval-prompt-library/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/eval-prompt-library/spec.md
@@ -0,0 +1,49 @@
+## ADDED Requirements
+
+### Requirement: Built-in evaluation prompt templates
+
+The system SHALL ship with a library of prompt templates organized by domain, ready for use with `create_llm_as_judge()`.
+
+Domains and included prompts:
+
+**Quality:**
+- `CORRECTNESS_PROMPT` — factual accuracy and completeness
+- `CONCISENESS_PROMPT` — concise responses without hedging or fluff
+- `HALLUCINATION_PROMPT` — claims verifiable from context
+- `ANSWER_RELEVANCE_PROMPT` — output addresses the input question
+- `PLAN_ADHERENCE_PROMPT` — agent actions match declared plan
+- `LAZINESS_PROMPT` — detects blank or low-effort responses
+
+**RAG:**
+- `RAG_GROUNDEDNESS_PROMPT` — output claims supported by retrieved context
+- `RAG_HELPFULNESS_PROMPT` — output addresses core question
+- `RAG_RETRIEVAL_RELEVANCE_PROMPT` — retrieved context is relevant to input
+
+**Safety:**
+- `TOXICITY_PROMPT` — personal attacks, hate speech
+- `FAIRNESS_PROMPT` — stereotyping, discrimination
+
+**Security:**
+- `PII_LEAKAGE_PROMPT` — names, contact info, credentials in output
+- `PROMPT_INJECTION_PROMPT` — delimiter manipulation, roleplay bypass
+- `CODE_INJECTION_PROMPT` — SQL injection, XSS, path traversal
+
+**Trajectory:**
+- `TRAJECTORY_ACCURACY_PROMPT` — logical progression, goal alignment
+- `TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE` — semantically equivalent to reference
+- `TOOL_SELECTION_PROMPT` — right tools, right order, no redundant calls
+
+**Conversation:**
+- `USER_SATISFACTION_PROMPT` — gratitude, resolution, engagement
+- `TASK_COMPLETION_PROMPT` — was the user's goal achieved
+- `AGENT_TONE_PROMPT` — appropriate tone and professionalism
+
+#### Scenario: Each prompt is a string with {inputs}, {outputs}, {reference_outputs} placeholders
+
+- **WHEN** a prompt template is inspected
+- **THEN** it SHALL be a string compatible with `str.format()` containing at least `{outputs}`
+
+#### Scenario: Prompt templates follow rubric structure
+
+- **WHEN** a prompt template is read
+- **THEN** it SHALL contain `<Rubric>`, `<Instructions>`, and `<Reminder>` XML sections
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/graph-streaming/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/graph-streaming/spec.md
@@ -0,0 +1,49 @@
+## ADDED Requirements
+
+### Requirement: Stream modes
+
+The system SHALL support multiple stream modes when invoking a compiled graph:
+
+- `"values"` — emits the full state after each superstep
+- `"updates"` — emits only the state changes after each superstep
+- `"messages"` — emits individual message chunks for chat-oriented graphs
+- `"debug"` — emits debug events with full superstep information
+- `"custom"` — supports user-defined events via a emit function
+
+#### Scenario: Values mode emits full state
+
+- **WHEN** a graph is streamed with `streamMode: ["values"]`
+- **THEN** each chunk SHALL contain the complete state object after each superstep
+
+#### Scenario: Updates mode emits diffs
+
+- **WHEN** a graph is streamed with `streamMode: ["updates"]`
+- **THEN** each chunk SHALL contain only the state keys that changed
+
+### Requirement: Stream event protocol
+
+The system SHALL emit structured events during graph execution, including:
+- `on_chain_start` — node execution begins
+- `on_chain_end` — node execution completes
+- `on_chain_stream` — intermediate output from a node
+- `on_custom_event` — user-defined events
+- Checkpoint metadata paired with each event (id, parent_id, step, source)
+
+#### Scenario: Events include checkpoint metadata
+
+- **WHEN** a stream event is received
+- **THEN** it SHALL include a `checkpoint` envelope with `id`, `step`, and `source`
+
+#### Scenario: Custom events propagate from nodes
+
+- **WHEN** a node emits a custom event via an emit function
+- **THEN** that event SHALL appear in the stream with type `on_custom_event`
+
+### Requirement: Async iteration over streams
+
+The system SHALL support `for await...of` iteration over graph streams.
+
+#### Scenario: Stream is async iterable
+
+- **WHEN** `for await (const chunk of graph.stream(...))` is used
+- **THEN** each chunk SHALL be available as it is produced
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/human-in-the-loop/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/human-in-the-loop/spec.md
@@ -0,0 +1,56 @@
+## ADDED Requirements
+
+### Requirement: Node interrupt function
+
+The system SHALL provide an `interrupt(value)` function that pauses graph execution and returns a resume value when the graph is continued.
+
+#### Scenario: Interrupt pauses execution with value
+
+- **WHEN** a node calls `const approval = interrupt({ question: "Approve this action?" })`
+- **THEN** execution SHALL pause and the interrupt value SHALL be available in the stream output
+
+#### Scenario: Resume returns value to interrupt
+
+- **WHEN** the graph is resumed with `Command({ resume: "approved" })`
+- **THEN** the `interrupt()` call SHALL return `"approved"`
+
+#### Scenario: Multiple interrupts are supported
+
+- **WHEN** a node calls `interrupt()` twice
+- **THEN** each interrupt SHALL be resolved sequentially, requiring two resume commands
+
+### Requirement: Command-based graph resumption
+
+The system SHALL provide a `Command` class that supports:
+- `Command.RESUME` — resume value for pending interrupts
+- `Command.GOTO` — Send or node name for dynamic routing
+- `Command.PARENT` — bubble up to parent graph
+
+#### Scenario: Command with resume continues execution
+
+- **WHEN** `await graph.stream(new Command({ resume: "user input" }))` is called
+- **THEN** the interrupted node SHALL continue with the resume value
+
+#### Scenario: Command with goto routes dynamically
+
+- **WHEN** a node returns `new Command({ goto: "human_review" })`
+- **THEN** execution SHALL route to `human_review` node
+
+### Requirement: Automated interrupts at node boundaries
+
+The system SHALL support `interruptBefore` and `interruptAfter` in `compile()` options to automatically pause at specific nodes.
+
+#### Scenario: InterruptBefore pauses before node execution
+
+- **WHEN** `graph.compile({ interruptBefore: ["approval_node"] })` is used
+- **THEN** the graph SHALL pause just before executing `approval_node`
+
+### Requirement: State snapshots on interrupt
+
+When a graph uses a checkpointer, interrupt states SHALL be persisted so execution can be resumed across process boundaries.
+
+#### Scenario: Interrupted state is checkpointed
+
+- **WHEN** a graphed with a checkpointer is interrupted
+- **THEN** the checkpoint SHALL contain the interrupt state
+- **THEN** restoring from that checkpoint SHALL yield the same interrupt state
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/llm-as-judge/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/llm-as-judge/spec.md
@@ -0,0 +1,55 @@
+## ADDED Requirements
+
+### Requirement: LLM-as-judge evaluator factory
+
+The system SHALL provide a `create_llm_as_judge()` function that creates an evaluator using an LLM to assess output quality.
+
+Parameters:
+- `prompt: string | Runnable | Callable` — evaluation prompt (string template with `{inputs}`, `{outputs}`, `{reference_outputs}`)
+- `judge?: ModelClient | BaseChatModel` — LLM client
+- `model?: string` — model identifier
+- `system?: string` — optional system message
+- `continuous: boolean = false` — float 0-1 scoring when true, boolean when false
+- `choices?: number[]` — specific enum float values for score
+- `use_reasoning: boolean = true` — include reasoning in output
+- `few_shot_examples?: FewShotExample[]` — example evaluations
+- `output_schema?: JSONSchema | ZodSchema` — custom structured output format
+
+#### Scenario: String prompt evaluator returns scored result
+
+- **WHEN** `create_llm_as_judge(prompt="Rate: {outputs}")` is invoked with `outputs="Hello world"`
+- **THEN** it SHALL return an `EvaluatorResult` with `key: "score"` and a valid `score`
+
+#### Scenario: Continuous scoring returns float
+
+- **WHEN** `create_llm_as_judge(prompt=..., continuous=true)` scores output
+- **THEN** the score SHALL be a float between 0.0 and 1.0
+
+#### Scenario: Choices scoring returns enum value
+
+- **WHEN** `create_llm_as_judge(prompt=..., choices=[0.0, 0.5, 1.0])` scores output
+- **THEN** the score SHALL be exactly one of the enumerated choices
+
+#### Scenario: Reasoning mode returns comment
+
+- **WHEN** `create_llm_as_judge(prompt=..., use_reasoning=true)` scores output
+- **THEN** the `comment` field SHALL contain the LLM's reasoning
+
+#### Scenario: Few-shot examples are appended to prompt
+
+- **WHEN** `few_shot_examples` are provided
+- **THEN** they SHALL be appended as `<example>` XML blocks to the last user message
+
+#### Scenario: Output schema returns structured dict
+
+- **WHEN** `output_schema` is provided (e.g., z.object({ quality: z.number() }))
+- **THEN** the evaluator SHALL return a dict matching that schema instead of `EvaluatorResult`
+
+### Requirement: Async LLM-as-judge
+
+The system SHALL provide `create_async_llm_as_judge()` with identical parameters, returning an async evaluator.
+
+#### Scenario: Async evaluator returns same structure as sync
+
+- **WHEN** `await` is used on an async evaluator invocation
+- **THEN** the result SHALL match the same structure as the sync equivalent
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/multi-turn-simulation/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/multi-turn-simulation/spec.md
@@ -0,0 +1,39 @@
+## ADDED Requirements
+
+### Requirement: Multi-turn conversation simulation
+
+The system SHALL provide `run_multiturn_simulation()` that simulates a multi-turn conversation between an app and a simulated user.
+
+Parameters:
+- `app: Callable[[ChatCompletionMessage], ChatCompletionMessage]` — the application under test
+- `user: Callable | string[]` — simulated user (dynamic or static responses)
+- `max_turns?: number` — maximum conversation turns
+- `trajectory_evaluators?: EvalFunction[]` — evaluators that assess the final trajectory
+- `stopping_condition?: Callable[[Message[], number], boolean]` — early termination
+- `reference_outputs?: unknown` — passed to evaluators
+
+#### Scenario: Static user responses drive conversation
+
+- **WHEN** `user=["Hello", "Tell me more", "Goodbye"]` with `max_turns=3`
+- **THEN** the simulation SHALL alternate between user responses and app responses for 3 turns
+
+#### Scenario: Dynamic simulated user adapts to context
+
+- **WHEN** `user` is a `Callable` receiving the current trajectory
+- **THEN** the user function SHALL receive the current conversation history and return the next message
+
+#### Scenario: Trajectory evaluators run after simulation
+
+- **WHEN** `trajectory_evaluators` are provided
+- **THEN** each evaluator SHALL receive the full conversation trajectory as `outputs`
+- **THEN** the simulation result SHALL include `evaluator_results` from each evaluator
+
+#### Scenario: Stopping condition terminates early
+
+- **WHEN** `stopping_condition` returns `true` before `max_turns`
+- **THEN** the simulation SHALL terminate immediately
+
+#### Scenario: Async simulation is supported
+
+- **WHEN** `run_multiturn_simulation_async()` is called with async `app` and `user` functions
+- **THEN** the simulation SHALL await each turn and return the same result structure
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/pregel-execution/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/pregel-execution/spec.md
@@ -0,0 +1,49 @@
+## ADDED Requirements
+
+### Requirement: Pregel execution engine
+
+The system SHALL implement a Pregel-style superstep execution engine where:
+
+- Each "superstep" executes all ready nodes concurrently
+- Nodes communicate through typed channels (not direct function calls)
+- Channel writes from one superstep are visible as reads in the next
+- The engine supports `PULL` (edge-triggered) and `PUSH` (dynamic Send) task scheduling
+
+#### Scenario: Nodes execute in dependency order
+
+- **WHEN** node B subscribes to channel A
+- **THEN** node B SHALL execute in the superstep after node A writes to channel A
+
+#### Scenario: Concurrent nodes run in parallel
+
+- **WHEN** two nodes have no dependencies between them
+- **THEN** they SHALL execute concurrently within the same superstep
+
+#### Scenario: Dynamic Send spawns new node executions
+
+- **WHEN** a node calls `send("node_c", { ... })` via `Command`
+- **THEN** `node_c` SHALL be scheduled for execution in the current or next superstep
+
+### Requirement: Graph compilation
+
+The system SHALL provide `graph.compile()` that produces a runnable compiled graph.
+
+Parameters:
+- `checkpointer?: Checkpointer` — optional persistence
+- `interruptBefore?: string[]` — nodes to pause before
+- `interruptAfter?: string[]` — nodes to pause after
+- `name?: string` — graph name
+
+#### Scenario: Compiled graph can be invoked
+
+- **WHEN** `compiled_graph.invoke({ messages: [] })` is called
+- **THEN** it SHALL execute all nodes and return the final state
+
+### Requirement: Recursion limit
+
+The system SHALL enforce a configurable recursion limit to prevent infinite loops.
+
+#### Scenario: Exceeding recursion limit throws
+
+- **WHEN** a graph exceeds the recursion limit
+- **THEN** a `GraphRecursionError` SHALL be thrown
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-command-execution/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-command-execution/spec.md
@@ -0,0 +1,61 @@
+## ADDED Requirements
+
+### Requirement: Command execution (blocking)
+
+The system SHALL provide `sandbox.runCommand(cmd, args?, opts?)` that executes a command inside the sandbox and waits for completion.
+
+Parameters:
+- `cmd: string` — command to execute
+- `args?: string[]` — command arguments
+- `cwd?: string` — working directory
+- `env?: Record<string, string>` — per-command environment variables
+- `sudo?: boolean` — execute with root privileges
+- `timeoutMs?: number` — max execution time (SIGKILL on expiry)
+- `signal?: AbortSignal` — cancellation
+
+#### Scenario: Blocking runCommand returns finished result with exit code
+
+- **WHEN** `sandbox.runCommand("echo", ["hello"])` is called
+- **THEN** it SHALL return a `CommandFinished` instance with `exitCode: 0`
+
+#### Scenario: Command timeout kills process
+
+- **WHEN** `sandbox.runCommand("sleep", ["100"], { timeoutMs: 100 })` is executed
+- **THEN** it SHALL return a non-zero exit code after ~100ms
+
+#### Scenario: Stderr is captured separately
+
+- **WHEN** a command writes to both stdout and stderr
+- **THEN** `result.stdout()` and `result.stderr()` SHALL return their respective streams
+
+### Requirement: Detached command execution
+
+The system SHALL support `{ detached: true }` mode where `runCommand()` returns immediately with a live `Command` handle.
+
+#### Scenario: Detached command returns before completion
+
+- **WHEN** `sandbox.runCommand({ cmd: "sleep", args: ["5"], detached: true })` is called
+- **THEN** it SHALL return a `Command` instance immediately (before the process exits)
+
+#### Scenario: Detached command can be waited on
+
+- **WHEN** `command.wait()` is called on a detached command
+- **THEN** it SHALL return a `CommandFinished` when the process exits
+
+### Requirement: Command log streaming
+
+The system SHALL provide `command.logs()` as an async iterable of stdout/stderr log lines.
+
+#### Scenario: Logs stream output lines
+
+- **WHEN** `for await (const log of command.logs())` is iterated
+- **THEN** each `log` SHALL have `stream: "stdout" | "stderr"` and `data: string`
+
+### Requirement: Command kill
+
+The system SHALL provide `command.kill(signal?)` to send a POSIX signal to a running command.
+
+#### Scenario: Default kill sends SIGTERM
+
+- **WHEN** `command.kill()` is called without a signal
+- **THEN** SIGTERM SHALL be sent to the process
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-filesystem/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-filesystem/spec.md
@@ -0,0 +1,50 @@
+## ADDED Requirements
+
+### Requirement: Filesystem API matching node:fs/promises
+
+The system SHALL provide `sandbox.fs` implementing the Node.js `fs/promises` API:
+
+- `readFile(path, encoding?)` → `Buffer | string`
+- `writeFile(path, data)` → `void`
+- `appendFile(path, data)` → `void`
+- `mkdir(path, { recursive? })` → `void`
+- `readdir(path, { withFileTypes? })` → `string[] | Dirent[]`
+- `stat(path)` / `lstat(path)` → `Stats`
+- `unlink(path)`, `rm(path, { recursive?, force? })`, `rmdir(path)` → `void`
+- `rename(oldPath, newPath)` → `void`
+- `copyFile(src, dest)` → `void`
+- `chmod(path, mode)`, `chown(path, uid, gid)` → `void`
+- `symlink(target, path)`, `readlink(path)` → `void`
+- `realpath(path)`, `truncate(path, len?)` → `void`
+- `mkdtemp(prefix)` → `string`
+- `access(path)`, `exists(path)` → `boolean`
+
+#### Scenario: ReadFile returns correct content
+
+- **WHEN** `sandbox.fs.readFile("/etc/hostname", "utf8")` is called
+- **THEN** it SHALL return the file content as a string
+
+#### Scenario: WriteFile creates new file
+
+- **WHEN** `sandbox.fs.writeFile("/tmp/test.txt", "hello")` is called
+- **THEN** subsequent `sandbox.fs.readFile("/tmp/test.txt", "utf8")` SHALL return `"hello"`
+
+#### Scenario: Readdir lists directory contents
+
+- **WHEN** `sandbox.fs.readdir("/")` is called
+- **THEN** it SHALL return an array of filenames
+
+#### Scenario: Stat returns file metadata
+
+- **WHEN** `sandbox.fs.stat("/etc/hostname")` is called
+- **THEN** it SHALL return a `Stats`-compatible object with `size`, `isFile()`, `isDirectory()`, `mode`, `uid`, `gid`, `mtime`, etc.
+
+#### Scenario: Mkdir creates intermediate directories
+
+- **WHEN** `sandbox.fs.mkdir("/tmp/a/b/c", { recursive: true })` is called
+- **THEN** the directory `/tmp/a/b/c` SHALL exist
+
+#### Scenario: Exists returns false for missing files
+
+- **WHEN** `sandbox.fs.exists("/nonexistent")` is called
+- **THEN** it SHALL return `false`
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-lifecycle/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-lifecycle/spec.md
@@ -0,0 +1,70 @@
+## ADDED Requirements
+
+### Requirement: Sandbox creation
+
+The system SHALL provide a `Sandbox.create()` static method that provisions a new isolated compute environment.
+
+Parameters:
+- `name?: string` — optional human-readable name
+- `source?: { type: "git" | "tarball" | "snapshot" }` — source for initial filesystem
+- `ports?: number[]` — ports to expose (max 4)
+- `timeout?: number` — auto-terminate timeout in ms
+- `resources?: { vcpus: number }` — CPU allocation (2048 MB RAM per vCPU)
+- `runtime?: string` — runtime identifier
+- `networkPolicy?: NetworkPolicy` — network restrictions
+- `env?: Record<string, string>` — default environment variables
+- `tags?: Record<string, string>` — metadata tags (max 5)
+- `persistent?: boolean` — persistent filesystem across sessions
+- `signal?: AbortSignal` — cancellation support
+
+#### Scenario: Create returns a running Sandbox instance
+
+- **WHEN** `Sandbox.create()` is called with valid parameters
+- **THEN** it SHALL return a `Sandbox` instance with a running session
+
+#### Scenario: Create supports AsyncDisposable
+
+- **WHEN** `Sandbox.create()` is used with `await using`
+- **THEN** the sandbox SHALL be automatically stopped when scope exits
+
+#### Scenario: Source specifies initial filesystem content
+
+- **WHEN** `source: { type: "git", url: "..." }` is provided
+- **THEN** the sandbox SHALL clone the git repository on creation
+
+### Requirement: Sandbox retrieval
+
+The system SHALL provide `Sandbox.get()` to retrieve an existing sandbox and `Sandbox.getOrCreate()` for idempotent get-or-create.
+
+#### Scenario: Get retrieves existing sandbox
+
+- **WHEN** `Sandbox.get({ name: "my-sandbox" })` is called for an existing sandbox
+- **THEN** it SHALL return the sandbox with its session resumed
+
+#### Scenario: GetOrCreate creates when not found
+
+- **WHEN** `Sandbox.getOrCreate({ name: "new-sandbox", onCreate: ... })` is called and sandbox doesn't exist
+- **THEN** it SHALL create a new sandbox and call `onCreate` once
+
+### Requirement: Sandbox forking
+
+The system SHALL provide `Sandbox.fork()` to create a new sandbox from an existing one's current filesystem state.
+
+#### Scenario: Fork preserves filesystem state
+
+- **WHEN** `Sandbox.fork({ sourceSandbox: "original" })` is called
+- **THEN** the new sandbox SHALL start with the filesystem state of the source sandbox
+
+### Requirement: Sandbox update and delete
+
+The system SHALL support `sandbox.update()` for configuration changes and `sandbox.delete()` for removal.
+
+#### Scenario: Update changes sandbox config
+
+- **WHEN** `sandbox.update({ timeout: 300000 })` is called
+- **THEN** the sandbox's timeout SHALL be updated for subsequent sessions
+
+#### Scenario: Delete removes the sandbox
+
+- **WHEN** `sandbox.delete()` is called
+- **THEN** the sandbox SHALL be permanently removed
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-network-policy/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-network-policy/spec.md
@@ -0,0 +1,52 @@
+## ADDED Requirements
+
+### Requirement: Network policy type
+
+The system SHALL define a `NetworkPolicy` type with three forms:
+
+- `"allow-all"` — full internet access (default)
+- `"deny-all"` — no external access
+- `{ allow?: string[] | Record<string, NetworkPolicyRule[]>; subnets?: { allow?: string[]; deny?: string[] } }` — custom rules
+
+#### Scenario: Allow-all permits all traffic
+
+- **WHEN** `networkPolicy: "allow-all"` is set
+- **THEN** all outbound traffic SHALL be permitted
+
+#### Scenario: Deny-all blocks all traffic
+
+- **WHEN** `networkPolicy: "deny-all"` is set
+- **THEN** all outbound traffic SHALL be denied
+
+#### Scenario: Domain allowlist restricts access
+
+- **WHEN** `networkPolicy: { allow: ["*.npmjs.org"] }` is set
+- **THEN** traffic to `registry.npmjs.org` SHALL be allowed and all other traffic SHALL be denied
+
+#### Scenario: Wildcard domains match subdomains
+
+- **WHEN** a domain pattern starts with `*.` (e.g., `*.example.com`)
+- **THEN** it SHALL match any subdomain of that domain
+
+### Requirement: Network policy rules with transformers
+
+The system SHALL support per-domain rules with request transformers for header injection.
+
+Parameters per rule:
+- `match?: { path?, method?, queryString?, headers? }` — request matchers
+- `transform?: { headers: Record<string, string> }[]` — header injection
+- `forwardURL?: string` — HTTPS proxy forwarding
+
+#### Scenario: Header transform injects authorization
+
+- **WHEN** a request matches a rule with `transform: [{ headers: { authorization: "Bearer token" } }]`
+- **THEN** the `authorization` header SHALL be injected before forwarding
+
+### Requirement: Subnet filtering
+
+The system SHALL support subnet-level access control via CIDR notation.
+
+#### Scenario: Subnet allow takes precedence over domain deny
+
+- **WHEN** `subnets: { allow: ["10.0.0.0/8"] }` is set
+- **THEN** traffic to `10.0.0.1` SHALL be allowed regardless of domain rules
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-snapshots/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/sandbox-snapshots/spec.md
@@ -0,0 +1,59 @@
+## ADDED Requirements
+
+### Requirement: Snapshot creation
+
+The system SHALL provide `sandbox.snapshot()` to create a point-in-time filesystem snapshot.
+
+Parameters:
+- `expiration?: number` — TTL in milliseconds (0 for no expiration)
+
+#### Scenario: Snapshot stops the session and returns Snapshot instance
+
+- **WHEN** `sandbox.snapshot()` is called on a running sandbox
+- **THEN** the current session SHALL be stopped and a `Snapshot` SHALL be returned
+
+### Requirement: Snapshot retrieval and listing
+
+The system SHALL provide `Snapshot.get()`, `Snapshot.list()`, and `Snapshot.tree()` for managing snapshots.
+
+#### Scenario: Retrieve snapshot by ID
+
+- **WHEN** `Snapshot.get({ snapshotId: "snap_abc" })` is called
+- **THEN** it SHALL return the snapshot with matching ID
+
+#### Scenario: List snapshots with pagination
+
+- **WHEN** `Snapshot.list({ name: "my-sandbox" })` is called
+- **THEN** it SHALL return a paginated list of snapshots for that sandbox
+
+#### Scenario: Ancestry tree is accessible
+
+- **WHEN** `Snapshot.tree({ snapshotId: "snap_abc" })` is called
+- **THEN** it SHALL return the ancestry tree of the snapshot
+
+### Requirement: Snapshot deletion
+
+The system SHALL provide `snapshot.delete()` to remove a snapshot.
+
+#### Scenario: Deleted snapshot is no longer listable
+
+- **WHEN** `snapshot.delete()` is called and then `Snapshot.list()` is called
+- **THEN** the deleted snapshot SHALL no longer appear in the list
+
+### Requirement: Snapshot-based sandbox creation
+
+The system SHALL support creating sandboxes from snapshots via `Sandbox.create({ source: { type: "snapshot", snapshotId } })`.
+
+#### Scenario: Sandbox created from snapshot has matching filesystem
+
+- **WHEN** a sandbox is created with a snapshot source and a file is written, then another sandbox is created from the resulting snapshot
+- **THEN** the second sandbox SHALL contain the file from the first
+
+### Requirement: Snapshot retention
+
+The system SHALL support `keepLastSnapshots` retention policy on sandboxes.
+
+#### Scenario: Retention evicts oldest snapshots
+
+- **WHEN** a sandbox has `keepLastSnapshots: { count: 3 }` and a 4th snapshot is created
+- **THEN** the oldest snapshot SHALL be evicted
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/state-graph/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/state-graph/spec.md
@@ -0,0 +1,43 @@
+## ADDED Requirements
+
+### Requirement: State definition via annotations
+
+The system SHALL provide an `Annotation` API for defining graph state schemas:
+
+- `Annotation<T>(reducer?)` — creates a state key with optional reducer
+- `Annotation.Root({ key: Annotation<T> })` — combines keys into a state schema
+- Reducers: `LastValue` (default — overwrite), `BinaryOperator` (custom merge function)
+
+#### Scenario: Annotation.Root defines typed state
+
+- **WHEN** `const State = Annotation.Root({ messages: Annotation<string[]>(addMessages), step: Annotation<number>() })` is defined
+- **THEN** `State` SHALL have `State`, `Update`, and `Node` type members
+
+#### Scenario: LastValue reducer replaces on each write
+
+- **WHEN** a node writes `{ step: 2 }` and then `{ step: 3 }` in the same step
+- **THEN** the LastValue channel SHALL throw an `InvalidUpdateError`
+
+#### Scenario: BinaryOperator reducer accumulates
+
+- **WHEN** a node returns `{ messages: ["hello"] }` and another returns `{ messages: ["world"] }` with an `addMessages` reducer
+- **THEN** the final state SHALL contain `messages: ["hello", "world"]`
+
+### Requirement: StateGraph builder
+
+The system SHALL provide a `StateGraph` class for constructing stateful agent graphs.
+
+#### Scenario: StateGraph is constructed with state schema
+
+- **WHEN** `new StateGraph({ stateSchema: State })` is called
+- **THEN** the graph SHALL accept nodes that receive and can update the defined state
+
+#### Scenario: Nodes can read and write state
+
+- **WHEN** a node function receives state with `{ messages, step }` and returns `{ step: step + 1 }`
+- **THEN** the graph SHALL update `step` and preserve `messages`
+
+#### Scenario: Conditional edges route based on state
+
+- **WHEN** `addConditionalEdges("node_a", (state) => state.step > 5 ? "end" : "node_b")` is added
+- **THEN** execution SHALL route based on the state value at runtime
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/trajectory-eval/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/trajectory-eval/spec.md
@@ -0,0 +1,51 @@
+## ADDED Requirements
+
+### Requirement: Trajectory match evaluator
+
+The system SHALL provide `create_trajectory_match_evaluator()` that compares agent tool-call trajectories against reference trajectories.
+
+Parameters:
+- `trajectory_match_mode: "strict" | "unordered" | "subset" | "superset"` — matching strategy
+- `tool_args_match_mode: "exact" | "ignore" | "subset" | "superset"` — tool argument comparison
+- `tool_args_match_overrides?: Record<string, ToolArgsMatchMode | Callable | string[]>` — per-tool custom matching
+
+#### Scenario: Strict mode requires exact order
+
+- **WHEN** output trajectory has tool calls `[A, B]` and reference is `[A, B]`
+- **THEN** strict mode SHALL return score `true`
+- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
+- **THEN** strict mode SHALL return score `false`
+
+#### Scenario: Unordered mode ignores order
+
+- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
+- **THEN** unordered mode SHALL return score `true`
+
+#### Scenario: Subset mode accepts partial trajectory
+
+- **WHEN** output trajectory has tool calls `[A]` and reference is `[A, B]`
+- **THEN** subset mode SHALL return score `true`
+
+#### Scenario: Superset mode allows extra tool calls
+
+- **WHEN** output trajectory has tool calls `[A, B, C]` and reference is `[A, B]`
+- **THEN** superset mode SHALL return score `true`
+
+#### Scenario: Tool args ignore mode skips argument comparison
+
+- **WHEN** `tool_args_match_mode="ignore"` is set
+- **THEN** tool calls match regardless of their arguments
+
+#### Scenario: Custom tool arg matcher is used
+
+- **WHEN** `tool_args_match_overrides` contains a `Callable` for a tool name
+- **THEN** that callable SHALL be invoked to compare the tool's arguments
+
+### Requirement: Trajectory LLM-as-judge
+
+The system SHALL provide `create_trajectory_llm_as_judge()` that uses an LLM to grade trajectory quality and accuracy.
+
+#### Scenario: Trajectory is formatted as XML for LLM
+
+- **WHEN** an LLM trajectory evaluator is invoked
+- **THEN** the trajectory SHALL be formatted as XML with `<role>`, `<tool_call>`, `<tool_result>` elements
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/tasks.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/tasks.md
@@ -0,0 +1,50 @@
+## 1. Foundation: Core Types & Monorepo Setup ✅
+
+- [x] 1.1 Initialize pnpm monorepo with turbo.json at root, configure `@agent-runtime/*` workspace packages
+- [x] 1.2 Set up shared TypeScript config (strict mode, ESNext modules, path aliases)
+- [x] 1.3 Implement `@agent-runtime/core` package: `EvaluatorResult`, `ScoreType`, `ModelClient` protocol, `Serializable` interface
+- [x] 1.4 Implement `@agent-runtime/core` serialization protocol: `toJSON()`/`fromJSON()` pattern on stateful types
+- [x] 1.5 Implement `@agent-runtime/core` error types: `EvalError`
+- [x] 1.6 Implement `@agent-runtime/core` utility functions: message normalization, XML formatting, JSON schema construction
+
+## 2. Eval: LLM-as-Judge Core
+
+- [ ] 2.1 Implement `_construct_default_output_json_schema()` for continuous/binary/choices scoring with reasoning
+- [ ] 2.2 Implement prompt formatting (string templates, attachments, system messages)
+- [ ] 2.3 Implement `_append_few_shot_examples()` with XML `<example>` formatting
+- [ ] 2.4 Implement `_create_llm_as_judge_scorer()` — core scorer with structured output via OpenAI JSON schema
+- [ ] 2.5 Implement `create_llm_as_judge()` factory wrapping scorer into `_run_evaluator()`
+- [ ] 2.6 Implement async variants: `create_async_llm_as_judge()`, `_create_async_llm_as_judge_scorer()`
+- [ ] 2.7 Implement `_run_evaluator_untyped()` and `_process_score()` for result aggregation
+- [ ] 2.8 Write unit tests for LLM-as-judge: string prompts, continuous scoring, choices, reasoning, few-shot
+
+## 3. Eval: Trajectory Evaluators
+
+- [ ] 3.1 Implement trajectory matching utilities: `_normalize_to_openai_messages_list()`, `_extract_tool_calls()`
+- [ ] 3.2 Implement `_is_trajectory_superset()` core comparator with `_get_matcher_for_tool_name()` override system
+- [ ] 3.3 Implement strict/unordered/subset/superset matching scorers
+- [ ] 3.4 Implement `create_trajectory_match_evaluator()` with all 4 modes and `tool_args_match_overrides`
+- [ ] 3.5 Write tests: all 4 match modes, tool args ignore, custom matchers
+
+## 4. Eval: Code Correctness Evaluators
+
+- [ ] 4.1 Implement code extraction: `_extract_code_from_markdown_code_blocks()` regex parser
+- [ ] 4.2 Implement `_create_base_code_evaluator()` with pluggable extraction pipeline
+- [ ] 4.3 Implement `create_code_llm_as_judge()` combining extraction + LLM scoring
+- [ ] 4.4 Implement `create_pyright_evaluator()` with temp file execution and JSON output parsing
+- [ ] 4.5 Write tests: markdown extraction, Pyright static analysis
+
+## 5. Eval: Prompt Library
+
+- [ ] 5.1 Export Quality prompt templates: correctness, conciseness, hallucination, answer_relevance, code_correctness, plan_adherence
+- [ ] 5.2 Export Safety/Security prompt templates: toxicity, fairness, pii_leakage, prompt_injection
+- [ ] 5.3 Export Trajectory prompt templates: trajectory_accuracy (with and without reference), tool_selection
+- [ ] 5.4 Export Conversation prompt templates: user_satisfaction, task_completion, agent_tone
+
+## 6. Documentation & Release
+
+- [ ] 6.1 Write README with architecture overview and getting-started example
+- [ ] 6.2 Document each package with tsdoc exports
+- [ ] 6.3 Write usage examples: basic eval, code correctness check
+- [ ] 6.4 Add CI pipeline: lint, type-check, test
+- [ ] 6.5 Publish initial alpha for `@agent-runtime/eval` package