chore(openspec): drop 9 superseded proposals + 11 stub archive files

Drop 9 batch proposals that are superseded by the boocode-lift-analysis
(boocontext-audit, conductor upgrades, self-healing/verify-gate skills):
add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform,
conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul,
agent-reliability.

Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only)
that provide zero documentation value over the existing CHANGELOG.md + git tags.
This commit is contained in:
2026-06-07 22:15:38 +00:00
parent 0d6e9a2413
commit c935687725
119 changed files with 4897 additions and 45 deletions

View File

@@ -0,0 +1,65 @@
## ADDED Requirements
### Requirement: Code LLM-as-judge
The system SHALL provide `create_code_llm_as_judge()` that evaluates code correctness using an LLM, with code extraction from responses.
Parameters:
- `code_extraction_strategy: "none" | "llm" | "markdown_code_blocks"` — how to extract code from output
- `code_extractor?: Callable` — custom extraction function
#### Scenario: Markdown code block extraction
- **WHEN** `code_extraction_strategy="markdown_code_blocks"` and output contains triple-backtick code blocks
- **THEN** the evaluator SHALL extract code from those blocks before scoring
#### Scenario: LLM-based code extraction
- **WHEN** `code_extraction_strategy="llm"` and a `judge` is provided
- **THEN** the evaluator SHALL use an LLM with `ExtractCode`/`NoCode` tools to extract code
#### Scenario: No extraction returns raw output
- **WHEN** `code_extraction_strategy="none"`
- **THEN** the raw output string is passed directly to the scorer
### Requirement: Static analysis evaluator (Pyright)
The system SHALL provide `create_pyright_evaluator()` that runs Pyright static type checking on extracted Python code.
Parameters:
- `pyright_cli_args: string[]` — additional CLI flags
- `code_extraction_strategy` / `code_extractor` — same as code LLM evaluator
#### Scenario: Pyright detects type error
- **WHEN** code with a type error (e.g., `x: int = "string"`) is evaluated
- **THEN** the evaluator SHALL return score `false` with error details in `comment`
#### Scenario: Pyright passes clean code
- **WHEN** valid Python code is evaluated
- **THEN** the evaluator SHALL return score `true`
### Requirement: Static analysis evaluator (Mypy)
The system SHALL provide `create_mypy_evaluator()` with equivalent behavior to Pyright evaluator but using the Mypy type checker.
#### Scenario: Mypy detects type error
- **WHEN** code with an unannotated function returning mismatched types is evaluated
- **THEN** the evaluator SHALL return score `false`
### Requirement: Sandboxed code execution
The system SHALL provide `create_e2b_execution_evaluator()` that executes code in a sandbox and checks for runtime errors.
#### Scenario: Code executes without errors
- **WHEN** valid Python code runs in the sandbox
- **THEN** the evaluator SHALL return score `true`
#### Scenario: Code raises runtime exception
- **WHEN** code that raises an exception is executed
- **THEN** the evaluator SHALL return score `false` with error details

View File

@@ -0,0 +1,31 @@
## ADDED Requirements
### Requirement: Shared type system
The system SHALL define a shared set of types used by all packages:
- `EvaluatorResult` — TypedDict with `key: string`, `score: number | boolean`, `comment?: string`, `metadata?: Record<string, unknown>`, `source_run_id?: string`
- `ModelClient` — Protocol with `chat.completions.create()` for LLM access
- `SandboxProvider` — Interface for provider-agnostic sandbox creation/management
- `Checkpointer` — Interface for checkpoint persistence
- `Serializable` — Interface requiring `toJSON()` and static `fromJSON()` methods
- All evaluators SHALL accept a consistent call signature: `(inputs?, outputs, reference_outputs?, **kwargs)`
- Error types: `GraphInterrupt`, `SandboxError`, `EvalError`
#### Scenario: EvaluatorResult conforms to schema
- **WHEN** an evaluator returns a result
- **THEN** the result SHALL conform to `EvaluatorResult` with at least `key` and `score`
#### Scenario: All stateful objects are serializable
- **WHEN** a `Sandbox`, `Snapshot`, or `Command` instance is serialized via `toJSON()`
- **THEN** a subsequent `fromJSON()` call SHALL reconstruct an equivalent instance
### Requirement: Serialization protocol
All stateful objects (`Sandbox`, `Session`, `Command`, `Snapshot`, `GraphState`) SHALL implement `toJSON()` / `fromJSON()` static methods for cross-session persistence.
#### Scenario: Round-trip serialization preserves identity
- **WHEN** an object is serialized and deserialized
- **THEN** the deserialized object SHALL have matching identity fields (`id`, `name`, `sessionId`)

View File

@@ -0,0 +1,49 @@
## ADDED Requirements
### Requirement: Built-in evaluation prompt templates
The system SHALL ship with a library of prompt templates organized by domain, ready for use with `create_llm_as_judge()`.
Domains and included prompts:
**Quality:**
- `CORRECTNESS_PROMPT` — factual accuracy and completeness
- `CONCISENESS_PROMPT` — concise responses without hedging or fluff
- `HALLUCINATION_PROMPT` — claims verifiable from context
- `ANSWER_RELEVANCE_PROMPT` — output addresses the input question
- `PLAN_ADHERENCE_PROMPT` — agent actions match declared plan
- `LAZINESS_PROMPT` — detects blank or low-effort responses
**RAG:**
- `RAG_GROUNDEDNESS_PROMPT` — output claims supported by retrieved context
- `RAG_HELPFULNESS_PROMPT` — output addresses core question
- `RAG_RETRIEVAL_RELEVANCE_PROMPT` — retrieved context is relevant to input
**Safety:**
- `TOXICITY_PROMPT` — personal attacks, hate speech
- `FAIRNESS_PROMPT` — stereotyping, discrimination
**Security:**
- `PII_LEAKAGE_PROMPT` — names, contact info, credentials in output
- `PROMPT_INJECTION_PROMPT` — delimiter manipulation, roleplay bypass
- `CODE_INJECTION_PROMPT` — SQL injection, XSS, path traversal
**Trajectory:**
- `TRAJECTORY_ACCURACY_PROMPT` — logical progression, goal alignment
- `TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE` — semantically equivalent to reference
- `TOOL_SELECTION_PROMPT` — right tools, right order, no redundant calls
**Conversation:**
- `USER_SATISFACTION_PROMPT` — gratitude, resolution, engagement
- `TASK_COMPLETION_PROMPT` — was the user's goal achieved
- `AGENT_TONE_PROMPT` — appropriate tone and professionalism
#### Scenario: Each prompt is a string with {inputs}, {outputs}, {reference_outputs} placeholders
- **WHEN** a prompt template is inspected
- **THEN** it SHALL be a string compatible with `str.format()` containing at least `{outputs}`
#### Scenario: Prompt templates follow rubric structure
- **WHEN** a prompt template is read
- **THEN** it SHALL contain `<Rubric>`, `<Instructions>`, and `<Reminder>` XML sections

View File

@@ -0,0 +1,49 @@
## ADDED Requirements
### Requirement: Stream modes
The system SHALL support multiple stream modes when invoking a compiled graph:
- `"values"` — emits the full state after each superstep
- `"updates"` — emits only the state changes after each superstep
- `"messages"` — emits individual message chunks for chat-oriented graphs
- `"debug"` — emits debug events with full superstep information
- `"custom"` — supports user-defined events via a emit function
#### Scenario: Values mode emits full state
- **WHEN** a graph is streamed with `streamMode: ["values"]`
- **THEN** each chunk SHALL contain the complete state object after each superstep
#### Scenario: Updates mode emits diffs
- **WHEN** a graph is streamed with `streamMode: ["updates"]`
- **THEN** each chunk SHALL contain only the state keys that changed
### Requirement: Stream event protocol
The system SHALL emit structured events during graph execution, including:
- `on_chain_start` — node execution begins
- `on_chain_end` — node execution completes
- `on_chain_stream` — intermediate output from a node
- `on_custom_event` — user-defined events
- Checkpoint metadata paired with each event (id, parent_id, step, source)
#### Scenario: Events include checkpoint metadata
- **WHEN** a stream event is received
- **THEN** it SHALL include a `checkpoint` envelope with `id`, `step`, and `source`
#### Scenario: Custom events propagate from nodes
- **WHEN** a node emits a custom event via an emit function
- **THEN** that event SHALL appear in the stream with type `on_custom_event`
### Requirement: Async iteration over streams
The system SHALL support `for await...of` iteration over graph streams.
#### Scenario: Stream is async iterable
- **WHEN** `for await (const chunk of graph.stream(...))` is used
- **THEN** each chunk SHALL be available as it is produced

View File

@@ -0,0 +1,56 @@
## ADDED Requirements
### Requirement: Node interrupt function
The system SHALL provide an `interrupt(value)` function that pauses graph execution and returns a resume value when the graph is continued.
#### Scenario: Interrupt pauses execution with value
- **WHEN** a node calls `const approval = interrupt({ question: "Approve this action?" })`
- **THEN** execution SHALL pause and the interrupt value SHALL be available in the stream output
#### Scenario: Resume returns value to interrupt
- **WHEN** the graph is resumed with `Command({ resume: "approved" })`
- **THEN** the `interrupt()` call SHALL return `"approved"`
#### Scenario: Multiple interrupts are supported
- **WHEN** a node calls `interrupt()` twice
- **THEN** each interrupt SHALL be resolved sequentially, requiring two resume commands
### Requirement: Command-based graph resumption
The system SHALL provide a `Command` class that supports:
- `Command.RESUME` — resume value for pending interrupts
- `Command.GOTO` — Send or node name for dynamic routing
- `Command.PARENT` — bubble up to parent graph
#### Scenario: Command with resume continues execution
- **WHEN** `await graph.stream(new Command({ resume: "user input" }))` is called
- **THEN** the interrupted node SHALL continue with the resume value
#### Scenario: Command with goto routes dynamically
- **WHEN** a node returns `new Command({ goto: "human_review" })`
- **THEN** execution SHALL route to `human_review` node
### Requirement: Automated interrupts at node boundaries
The system SHALL support `interruptBefore` and `interruptAfter` in `compile()` options to automatically pause at specific nodes.
#### Scenario: InterruptBefore pauses before node execution
- **WHEN** `graph.compile({ interruptBefore: ["approval_node"] })` is used
- **THEN** the graph SHALL pause just before executing `approval_node`
### Requirement: State snapshots on interrupt
When a graph uses a checkpointer, interrupt states SHALL be persisted so execution can be resumed across process boundaries.
#### Scenario: Interrupted state is checkpointed
- **WHEN** a graphed with a checkpointer is interrupted
- **THEN** the checkpoint SHALL contain the interrupt state
- **THEN** restoring from that checkpoint SHALL yield the same interrupt state

View File

@@ -0,0 +1,55 @@
## ADDED Requirements
### Requirement: LLM-as-judge evaluator factory
The system SHALL provide a `create_llm_as_judge()` function that creates an evaluator using an LLM to assess output quality.
Parameters:
- `prompt: string | Runnable | Callable` — evaluation prompt (string template with `{inputs}`, `{outputs}`, `{reference_outputs}`)
- `judge?: ModelClient | BaseChatModel` — LLM client
- `model?: string` — model identifier
- `system?: string` — optional system message
- `continuous: boolean = false` — float 0-1 scoring when true, boolean when false
- `choices?: number[]` — specific enum float values for score
- `use_reasoning: boolean = true` — include reasoning in output
- `few_shot_examples?: FewShotExample[]` — example evaluations
- `output_schema?: JSONSchema | ZodSchema` — custom structured output format
#### Scenario: String prompt evaluator returns scored result
- **WHEN** `create_llm_as_judge(prompt="Rate: {outputs}")` is invoked with `outputs="Hello world"`
- **THEN** it SHALL return an `EvaluatorResult` with `key: "score"` and a valid `score`
#### Scenario: Continuous scoring returns float
- **WHEN** `create_llm_as_judge(prompt=..., continuous=true)` scores output
- **THEN** the score SHALL be a float between 0.0 and 1.0
#### Scenario: Choices scoring returns enum value
- **WHEN** `create_llm_as_judge(prompt=..., choices=[0.0, 0.5, 1.0])` scores output
- **THEN** the score SHALL be exactly one of the enumerated choices
#### Scenario: Reasoning mode returns comment
- **WHEN** `create_llm_as_judge(prompt=..., use_reasoning=true)` scores output
- **THEN** the `comment` field SHALL contain the LLM's reasoning
#### Scenario: Few-shot examples are appended to prompt
- **WHEN** `few_shot_examples` are provided
- **THEN** they SHALL be appended as `<example>` XML blocks to the last user message
#### Scenario: Output schema returns structured dict
- **WHEN** `output_schema` is provided (e.g., z.object({ quality: z.number() }))
- **THEN** the evaluator SHALL return a dict matching that schema instead of `EvaluatorResult`
### Requirement: Async LLM-as-judge
The system SHALL provide `create_async_llm_as_judge()` with identical parameters, returning an async evaluator.
#### Scenario: Async evaluator returns same structure as sync
- **WHEN** `await` is used on an async evaluator invocation
- **THEN** the result SHALL match the same structure as the sync equivalent

View File

@@ -0,0 +1,39 @@
## ADDED Requirements
### Requirement: Multi-turn conversation simulation
The system SHALL provide `run_multiturn_simulation()` that simulates a multi-turn conversation between an app and a simulated user.
Parameters:
- `app: Callable[[ChatCompletionMessage], ChatCompletionMessage]` — the application under test
- `user: Callable | string[]` — simulated user (dynamic or static responses)
- `max_turns?: number` — maximum conversation turns
- `trajectory_evaluators?: EvalFunction[]` — evaluators that assess the final trajectory
- `stopping_condition?: Callable[[Message[], number], boolean]` — early termination
- `reference_outputs?: unknown` — passed to evaluators
#### Scenario: Static user responses drive conversation
- **WHEN** `user=["Hello", "Tell me more", "Goodbye"]` with `max_turns=3`
- **THEN** the simulation SHALL alternate between user responses and app responses for 3 turns
#### Scenario: Dynamic simulated user adapts to context
- **WHEN** `user` is a `Callable` receiving the current trajectory
- **THEN** the user function SHALL receive the current conversation history and return the next message
#### Scenario: Trajectory evaluators run after simulation
- **WHEN** `trajectory_evaluators` are provided
- **THEN** each evaluator SHALL receive the full conversation trajectory as `outputs`
- **THEN** the simulation result SHALL include `evaluator_results` from each evaluator
#### Scenario: Stopping condition terminates early
- **WHEN** `stopping_condition` returns `true` before `max_turns`
- **THEN** the simulation SHALL terminate immediately
#### Scenario: Async simulation is supported
- **WHEN** `run_multiturn_simulation_async()` is called with async `app` and `user` functions
- **THEN** the simulation SHALL await each turn and return the same result structure

View File

@@ -0,0 +1,49 @@
## ADDED Requirements
### Requirement: Pregel execution engine
The system SHALL implement a Pregel-style superstep execution engine where:
- Each "superstep" executes all ready nodes concurrently
- Nodes communicate through typed channels (not direct function calls)
- Channel writes from one superstep are visible as reads in the next
- The engine supports `PULL` (edge-triggered) and `PUSH` (dynamic Send) task scheduling
#### Scenario: Nodes execute in dependency order
- **WHEN** node B subscribes to channel A
- **THEN** node B SHALL execute in the superstep after node A writes to channel A
#### Scenario: Concurrent nodes run in parallel
- **WHEN** two nodes have no dependencies between them
- **THEN** they SHALL execute concurrently within the same superstep
#### Scenario: Dynamic Send spawns new node executions
- **WHEN** a node calls `send("node_c", { ... })` via `Command`
- **THEN** `node_c` SHALL be scheduled for execution in the current or next superstep
### Requirement: Graph compilation
The system SHALL provide `graph.compile()` that produces a runnable compiled graph.
Parameters:
- `checkpointer?: Checkpointer` — optional persistence
- `interruptBefore?: string[]` — nodes to pause before
- `interruptAfter?: string[]` — nodes to pause after
- `name?: string` — graph name
#### Scenario: Compiled graph can be invoked
- **WHEN** `compiled_graph.invoke({ messages: [] })` is called
- **THEN** it SHALL execute all nodes and return the final state
### Requirement: Recursion limit
The system SHALL enforce a configurable recursion limit to prevent infinite loops.
#### Scenario: Exceeding recursion limit throws
- **WHEN** a graph exceeds the recursion limit
- **THEN** a `GraphRecursionError` SHALL be thrown

View File

@@ -0,0 +1,61 @@
## ADDED Requirements
### Requirement: Command execution (blocking)
The system SHALL provide `sandbox.runCommand(cmd, args?, opts?)` that executes a command inside the sandbox and waits for completion.
Parameters:
- `cmd: string` — command to execute
- `args?: string[]` — command arguments
- `cwd?: string` — working directory
- `env?: Record<string, string>` — per-command environment variables
- `sudo?: boolean` — execute with root privileges
- `timeoutMs?: number` — max execution time (SIGKILL on expiry)
- `signal?: AbortSignal` — cancellation
#### Scenario: Blocking runCommand returns finished result with exit code
- **WHEN** `sandbox.runCommand("echo", ["hello"])` is called
- **THEN** it SHALL return a `CommandFinished` instance with `exitCode: 0`
#### Scenario: Command timeout kills process
- **WHEN** `sandbox.runCommand("sleep", ["100"], { timeoutMs: 100 })` is executed
- **THEN** it SHALL return a non-zero exit code after ~100ms
#### Scenario: Stderr is captured separately
- **WHEN** a command writes to both stdout and stderr
- **THEN** `result.stdout()` and `result.stderr()` SHALL return their respective streams
### Requirement: Detached command execution
The system SHALL support `{ detached: true }` mode where `runCommand()` returns immediately with a live `Command` handle.
#### Scenario: Detached command returns before completion
- **WHEN** `sandbox.runCommand({ cmd: "sleep", args: ["5"], detached: true })` is called
- **THEN** it SHALL return a `Command` instance immediately (before the process exits)
#### Scenario: Detached command can be waited on
- **WHEN** `command.wait()` is called on a detached command
- **THEN** it SHALL return a `CommandFinished` when the process exits
### Requirement: Command log streaming
The system SHALL provide `command.logs()` as an async iterable of stdout/stderr log lines.
#### Scenario: Logs stream output lines
- **WHEN** `for await (const log of command.logs())` is iterated
- **THEN** each `log` SHALL have `stream: "stdout" | "stderr"` and `data: string`
### Requirement: Command kill
The system SHALL provide `command.kill(signal?)` to send a POSIX signal to a running command.
#### Scenario: Default kill sends SIGTERM
- **WHEN** `command.kill()` is called without a signal
- **THEN** SIGTERM SHALL be sent to the process

View File

@@ -0,0 +1,50 @@
## ADDED Requirements
### Requirement: Filesystem API matching node:fs/promises
The system SHALL provide `sandbox.fs` implementing the Node.js `fs/promises` API:
- `readFile(path, encoding?)``Buffer | string`
- `writeFile(path, data)``void`
- `appendFile(path, data)``void`
- `mkdir(path, { recursive? })``void`
- `readdir(path, { withFileTypes? })``string[] | Dirent[]`
- `stat(path)` / `lstat(path)``Stats`
- `unlink(path)`, `rm(path, { recursive?, force? })`, `rmdir(path)``void`
- `rename(oldPath, newPath)``void`
- `copyFile(src, dest)``void`
- `chmod(path, mode)`, `chown(path, uid, gid)``void`
- `symlink(target, path)`, `readlink(path)``void`
- `realpath(path)`, `truncate(path, len?)``void`
- `mkdtemp(prefix)``string`
- `access(path)`, `exists(path)``boolean`
#### Scenario: ReadFile returns correct content
- **WHEN** `sandbox.fs.readFile("/etc/hostname", "utf8")` is called
- **THEN** it SHALL return the file content as a string
#### Scenario: WriteFile creates new file
- **WHEN** `sandbox.fs.writeFile("/tmp/test.txt", "hello")` is called
- **THEN** subsequent `sandbox.fs.readFile("/tmp/test.txt", "utf8")` SHALL return `"hello"`
#### Scenario: Readdir lists directory contents
- **WHEN** `sandbox.fs.readdir("/")` is called
- **THEN** it SHALL return an array of filenames
#### Scenario: Stat returns file metadata
- **WHEN** `sandbox.fs.stat("/etc/hostname")` is called
- **THEN** it SHALL return a `Stats`-compatible object with `size`, `isFile()`, `isDirectory()`, `mode`, `uid`, `gid`, `mtime`, etc.
#### Scenario: Mkdir creates intermediate directories
- **WHEN** `sandbox.fs.mkdir("/tmp/a/b/c", { recursive: true })` is called
- **THEN** the directory `/tmp/a/b/c` SHALL exist
#### Scenario: Exists returns false for missing files
- **WHEN** `sandbox.fs.exists("/nonexistent")` is called
- **THEN** it SHALL return `false`

View File

@@ -0,0 +1,70 @@
## ADDED Requirements
### Requirement: Sandbox creation
The system SHALL provide a `Sandbox.create()` static method that provisions a new isolated compute environment.
Parameters:
- `name?: string` — optional human-readable name
- `source?: { type: "git" | "tarball" | "snapshot" }` — source for initial filesystem
- `ports?: number[]` — ports to expose (max 4)
- `timeout?: number` — auto-terminate timeout in ms
- `resources?: { vcpus: number }` — CPU allocation (2048 MB RAM per vCPU)
- `runtime?: string` — runtime identifier
- `networkPolicy?: NetworkPolicy` — network restrictions
- `env?: Record<string, string>` — default environment variables
- `tags?: Record<string, string>` — metadata tags (max 5)
- `persistent?: boolean` — persistent filesystem across sessions
- `signal?: AbortSignal` — cancellation support
#### Scenario: Create returns a running Sandbox instance
- **WHEN** `Sandbox.create()` is called with valid parameters
- **THEN** it SHALL return a `Sandbox` instance with a running session
#### Scenario: Create supports AsyncDisposable
- **WHEN** `Sandbox.create()` is used with `await using`
- **THEN** the sandbox SHALL be automatically stopped when scope exits
#### Scenario: Source specifies initial filesystem content
- **WHEN** `source: { type: "git", url: "..." }` is provided
- **THEN** the sandbox SHALL clone the git repository on creation
### Requirement: Sandbox retrieval
The system SHALL provide `Sandbox.get()` to retrieve an existing sandbox and `Sandbox.getOrCreate()` for idempotent get-or-create.
#### Scenario: Get retrieves existing sandbox
- **WHEN** `Sandbox.get({ name: "my-sandbox" })` is called for an existing sandbox
- **THEN** it SHALL return the sandbox with its session resumed
#### Scenario: GetOrCreate creates when not found
- **WHEN** `Sandbox.getOrCreate({ name: "new-sandbox", onCreate: ... })` is called and sandbox doesn't exist
- **THEN** it SHALL create a new sandbox and call `onCreate` once
### Requirement: Sandbox forking
The system SHALL provide `Sandbox.fork()` to create a new sandbox from an existing one's current filesystem state.
#### Scenario: Fork preserves filesystem state
- **WHEN** `Sandbox.fork({ sourceSandbox: "original" })` is called
- **THEN** the new sandbox SHALL start with the filesystem state of the source sandbox
### Requirement: Sandbox update and delete
The system SHALL support `sandbox.update()` for configuration changes and `sandbox.delete()` for removal.
#### Scenario: Update changes sandbox config
- **WHEN** `sandbox.update({ timeout: 300000 })` is called
- **THEN** the sandbox's timeout SHALL be updated for subsequent sessions
#### Scenario: Delete removes the sandbox
- **WHEN** `sandbox.delete()` is called
- **THEN** the sandbox SHALL be permanently removed

View File

@@ -0,0 +1,52 @@
## ADDED Requirements
### Requirement: Network policy type
The system SHALL define a `NetworkPolicy` type with three forms:
- `"allow-all"` — full internet access (default)
- `"deny-all"` — no external access
- `{ allow?: string[] | Record<string, NetworkPolicyRule[]>; subnets?: { allow?: string[]; deny?: string[] } }` — custom rules
#### Scenario: Allow-all permits all traffic
- **WHEN** `networkPolicy: "allow-all"` is set
- **THEN** all outbound traffic SHALL be permitted
#### Scenario: Deny-all blocks all traffic
- **WHEN** `networkPolicy: "deny-all"` is set
- **THEN** all outbound traffic SHALL be denied
#### Scenario: Domain allowlist restricts access
- **WHEN** `networkPolicy: { allow: ["*.npmjs.org"] }` is set
- **THEN** traffic to `registry.npmjs.org` SHALL be allowed and all other traffic SHALL be denied
#### Scenario: Wildcard domains match subdomains
- **WHEN** a domain pattern starts with `*.` (e.g., `*.example.com`)
- **THEN** it SHALL match any subdomain of that domain
### Requirement: Network policy rules with transformers
The system SHALL support per-domain rules with request transformers for header injection.
Parameters per rule:
- `match?: { path?, method?, queryString?, headers? }` — request matchers
- `transform?: { headers: Record<string, string> }[]` — header injection
- `forwardURL?: string` — HTTPS proxy forwarding
#### Scenario: Header transform injects authorization
- **WHEN** a request matches a rule with `transform: [{ headers: { authorization: "Bearer token" } }]`
- **THEN** the `authorization` header SHALL be injected before forwarding
### Requirement: Subnet filtering
The system SHALL support subnet-level access control via CIDR notation.
#### Scenario: Subnet allow takes precedence over domain deny
- **WHEN** `subnets: { allow: ["10.0.0.0/8"] }` is set
- **THEN** traffic to `10.0.0.1` SHALL be allowed regardless of domain rules

View File

@@ -0,0 +1,59 @@
## ADDED Requirements
### Requirement: Snapshot creation
The system SHALL provide `sandbox.snapshot()` to create a point-in-time filesystem snapshot.
Parameters:
- `expiration?: number` — TTL in milliseconds (0 for no expiration)
#### Scenario: Snapshot stops the session and returns Snapshot instance
- **WHEN** `sandbox.snapshot()` is called on a running sandbox
- **THEN** the current session SHALL be stopped and a `Snapshot` SHALL be returned
### Requirement: Snapshot retrieval and listing
The system SHALL provide `Snapshot.get()`, `Snapshot.list()`, and `Snapshot.tree()` for managing snapshots.
#### Scenario: Retrieve snapshot by ID
- **WHEN** `Snapshot.get({ snapshotId: "snap_abc" })` is called
- **THEN** it SHALL return the snapshot with matching ID
#### Scenario: List snapshots with pagination
- **WHEN** `Snapshot.list({ name: "my-sandbox" })` is called
- **THEN** it SHALL return a paginated list of snapshots for that sandbox
#### Scenario: Ancestry tree is accessible
- **WHEN** `Snapshot.tree({ snapshotId: "snap_abc" })` is called
- **THEN** it SHALL return the ancestry tree of the snapshot
### Requirement: Snapshot deletion
The system SHALL provide `snapshot.delete()` to remove a snapshot.
#### Scenario: Deleted snapshot is no longer listable
- **WHEN** `snapshot.delete()` is called and then `Snapshot.list()` is called
- **THEN** the deleted snapshot SHALL no longer appear in the list
### Requirement: Snapshot-based sandbox creation
The system SHALL support creating sandboxes from snapshots via `Sandbox.create({ source: { type: "snapshot", snapshotId } })`.
#### Scenario: Sandbox created from snapshot has matching filesystem
- **WHEN** a sandbox is created with a snapshot source and a file is written, then another sandbox is created from the resulting snapshot
- **THEN** the second sandbox SHALL contain the file from the first
### Requirement: Snapshot retention
The system SHALL support `keepLastSnapshots` retention policy on sandboxes.
#### Scenario: Retention evicts oldest snapshots
- **WHEN** a sandbox has `keepLastSnapshots: { count: 3 }` and a 4th snapshot is created
- **THEN** the oldest snapshot SHALL be evicted

View File

@@ -0,0 +1,43 @@
## ADDED Requirements
### Requirement: State definition via annotations
The system SHALL provide an `Annotation` API for defining graph state schemas:
- `Annotation<T>(reducer?)` — creates a state key with optional reducer
- `Annotation.Root({ key: Annotation<T> })` — combines keys into a state schema
- Reducers: `LastValue` (default — overwrite), `BinaryOperator` (custom merge function)
#### Scenario: Annotation.Root defines typed state
- **WHEN** `const State = Annotation.Root({ messages: Annotation<string[]>(addMessages), step: Annotation<number>() })` is defined
- **THEN** `State` SHALL have `State`, `Update`, and `Node` type members
#### Scenario: LastValue reducer replaces on each write
- **WHEN** a node writes `{ step: 2 }` and then `{ step: 3 }` in the same step
- **THEN** the LastValue channel SHALL throw an `InvalidUpdateError`
#### Scenario: BinaryOperator reducer accumulates
- **WHEN** a node returns `{ messages: ["hello"] }` and another returns `{ messages: ["world"] }` with an `addMessages` reducer
- **THEN** the final state SHALL contain `messages: ["hello", "world"]`
### Requirement: StateGraph builder
The system SHALL provide a `StateGraph` class for constructing stateful agent graphs.
#### Scenario: StateGraph is constructed with state schema
- **WHEN** `new StateGraph({ stateSchema: State })` is called
- **THEN** the graph SHALL accept nodes that receive and can update the defined state
#### Scenario: Nodes can read and write state
- **WHEN** a node function receives state with `{ messages, step }` and returns `{ step: step + 1 }`
- **THEN** the graph SHALL update `step` and preserve `messages`
#### Scenario: Conditional edges route based on state
- **WHEN** `addConditionalEdges("node_a", (state) => state.step > 5 ? "end" : "node_b")` is added
- **THEN** execution SHALL route based on the state value at runtime

View File

@@ -0,0 +1,51 @@
## ADDED Requirements
### Requirement: Trajectory match evaluator
The system SHALL provide `create_trajectory_match_evaluator()` that compares agent tool-call trajectories against reference trajectories.
Parameters:
- `trajectory_match_mode: "strict" | "unordered" | "subset" | "superset"` — matching strategy
- `tool_args_match_mode: "exact" | "ignore" | "subset" | "superset"` — tool argument comparison
- `tool_args_match_overrides?: Record<string, ToolArgsMatchMode | Callable | string[]>` — per-tool custom matching
#### Scenario: Strict mode requires exact order
- **WHEN** output trajectory has tool calls `[A, B]` and reference is `[A, B]`
- **THEN** strict mode SHALL return score `true`
- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
- **THEN** strict mode SHALL return score `false`
#### Scenario: Unordered mode ignores order
- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
- **THEN** unordered mode SHALL return score `true`
#### Scenario: Subset mode accepts partial trajectory
- **WHEN** output trajectory has tool calls `[A]` and reference is `[A, B]`
- **THEN** subset mode SHALL return score `true`
#### Scenario: Superset mode allows extra tool calls
- **WHEN** output trajectory has tool calls `[A, B, C]` and reference is `[A, B]`
- **THEN** superset mode SHALL return score `true`
#### Scenario: Tool args ignore mode skips argument comparison
- **WHEN** `tool_args_match_mode="ignore"` is set
- **THEN** tool calls match regardless of their arguments
#### Scenario: Custom tool arg matcher is used
- **WHEN** `tool_args_match_overrides` contains a `Callable` for a tool name
- **THEN** that callable SHALL be invoked to compare the tool's arguments
### Requirement: Trajectory LLM-as-judge
The system SHALL provide `create_trajectory_llm_as_judge()` that uses an LLM to grade trajectory quality and accuracy.
#### Scenario: Trajectory is formatted as XML for LLM
- **WHEN** an LLM trajectory evaluator is invoked
- **THEN** the trajectory SHALL be formatted as XML with `<role>`, `<tool_call>`, `<tool_result>` elements