chore(openspec): drop 9 superseded proposals + 11 stub archive files
Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
This commit is contained in:
@@ -0,0 +1,65 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Code LLM-as-judge
|
||||
|
||||
The system SHALL provide `create_code_llm_as_judge()` that evaluates code correctness using an LLM, with code extraction from responses.
|
||||
|
||||
Parameters:
|
||||
- `code_extraction_strategy: "none" | "llm" | "markdown_code_blocks"` — how to extract code from output
|
||||
- `code_extractor?: Callable` — custom extraction function
|
||||
|
||||
#### Scenario: Markdown code block extraction
|
||||
|
||||
- **WHEN** `code_extraction_strategy="markdown_code_blocks"` and output contains triple-backtick code blocks
|
||||
- **THEN** the evaluator SHALL extract code from those blocks before scoring
|
||||
|
||||
#### Scenario: LLM-based code extraction
|
||||
|
||||
- **WHEN** `code_extraction_strategy="llm"` and a `judge` is provided
|
||||
- **THEN** the evaluator SHALL use an LLM with `ExtractCode`/`NoCode` tools to extract code
|
||||
|
||||
#### Scenario: No extraction returns raw output
|
||||
|
||||
- **WHEN** `code_extraction_strategy="none"`
|
||||
- **THEN** the raw output string is passed directly to the scorer
|
||||
|
||||
### Requirement: Static analysis evaluator (Pyright)
|
||||
|
||||
The system SHALL provide `create_pyright_evaluator()` that runs Pyright static type checking on extracted Python code.
|
||||
|
||||
Parameters:
|
||||
- `pyright_cli_args: string[]` — additional CLI flags
|
||||
- `code_extraction_strategy` / `code_extractor` — same as code LLM evaluator
|
||||
|
||||
#### Scenario: Pyright detects type error
|
||||
|
||||
- **WHEN** code with a type error (e.g., `x: int = "string"`) is evaluated
|
||||
- **THEN** the evaluator SHALL return score `false` with error details in `comment`
|
||||
|
||||
#### Scenario: Pyright passes clean code
|
||||
|
||||
- **WHEN** valid Python code is evaluated
|
||||
- **THEN** the evaluator SHALL return score `true`
|
||||
|
||||
### Requirement: Static analysis evaluator (Mypy)
|
||||
|
||||
The system SHALL provide `create_mypy_evaluator()` with equivalent behavior to Pyright evaluator but using the Mypy type checker.
|
||||
|
||||
#### Scenario: Mypy detects type error
|
||||
|
||||
- **WHEN** code with an unannotated function returning mismatched types is evaluated
|
||||
- **THEN** the evaluator SHALL return score `false`
|
||||
|
||||
### Requirement: Sandboxed code execution
|
||||
|
||||
The system SHALL provide `create_e2b_execution_evaluator()` that executes code in a sandbox and checks for runtime errors.
|
||||
|
||||
#### Scenario: Code executes without errors
|
||||
|
||||
- **WHEN** valid Python code runs in the sandbox
|
||||
- **THEN** the evaluator SHALL return score `true`
|
||||
|
||||
#### Scenario: Code raises runtime exception
|
||||
|
||||
- **WHEN** code that raises an exception is executed
|
||||
- **THEN** the evaluator SHALL return score `false` with error details
|
||||
@@ -0,0 +1,31 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Shared type system
|
||||
|
||||
The system SHALL define a shared set of types used by all packages:
|
||||
- `EvaluatorResult` — TypedDict with `key: string`, `score: number | boolean`, `comment?: string`, `metadata?: Record<string, unknown>`, `source_run_id?: string`
|
||||
- `ModelClient` — Protocol with `chat.completions.create()` for LLM access
|
||||
- `SandboxProvider` — Interface for provider-agnostic sandbox creation/management
|
||||
- `Checkpointer` — Interface for checkpoint persistence
|
||||
- `Serializable` — Interface requiring `toJSON()` and static `fromJSON()` methods
|
||||
- All evaluators SHALL accept a consistent call signature: `(inputs?, outputs, reference_outputs?, **kwargs)`
|
||||
- Error types: `GraphInterrupt`, `SandboxError`, `EvalError`
|
||||
|
||||
#### Scenario: EvaluatorResult conforms to schema
|
||||
|
||||
- **WHEN** an evaluator returns a result
|
||||
- **THEN** the result SHALL conform to `EvaluatorResult` with at least `key` and `score`
|
||||
|
||||
#### Scenario: All stateful objects are serializable
|
||||
|
||||
- **WHEN** a `Sandbox`, `Snapshot`, or `Command` instance is serialized via `toJSON()`
|
||||
- **THEN** a subsequent `fromJSON()` call SHALL reconstruct an equivalent instance
|
||||
|
||||
### Requirement: Serialization protocol
|
||||
|
||||
All stateful objects (`Sandbox`, `Session`, `Command`, `Snapshot`, `GraphState`) SHALL implement `toJSON()` / `fromJSON()` static methods for cross-session persistence.
|
||||
|
||||
#### Scenario: Round-trip serialization preserves identity
|
||||
|
||||
- **WHEN** an object is serialized and deserialized
|
||||
- **THEN** the deserialized object SHALL have matching identity fields (`id`, `name`, `sessionId`)
|
||||
@@ -0,0 +1,49 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Built-in evaluation prompt templates
|
||||
|
||||
The system SHALL ship with a library of prompt templates organized by domain, ready for use with `create_llm_as_judge()`.
|
||||
|
||||
Domains and included prompts:
|
||||
|
||||
**Quality:**
|
||||
- `CORRECTNESS_PROMPT` — factual accuracy and completeness
|
||||
- `CONCISENESS_PROMPT` — concise responses without hedging or fluff
|
||||
- `HALLUCINATION_PROMPT` — claims verifiable from context
|
||||
- `ANSWER_RELEVANCE_PROMPT` — output addresses the input question
|
||||
- `PLAN_ADHERENCE_PROMPT` — agent actions match declared plan
|
||||
- `LAZINESS_PROMPT` — detects blank or low-effort responses
|
||||
|
||||
**RAG:**
|
||||
- `RAG_GROUNDEDNESS_PROMPT` — output claims supported by retrieved context
|
||||
- `RAG_HELPFULNESS_PROMPT` — output addresses core question
|
||||
- `RAG_RETRIEVAL_RELEVANCE_PROMPT` — retrieved context is relevant to input
|
||||
|
||||
**Safety:**
|
||||
- `TOXICITY_PROMPT` — personal attacks, hate speech
|
||||
- `FAIRNESS_PROMPT` — stereotyping, discrimination
|
||||
|
||||
**Security:**
|
||||
- `PII_LEAKAGE_PROMPT` — names, contact info, credentials in output
|
||||
- `PROMPT_INJECTION_PROMPT` — delimiter manipulation, roleplay bypass
|
||||
- `CODE_INJECTION_PROMPT` — SQL injection, XSS, path traversal
|
||||
|
||||
**Trajectory:**
|
||||
- `TRAJECTORY_ACCURACY_PROMPT` — logical progression, goal alignment
|
||||
- `TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE` — semantically equivalent to reference
|
||||
- `TOOL_SELECTION_PROMPT` — right tools, right order, no redundant calls
|
||||
|
||||
**Conversation:**
|
||||
- `USER_SATISFACTION_PROMPT` — gratitude, resolution, engagement
|
||||
- `TASK_COMPLETION_PROMPT` — was the user's goal achieved
|
||||
- `AGENT_TONE_PROMPT` — appropriate tone and professionalism
|
||||
|
||||
#### Scenario: Each prompt is a string with {inputs}, {outputs}, {reference_outputs} placeholders
|
||||
|
||||
- **WHEN** a prompt template is inspected
|
||||
- **THEN** it SHALL be a string compatible with `str.format()` containing at least `{outputs}`
|
||||
|
||||
#### Scenario: Prompt templates follow rubric structure
|
||||
|
||||
- **WHEN** a prompt template is read
|
||||
- **THEN** it SHALL contain `<Rubric>`, `<Instructions>`, and `<Reminder>` XML sections
|
||||
@@ -0,0 +1,49 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Stream modes
|
||||
|
||||
The system SHALL support multiple stream modes when invoking a compiled graph:
|
||||
|
||||
- `"values"` — emits the full state after each superstep
|
||||
- `"updates"` — emits only the state changes after each superstep
|
||||
- `"messages"` — emits individual message chunks for chat-oriented graphs
|
||||
- `"debug"` — emits debug events with full superstep information
|
||||
- `"custom"` — supports user-defined events via a emit function
|
||||
|
||||
#### Scenario: Values mode emits full state
|
||||
|
||||
- **WHEN** a graph is streamed with `streamMode: ["values"]`
|
||||
- **THEN** each chunk SHALL contain the complete state object after each superstep
|
||||
|
||||
#### Scenario: Updates mode emits diffs
|
||||
|
||||
- **WHEN** a graph is streamed with `streamMode: ["updates"]`
|
||||
- **THEN** each chunk SHALL contain only the state keys that changed
|
||||
|
||||
### Requirement: Stream event protocol
|
||||
|
||||
The system SHALL emit structured events during graph execution, including:
|
||||
- `on_chain_start` — node execution begins
|
||||
- `on_chain_end` — node execution completes
|
||||
- `on_chain_stream` — intermediate output from a node
|
||||
- `on_custom_event` — user-defined events
|
||||
- Checkpoint metadata paired with each event (id, parent_id, step, source)
|
||||
|
||||
#### Scenario: Events include checkpoint metadata
|
||||
|
||||
- **WHEN** a stream event is received
|
||||
- **THEN** it SHALL include a `checkpoint` envelope with `id`, `step`, and `source`
|
||||
|
||||
#### Scenario: Custom events propagate from nodes
|
||||
|
||||
- **WHEN** a node emits a custom event via an emit function
|
||||
- **THEN** that event SHALL appear in the stream with type `on_custom_event`
|
||||
|
||||
### Requirement: Async iteration over streams
|
||||
|
||||
The system SHALL support `for await...of` iteration over graph streams.
|
||||
|
||||
#### Scenario: Stream is async iterable
|
||||
|
||||
- **WHEN** `for await (const chunk of graph.stream(...))` is used
|
||||
- **THEN** each chunk SHALL be available as it is produced
|
||||
@@ -0,0 +1,56 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Node interrupt function
|
||||
|
||||
The system SHALL provide an `interrupt(value)` function that pauses graph execution and returns a resume value when the graph is continued.
|
||||
|
||||
#### Scenario: Interrupt pauses execution with value
|
||||
|
||||
- **WHEN** a node calls `const approval = interrupt({ question: "Approve this action?" })`
|
||||
- **THEN** execution SHALL pause and the interrupt value SHALL be available in the stream output
|
||||
|
||||
#### Scenario: Resume returns value to interrupt
|
||||
|
||||
- **WHEN** the graph is resumed with `Command({ resume: "approved" })`
|
||||
- **THEN** the `interrupt()` call SHALL return `"approved"`
|
||||
|
||||
#### Scenario: Multiple interrupts are supported
|
||||
|
||||
- **WHEN** a node calls `interrupt()` twice
|
||||
- **THEN** each interrupt SHALL be resolved sequentially, requiring two resume commands
|
||||
|
||||
### Requirement: Command-based graph resumption
|
||||
|
||||
The system SHALL provide a `Command` class that supports:
|
||||
- `Command.RESUME` — resume value for pending interrupts
|
||||
- `Command.GOTO` — Send or node name for dynamic routing
|
||||
- `Command.PARENT` — bubble up to parent graph
|
||||
|
||||
#### Scenario: Command with resume continues execution
|
||||
|
||||
- **WHEN** `await graph.stream(new Command({ resume: "user input" }))` is called
|
||||
- **THEN** the interrupted node SHALL continue with the resume value
|
||||
|
||||
#### Scenario: Command with goto routes dynamically
|
||||
|
||||
- **WHEN** a node returns `new Command({ goto: "human_review" })`
|
||||
- **THEN** execution SHALL route to `human_review` node
|
||||
|
||||
### Requirement: Automated interrupts at node boundaries
|
||||
|
||||
The system SHALL support `interruptBefore` and `interruptAfter` in `compile()` options to automatically pause at specific nodes.
|
||||
|
||||
#### Scenario: InterruptBefore pauses before node execution
|
||||
|
||||
- **WHEN** `graph.compile({ interruptBefore: ["approval_node"] })` is used
|
||||
- **THEN** the graph SHALL pause just before executing `approval_node`
|
||||
|
||||
### Requirement: State snapshots on interrupt
|
||||
|
||||
When a graph uses a checkpointer, interrupt states SHALL be persisted so execution can be resumed across process boundaries.
|
||||
|
||||
#### Scenario: Interrupted state is checkpointed
|
||||
|
||||
- **WHEN** a graphed with a checkpointer is interrupted
|
||||
- **THEN** the checkpoint SHALL contain the interrupt state
|
||||
- **THEN** restoring from that checkpoint SHALL yield the same interrupt state
|
||||
@@ -0,0 +1,55 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: LLM-as-judge evaluator factory
|
||||
|
||||
The system SHALL provide a `create_llm_as_judge()` function that creates an evaluator using an LLM to assess output quality.
|
||||
|
||||
Parameters:
|
||||
- `prompt: string | Runnable | Callable` — evaluation prompt (string template with `{inputs}`, `{outputs}`, `{reference_outputs}`)
|
||||
- `judge?: ModelClient | BaseChatModel` — LLM client
|
||||
- `model?: string` — model identifier
|
||||
- `system?: string` — optional system message
|
||||
- `continuous: boolean = false` — float 0-1 scoring when true, boolean when false
|
||||
- `choices?: number[]` — specific enum float values for score
|
||||
- `use_reasoning: boolean = true` — include reasoning in output
|
||||
- `few_shot_examples?: FewShotExample[]` — example evaluations
|
||||
- `output_schema?: JSONSchema | ZodSchema` — custom structured output format
|
||||
|
||||
#### Scenario: String prompt evaluator returns scored result
|
||||
|
||||
- **WHEN** `create_llm_as_judge(prompt="Rate: {outputs}")` is invoked with `outputs="Hello world"`
|
||||
- **THEN** it SHALL return an `EvaluatorResult` with `key: "score"` and a valid `score`
|
||||
|
||||
#### Scenario: Continuous scoring returns float
|
||||
|
||||
- **WHEN** `create_llm_as_judge(prompt=..., continuous=true)` scores output
|
||||
- **THEN** the score SHALL be a float between 0.0 and 1.0
|
||||
|
||||
#### Scenario: Choices scoring returns enum value
|
||||
|
||||
- **WHEN** `create_llm_as_judge(prompt=..., choices=[0.0, 0.5, 1.0])` scores output
|
||||
- **THEN** the score SHALL be exactly one of the enumerated choices
|
||||
|
||||
#### Scenario: Reasoning mode returns comment
|
||||
|
||||
- **WHEN** `create_llm_as_judge(prompt=..., use_reasoning=true)` scores output
|
||||
- **THEN** the `comment` field SHALL contain the LLM's reasoning
|
||||
|
||||
#### Scenario: Few-shot examples are appended to prompt
|
||||
|
||||
- **WHEN** `few_shot_examples` are provided
|
||||
- **THEN** they SHALL be appended as `<example>` XML blocks to the last user message
|
||||
|
||||
#### Scenario: Output schema returns structured dict
|
||||
|
||||
- **WHEN** `output_schema` is provided (e.g., z.object({ quality: z.number() }))
|
||||
- **THEN** the evaluator SHALL return a dict matching that schema instead of `EvaluatorResult`
|
||||
|
||||
### Requirement: Async LLM-as-judge
|
||||
|
||||
The system SHALL provide `create_async_llm_as_judge()` with identical parameters, returning an async evaluator.
|
||||
|
||||
#### Scenario: Async evaluator returns same structure as sync
|
||||
|
||||
- **WHEN** `await` is used on an async evaluator invocation
|
||||
- **THEN** the result SHALL match the same structure as the sync equivalent
|
||||
@@ -0,0 +1,39 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Multi-turn conversation simulation
|
||||
|
||||
The system SHALL provide `run_multiturn_simulation()` that simulates a multi-turn conversation between an app and a simulated user.
|
||||
|
||||
Parameters:
|
||||
- `app: Callable[[ChatCompletionMessage], ChatCompletionMessage]` — the application under test
|
||||
- `user: Callable | string[]` — simulated user (dynamic or static responses)
|
||||
- `max_turns?: number` — maximum conversation turns
|
||||
- `trajectory_evaluators?: EvalFunction[]` — evaluators that assess the final trajectory
|
||||
- `stopping_condition?: Callable[[Message[], number], boolean]` — early termination
|
||||
- `reference_outputs?: unknown` — passed to evaluators
|
||||
|
||||
#### Scenario: Static user responses drive conversation
|
||||
|
||||
- **WHEN** `user=["Hello", "Tell me more", "Goodbye"]` with `max_turns=3`
|
||||
- **THEN** the simulation SHALL alternate between user responses and app responses for 3 turns
|
||||
|
||||
#### Scenario: Dynamic simulated user adapts to context
|
||||
|
||||
- **WHEN** `user` is a `Callable` receiving the current trajectory
|
||||
- **THEN** the user function SHALL receive the current conversation history and return the next message
|
||||
|
||||
#### Scenario: Trajectory evaluators run after simulation
|
||||
|
||||
- **WHEN** `trajectory_evaluators` are provided
|
||||
- **THEN** each evaluator SHALL receive the full conversation trajectory as `outputs`
|
||||
- **THEN** the simulation result SHALL include `evaluator_results` from each evaluator
|
||||
|
||||
#### Scenario: Stopping condition terminates early
|
||||
|
||||
- **WHEN** `stopping_condition` returns `true` before `max_turns`
|
||||
- **THEN** the simulation SHALL terminate immediately
|
||||
|
||||
#### Scenario: Async simulation is supported
|
||||
|
||||
- **WHEN** `run_multiturn_simulation_async()` is called with async `app` and `user` functions
|
||||
- **THEN** the simulation SHALL await each turn and return the same result structure
|
||||
@@ -0,0 +1,49 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Pregel execution engine
|
||||
|
||||
The system SHALL implement a Pregel-style superstep execution engine where:
|
||||
|
||||
- Each "superstep" executes all ready nodes concurrently
|
||||
- Nodes communicate through typed channels (not direct function calls)
|
||||
- Channel writes from one superstep are visible as reads in the next
|
||||
- The engine supports `PULL` (edge-triggered) and `PUSH` (dynamic Send) task scheduling
|
||||
|
||||
#### Scenario: Nodes execute in dependency order
|
||||
|
||||
- **WHEN** node B subscribes to channel A
|
||||
- **THEN** node B SHALL execute in the superstep after node A writes to channel A
|
||||
|
||||
#### Scenario: Concurrent nodes run in parallel
|
||||
|
||||
- **WHEN** two nodes have no dependencies between them
|
||||
- **THEN** they SHALL execute concurrently within the same superstep
|
||||
|
||||
#### Scenario: Dynamic Send spawns new node executions
|
||||
|
||||
- **WHEN** a node calls `send("node_c", { ... })` via `Command`
|
||||
- **THEN** `node_c` SHALL be scheduled for execution in the current or next superstep
|
||||
|
||||
### Requirement: Graph compilation
|
||||
|
||||
The system SHALL provide `graph.compile()` that produces a runnable compiled graph.
|
||||
|
||||
Parameters:
|
||||
- `checkpointer?: Checkpointer` — optional persistence
|
||||
- `interruptBefore?: string[]` — nodes to pause before
|
||||
- `interruptAfter?: string[]` — nodes to pause after
|
||||
- `name?: string` — graph name
|
||||
|
||||
#### Scenario: Compiled graph can be invoked
|
||||
|
||||
- **WHEN** `compiled_graph.invoke({ messages: [] })` is called
|
||||
- **THEN** it SHALL execute all nodes and return the final state
|
||||
|
||||
### Requirement: Recursion limit
|
||||
|
||||
The system SHALL enforce a configurable recursion limit to prevent infinite loops.
|
||||
|
||||
#### Scenario: Exceeding recursion limit throws
|
||||
|
||||
- **WHEN** a graph exceeds the recursion limit
|
||||
- **THEN** a `GraphRecursionError` SHALL be thrown
|
||||
@@ -0,0 +1,61 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Command execution (blocking)
|
||||
|
||||
The system SHALL provide `sandbox.runCommand(cmd, args?, opts?)` that executes a command inside the sandbox and waits for completion.
|
||||
|
||||
Parameters:
|
||||
- `cmd: string` — command to execute
|
||||
- `args?: string[]` — command arguments
|
||||
- `cwd?: string` — working directory
|
||||
- `env?: Record<string, string>` — per-command environment variables
|
||||
- `sudo?: boolean` — execute with root privileges
|
||||
- `timeoutMs?: number` — max execution time (SIGKILL on expiry)
|
||||
- `signal?: AbortSignal` — cancellation
|
||||
|
||||
#### Scenario: Blocking runCommand returns finished result with exit code
|
||||
|
||||
- **WHEN** `sandbox.runCommand("echo", ["hello"])` is called
|
||||
- **THEN** it SHALL return a `CommandFinished` instance with `exitCode: 0`
|
||||
|
||||
#### Scenario: Command timeout kills process
|
||||
|
||||
- **WHEN** `sandbox.runCommand("sleep", ["100"], { timeoutMs: 100 })` is executed
|
||||
- **THEN** it SHALL return a non-zero exit code after ~100ms
|
||||
|
||||
#### Scenario: Stderr is captured separately
|
||||
|
||||
- **WHEN** a command writes to both stdout and stderr
|
||||
- **THEN** `result.stdout()` and `result.stderr()` SHALL return their respective streams
|
||||
|
||||
### Requirement: Detached command execution
|
||||
|
||||
The system SHALL support `{ detached: true }` mode where `runCommand()` returns immediately with a live `Command` handle.
|
||||
|
||||
#### Scenario: Detached command returns before completion
|
||||
|
||||
- **WHEN** `sandbox.runCommand({ cmd: "sleep", args: ["5"], detached: true })` is called
|
||||
- **THEN** it SHALL return a `Command` instance immediately (before the process exits)
|
||||
|
||||
#### Scenario: Detached command can be waited on
|
||||
|
||||
- **WHEN** `command.wait()` is called on a detached command
|
||||
- **THEN** it SHALL return a `CommandFinished` when the process exits
|
||||
|
||||
### Requirement: Command log streaming
|
||||
|
||||
The system SHALL provide `command.logs()` as an async iterable of stdout/stderr log lines.
|
||||
|
||||
#### Scenario: Logs stream output lines
|
||||
|
||||
- **WHEN** `for await (const log of command.logs())` is iterated
|
||||
- **THEN** each `log` SHALL have `stream: "stdout" | "stderr"` and `data: string`
|
||||
|
||||
### Requirement: Command kill
|
||||
|
||||
The system SHALL provide `command.kill(signal?)` to send a POSIX signal to a running command.
|
||||
|
||||
#### Scenario: Default kill sends SIGTERM
|
||||
|
||||
- **WHEN** `command.kill()` is called without a signal
|
||||
- **THEN** SIGTERM SHALL be sent to the process
|
||||
@@ -0,0 +1,50 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Filesystem API matching node:fs/promises
|
||||
|
||||
The system SHALL provide `sandbox.fs` implementing the Node.js `fs/promises` API:
|
||||
|
||||
- `readFile(path, encoding?)` → `Buffer | string`
|
||||
- `writeFile(path, data)` → `void`
|
||||
- `appendFile(path, data)` → `void`
|
||||
- `mkdir(path, { recursive? })` → `void`
|
||||
- `readdir(path, { withFileTypes? })` → `string[] | Dirent[]`
|
||||
- `stat(path)` / `lstat(path)` → `Stats`
|
||||
- `unlink(path)`, `rm(path, { recursive?, force? })`, `rmdir(path)` → `void`
|
||||
- `rename(oldPath, newPath)` → `void`
|
||||
- `copyFile(src, dest)` → `void`
|
||||
- `chmod(path, mode)`, `chown(path, uid, gid)` → `void`
|
||||
- `symlink(target, path)`, `readlink(path)` → `void`
|
||||
- `realpath(path)`, `truncate(path, len?)` → `void`
|
||||
- `mkdtemp(prefix)` → `string`
|
||||
- `access(path)`, `exists(path)` → `boolean`
|
||||
|
||||
#### Scenario: ReadFile returns correct content
|
||||
|
||||
- **WHEN** `sandbox.fs.readFile("/etc/hostname", "utf8")` is called
|
||||
- **THEN** it SHALL return the file content as a string
|
||||
|
||||
#### Scenario: WriteFile creates new file
|
||||
|
||||
- **WHEN** `sandbox.fs.writeFile("/tmp/test.txt", "hello")` is called
|
||||
- **THEN** subsequent `sandbox.fs.readFile("/tmp/test.txt", "utf8")` SHALL return `"hello"`
|
||||
|
||||
#### Scenario: Readdir lists directory contents
|
||||
|
||||
- **WHEN** `sandbox.fs.readdir("/")` is called
|
||||
- **THEN** it SHALL return an array of filenames
|
||||
|
||||
#### Scenario: Stat returns file metadata
|
||||
|
||||
- **WHEN** `sandbox.fs.stat("/etc/hostname")` is called
|
||||
- **THEN** it SHALL return a `Stats`-compatible object with `size`, `isFile()`, `isDirectory()`, `mode`, `uid`, `gid`, `mtime`, etc.
|
||||
|
||||
#### Scenario: Mkdir creates intermediate directories
|
||||
|
||||
- **WHEN** `sandbox.fs.mkdir("/tmp/a/b/c", { recursive: true })` is called
|
||||
- **THEN** the directory `/tmp/a/b/c` SHALL exist
|
||||
|
||||
#### Scenario: Exists returns false for missing files
|
||||
|
||||
- **WHEN** `sandbox.fs.exists("/nonexistent")` is called
|
||||
- **THEN** it SHALL return `false`
|
||||
@@ -0,0 +1,70 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Sandbox creation
|
||||
|
||||
The system SHALL provide a `Sandbox.create()` static method that provisions a new isolated compute environment.
|
||||
|
||||
Parameters:
|
||||
- `name?: string` — optional human-readable name
|
||||
- `source?: { type: "git" | "tarball" | "snapshot" }` — source for initial filesystem
|
||||
- `ports?: number[]` — ports to expose (max 4)
|
||||
- `timeout?: number` — auto-terminate timeout in ms
|
||||
- `resources?: { vcpus: number }` — CPU allocation (2048 MB RAM per vCPU)
|
||||
- `runtime?: string` — runtime identifier
|
||||
- `networkPolicy?: NetworkPolicy` — network restrictions
|
||||
- `env?: Record<string, string>` — default environment variables
|
||||
- `tags?: Record<string, string>` — metadata tags (max 5)
|
||||
- `persistent?: boolean` — persistent filesystem across sessions
|
||||
- `signal?: AbortSignal` — cancellation support
|
||||
|
||||
#### Scenario: Create returns a running Sandbox instance
|
||||
|
||||
- **WHEN** `Sandbox.create()` is called with valid parameters
|
||||
- **THEN** it SHALL return a `Sandbox` instance with a running session
|
||||
|
||||
#### Scenario: Create supports AsyncDisposable
|
||||
|
||||
- **WHEN** `Sandbox.create()` is used with `await using`
|
||||
- **THEN** the sandbox SHALL be automatically stopped when scope exits
|
||||
|
||||
#### Scenario: Source specifies initial filesystem content
|
||||
|
||||
- **WHEN** `source: { type: "git", url: "..." }` is provided
|
||||
- **THEN** the sandbox SHALL clone the git repository on creation
|
||||
|
||||
### Requirement: Sandbox retrieval
|
||||
|
||||
The system SHALL provide `Sandbox.get()` to retrieve an existing sandbox and `Sandbox.getOrCreate()` for idempotent get-or-create.
|
||||
|
||||
#### Scenario: Get retrieves existing sandbox
|
||||
|
||||
- **WHEN** `Sandbox.get({ name: "my-sandbox" })` is called for an existing sandbox
|
||||
- **THEN** it SHALL return the sandbox with its session resumed
|
||||
|
||||
#### Scenario: GetOrCreate creates when not found
|
||||
|
||||
- **WHEN** `Sandbox.getOrCreate({ name: "new-sandbox", onCreate: ... })` is called and sandbox doesn't exist
|
||||
- **THEN** it SHALL create a new sandbox and call `onCreate` once
|
||||
|
||||
### Requirement: Sandbox forking
|
||||
|
||||
The system SHALL provide `Sandbox.fork()` to create a new sandbox from an existing one's current filesystem state.
|
||||
|
||||
#### Scenario: Fork preserves filesystem state
|
||||
|
||||
- **WHEN** `Sandbox.fork({ sourceSandbox: "original" })` is called
|
||||
- **THEN** the new sandbox SHALL start with the filesystem state of the source sandbox
|
||||
|
||||
### Requirement: Sandbox update and delete
|
||||
|
||||
The system SHALL support `sandbox.update()` for configuration changes and `sandbox.delete()` for removal.
|
||||
|
||||
#### Scenario: Update changes sandbox config
|
||||
|
||||
- **WHEN** `sandbox.update({ timeout: 300000 })` is called
|
||||
- **THEN** the sandbox's timeout SHALL be updated for subsequent sessions
|
||||
|
||||
#### Scenario: Delete removes the sandbox
|
||||
|
||||
- **WHEN** `sandbox.delete()` is called
|
||||
- **THEN** the sandbox SHALL be permanently removed
|
||||
@@ -0,0 +1,52 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Network policy type
|
||||
|
||||
The system SHALL define a `NetworkPolicy` type with three forms:
|
||||
|
||||
- `"allow-all"` — full internet access (default)
|
||||
- `"deny-all"` — no external access
|
||||
- `{ allow?: string[] | Record<string, NetworkPolicyRule[]>; subnets?: { allow?: string[]; deny?: string[] } }` — custom rules
|
||||
|
||||
#### Scenario: Allow-all permits all traffic
|
||||
|
||||
- **WHEN** `networkPolicy: "allow-all"` is set
|
||||
- **THEN** all outbound traffic SHALL be permitted
|
||||
|
||||
#### Scenario: Deny-all blocks all traffic
|
||||
|
||||
- **WHEN** `networkPolicy: "deny-all"` is set
|
||||
- **THEN** all outbound traffic SHALL be denied
|
||||
|
||||
#### Scenario: Domain allowlist restricts access
|
||||
|
||||
- **WHEN** `networkPolicy: { allow: ["*.npmjs.org"] }` is set
|
||||
- **THEN** traffic to `registry.npmjs.org` SHALL be allowed and all other traffic SHALL be denied
|
||||
|
||||
#### Scenario: Wildcard domains match subdomains
|
||||
|
||||
- **WHEN** a domain pattern starts with `*.` (e.g., `*.example.com`)
|
||||
- **THEN** it SHALL match any subdomain of that domain
|
||||
|
||||
### Requirement: Network policy rules with transformers
|
||||
|
||||
The system SHALL support per-domain rules with request transformers for header injection.
|
||||
|
||||
Parameters per rule:
|
||||
- `match?: { path?, method?, queryString?, headers? }` — request matchers
|
||||
- `transform?: { headers: Record<string, string> }[]` — header injection
|
||||
- `forwardURL?: string` — HTTPS proxy forwarding
|
||||
|
||||
#### Scenario: Header transform injects authorization
|
||||
|
||||
- **WHEN** a request matches a rule with `transform: [{ headers: { authorization: "Bearer token" } }]`
|
||||
- **THEN** the `authorization` header SHALL be injected before forwarding
|
||||
|
||||
### Requirement: Subnet filtering
|
||||
|
||||
The system SHALL support subnet-level access control via CIDR notation.
|
||||
|
||||
#### Scenario: Subnet allow takes precedence over domain deny
|
||||
|
||||
- **WHEN** `subnets: { allow: ["10.0.0.0/8"] }` is set
|
||||
- **THEN** traffic to `10.0.0.1` SHALL be allowed regardless of domain rules
|
||||
@@ -0,0 +1,59 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Snapshot creation
|
||||
|
||||
The system SHALL provide `sandbox.snapshot()` to create a point-in-time filesystem snapshot.
|
||||
|
||||
Parameters:
|
||||
- `expiration?: number` — TTL in milliseconds (0 for no expiration)
|
||||
|
||||
#### Scenario: Snapshot stops the session and returns Snapshot instance
|
||||
|
||||
- **WHEN** `sandbox.snapshot()` is called on a running sandbox
|
||||
- **THEN** the current session SHALL be stopped and a `Snapshot` SHALL be returned
|
||||
|
||||
### Requirement: Snapshot retrieval and listing
|
||||
|
||||
The system SHALL provide `Snapshot.get()`, `Snapshot.list()`, and `Snapshot.tree()` for managing snapshots.
|
||||
|
||||
#### Scenario: Retrieve snapshot by ID
|
||||
|
||||
- **WHEN** `Snapshot.get({ snapshotId: "snap_abc" })` is called
|
||||
- **THEN** it SHALL return the snapshot with matching ID
|
||||
|
||||
#### Scenario: List snapshots with pagination
|
||||
|
||||
- **WHEN** `Snapshot.list({ name: "my-sandbox" })` is called
|
||||
- **THEN** it SHALL return a paginated list of snapshots for that sandbox
|
||||
|
||||
#### Scenario: Ancestry tree is accessible
|
||||
|
||||
- **WHEN** `Snapshot.tree({ snapshotId: "snap_abc" })` is called
|
||||
- **THEN** it SHALL return the ancestry tree of the snapshot
|
||||
|
||||
### Requirement: Snapshot deletion
|
||||
|
||||
The system SHALL provide `snapshot.delete()` to remove a snapshot.
|
||||
|
||||
#### Scenario: Deleted snapshot is no longer listable
|
||||
|
||||
- **WHEN** `snapshot.delete()` is called and then `Snapshot.list()` is called
|
||||
- **THEN** the deleted snapshot SHALL no longer appear in the list
|
||||
|
||||
### Requirement: Snapshot-based sandbox creation
|
||||
|
||||
The system SHALL support creating sandboxes from snapshots via `Sandbox.create({ source: { type: "snapshot", snapshotId } })`.
|
||||
|
||||
#### Scenario: Sandbox created from snapshot has matching filesystem
|
||||
|
||||
- **WHEN** a sandbox is created with a snapshot source and a file is written, then another sandbox is created from the resulting snapshot
|
||||
- **THEN** the second sandbox SHALL contain the file from the first
|
||||
|
||||
### Requirement: Snapshot retention
|
||||
|
||||
The system SHALL support `keepLastSnapshots` retention policy on sandboxes.
|
||||
|
||||
#### Scenario: Retention evicts oldest snapshots
|
||||
|
||||
- **WHEN** a sandbox has `keepLastSnapshots: { count: 3 }` and a 4th snapshot is created
|
||||
- **THEN** the oldest snapshot SHALL be evicted
|
||||
@@ -0,0 +1,43 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: State definition via annotations
|
||||
|
||||
The system SHALL provide an `Annotation` API for defining graph state schemas:
|
||||
|
||||
- `Annotation<T>(reducer?)` — creates a state key with optional reducer
|
||||
- `Annotation.Root({ key: Annotation<T> })` — combines keys into a state schema
|
||||
- Reducers: `LastValue` (default — overwrite), `BinaryOperator` (custom merge function)
|
||||
|
||||
#### Scenario: Annotation.Root defines typed state
|
||||
|
||||
- **WHEN** `const State = Annotation.Root({ messages: Annotation<string[]>(addMessages), step: Annotation<number>() })` is defined
|
||||
- **THEN** `State` SHALL have `State`, `Update`, and `Node` type members
|
||||
|
||||
#### Scenario: LastValue reducer replaces on each write
|
||||
|
||||
- **WHEN** a node writes `{ step: 2 }` and then `{ step: 3 }` in the same step
|
||||
- **THEN** the LastValue channel SHALL throw an `InvalidUpdateError`
|
||||
|
||||
#### Scenario: BinaryOperator reducer accumulates
|
||||
|
||||
- **WHEN** a node returns `{ messages: ["hello"] }` and another returns `{ messages: ["world"] }` with an `addMessages` reducer
|
||||
- **THEN** the final state SHALL contain `messages: ["hello", "world"]`
|
||||
|
||||
### Requirement: StateGraph builder
|
||||
|
||||
The system SHALL provide a `StateGraph` class for constructing stateful agent graphs.
|
||||
|
||||
#### Scenario: StateGraph is constructed with state schema
|
||||
|
||||
- **WHEN** `new StateGraph({ stateSchema: State })` is called
|
||||
- **THEN** the graph SHALL accept nodes that receive and can update the defined state
|
||||
|
||||
#### Scenario: Nodes can read and write state
|
||||
|
||||
- **WHEN** a node function receives state with `{ messages, step }` and returns `{ step: step + 1 }`
|
||||
- **THEN** the graph SHALL update `step` and preserve `messages`
|
||||
|
||||
#### Scenario: Conditional edges route based on state
|
||||
|
||||
- **WHEN** `addConditionalEdges("node_a", (state) => state.step > 5 ? "end" : "node_b")` is added
|
||||
- **THEN** execution SHALL route based on the state value at runtime
|
||||
@@ -0,0 +1,51 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Trajectory match evaluator
|
||||
|
||||
The system SHALL provide `create_trajectory_match_evaluator()` that compares agent tool-call trajectories against reference trajectories.
|
||||
|
||||
Parameters:
|
||||
- `trajectory_match_mode: "strict" | "unordered" | "subset" | "superset"` — matching strategy
|
||||
- `tool_args_match_mode: "exact" | "ignore" | "subset" | "superset"` — tool argument comparison
|
||||
- `tool_args_match_overrides?: Record<string, ToolArgsMatchMode | Callable | string[]>` — per-tool custom matching
|
||||
|
||||
#### Scenario: Strict mode requires exact order
|
||||
|
||||
- **WHEN** output trajectory has tool calls `[A, B]` and reference is `[A, B]`
|
||||
- **THEN** strict mode SHALL return score `true`
|
||||
- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
|
||||
- **THEN** strict mode SHALL return score `false`
|
||||
|
||||
#### Scenario: Unordered mode ignores order
|
||||
|
||||
- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
|
||||
- **THEN** unordered mode SHALL return score `true`
|
||||
|
||||
#### Scenario: Subset mode accepts partial trajectory
|
||||
|
||||
- **WHEN** output trajectory has tool calls `[A]` and reference is `[A, B]`
|
||||
- **THEN** subset mode SHALL return score `true`
|
||||
|
||||
#### Scenario: Superset mode allows extra tool calls
|
||||
|
||||
- **WHEN** output trajectory has tool calls `[A, B, C]` and reference is `[A, B]`
|
||||
- **THEN** superset mode SHALL return score `true`
|
||||
|
||||
#### Scenario: Tool args ignore mode skips argument comparison
|
||||
|
||||
- **WHEN** `tool_args_match_mode="ignore"` is set
|
||||
- **THEN** tool calls match regardless of their arguments
|
||||
|
||||
#### Scenario: Custom tool arg matcher is used
|
||||
|
||||
- **WHEN** `tool_args_match_overrides` contains a `Callable` for a tool name
|
||||
- **THEN** that callable SHALL be invoked to compare the tool's arguments
|
||||
|
||||
### Requirement: Trajectory LLM-as-judge
|
||||
|
||||
The system SHALL provide `create_trajectory_llm_as_judge()` that uses an LLM to grade trajectory quality and accuracy.
|
||||
|
||||
#### Scenario: Trajectory is formatted as XML for LLM
|
||||
|
||||
- **WHEN** an LLM trajectory evaluator is invoked
|
||||
- **THEN** the trajectory SHALL be formatted as XML with `<role>`, `<tool_call>`, `<tool_result>` elements
|
||||
Reference in New Issue
Block a user