chore(openspec): drop 9 superseded proposals + 11 stub archive files
Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
This commit is contained in:
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-06-07
|
||||
@@ -0,0 +1,76 @@
|
||||
## Context
|
||||
|
||||
This design defines a unified Agent Evaluation & Execution Runtime combining three subsystems inspired by OpenEvals, Vercel Sandbox, and langgraphjs. The system is a TypeScript monorepo with four packages:
|
||||
|
||||
- **`@agent-runtime/core`** — Shared types, serialization protocol, provider abstraction
|
||||
- **`@agent-runtime/eval`** — LLM-as-judge, trajectory, code correctness, multi-turn sim, prompt library
|
||||
- **`@agent-runtime/sandbox`** — Remote sandbox lifecycle, command execution, filesystem, snapshots, network policy
|
||||
- **`@agent-runtime/graph`** — Stateful graph, Pregel execution, checkpoints, interrupts, streaming
|
||||
|
||||
Each package is independently usable but designed to compose: evals run code in sandboxes, sandbox lifecycles are orchestrated by graphs, and graph nodes can be evaluated by evals.
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Zero required runtime dependencies for eval core (optional providers via adapter pattern)
|
||||
- Sandbox abstraction that works with any provider (Vercel, Fly, custom) via APIClient interface
|
||||
- Graph execution with pluggable checkpointers (in-memory, SQLite, Redis, Postgres)
|
||||
- All three subsystems share a common serialization protocol for cross-persistence
|
||||
- Evaluation can target code running inside sandbox instances
|
||||
- Graph nodes can suspend/resume via interrupts with persistent checkpointing
|
||||
|
||||
**Non-Goals:**
|
||||
- Not a replacement for LangChain/LlamaIndex — no integrations with existing frameworks in v1
|
||||
- Not a general-purpose workflow engine — focused on agent/task orchestration patterns
|
||||
- No UI or dashboard in v1 — CLI and programmatic API only
|
||||
- No Python SDK in v1 — TypeScript-first, Python planned
|
||||
|
||||
## Decisions
|
||||
|
||||
### D1: Package Architecture — `core` + 3 domain packages
|
||||
- **Rationale**: Eval, Sandbox, and Graph have zero overlap in concerns but share types (serialization, error handling, config). A shared core avoids circular deps and keeps each package lightweight.
|
||||
- **Alternatives considered**: Monolithic single package — rejected because users may want only one subsystem.
|
||||
|
||||
### D2: Eval Factory Pattern (from OpenEvals)
|
||||
- **Rationale**: OpenEvals' `create_llm_as_judge(prompt, model, ...)` returning a callable is elegant — the evaluator is a function, not a class. Users compose evaluators into test suites. This pattern is preserved exactly.
|
||||
- **Deviation**: Drop LangChain dependency. Use a minimal `ModelClient` protocol (like OpenEvals' `ModelClient` protocol) instead of `BaseChatModel`. Users pass an OpenAI-compatible client or a custom adapter.
|
||||
|
||||
### D3: Sandbox as API Wrapper (from Vercel Sandbox)
|
||||
- **Rationale**: The Vercel Sandbox `Sandbox` class cleanly separates the **Sandbox** (persistent config) from **Session** (running VM). `Sandbox.create()` → VM, `sandbox.runCommand()` → execute, `sandbox.fs` → filesystem. This maps naturally to any provider with Firecracker/kata-containers.
|
||||
- **Deviation**: Abstract `APIClient` behind `SandboxProvider` interface so multiple backends can be plugged in. The `"use step"` Vercel compiler directive is replaced with explicit serialization methods.
|
||||
|
||||
### D4: Graph as Pregel + Checkpointer (from langgraphjs)
|
||||
- **Rationale**: The superstep-based Pregel engine with typed channels is a proven pattern for stateful agent graphs. Separating graph definition (`StateGraph`) from execution (`Pregel.compile()`) is the right abstraction.
|
||||
- **Deviation**: Drop `@langchain/core/runnables` dependency. Define `Runnable` as a minimal interface (invoke, stream only). Use native `Promise` concurrency instead of LangChain callback system.
|
||||
|
||||
### D5: Interrupt/Resume via Checkpoint (from langgraphjs)
|
||||
- **Rationale**: `interrupt()` throwing a typed error that's caught by the execution loop, persisted to checkpoints, and resumed via `Command({resume: ...})` is the cleanest HITL pattern.
|
||||
- **Deviation**: Simplify to a single `GraphInterrupt` error type. No scratchpad — just a sequential interrupt index stored in checkpoint metadata.
|
||||
|
||||
### D6: Serialization Protocol
|
||||
- **Rationale**: Vercel Sandbox's `WORKFLOW_SERIALIZE`/`WORKFLOW_DESERIALIZE` pattern enables cross-session persistence. We adopt `toJSON()`/`fromJSON()` static methods on all stateful types.
|
||||
- **Channels** → serialized as plain objects.
|
||||
- **Checkpoints** → serialized as versioned JSON with hash verification.
|
||||
|
||||
### D7: Filesystem API over Shell Commands (from Vercel Sandbox)
|
||||
- **Rationale**: Vercel's `FileSystem` class implements the full `node:fs/promises` API by running shell commands (`stat`, `find`, `mkdir`, etc.) inside the sandbox. This is pragmatic and avoids building a special FS protocol.
|
||||
- **Limitation**: Stat parsing from shell output is fragile. Mitigate with structured output format (JSON + delimiter parsing).
|
||||
|
||||
### D8: Network Policy as TypeScript Types (from Vercel Sandbox)
|
||||
- **Rationale**: The `NetworkPolicy` union type (`"allow-all" | "deny-all" | { allow: ... }`) maps directly to firewall rules. It's declarative, serializable, and provider-agnostic.
|
||||
- **Extension**: Add `tls` and `rateLimit` options beyond what Vercel provides.
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
- **[Risk] Provider coupling for sandbox**: Abstracting `SandboxProvider` might leak provider-specific features. **Mitigation**: Define the interface minimally (CRUD + exec + fs); provider-specific features are accessed via `(sandbox as any)` escape hatch.
|
||||
- **[Risk] Pregel complexity**: The superstep execution model is sophisticated (~2700 lines in langgraphjs). **Mitigation**: Start with sequential execution, add parallelism as optimization. The channel model stays from day one.
|
||||
- **[Risk] Eval without LangChain**: Dropping LangChain means reimplementing structured output parsing (`with_structured_output`). **Mitigation**: Target OpenAI-compatible APIs first (they support `response_format: json_schema` natively). Add generic Zod/json-schema path for other providers.
|
||||
- **[Trade-off] TypeScript-first**: Python users of OpenEvals patterns won't get a direct migration path. **Mitigation**: The eval prompt templates are language-agnostic strings; the core logic is portable.
|
||||
- **[Trade-off] Monorepo overhead**: Four packages with shared config. **Mitigation**: Use minimal workspaces (pnpm/turbo), keep build config shared.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Should the sandbox provider interface include a `createCheckpoint`/`restoreCheckpoint` for VM-level snapshots, or should that be graph-layer only?
|
||||
- What's the minimum Node.js version? Node 20+ for `AsyncDisposable` support (used in Sandbox lifecycle).
|
||||
- Should the eval prompt library ship as part of `@agent-runtime/eval` or as a separate `@agent-runtime/prompts` package?
|
||||
- How should eval results feed back into graph state? E.g., a "code correctness eval" runs inside a graph node, and the score influences routing.
|
||||
@@ -0,0 +1,44 @@
|
||||
## Why
|
||||
|
||||
Building reliable LLM agents requires three tightly integrated subsystems that today exist in separate ecosystems with incompatible interfaces: **evaluation** (OpenEvals), **sandboxed execution** (Vercel Sandbox), and **graph-based orchestration** (langgraphjs). There is no unified runtime that combines all three — meaning every team building production agents must stitch together incompatible libraries, resulting in fragile eval pipelines, uncontained code execution, and ad-hoc workflow orchestration.
|
||||
|
||||
This change proposes **a unified Agent Evaluation & Execution Runtime** that combines patterns from all three into a single, consistent system.
|
||||
|
||||
## What Changes
|
||||
|
||||
- **New `@agent-runtime/eval` package**: LLM-as-judge evaluator, trajectory evaluators, code correctness (static analysis + sandboxed execution), multi-turn simulation, and 20+ built-in eval prompt templates — extracted from OpenEvals without the LangChain dependency
|
||||
- **New `@agent-runtime/sandbox` package**: Remote sandbox provisioning with Firecracker MicroVM isolation, command execution (sync + detached + streaming), POSIX filesystem API, snapshot/checkpoint lifecycle, and network policy control — generalized from Vercel Sandbox patterns
|
||||
- **New `@agent-runtime/graph` package**: Stateful agent graph with Pregel-style superstep execution, typed channel-based state management, pluggable checkpointer, human-in-the-loop interrupts, and multi-mode streaming — inspired by langgraphjs
|
||||
- **New `@agent-runtime/core` package**: Shared types, serialization protocol, configuration system, and provider abstraction layer used by all three subsystems
|
||||
- **Integration wiring**: Eval suite can execute code in sandbox, sandbox lifecycle is orchestrated by graph, graph nodes can be evaluated by evals — forming a virtuous cycle
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
|
||||
- `llm-as-judge`: LLM-as-judge evaluator with structured output, continuous/binary/choices scoring, reasoning, few-shot examples, and multimodal attachments
|
||||
- `trajectory-eval`: Agent trajectory evaluation with strict/unordered/subset/superset matching and LLM-as-judge trajectory grading
|
||||
- `code-correctness-eval`: Code correctness evaluation via LLM judging, static analysis (Pyright/MyPy), and sandboxed execution
|
||||
- `multi-turn-simulation`: Multi-turn conversation simulation between an app and simulated users with trajectory evaluation
|
||||
- `eval-prompt-library`: Library of 20+ prompt templates across quality, RAG, safety, security, conversation, image, and voice domains
|
||||
- `sandbox-lifecycle`: Sandbox creation, retrieval, forking, and deletion with Firecracker MicroVM isolation
|
||||
- `sandbox-command-execution`: Command execution with blocking, detached, and streaming modes, timeout enforcement, and signal-based process control
|
||||
- `sandbox-filesystem`: POSIX filesystem operations (read/write/stat/ls/mkdir/cp/mv/rm/chmod/chown/symlink) via remote sandbox
|
||||
- `sandbox-snapshots`: Point-in-time sandbox snapshots with expiration, retention policies, and ancestry trees
|
||||
- `sandbox-network-policy`: Network access control with domain allow/deny, request transformers, and subnet rules
|
||||
- `state-graph`: Stateful graph with Annotation.Root state definition, typed reducers, and StateGraph builder
|
||||
- `pregel-execution`: Superstep-based execution engine with channel communication, parallel task execution, and checkpointing
|
||||
- `human-in-the-loop`: Node-level interrupts with resume values, multi-interrupt support, and Command-based graph resumption
|
||||
- `graph-streaming`: Multi-mode streaming protocol (events, messages, metadata) with envelope parsing
|
||||
|
||||
### Modified Capabilities
|
||||
|
||||
*None — this is a greenfield system.*
|
||||
|
||||
## Impact
|
||||
|
||||
- **New packages**: `@agent-runtime/core`, `@agent-runtime/eval`, `@agent-runtime/sandbox`, `@agent-runtime/graph`
|
||||
- **Languages**: TypeScript (all packages), Python support planned for eval package
|
||||
- **Dependencies**: Zero required runtime deps for core eval logic; optional checkpointer backends (SQLite, Redis, Postgres) for graph; sandbox requires HTTP client
|
||||
- **Target platforms**: Node.js 20+, edge-compatible for eval-only usage
|
||||
- **No existing code is modified** — this is pure additive
|
||||
@@ -0,0 +1,65 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Code LLM-as-judge
|
||||
|
||||
The system SHALL provide `create_code_llm_as_judge()` that evaluates code correctness using an LLM, with code extraction from responses.
|
||||
|
||||
Parameters:
|
||||
- `code_extraction_strategy: "none" | "llm" | "markdown_code_blocks"` — how to extract code from output
|
||||
- `code_extractor?: Callable` — custom extraction function
|
||||
|
||||
#### Scenario: Markdown code block extraction
|
||||
|
||||
- **WHEN** `code_extraction_strategy="markdown_code_blocks"` and output contains triple-backtick code blocks
|
||||
- **THEN** the evaluator SHALL extract code from those blocks before scoring
|
||||
|
||||
#### Scenario: LLM-based code extraction
|
||||
|
||||
- **WHEN** `code_extraction_strategy="llm"` and a `judge` is provided
|
||||
- **THEN** the evaluator SHALL use an LLM with `ExtractCode`/`NoCode` tools to extract code
|
||||
|
||||
#### Scenario: No extraction returns raw output
|
||||
|
||||
- **WHEN** `code_extraction_strategy="none"`
|
||||
- **THEN** the raw output string is passed directly to the scorer
|
||||
|
||||
### Requirement: Static analysis evaluator (Pyright)
|
||||
|
||||
The system SHALL provide `create_pyright_evaluator()` that runs Pyright static type checking on extracted Python code.
|
||||
|
||||
Parameters:
|
||||
- `pyright_cli_args: string[]` — additional CLI flags
|
||||
- `code_extraction_strategy` / `code_extractor` — same as code LLM evaluator
|
||||
|
||||
#### Scenario: Pyright detects type error
|
||||
|
||||
- **WHEN** code with a type error (e.g., `x: int = "string"`) is evaluated
|
||||
- **THEN** the evaluator SHALL return score `false` with error details in `comment`
|
||||
|
||||
#### Scenario: Pyright passes clean code
|
||||
|
||||
- **WHEN** valid Python code is evaluated
|
||||
- **THEN** the evaluator SHALL return score `true`
|
||||
|
||||
### Requirement: Static analysis evaluator (Mypy)
|
||||
|
||||
The system SHALL provide `create_mypy_evaluator()` with equivalent behavior to Pyright evaluator but using the Mypy type checker.
|
||||
|
||||
#### Scenario: Mypy detects type error
|
||||
|
||||
- **WHEN** code with an unannotated function returning mismatched types is evaluated
|
||||
- **THEN** the evaluator SHALL return score `false`
|
||||
|
||||
### Requirement: Sandboxed code execution
|
||||
|
||||
The system SHALL provide `create_e2b_execution_evaluator()` that executes code in a sandbox and checks for runtime errors.
|
||||
|
||||
#### Scenario: Code executes without errors
|
||||
|
||||
- **WHEN** valid Python code runs in the sandbox
|
||||
- **THEN** the evaluator SHALL return score `true`
|
||||
|
||||
#### Scenario: Code raises runtime exception
|
||||
|
||||
- **WHEN** code that raises an exception is executed
|
||||
- **THEN** the evaluator SHALL return score `false` with error details
|
||||
@@ -0,0 +1,31 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Shared type system
|
||||
|
||||
The system SHALL define a shared set of types used by all packages:
|
||||
- `EvaluatorResult` — TypedDict with `key: string`, `score: number | boolean`, `comment?: string`, `metadata?: Record<string, unknown>`, `source_run_id?: string`
|
||||
- `ModelClient` — Protocol with `chat.completions.create()` for LLM access
|
||||
- `SandboxProvider` — Interface for provider-agnostic sandbox creation/management
|
||||
- `Checkpointer` — Interface for checkpoint persistence
|
||||
- `Serializable` — Interface requiring `toJSON()` and static `fromJSON()` methods
|
||||
- All evaluators SHALL accept a consistent call signature: `(inputs?, outputs, reference_outputs?, **kwargs)`
|
||||
- Error types: `GraphInterrupt`, `SandboxError`, `EvalError`
|
||||
|
||||
#### Scenario: EvaluatorResult conforms to schema
|
||||
|
||||
- **WHEN** an evaluator returns a result
|
||||
- **THEN** the result SHALL conform to `EvaluatorResult` with at least `key` and `score`
|
||||
|
||||
#### Scenario: All stateful objects are serializable
|
||||
|
||||
- **WHEN** a `Sandbox`, `Snapshot`, or `Command` instance is serialized via `toJSON()`
|
||||
- **THEN** a subsequent `fromJSON()` call SHALL reconstruct an equivalent instance
|
||||
|
||||
### Requirement: Serialization protocol
|
||||
|
||||
All stateful objects (`Sandbox`, `Session`, `Command`, `Snapshot`, `GraphState`) SHALL implement `toJSON()` / `fromJSON()` static methods for cross-session persistence.
|
||||
|
||||
#### Scenario: Round-trip serialization preserves identity
|
||||
|
||||
- **WHEN** an object is serialized and deserialized
|
||||
- **THEN** the deserialized object SHALL have matching identity fields (`id`, `name`, `sessionId`)
|
||||
@@ -0,0 +1,49 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Built-in evaluation prompt templates
|
||||
|
||||
The system SHALL ship with a library of prompt templates organized by domain, ready for use with `create_llm_as_judge()`.
|
||||
|
||||
Domains and included prompts:
|
||||
|
||||
**Quality:**
|
||||
- `CORRECTNESS_PROMPT` — factual accuracy and completeness
|
||||
- `CONCISENESS_PROMPT` — concise responses without hedging or fluff
|
||||
- `HALLUCINATION_PROMPT` — claims verifiable from context
|
||||
- `ANSWER_RELEVANCE_PROMPT` — output addresses the input question
|
||||
- `PLAN_ADHERENCE_PROMPT` — agent actions match declared plan
|
||||
- `LAZINESS_PROMPT` — detects blank or low-effort responses
|
||||
|
||||
**RAG:**
|
||||
- `RAG_GROUNDEDNESS_PROMPT` — output claims supported by retrieved context
|
||||
- `RAG_HELPFULNESS_PROMPT` — output addresses core question
|
||||
- `RAG_RETRIEVAL_RELEVANCE_PROMPT` — retrieved context is relevant to input
|
||||
|
||||
**Safety:**
|
||||
- `TOXICITY_PROMPT` — personal attacks, hate speech
|
||||
- `FAIRNESS_PROMPT` — stereotyping, discrimination
|
||||
|
||||
**Security:**
|
||||
- `PII_LEAKAGE_PROMPT` — names, contact info, credentials in output
|
||||
- `PROMPT_INJECTION_PROMPT` — delimiter manipulation, roleplay bypass
|
||||
- `CODE_INJECTION_PROMPT` — SQL injection, XSS, path traversal
|
||||
|
||||
**Trajectory:**
|
||||
- `TRAJECTORY_ACCURACY_PROMPT` — logical progression, goal alignment
|
||||
- `TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE` — semantically equivalent to reference
|
||||
- `TOOL_SELECTION_PROMPT` — right tools, right order, no redundant calls
|
||||
|
||||
**Conversation:**
|
||||
- `USER_SATISFACTION_PROMPT` — gratitude, resolution, engagement
|
||||
- `TASK_COMPLETION_PROMPT` — was the user's goal achieved
|
||||
- `AGENT_TONE_PROMPT` — appropriate tone and professionalism
|
||||
|
||||
#### Scenario: Each prompt is a string with {inputs}, {outputs}, {reference_outputs} placeholders
|
||||
|
||||
- **WHEN** a prompt template is inspected
|
||||
- **THEN** it SHALL be a string compatible with `str.format()` containing at least `{outputs}`
|
||||
|
||||
#### Scenario: Prompt templates follow rubric structure
|
||||
|
||||
- **WHEN** a prompt template is read
|
||||
- **THEN** it SHALL contain `<Rubric>`, `<Instructions>`, and `<Reminder>` XML sections
|
||||
@@ -0,0 +1,49 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Stream modes
|
||||
|
||||
The system SHALL support multiple stream modes when invoking a compiled graph:
|
||||
|
||||
- `"values"` — emits the full state after each superstep
|
||||
- `"updates"` — emits only the state changes after each superstep
|
||||
- `"messages"` — emits individual message chunks for chat-oriented graphs
|
||||
- `"debug"` — emits debug events with full superstep information
|
||||
- `"custom"` — supports user-defined events via a emit function
|
||||
|
||||
#### Scenario: Values mode emits full state
|
||||
|
||||
- **WHEN** a graph is streamed with `streamMode: ["values"]`
|
||||
- **THEN** each chunk SHALL contain the complete state object after each superstep
|
||||
|
||||
#### Scenario: Updates mode emits diffs
|
||||
|
||||
- **WHEN** a graph is streamed with `streamMode: ["updates"]`
|
||||
- **THEN** each chunk SHALL contain only the state keys that changed
|
||||
|
||||
### Requirement: Stream event protocol
|
||||
|
||||
The system SHALL emit structured events during graph execution, including:
|
||||
- `on_chain_start` — node execution begins
|
||||
- `on_chain_end` — node execution completes
|
||||
- `on_chain_stream` — intermediate output from a node
|
||||
- `on_custom_event` — user-defined events
|
||||
- Checkpoint metadata paired with each event (id, parent_id, step, source)
|
||||
|
||||
#### Scenario: Events include checkpoint metadata
|
||||
|
||||
- **WHEN** a stream event is received
|
||||
- **THEN** it SHALL include a `checkpoint` envelope with `id`, `step`, and `source`
|
||||
|
||||
#### Scenario: Custom events propagate from nodes
|
||||
|
||||
- **WHEN** a node emits a custom event via an emit function
|
||||
- **THEN** that event SHALL appear in the stream with type `on_custom_event`
|
||||
|
||||
### Requirement: Async iteration over streams
|
||||
|
||||
The system SHALL support `for await...of` iteration over graph streams.
|
||||
|
||||
#### Scenario: Stream is async iterable
|
||||
|
||||
- **WHEN** `for await (const chunk of graph.stream(...))` is used
|
||||
- **THEN** each chunk SHALL be available as it is produced
|
||||
@@ -0,0 +1,56 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Node interrupt function
|
||||
|
||||
The system SHALL provide an `interrupt(value)` function that pauses graph execution and returns a resume value when the graph is continued.
|
||||
|
||||
#### Scenario: Interrupt pauses execution with value
|
||||
|
||||
- **WHEN** a node calls `const approval = interrupt({ question: "Approve this action?" })`
|
||||
- **THEN** execution SHALL pause and the interrupt value SHALL be available in the stream output
|
||||
|
||||
#### Scenario: Resume returns value to interrupt
|
||||
|
||||
- **WHEN** the graph is resumed with `Command({ resume: "approved" })`
|
||||
- **THEN** the `interrupt()` call SHALL return `"approved"`
|
||||
|
||||
#### Scenario: Multiple interrupts are supported
|
||||
|
||||
- **WHEN** a node calls `interrupt()` twice
|
||||
- **THEN** each interrupt SHALL be resolved sequentially, requiring two resume commands
|
||||
|
||||
### Requirement: Command-based graph resumption
|
||||
|
||||
The system SHALL provide a `Command` class that supports:
|
||||
- `Command.RESUME` — resume value for pending interrupts
|
||||
- `Command.GOTO` — Send or node name for dynamic routing
|
||||
- `Command.PARENT` — bubble up to parent graph
|
||||
|
||||
#### Scenario: Command with resume continues execution
|
||||
|
||||
- **WHEN** `await graph.stream(new Command({ resume: "user input" }))` is called
|
||||
- **THEN** the interrupted node SHALL continue with the resume value
|
||||
|
||||
#### Scenario: Command with goto routes dynamically
|
||||
|
||||
- **WHEN** a node returns `new Command({ goto: "human_review" })`
|
||||
- **THEN** execution SHALL route to `human_review` node
|
||||
|
||||
### Requirement: Automated interrupts at node boundaries
|
||||
|
||||
The system SHALL support `interruptBefore` and `interruptAfter` in `compile()` options to automatically pause at specific nodes.
|
||||
|
||||
#### Scenario: InterruptBefore pauses before node execution
|
||||
|
||||
- **WHEN** `graph.compile({ interruptBefore: ["approval_node"] })` is used
|
||||
- **THEN** the graph SHALL pause just before executing `approval_node`
|
||||
|
||||
### Requirement: State snapshots on interrupt
|
||||
|
||||
When a graph uses a checkpointer, interrupt states SHALL be persisted so execution can be resumed across process boundaries.
|
||||
|
||||
#### Scenario: Interrupted state is checkpointed
|
||||
|
||||
- **WHEN** a graphed with a checkpointer is interrupted
|
||||
- **THEN** the checkpoint SHALL contain the interrupt state
|
||||
- **THEN** restoring from that checkpoint SHALL yield the same interrupt state
|
||||
@@ -0,0 +1,55 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: LLM-as-judge evaluator factory
|
||||
|
||||
The system SHALL provide a `create_llm_as_judge()` function that creates an evaluator using an LLM to assess output quality.
|
||||
|
||||
Parameters:
|
||||
- `prompt: string | Runnable | Callable` — evaluation prompt (string template with `{inputs}`, `{outputs}`, `{reference_outputs}`)
|
||||
- `judge?: ModelClient | BaseChatModel` — LLM client
|
||||
- `model?: string` — model identifier
|
||||
- `system?: string` — optional system message
|
||||
- `continuous: boolean = false` — float 0-1 scoring when true, boolean when false
|
||||
- `choices?: number[]` — specific enum float values for score
|
||||
- `use_reasoning: boolean = true` — include reasoning in output
|
||||
- `few_shot_examples?: FewShotExample[]` — example evaluations
|
||||
- `output_schema?: JSONSchema | ZodSchema` — custom structured output format
|
||||
|
||||
#### Scenario: String prompt evaluator returns scored result
|
||||
|
||||
- **WHEN** `create_llm_as_judge(prompt="Rate: {outputs}")` is invoked with `outputs="Hello world"`
|
||||
- **THEN** it SHALL return an `EvaluatorResult` with `key: "score"` and a valid `score`
|
||||
|
||||
#### Scenario: Continuous scoring returns float
|
||||
|
||||
- **WHEN** `create_llm_as_judge(prompt=..., continuous=true)` scores output
|
||||
- **THEN** the score SHALL be a float between 0.0 and 1.0
|
||||
|
||||
#### Scenario: Choices scoring returns enum value
|
||||
|
||||
- **WHEN** `create_llm_as_judge(prompt=..., choices=[0.0, 0.5, 1.0])` scores output
|
||||
- **THEN** the score SHALL be exactly one of the enumerated choices
|
||||
|
||||
#### Scenario: Reasoning mode returns comment
|
||||
|
||||
- **WHEN** `create_llm_as_judge(prompt=..., use_reasoning=true)` scores output
|
||||
- **THEN** the `comment` field SHALL contain the LLM's reasoning
|
||||
|
||||
#### Scenario: Few-shot examples are appended to prompt
|
||||
|
||||
- **WHEN** `few_shot_examples` are provided
|
||||
- **THEN** they SHALL be appended as `<example>` XML blocks to the last user message
|
||||
|
||||
#### Scenario: Output schema returns structured dict
|
||||
|
||||
- **WHEN** `output_schema` is provided (e.g., z.object({ quality: z.number() }))
|
||||
- **THEN** the evaluator SHALL return a dict matching that schema instead of `EvaluatorResult`
|
||||
|
||||
### Requirement: Async LLM-as-judge
|
||||
|
||||
The system SHALL provide `create_async_llm_as_judge()` with identical parameters, returning an async evaluator.
|
||||
|
||||
#### Scenario: Async evaluator returns same structure as sync
|
||||
|
||||
- **WHEN** `await` is used on an async evaluator invocation
|
||||
- **THEN** the result SHALL match the same structure as the sync equivalent
|
||||
@@ -0,0 +1,39 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Multi-turn conversation simulation
|
||||
|
||||
The system SHALL provide `run_multiturn_simulation()` that simulates a multi-turn conversation between an app and a simulated user.
|
||||
|
||||
Parameters:
|
||||
- `app: Callable[[ChatCompletionMessage], ChatCompletionMessage]` — the application under test
|
||||
- `user: Callable | string[]` — simulated user (dynamic or static responses)
|
||||
- `max_turns?: number` — maximum conversation turns
|
||||
- `trajectory_evaluators?: EvalFunction[]` — evaluators that assess the final trajectory
|
||||
- `stopping_condition?: Callable[[Message[], number], boolean]` — early termination
|
||||
- `reference_outputs?: unknown` — passed to evaluators
|
||||
|
||||
#### Scenario: Static user responses drive conversation
|
||||
|
||||
- **WHEN** `user=["Hello", "Tell me more", "Goodbye"]` with `max_turns=3`
|
||||
- **THEN** the simulation SHALL alternate between user responses and app responses for 3 turns
|
||||
|
||||
#### Scenario: Dynamic simulated user adapts to context
|
||||
|
||||
- **WHEN** `user` is a `Callable` receiving the current trajectory
|
||||
- **THEN** the user function SHALL receive the current conversation history and return the next message
|
||||
|
||||
#### Scenario: Trajectory evaluators run after simulation
|
||||
|
||||
- **WHEN** `trajectory_evaluators` are provided
|
||||
- **THEN** each evaluator SHALL receive the full conversation trajectory as `outputs`
|
||||
- **THEN** the simulation result SHALL include `evaluator_results` from each evaluator
|
||||
|
||||
#### Scenario: Stopping condition terminates early
|
||||
|
||||
- **WHEN** `stopping_condition` returns `true` before `max_turns`
|
||||
- **THEN** the simulation SHALL terminate immediately
|
||||
|
||||
#### Scenario: Async simulation is supported
|
||||
|
||||
- **WHEN** `run_multiturn_simulation_async()` is called with async `app` and `user` functions
|
||||
- **THEN** the simulation SHALL await each turn and return the same result structure
|
||||
@@ -0,0 +1,49 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Pregel execution engine
|
||||
|
||||
The system SHALL implement a Pregel-style superstep execution engine where:
|
||||
|
||||
- Each "superstep" executes all ready nodes concurrently
|
||||
- Nodes communicate through typed channels (not direct function calls)
|
||||
- Channel writes from one superstep are visible as reads in the next
|
||||
- The engine supports `PULL` (edge-triggered) and `PUSH` (dynamic Send) task scheduling
|
||||
|
||||
#### Scenario: Nodes execute in dependency order
|
||||
|
||||
- **WHEN** node B subscribes to channel A
|
||||
- **THEN** node B SHALL execute in the superstep after node A writes to channel A
|
||||
|
||||
#### Scenario: Concurrent nodes run in parallel
|
||||
|
||||
- **WHEN** two nodes have no dependencies between them
|
||||
- **THEN** they SHALL execute concurrently within the same superstep
|
||||
|
||||
#### Scenario: Dynamic Send spawns new node executions
|
||||
|
||||
- **WHEN** a node calls `send("node_c", { ... })` via `Command`
|
||||
- **THEN** `node_c` SHALL be scheduled for execution in the current or next superstep
|
||||
|
||||
### Requirement: Graph compilation
|
||||
|
||||
The system SHALL provide `graph.compile()` that produces a runnable compiled graph.
|
||||
|
||||
Parameters:
|
||||
- `checkpointer?: Checkpointer` — optional persistence
|
||||
- `interruptBefore?: string[]` — nodes to pause before
|
||||
- `interruptAfter?: string[]` — nodes to pause after
|
||||
- `name?: string` — graph name
|
||||
|
||||
#### Scenario: Compiled graph can be invoked
|
||||
|
||||
- **WHEN** `compiled_graph.invoke({ messages: [] })` is called
|
||||
- **THEN** it SHALL execute all nodes and return the final state
|
||||
|
||||
### Requirement: Recursion limit
|
||||
|
||||
The system SHALL enforce a configurable recursion limit to prevent infinite loops.
|
||||
|
||||
#### Scenario: Exceeding recursion limit throws
|
||||
|
||||
- **WHEN** a graph exceeds the recursion limit
|
||||
- **THEN** a `GraphRecursionError` SHALL be thrown
|
||||
@@ -0,0 +1,61 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Command execution (blocking)
|
||||
|
||||
The system SHALL provide `sandbox.runCommand(cmd, args?, opts?)` that executes a command inside the sandbox and waits for completion.
|
||||
|
||||
Parameters:
|
||||
- `cmd: string` — command to execute
|
||||
- `args?: string[]` — command arguments
|
||||
- `cwd?: string` — working directory
|
||||
- `env?: Record<string, string>` — per-command environment variables
|
||||
- `sudo?: boolean` — execute with root privileges
|
||||
- `timeoutMs?: number` — max execution time (SIGKILL on expiry)
|
||||
- `signal?: AbortSignal` — cancellation
|
||||
|
||||
#### Scenario: Blocking runCommand returns finished result with exit code
|
||||
|
||||
- **WHEN** `sandbox.runCommand("echo", ["hello"])` is called
|
||||
- **THEN** it SHALL return a `CommandFinished` instance with `exitCode: 0`
|
||||
|
||||
#### Scenario: Command timeout kills process
|
||||
|
||||
- **WHEN** `sandbox.runCommand("sleep", ["100"], { timeoutMs: 100 })` is executed
|
||||
- **THEN** it SHALL return a non-zero exit code after ~100ms
|
||||
|
||||
#### Scenario: Stderr is captured separately
|
||||
|
||||
- **WHEN** a command writes to both stdout and stderr
|
||||
- **THEN** `result.stdout()` and `result.stderr()` SHALL return their respective streams
|
||||
|
||||
### Requirement: Detached command execution
|
||||
|
||||
The system SHALL support `{ detached: true }` mode where `runCommand()` returns immediately with a live `Command` handle.
|
||||
|
||||
#### Scenario: Detached command returns before completion
|
||||
|
||||
- **WHEN** `sandbox.runCommand({ cmd: "sleep", args: ["5"], detached: true })` is called
|
||||
- **THEN** it SHALL return a `Command` instance immediately (before the process exits)
|
||||
|
||||
#### Scenario: Detached command can be waited on
|
||||
|
||||
- **WHEN** `command.wait()` is called on a detached command
|
||||
- **THEN** it SHALL return a `CommandFinished` when the process exits
|
||||
|
||||
### Requirement: Command log streaming
|
||||
|
||||
The system SHALL provide `command.logs()` as an async iterable of stdout/stderr log lines.
|
||||
|
||||
#### Scenario: Logs stream output lines
|
||||
|
||||
- **WHEN** `for await (const log of command.logs())` is iterated
|
||||
- **THEN** each `log` SHALL have `stream: "stdout" | "stderr"` and `data: string`
|
||||
|
||||
### Requirement: Command kill
|
||||
|
||||
The system SHALL provide `command.kill(signal?)` to send a POSIX signal to a running command.
|
||||
|
||||
#### Scenario: Default kill sends SIGTERM
|
||||
|
||||
- **WHEN** `command.kill()` is called without a signal
|
||||
- **THEN** SIGTERM SHALL be sent to the process
|
||||
@@ -0,0 +1,50 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Filesystem API matching node:fs/promises
|
||||
|
||||
The system SHALL provide `sandbox.fs` implementing the Node.js `fs/promises` API:
|
||||
|
||||
- `readFile(path, encoding?)` → `Buffer | string`
|
||||
- `writeFile(path, data)` → `void`
|
||||
- `appendFile(path, data)` → `void`
|
||||
- `mkdir(path, { recursive? })` → `void`
|
||||
- `readdir(path, { withFileTypes? })` → `string[] | Dirent[]`
|
||||
- `stat(path)` / `lstat(path)` → `Stats`
|
||||
- `unlink(path)`, `rm(path, { recursive?, force? })`, `rmdir(path)` → `void`
|
||||
- `rename(oldPath, newPath)` → `void`
|
||||
- `copyFile(src, dest)` → `void`
|
||||
- `chmod(path, mode)`, `chown(path, uid, gid)` → `void`
|
||||
- `symlink(target, path)`, `readlink(path)` → `void`
|
||||
- `realpath(path)`, `truncate(path, len?)` → `void`
|
||||
- `mkdtemp(prefix)` → `string`
|
||||
- `access(path)`, `exists(path)` → `boolean`
|
||||
|
||||
#### Scenario: ReadFile returns correct content
|
||||
|
||||
- **WHEN** `sandbox.fs.readFile("/etc/hostname", "utf8")` is called
|
||||
- **THEN** it SHALL return the file content as a string
|
||||
|
||||
#### Scenario: WriteFile creates new file
|
||||
|
||||
- **WHEN** `sandbox.fs.writeFile("/tmp/test.txt", "hello")` is called
|
||||
- **THEN** subsequent `sandbox.fs.readFile("/tmp/test.txt", "utf8")` SHALL return `"hello"`
|
||||
|
||||
#### Scenario: Readdir lists directory contents
|
||||
|
||||
- **WHEN** `sandbox.fs.readdir("/")` is called
|
||||
- **THEN** it SHALL return an array of filenames
|
||||
|
||||
#### Scenario: Stat returns file metadata
|
||||
|
||||
- **WHEN** `sandbox.fs.stat("/etc/hostname")` is called
|
||||
- **THEN** it SHALL return a `Stats`-compatible object with `size`, `isFile()`, `isDirectory()`, `mode`, `uid`, `gid`, `mtime`, etc.
|
||||
|
||||
#### Scenario: Mkdir creates intermediate directories
|
||||
|
||||
- **WHEN** `sandbox.fs.mkdir("/tmp/a/b/c", { recursive: true })` is called
|
||||
- **THEN** the directory `/tmp/a/b/c` SHALL exist
|
||||
|
||||
#### Scenario: Exists returns false for missing files
|
||||
|
||||
- **WHEN** `sandbox.fs.exists("/nonexistent")` is called
|
||||
- **THEN** it SHALL return `false`
|
||||
@@ -0,0 +1,70 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Sandbox creation
|
||||
|
||||
The system SHALL provide a `Sandbox.create()` static method that provisions a new isolated compute environment.
|
||||
|
||||
Parameters:
|
||||
- `name?: string` — optional human-readable name
|
||||
- `source?: { type: "git" | "tarball" | "snapshot" }` — source for initial filesystem
|
||||
- `ports?: number[]` — ports to expose (max 4)
|
||||
- `timeout?: number` — auto-terminate timeout in ms
|
||||
- `resources?: { vcpus: number }` — CPU allocation (2048 MB RAM per vCPU)
|
||||
- `runtime?: string` — runtime identifier
|
||||
- `networkPolicy?: NetworkPolicy` — network restrictions
|
||||
- `env?: Record<string, string>` — default environment variables
|
||||
- `tags?: Record<string, string>` — metadata tags (max 5)
|
||||
- `persistent?: boolean` — persistent filesystem across sessions
|
||||
- `signal?: AbortSignal` — cancellation support
|
||||
|
||||
#### Scenario: Create returns a running Sandbox instance
|
||||
|
||||
- **WHEN** `Sandbox.create()` is called with valid parameters
|
||||
- **THEN** it SHALL return a `Sandbox` instance with a running session
|
||||
|
||||
#### Scenario: Create supports AsyncDisposable
|
||||
|
||||
- **WHEN** `Sandbox.create()` is used with `await using`
|
||||
- **THEN** the sandbox SHALL be automatically stopped when scope exits
|
||||
|
||||
#### Scenario: Source specifies initial filesystem content
|
||||
|
||||
- **WHEN** `source: { type: "git", url: "..." }` is provided
|
||||
- **THEN** the sandbox SHALL clone the git repository on creation
|
||||
|
||||
### Requirement: Sandbox retrieval
|
||||
|
||||
The system SHALL provide `Sandbox.get()` to retrieve an existing sandbox and `Sandbox.getOrCreate()` for idempotent get-or-create.
|
||||
|
||||
#### Scenario: Get retrieves existing sandbox
|
||||
|
||||
- **WHEN** `Sandbox.get({ name: "my-sandbox" })` is called for an existing sandbox
|
||||
- **THEN** it SHALL return the sandbox with its session resumed
|
||||
|
||||
#### Scenario: GetOrCreate creates when not found
|
||||
|
||||
- **WHEN** `Sandbox.getOrCreate({ name: "new-sandbox", onCreate: ... })` is called and sandbox doesn't exist
|
||||
- **THEN** it SHALL create a new sandbox and call `onCreate` once
|
||||
|
||||
### Requirement: Sandbox forking
|
||||
|
||||
The system SHALL provide `Sandbox.fork()` to create a new sandbox from an existing one's current filesystem state.
|
||||
|
||||
#### Scenario: Fork preserves filesystem state
|
||||
|
||||
- **WHEN** `Sandbox.fork({ sourceSandbox: "original" })` is called
|
||||
- **THEN** the new sandbox SHALL start with the filesystem state of the source sandbox
|
||||
|
||||
### Requirement: Sandbox update and delete
|
||||
|
||||
The system SHALL support `sandbox.update()` for configuration changes and `sandbox.delete()` for removal.
|
||||
|
||||
#### Scenario: Update changes sandbox config
|
||||
|
||||
- **WHEN** `sandbox.update({ timeout: 300000 })` is called
|
||||
- **THEN** the sandbox's timeout SHALL be updated for subsequent sessions
|
||||
|
||||
#### Scenario: Delete removes the sandbox
|
||||
|
||||
- **WHEN** `sandbox.delete()` is called
|
||||
- **THEN** the sandbox SHALL be permanently removed
|
||||
@@ -0,0 +1,52 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Network policy type
|
||||
|
||||
The system SHALL define a `NetworkPolicy` type with three forms:
|
||||
|
||||
- `"allow-all"` — full internet access (default)
|
||||
- `"deny-all"` — no external access
|
||||
- `{ allow?: string[] | Record<string, NetworkPolicyRule[]>; subnets?: { allow?: string[]; deny?: string[] } }` — custom rules
|
||||
|
||||
#### Scenario: Allow-all permits all traffic
|
||||
|
||||
- **WHEN** `networkPolicy: "allow-all"` is set
|
||||
- **THEN** all outbound traffic SHALL be permitted
|
||||
|
||||
#### Scenario: Deny-all blocks all traffic
|
||||
|
||||
- **WHEN** `networkPolicy: "deny-all"` is set
|
||||
- **THEN** all outbound traffic SHALL be denied
|
||||
|
||||
#### Scenario: Domain allowlist restricts access
|
||||
|
||||
- **WHEN** `networkPolicy: { allow: ["*.npmjs.org"] }` is set
|
||||
- **THEN** traffic to `registry.npmjs.org` SHALL be allowed and all other traffic SHALL be denied
|
||||
|
||||
#### Scenario: Wildcard domains match subdomains
|
||||
|
||||
- **WHEN** a domain pattern starts with `*.` (e.g., `*.example.com`)
|
||||
- **THEN** it SHALL match any subdomain of that domain
|
||||
|
||||
### Requirement: Network policy rules with transformers
|
||||
|
||||
The system SHALL support per-domain rules with request transformers for header injection.
|
||||
|
||||
Parameters per rule:
|
||||
- `match?: { path?, method?, queryString?, headers? }` — request matchers
|
||||
- `transform?: { headers: Record<string, string> }[]` — header injection
|
||||
- `forwardURL?: string` — HTTPS proxy forwarding
|
||||
|
||||
#### Scenario: Header transform injects authorization
|
||||
|
||||
- **WHEN** a request matches a rule with `transform: [{ headers: { authorization: "Bearer token" } }]`
|
||||
- **THEN** the `authorization` header SHALL be injected before forwarding
|
||||
|
||||
### Requirement: Subnet filtering
|
||||
|
||||
The system SHALL support subnet-level access control via CIDR notation.
|
||||
|
||||
#### Scenario: Subnet allow takes precedence over domain deny
|
||||
|
||||
- **WHEN** `subnets: { allow: ["10.0.0.0/8"] }` is set
|
||||
- **THEN** traffic to `10.0.0.1` SHALL be allowed regardless of domain rules
|
||||
@@ -0,0 +1,59 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Snapshot creation
|
||||
|
||||
The system SHALL provide `sandbox.snapshot()` to create a point-in-time filesystem snapshot.
|
||||
|
||||
Parameters:
|
||||
- `expiration?: number` — TTL in milliseconds (0 for no expiration)
|
||||
|
||||
#### Scenario: Snapshot stops the session and returns Snapshot instance
|
||||
|
||||
- **WHEN** `sandbox.snapshot()` is called on a running sandbox
|
||||
- **THEN** the current session SHALL be stopped and a `Snapshot` SHALL be returned
|
||||
|
||||
### Requirement: Snapshot retrieval and listing
|
||||
|
||||
The system SHALL provide `Snapshot.get()`, `Snapshot.list()`, and `Snapshot.tree()` for managing snapshots.
|
||||
|
||||
#### Scenario: Retrieve snapshot by ID
|
||||
|
||||
- **WHEN** `Snapshot.get({ snapshotId: "snap_abc" })` is called
|
||||
- **THEN** it SHALL return the snapshot with matching ID
|
||||
|
||||
#### Scenario: List snapshots with pagination
|
||||
|
||||
- **WHEN** `Snapshot.list({ name: "my-sandbox" })` is called
|
||||
- **THEN** it SHALL return a paginated list of snapshots for that sandbox
|
||||
|
||||
#### Scenario: Ancestry tree is accessible
|
||||
|
||||
- **WHEN** `Snapshot.tree({ snapshotId: "snap_abc" })` is called
|
||||
- **THEN** it SHALL return the ancestry tree of the snapshot
|
||||
|
||||
### Requirement: Snapshot deletion
|
||||
|
||||
The system SHALL provide `snapshot.delete()` to remove a snapshot.
|
||||
|
||||
#### Scenario: Deleted snapshot is no longer listable
|
||||
|
||||
- **WHEN** `snapshot.delete()` is called and then `Snapshot.list()` is called
|
||||
- **THEN** the deleted snapshot SHALL no longer appear in the list
|
||||
|
||||
### Requirement: Snapshot-based sandbox creation
|
||||
|
||||
The system SHALL support creating sandboxes from snapshots via `Sandbox.create({ source: { type: "snapshot", snapshotId } })`.
|
||||
|
||||
#### Scenario: Sandbox created from snapshot has matching filesystem
|
||||
|
||||
- **WHEN** a sandbox is created with a snapshot source and a file is written, then another sandbox is created from the resulting snapshot
|
||||
- **THEN** the second sandbox SHALL contain the file from the first
|
||||
|
||||
### Requirement: Snapshot retention
|
||||
|
||||
The system SHALL support `keepLastSnapshots` retention policy on sandboxes.
|
||||
|
||||
#### Scenario: Retention evicts oldest snapshots
|
||||
|
||||
- **WHEN** a sandbox has `keepLastSnapshots: { count: 3 }` and a 4th snapshot is created
|
||||
- **THEN** the oldest snapshot SHALL be evicted
|
||||
@@ -0,0 +1,43 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: State definition via annotations
|
||||
|
||||
The system SHALL provide an `Annotation` API for defining graph state schemas:
|
||||
|
||||
- `Annotation<T>(reducer?)` — creates a state key with optional reducer
|
||||
- `Annotation.Root({ key: Annotation<T> })` — combines keys into a state schema
|
||||
- Reducers: `LastValue` (default — overwrite), `BinaryOperator` (custom merge function)
|
||||
|
||||
#### Scenario: Annotation.Root defines typed state
|
||||
|
||||
- **WHEN** `const State = Annotation.Root({ messages: Annotation<string[]>(addMessages), step: Annotation<number>() })` is defined
|
||||
- **THEN** `State` SHALL have `State`, `Update`, and `Node` type members
|
||||
|
||||
#### Scenario: LastValue reducer replaces on each write
|
||||
|
||||
- **WHEN** a node writes `{ step: 2 }` and then `{ step: 3 }` in the same step
|
||||
- **THEN** the LastValue channel SHALL throw an `InvalidUpdateError`
|
||||
|
||||
#### Scenario: BinaryOperator reducer accumulates
|
||||
|
||||
- **WHEN** a node returns `{ messages: ["hello"] }` and another returns `{ messages: ["world"] }` with an `addMessages` reducer
|
||||
- **THEN** the final state SHALL contain `messages: ["hello", "world"]`
|
||||
|
||||
### Requirement: StateGraph builder
|
||||
|
||||
The system SHALL provide a `StateGraph` class for constructing stateful agent graphs.
|
||||
|
||||
#### Scenario: StateGraph is constructed with state schema
|
||||
|
||||
- **WHEN** `new StateGraph({ stateSchema: State })` is called
|
||||
- **THEN** the graph SHALL accept nodes that receive and can update the defined state
|
||||
|
||||
#### Scenario: Nodes can read and write state
|
||||
|
||||
- **WHEN** a node function receives state with `{ messages, step }` and returns `{ step: step + 1 }`
|
||||
- **THEN** the graph SHALL update `step` and preserve `messages`
|
||||
|
||||
#### Scenario: Conditional edges route based on state
|
||||
|
||||
- **WHEN** `addConditionalEdges("node_a", (state) => state.step > 5 ? "end" : "node_b")` is added
|
||||
- **THEN** execution SHALL route based on the state value at runtime
|
||||
@@ -0,0 +1,51 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Trajectory match evaluator
|
||||
|
||||
The system SHALL provide `create_trajectory_match_evaluator()` that compares agent tool-call trajectories against reference trajectories.
|
||||
|
||||
Parameters:
|
||||
- `trajectory_match_mode: "strict" | "unordered" | "subset" | "superset"` — matching strategy
|
||||
- `tool_args_match_mode: "exact" | "ignore" | "subset" | "superset"` — tool argument comparison
|
||||
- `tool_args_match_overrides?: Record<string, ToolArgsMatchMode | Callable | string[]>` — per-tool custom matching
|
||||
|
||||
#### Scenario: Strict mode requires exact order
|
||||
|
||||
- **WHEN** output trajectory has tool calls `[A, B]` and reference is `[A, B]`
|
||||
- **THEN** strict mode SHALL return score `true`
|
||||
- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
|
||||
- **THEN** strict mode SHALL return score `false`
|
||||
|
||||
#### Scenario: Unordered mode ignores order
|
||||
|
||||
- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
|
||||
- **THEN** unordered mode SHALL return score `true`
|
||||
|
||||
#### Scenario: Subset mode accepts partial trajectory
|
||||
|
||||
- **WHEN** output trajectory has tool calls `[A]` and reference is `[A, B]`
|
||||
- **THEN** subset mode SHALL return score `true`
|
||||
|
||||
#### Scenario: Superset mode allows extra tool calls
|
||||
|
||||
- **WHEN** output trajectory has tool calls `[A, B, C]` and reference is `[A, B]`
|
||||
- **THEN** superset mode SHALL return score `true`
|
||||
|
||||
#### Scenario: Tool args ignore mode skips argument comparison
|
||||
|
||||
- **WHEN** `tool_args_match_mode="ignore"` is set
|
||||
- **THEN** tool calls match regardless of their arguments
|
||||
|
||||
#### Scenario: Custom tool arg matcher is used
|
||||
|
||||
- **WHEN** `tool_args_match_overrides` contains a `Callable` for a tool name
|
||||
- **THEN** that callable SHALL be invoked to compare the tool's arguments
|
||||
|
||||
### Requirement: Trajectory LLM-as-judge
|
||||
|
||||
The system SHALL provide `create_trajectory_llm_as_judge()` that uses an LLM to grade trajectory quality and accuracy.
|
||||
|
||||
#### Scenario: Trajectory is formatted as XML for LLM
|
||||
|
||||
- **WHEN** an LLM trajectory evaluator is invoked
|
||||
- **THEN** the trajectory SHALL be formatted as XML with `<role>`, `<tool_call>`, `<tool_result>` elements
|
||||
@@ -0,0 +1,50 @@
|
||||
## 1. Foundation: Core Types & Monorepo Setup ✅
|
||||
|
||||
- [x] 1.1 Initialize pnpm monorepo with turbo.json at root, configure `@agent-runtime/*` workspace packages
|
||||
- [x] 1.2 Set up shared TypeScript config (strict mode, ESNext modules, path aliases)
|
||||
- [x] 1.3 Implement `@agent-runtime/core` package: `EvaluatorResult`, `ScoreType`, `ModelClient` protocol, `Serializable` interface
|
||||
- [x] 1.4 Implement `@agent-runtime/core` serialization protocol: `toJSON()`/`fromJSON()` pattern on stateful types
|
||||
- [x] 1.5 Implement `@agent-runtime/core` error types: `EvalError`
|
||||
- [x] 1.6 Implement `@agent-runtime/core` utility functions: message normalization, XML formatting, JSON schema construction
|
||||
|
||||
## 2. Eval: LLM-as-Judge Core
|
||||
|
||||
- [ ] 2.1 Implement `_construct_default_output_json_schema()` for continuous/binary/choices scoring with reasoning
|
||||
- [ ] 2.2 Implement prompt formatting (string templates, attachments, system messages)
|
||||
- [ ] 2.3 Implement `_append_few_shot_examples()` with XML `<example>` formatting
|
||||
- [ ] 2.4 Implement `_create_llm_as_judge_scorer()` — core scorer with structured output via OpenAI JSON schema
|
||||
- [ ] 2.5 Implement `create_llm_as_judge()` factory wrapping scorer into `_run_evaluator()`
|
||||
- [ ] 2.6 Implement async variants: `create_async_llm_as_judge()`, `_create_async_llm_as_judge_scorer()`
|
||||
- [ ] 2.7 Implement `_run_evaluator_untyped()` and `_process_score()` for result aggregation
|
||||
- [ ] 2.8 Write unit tests for LLM-as-judge: string prompts, continuous scoring, choices, reasoning, few-shot
|
||||
|
||||
## 3. Eval: Trajectory Evaluators
|
||||
|
||||
- [ ] 3.1 Implement trajectory matching utilities: `_normalize_to_openai_messages_list()`, `_extract_tool_calls()`
|
||||
- [ ] 3.2 Implement `_is_trajectory_superset()` core comparator with `_get_matcher_for_tool_name()` override system
|
||||
- [ ] 3.3 Implement strict/unordered/subset/superset matching scorers
|
||||
- [ ] 3.4 Implement `create_trajectory_match_evaluator()` with all 4 modes and `tool_args_match_overrides`
|
||||
- [ ] 3.5 Write tests: all 4 match modes, tool args ignore, custom matchers
|
||||
|
||||
## 4. Eval: Code Correctness Evaluators
|
||||
|
||||
- [ ] 4.1 Implement code extraction: `_extract_code_from_markdown_code_blocks()` regex parser
|
||||
- [ ] 4.2 Implement `_create_base_code_evaluator()` with pluggable extraction pipeline
|
||||
- [ ] 4.3 Implement `create_code_llm_as_judge()` combining extraction + LLM scoring
|
||||
- [ ] 4.4 Implement `create_pyright_evaluator()` with temp file execution and JSON output parsing
|
||||
- [ ] 4.5 Write tests: markdown extraction, Pyright static analysis
|
||||
|
||||
## 5. Eval: Prompt Library
|
||||
|
||||
- [ ] 5.1 Export Quality prompt templates: correctness, conciseness, hallucination, answer_relevance, code_correctness, plan_adherence
|
||||
- [ ] 5.2 Export Safety/Security prompt templates: toxicity, fairness, pii_leakage, prompt_injection
|
||||
- [ ] 5.3 Export Trajectory prompt templates: trajectory_accuracy (with and without reference), tool_selection
|
||||
- [ ] 5.4 Export Conversation prompt templates: user_satisfaction, task_completion, agent_tone
|
||||
|
||||
## 6. Documentation & Release
|
||||
|
||||
- [ ] 6.1 Write README with architecture overview and getting-started example
|
||||
- [ ] 6.2 Document each package with tsdoc exports
|
||||
- [ ] 6.3 Write usage examples: basic eval, code correctness check
|
||||
- [ ] 6.4 Add CI pipeline: lint, type-check, test
|
||||
- [ ] 6.5 Publish initial alpha for `@agent-runtime/eval` package
|
||||
Reference in New Issue
Block a user