boocode/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/design.md at a2236e3c57a1e01fa65aeb38f6fe61b564021a15

Files

indifferentketchup c935687725 chore(openspec): drop 9 superseded proposals + 11 stub archive files

Drop 9 batch proposals that are superseded by the boocode-lift-analysis
(boocontext-audit, conductor upgrades, self-healing/verify-gate skills):
add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform,
conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul,
agent-reliability.

Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only)
that provide zero documentation value over the existing CHANGELOG.md + git tags.

2026-06-07 22:15:38 +00:00

6.6 KiB

Raw Blame History

Context

This design defines a unified Agent Evaluation & Execution Runtime combining three subsystems inspired by OpenEvals, Vercel Sandbox, and langgraphjs. The system is a TypeScript monorepo with four packages:

@agent-runtime/core — Shared types, serialization protocol, provider abstraction
@agent-runtime/eval — LLM-as-judge, trajectory, code correctness, multi-turn sim, prompt library
@agent-runtime/sandbox — Remote sandbox lifecycle, command execution, filesystem, snapshots, network policy
@agent-runtime/graph — Stateful graph, Pregel execution, checkpoints, interrupts, streaming

Each package is independently usable but designed to compose: evals run code in sandboxes, sandbox lifecycles are orchestrated by graphs, and graph nodes can be evaluated by evals.

Goals / Non-Goals

Goals:

Zero required runtime dependencies for eval core (optional providers via adapter pattern)
Sandbox abstraction that works with any provider (Vercel, Fly, custom) via APIClient interface
Graph execution with pluggable checkpointers (in-memory, SQLite, Redis, Postgres)
All three subsystems share a common serialization protocol for cross-persistence
Evaluation can target code running inside sandbox instances
Graph nodes can suspend/resume via interrupts with persistent checkpointing

Non-Goals:

Not a replacement for LangChain/LlamaIndex — no integrations with existing frameworks in v1
Not a general-purpose workflow engine — focused on agent/task orchestration patterns
No UI or dashboard in v1 — CLI and programmatic API only
No Python SDK in v1 — TypeScript-first, Python planned

Decisions

D1: Package Architecture — `core` + 3 domain packages

Rationale: Eval, Sandbox, and Graph have zero overlap in concerns but share types (serialization, error handling, config). A shared core avoids circular deps and keeps each package lightweight.
Alternatives considered: Monolithic single package — rejected because users may want only one subsystem.

D2: Eval Factory Pattern (from OpenEvals)

Rationale: OpenEvals' create_llm_as_judge(prompt, model, ...) returning a callable is elegant — the evaluator is a function, not a class. Users compose evaluators into test suites. This pattern is preserved exactly.
Deviation: Drop LangChain dependency. Use a minimal ModelClient protocol (like OpenEvals' ModelClient protocol) instead of BaseChatModel. Users pass an OpenAI-compatible client or a custom adapter.

D3: Sandbox as API Wrapper (from Vercel Sandbox)

Rationale: The Vercel Sandbox Sandbox class cleanly separates the Sandbox (persistent config) from Session (running VM). Sandbox.create() → VM, sandbox.runCommand() → execute, sandbox.fs → filesystem. This maps naturally to any provider with Firecracker/kata-containers.
Deviation: Abstract APIClient behind SandboxProvider interface so multiple backends can be plugged in. The "use step" Vercel compiler directive is replaced with explicit serialization methods.

D4: Graph as Pregel + Checkpointer (from langgraphjs)

Rationale: The superstep-based Pregel engine with typed channels is a proven pattern for stateful agent graphs. Separating graph definition (StateGraph) from execution (Pregel.compile()) is the right abstraction.
Deviation: Drop @langchain/core/runnables dependency. Define Runnable as a minimal interface (invoke, stream only). Use native Promise concurrency instead of LangChain callback system.

D5: Interrupt/Resume via Checkpoint (from langgraphjs)

Rationale: interrupt() throwing a typed error that's caught by the execution loop, persisted to checkpoints, and resumed via Command({resume: ...}) is the cleanest HITL pattern.
Deviation: Simplify to a single GraphInterrupt error type. No scratchpad — just a sequential interrupt index stored in checkpoint metadata.

D6: Serialization Protocol

Rationale: Vercel Sandbox's WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE pattern enables cross-session persistence. We adopt toJSON()/fromJSON() static methods on all stateful types.
Channels → serialized as plain objects.
Checkpoints → serialized as versioned JSON with hash verification.

D7: Filesystem API over Shell Commands (from Vercel Sandbox)

Rationale: Vercel's FileSystem class implements the full node:fs/promises API by running shell commands (stat, find, mkdir, etc.) inside the sandbox. This is pragmatic and avoids building a special FS protocol.
Limitation: Stat parsing from shell output is fragile. Mitigate with structured output format (JSON + delimiter parsing).

D8: Network Policy as TypeScript Types (from Vercel Sandbox)

Rationale: The NetworkPolicy union type ("allow-all" | "deny-all" | { allow: ... }) maps directly to firewall rules. It's declarative, serializable, and provider-agnostic.
Extension: Add tls and rateLimit options beyond what Vercel provides.

Risks / Trade-offs

[Risk] Provider coupling for sandbox: Abstracting SandboxProvider might leak provider-specific features. Mitigation: Define the interface minimally (CRUD + exec + fs); provider-specific features are accessed via (sandbox as any) escape hatch.
[Risk] Pregel complexity: The superstep execution model is sophisticated (~2700 lines in langgraphjs). Mitigation: Start with sequential execution, add parallelism as optimization. The channel model stays from day one.
[Risk] Eval without LangChain: Dropping LangChain means reimplementing structured output parsing (with_structured_output). Mitigation: Target OpenAI-compatible APIs first (they support response_format: json_schema natively). Add generic Zod/json-schema path for other providers.
[Trade-off] TypeScript-first: Python users of OpenEvals patterns won't get a direct migration path. Mitigation: The eval prompt templates are language-agnostic strings; the core logic is portable.
[Trade-off] Monorepo overhead: Four packages with shared config. Mitigation: Use minimal workspaces (pnpm/turbo), keep build config shared.

Open Questions

Should the sandbox provider interface include a createCheckpoint/restoreCheckpoint for VM-level snapshots, or should that be graph-layer only?
What's the minimum Node.js version? Node 20+ for AsyncDisposable support (used in Sandbox lifecycle).
Should the eval prompt library ship as part of @agent-runtime/eval or as a separate @agent-runtime/prompts package?
How should eval results feed back into graph state? E.g., a "code correctness eval" runs inside a graph node, and the score influences routing.

6.6 KiB Raw Blame History