Files
boocode/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/design.md
indifferentketchup c935687725 chore(openspec): drop 9 superseded proposals + 11 stub archive files
Drop 9 batch proposals that are superseded by the boocode-lift-analysis
(boocontext-audit, conductor upgrades, self-healing/verify-gate skills):
add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform,
conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul,
agent-reliability.

Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only)
that provide zero documentation value over the existing CHANGELOG.md + git tags.
2026-06-07 22:15:38 +00:00

6.6 KiB

Context

This design defines a unified Agent Evaluation & Execution Runtime combining three subsystems inspired by OpenEvals, Vercel Sandbox, and langgraphjs. The system is a TypeScript monorepo with four packages:

  • @agent-runtime/core — Shared types, serialization protocol, provider abstraction
  • @agent-runtime/eval — LLM-as-judge, trajectory, code correctness, multi-turn sim, prompt library
  • @agent-runtime/sandbox — Remote sandbox lifecycle, command execution, filesystem, snapshots, network policy
  • @agent-runtime/graph — Stateful graph, Pregel execution, checkpoints, interrupts, streaming

Each package is independently usable but designed to compose: evals run code in sandboxes, sandbox lifecycles are orchestrated by graphs, and graph nodes can be evaluated by evals.

Goals / Non-Goals

Goals:

  • Zero required runtime dependencies for eval core (optional providers via adapter pattern)
  • Sandbox abstraction that works with any provider (Vercel, Fly, custom) via APIClient interface
  • Graph execution with pluggable checkpointers (in-memory, SQLite, Redis, Postgres)
  • All three subsystems share a common serialization protocol for cross-persistence
  • Evaluation can target code running inside sandbox instances
  • Graph nodes can suspend/resume via interrupts with persistent checkpointing

Non-Goals:

  • Not a replacement for LangChain/LlamaIndex — no integrations with existing frameworks in v1
  • Not a general-purpose workflow engine — focused on agent/task orchestration patterns
  • No UI or dashboard in v1 — CLI and programmatic API only
  • No Python SDK in v1 — TypeScript-first, Python planned

Decisions

D1: Package Architecture — core + 3 domain packages

  • Rationale: Eval, Sandbox, and Graph have zero overlap in concerns but share types (serialization, error handling, config). A shared core avoids circular deps and keeps each package lightweight.
  • Alternatives considered: Monolithic single package — rejected because users may want only one subsystem.

D2: Eval Factory Pattern (from OpenEvals)

  • Rationale: OpenEvals' create_llm_as_judge(prompt, model, ...) returning a callable is elegant — the evaluator is a function, not a class. Users compose evaluators into test suites. This pattern is preserved exactly.
  • Deviation: Drop LangChain dependency. Use a minimal ModelClient protocol (like OpenEvals' ModelClient protocol) instead of BaseChatModel. Users pass an OpenAI-compatible client or a custom adapter.

D3: Sandbox as API Wrapper (from Vercel Sandbox)

  • Rationale: The Vercel Sandbox Sandbox class cleanly separates the Sandbox (persistent config) from Session (running VM). Sandbox.create() → VM, sandbox.runCommand() → execute, sandbox.fs → filesystem. This maps naturally to any provider with Firecracker/kata-containers.
  • Deviation: Abstract APIClient behind SandboxProvider interface so multiple backends can be plugged in. The "use step" Vercel compiler directive is replaced with explicit serialization methods.

D4: Graph as Pregel + Checkpointer (from langgraphjs)

  • Rationale: The superstep-based Pregel engine with typed channels is a proven pattern for stateful agent graphs. Separating graph definition (StateGraph) from execution (Pregel.compile()) is the right abstraction.
  • Deviation: Drop @langchain/core/runnables dependency. Define Runnable as a minimal interface (invoke, stream only). Use native Promise concurrency instead of LangChain callback system.

D5: Interrupt/Resume via Checkpoint (from langgraphjs)

  • Rationale: interrupt() throwing a typed error that's caught by the execution loop, persisted to checkpoints, and resumed via Command({resume: ...}) is the cleanest HITL pattern.
  • Deviation: Simplify to a single GraphInterrupt error type. No scratchpad — just a sequential interrupt index stored in checkpoint metadata.

D6: Serialization Protocol

  • Rationale: Vercel Sandbox's WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE pattern enables cross-session persistence. We adopt toJSON()/fromJSON() static methods on all stateful types.
  • Channels → serialized as plain objects.
  • Checkpoints → serialized as versioned JSON with hash verification.

D7: Filesystem API over Shell Commands (from Vercel Sandbox)

  • Rationale: Vercel's FileSystem class implements the full node:fs/promises API by running shell commands (stat, find, mkdir, etc.) inside the sandbox. This is pragmatic and avoids building a special FS protocol.
  • Limitation: Stat parsing from shell output is fragile. Mitigate with structured output format (JSON + delimiter parsing).

D8: Network Policy as TypeScript Types (from Vercel Sandbox)

  • Rationale: The NetworkPolicy union type ("allow-all" | "deny-all" | { allow: ... }) maps directly to firewall rules. It's declarative, serializable, and provider-agnostic.
  • Extension: Add tls and rateLimit options beyond what Vercel provides.

Risks / Trade-offs

  • [Risk] Provider coupling for sandbox: Abstracting SandboxProvider might leak provider-specific features. Mitigation: Define the interface minimally (CRUD + exec + fs); provider-specific features are accessed via (sandbox as any) escape hatch.
  • [Risk] Pregel complexity: The superstep execution model is sophisticated (~2700 lines in langgraphjs). Mitigation: Start with sequential execution, add parallelism as optimization. The channel model stays from day one.
  • [Risk] Eval without LangChain: Dropping LangChain means reimplementing structured output parsing (with_structured_output). Mitigation: Target OpenAI-compatible APIs first (they support response_format: json_schema natively). Add generic Zod/json-schema path for other providers.
  • [Trade-off] TypeScript-first: Python users of OpenEvals patterns won't get a direct migration path. Mitigation: The eval prompt templates are language-agnostic strings; the core logic is portable.
  • [Trade-off] Monorepo overhead: Four packages with shared config. Mitigation: Use minimal workspaces (pnpm/turbo), keep build config shared.

Open Questions

  • Should the sandbox provider interface include a createCheckpoint/restoreCheckpoint for VM-level snapshots, or should that be graph-layer only?
  • What's the minimum Node.js version? Node 20+ for AsyncDisposable support (used in Sandbox lifecycle).
  • Should the eval prompt library ship as part of @agent-runtime/eval or as a separate @agent-runtime/prompts package?
  • How should eval results feed back into graph state? E.g., a "code correctness eval" runs inside a graph node, and the score influences routing.