Files
boocode/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/proposal.md
indifferentketchup c935687725 chore(openspec): drop 9 superseded proposals + 11 stub archive files
Drop 9 batch proposals that are superseded by the boocode-lift-analysis
(boocontext-audit, conductor upgrades, self-healing/verify-gate skills):
add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform,
conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul,
agent-reliability.

Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only)
that provide zero documentation value over the existing CHANGELOG.md + git tags.
2026-06-07 22:15:38 +00:00

4.1 KiB

Why

Building reliable LLM agents requires three tightly integrated subsystems that today exist in separate ecosystems with incompatible interfaces: evaluation (OpenEvals), sandboxed execution (Vercel Sandbox), and graph-based orchestration (langgraphjs). There is no unified runtime that combines all three — meaning every team building production agents must stitch together incompatible libraries, resulting in fragile eval pipelines, uncontained code execution, and ad-hoc workflow orchestration.

This change proposes a unified Agent Evaluation & Execution Runtime that combines patterns from all three into a single, consistent system.

What Changes

  • New @agent-runtime/eval package: LLM-as-judge evaluator, trajectory evaluators, code correctness (static analysis + sandboxed execution), multi-turn simulation, and 20+ built-in eval prompt templates — extracted from OpenEvals without the LangChain dependency
  • New @agent-runtime/sandbox package: Remote sandbox provisioning with Firecracker MicroVM isolation, command execution (sync + detached + streaming), POSIX filesystem API, snapshot/checkpoint lifecycle, and network policy control — generalized from Vercel Sandbox patterns
  • New @agent-runtime/graph package: Stateful agent graph with Pregel-style superstep execution, typed channel-based state management, pluggable checkpointer, human-in-the-loop interrupts, and multi-mode streaming — inspired by langgraphjs
  • New @agent-runtime/core package: Shared types, serialization protocol, configuration system, and provider abstraction layer used by all three subsystems
  • Integration wiring: Eval suite can execute code in sandbox, sandbox lifecycle is orchestrated by graph, graph nodes can be evaluated by evals — forming a virtuous cycle

Capabilities

New Capabilities

  • llm-as-judge: LLM-as-judge evaluator with structured output, continuous/binary/choices scoring, reasoning, few-shot examples, and multimodal attachments
  • trajectory-eval: Agent trajectory evaluation with strict/unordered/subset/superset matching and LLM-as-judge trajectory grading
  • code-correctness-eval: Code correctness evaluation via LLM judging, static analysis (Pyright/MyPy), and sandboxed execution
  • multi-turn-simulation: Multi-turn conversation simulation between an app and simulated users with trajectory evaluation
  • eval-prompt-library: Library of 20+ prompt templates across quality, RAG, safety, security, conversation, image, and voice domains
  • sandbox-lifecycle: Sandbox creation, retrieval, forking, and deletion with Firecracker MicroVM isolation
  • sandbox-command-execution: Command execution with blocking, detached, and streaming modes, timeout enforcement, and signal-based process control
  • sandbox-filesystem: POSIX filesystem operations (read/write/stat/ls/mkdir/cp/mv/rm/chmod/chown/symlink) via remote sandbox
  • sandbox-snapshots: Point-in-time sandbox snapshots with expiration, retention policies, and ancestry trees
  • sandbox-network-policy: Network access control with domain allow/deny, request transformers, and subnet rules
  • state-graph: Stateful graph with Annotation.Root state definition, typed reducers, and StateGraph builder
  • pregel-execution: Superstep-based execution engine with channel communication, parallel task execution, and checkpointing
  • human-in-the-loop: Node-level interrupts with resume values, multi-interrupt support, and Command-based graph resumption
  • graph-streaming: Multi-mode streaming protocol (events, messages, metadata) with envelope parsing

Modified Capabilities

None — this is a greenfield system.

Impact

  • New packages: @agent-runtime/core, @agent-runtime/eval, @agent-runtime/sandbox, @agent-runtime/graph
  • Languages: TypeScript (all packages), Python support planned for eval package
  • Dependencies: Zero required runtime deps for core eval logic; optional checkpointer backends (SQLite, Redis, Postgres) for graph; sandbox requires HTTP client
  • Target platforms: Node.js 20+, edge-compatible for eval-only usage
  • No existing code is modified — this is pure additive