boocode/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/proposal.md at c9356877257cb482fbcdc9a4d2aa83dd510c3c2d

Files

indifferentketchup c935687725 chore(openspec): drop 9 superseded proposals + 11 stub archive files

Drop 9 batch proposals that are superseded by the boocode-lift-analysis
(boocontext-audit, conductor upgrades, self-healing/verify-gate skills):
add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform,
conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul,
agent-reliability.

Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only)
that provide zero documentation value over the existing CHANGELOG.md + git tags.

2026-06-07 22:15:38 +00:00

4.1 KiB

Raw Blame History

Why

Building reliable LLM agents requires three tightly integrated subsystems that today exist in separate ecosystems with incompatible interfaces: evaluation (OpenEvals), sandboxed execution (Vercel Sandbox), and graph-based orchestration (langgraphjs). There is no unified runtime that combines all three — meaning every team building production agents must stitch together incompatible libraries, resulting in fragile eval pipelines, uncontained code execution, and ad-hoc workflow orchestration.

This change proposes a unified Agent Evaluation & Execution Runtime that combines patterns from all three into a single, consistent system.

What Changes

New @agent-runtime/eval package: LLM-as-judge evaluator, trajectory evaluators, code correctness (static analysis + sandboxed execution), multi-turn simulation, and 20+ built-in eval prompt templates — extracted from OpenEvals without the LangChain dependency
New @agent-runtime/sandbox package: Remote sandbox provisioning with Firecracker MicroVM isolation, command execution (sync + detached + streaming), POSIX filesystem API, snapshot/checkpoint lifecycle, and network policy control — generalized from Vercel Sandbox patterns
New @agent-runtime/graph package: Stateful agent graph with Pregel-style superstep execution, typed channel-based state management, pluggable checkpointer, human-in-the-loop interrupts, and multi-mode streaming — inspired by langgraphjs
New @agent-runtime/core package: Shared types, serialization protocol, configuration system, and provider abstraction layer used by all three subsystems
Integration wiring: Eval suite can execute code in sandbox, sandbox lifecycle is orchestrated by graph, graph nodes can be evaluated by evals — forming a virtuous cycle

Capabilities

New Capabilities

llm-as-judge: LLM-as-judge evaluator with structured output, continuous/binary/choices scoring, reasoning, few-shot examples, and multimodal attachments
trajectory-eval: Agent trajectory evaluation with strict/unordered/subset/superset matching and LLM-as-judge trajectory grading
code-correctness-eval: Code correctness evaluation via LLM judging, static analysis (Pyright/MyPy), and sandboxed execution
multi-turn-simulation: Multi-turn conversation simulation between an app and simulated users with trajectory evaluation
eval-prompt-library: Library of 20+ prompt templates across quality, RAG, safety, security, conversation, image, and voice domains
sandbox-lifecycle: Sandbox creation, retrieval, forking, and deletion with Firecracker MicroVM isolation
sandbox-command-execution: Command execution with blocking, detached, and streaming modes, timeout enforcement, and signal-based process control
sandbox-filesystem: POSIX filesystem operations (read/write/stat/ls/mkdir/cp/mv/rm/chmod/chown/symlink) via remote sandbox
sandbox-snapshots: Point-in-time sandbox snapshots with expiration, retention policies, and ancestry trees
sandbox-network-policy: Network access control with domain allow/deny, request transformers, and subnet rules
state-graph: Stateful graph with Annotation.Root state definition, typed reducers, and StateGraph builder
pregel-execution: Superstep-based execution engine with channel communication, parallel task execution, and checkpointing
human-in-the-loop: Node-level interrupts with resume values, multi-interrupt support, and Command-based graph resumption
graph-streaming: Multi-mode streaming protocol (events, messages, metadata) with envelope parsing

Modified Capabilities

None — this is a greenfield system.

Impact

New packages: @agent-runtime/core, @agent-runtime/eval, @agent-runtime/sandbox, @agent-runtime/graph
Languages: TypeScript (all packages), Python support planned for eval package
Dependencies: Zero required runtime deps for core eval logic; optional checkpointer backends (SQLite, Redis, Postgres) for graph; sandbox requires HTTP client
Target platforms: Node.js 20+, edge-compatible for eval-only usage
No existing code is modified — this is pure additive

4.1 KiB Raw Blame History

Why