## Why Building reliable LLM agents requires three tightly integrated subsystems that today exist in separate ecosystems with incompatible interfaces: **evaluation** (OpenEvals), **sandboxed execution** (Vercel Sandbox), and **graph-based orchestration** (langgraphjs). There is no unified runtime that combines all three — meaning every team building production agents must stitch together incompatible libraries, resulting in fragile eval pipelines, uncontained code execution, and ad-hoc workflow orchestration. This change proposes **a unified Agent Evaluation & Execution Runtime** that combines patterns from all three into a single, consistent system. ## What Changes - **New `@agent-runtime/eval` package**: LLM-as-judge evaluator, trajectory evaluators, code correctness (static analysis + sandboxed execution), multi-turn simulation, and 20+ built-in eval prompt templates — extracted from OpenEvals without the LangChain dependency - **New `@agent-runtime/sandbox` package**: Remote sandbox provisioning with Firecracker MicroVM isolation, command execution (sync + detached + streaming), POSIX filesystem API, snapshot/checkpoint lifecycle, and network policy control — generalized from Vercel Sandbox patterns - **New `@agent-runtime/graph` package**: Stateful agent graph with Pregel-style superstep execution, typed channel-based state management, pluggable checkpointer, human-in-the-loop interrupts, and multi-mode streaming — inspired by langgraphjs - **New `@agent-runtime/core` package**: Shared types, serialization protocol, configuration system, and provider abstraction layer used by all three subsystems - **Integration wiring**: Eval suite can execute code in sandbox, sandbox lifecycle is orchestrated by graph, graph nodes can be evaluated by evals — forming a virtuous cycle ## Capabilities ### New Capabilities - `llm-as-judge`: LLM-as-judge evaluator with structured output, continuous/binary/choices scoring, reasoning, few-shot examples, and multimodal attachments - `trajectory-eval`: Agent trajectory evaluation with strict/unordered/subset/superset matching and LLM-as-judge trajectory grading - `code-correctness-eval`: Code correctness evaluation via LLM judging, static analysis (Pyright/MyPy), and sandboxed execution - `multi-turn-simulation`: Multi-turn conversation simulation between an app and simulated users with trajectory evaluation - `eval-prompt-library`: Library of 20+ prompt templates across quality, RAG, safety, security, conversation, image, and voice domains - `sandbox-lifecycle`: Sandbox creation, retrieval, forking, and deletion with Firecracker MicroVM isolation - `sandbox-command-execution`: Command execution with blocking, detached, and streaming modes, timeout enforcement, and signal-based process control - `sandbox-filesystem`: POSIX filesystem operations (read/write/stat/ls/mkdir/cp/mv/rm/chmod/chown/symlink) via remote sandbox - `sandbox-snapshots`: Point-in-time sandbox snapshots with expiration, retention policies, and ancestry trees - `sandbox-network-policy`: Network access control with domain allow/deny, request transformers, and subnet rules - `state-graph`: Stateful graph with Annotation.Root state definition, typed reducers, and StateGraph builder - `pregel-execution`: Superstep-based execution engine with channel communication, parallel task execution, and checkpointing - `human-in-the-loop`: Node-level interrupts with resume values, multi-interrupt support, and Command-based graph resumption - `graph-streaming`: Multi-mode streaming protocol (events, messages, metadata) with envelope parsing ### Modified Capabilities *None — this is a greenfield system.* ## Impact - **New packages**: `@agent-runtime/core`, `@agent-runtime/eval`, `@agent-runtime/sandbox`, `@agent-runtime/graph` - **Languages**: TypeScript (all packages), Python support planned for eval package - **Dependencies**: Zero required runtime deps for core eval logic; optional checkpointer backends (SQLite, Redis, Postgres) for graph; sandbox requires HTTP client - **Target platforms**: Node.js 20+, edge-compatible for eval-only usage - **No existing code is modified** — this is pure additive