## Why

Building reliable LLM agents requires three tightly integrated subsystems that today exist in separate ecosystems with incompatible interfaces: **evaluation** (OpenEvals), **sandboxed execution** (Vercel Sandbox), and **graph-based orchestration** (langgraphjs). There is no unified runtime that combines all three — meaning every team building production agents must stitch together incompatible libraries, resulting in fragile eval pipelines, uncontained code execution, and ad-hoc workflow orchestration.

This change proposes **a unified Agent Evaluation & Execution Runtime** that combines patterns from all three into a single, consistent system.

## What Changes

- **New `@agent-runtime/eval` package**: LLM-as-judge evaluator, trajectory evaluators, code correctness (static analysis + sandboxed execution), multi-turn simulation, and 20+ built-in eval prompt templates — extracted from OpenEvals without the LangChain dependency
- **New `@agent-runtime/sandbox` package**: Remote sandbox provisioning with Firecracker MicroVM isolation, command execution (sync + detached + streaming), POSIX filesystem API, snapshot/checkpoint lifecycle, and network policy control — generalized from Vercel Sandbox patterns
- **New `@agent-runtime/graph` package**: Stateful agent graph with Pregel-style superstep execution, typed channel-based state management, pluggable checkpointer, human-in-the-loop interrupts, and multi-mode streaming — inspired by langgraphjs
- **New `@agent-runtime/core` package**: Shared types, serialization protocol, configuration system, and provider abstraction layer used by all three subsystems
- **Integration wiring**: Eval suite can execute code in sandbox, sandbox lifecycle is orchestrated by graph, graph nodes can be evaluated by evals — forming a virtuous cycle

## Capabilities

### New Capabilities

- `llm-as-judge`: LLM-as-judge evaluator with structured output, continuous/binary/choices scoring, reasoning, few-shot examples, and multimodal attachments
- `trajectory-eval`: Agent trajectory evaluation with strict/unordered/subset/superset matching and LLM-as-judge trajectory grading
- `code-correctness-eval`: Code correctness evaluation via LLM judging, static analysis (Pyright/MyPy), and sandboxed execution
- `multi-turn-simulation`: Multi-turn conversation simulation between an app and simulated users with trajectory evaluation
- `eval-prompt-library`: Library of 20+ prompt templates across quality, RAG, safety, security, conversation, image, and voice domains
- `sandbox-lifecycle`: Sandbox creation, retrieval, forking, and deletion with Firecracker MicroVM isolation
- `sandbox-command-execution`: Command execution with blocking, detached, and streaming modes, timeout enforcement, and signal-based process control
- `sandbox-filesystem`: POSIX filesystem operations (read/write/stat/ls/mkdir/cp/mv/rm/chmod/chown/symlink) via remote sandbox
- `sandbox-snapshots`: Point-in-time sandbox snapshots with expiration, retention policies, and ancestry trees
- `sandbox-network-policy`: Network access control with domain allow/deny, request transformers, and subnet rules
- `state-graph`: Stateful graph with Annotation.Root state definition, typed reducers, and StateGraph builder
- `pregel-execution`: Superstep-based execution engine with channel communication, parallel task execution, and checkpointing
- `human-in-the-loop`: Node-level interrupts with resume values, multi-interrupt support, and Command-based graph resumption
- `graph-streaming`: Multi-mode streaming protocol (events, messages, metadata) with envelope parsing

### Modified Capabilities

*None — this is a greenfield system.*

## Impact

- **New packages**: `@agent-runtime/core`, `@agent-runtime/eval`, `@agent-runtime/sandbox`, `@agent-runtime/graph`
- **Languages**: TypeScript (all packages), Python support planned for eval package
- **Dependencies**: Zero required runtime deps for core eval logic; optional checkpointer backends (SQLite, Redis, Postgres) for graph; sandbox requires HTTP client
- **Target platforms**: Node.js 20+, edge-compatible for eval-only usage
- **No existing code is modified** — this is pure additive