Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
4.1 KiB
4.1 KiB
Why
Building reliable LLM agents requires three tightly integrated subsystems that today exist in separate ecosystems with incompatible interfaces: evaluation (OpenEvals), sandboxed execution (Vercel Sandbox), and graph-based orchestration (langgraphjs). There is no unified runtime that combines all three — meaning every team building production agents must stitch together incompatible libraries, resulting in fragile eval pipelines, uncontained code execution, and ad-hoc workflow orchestration.
This change proposes a unified Agent Evaluation & Execution Runtime that combines patterns from all three into a single, consistent system.
What Changes
- New
@agent-runtime/evalpackage: LLM-as-judge evaluator, trajectory evaluators, code correctness (static analysis + sandboxed execution), multi-turn simulation, and 20+ built-in eval prompt templates — extracted from OpenEvals without the LangChain dependency - New
@agent-runtime/sandboxpackage: Remote sandbox provisioning with Firecracker MicroVM isolation, command execution (sync + detached + streaming), POSIX filesystem API, snapshot/checkpoint lifecycle, and network policy control — generalized from Vercel Sandbox patterns - New
@agent-runtime/graphpackage: Stateful agent graph with Pregel-style superstep execution, typed channel-based state management, pluggable checkpointer, human-in-the-loop interrupts, and multi-mode streaming — inspired by langgraphjs - New
@agent-runtime/corepackage: Shared types, serialization protocol, configuration system, and provider abstraction layer used by all three subsystems - Integration wiring: Eval suite can execute code in sandbox, sandbox lifecycle is orchestrated by graph, graph nodes can be evaluated by evals — forming a virtuous cycle
Capabilities
New Capabilities
llm-as-judge: LLM-as-judge evaluator with structured output, continuous/binary/choices scoring, reasoning, few-shot examples, and multimodal attachmentstrajectory-eval: Agent trajectory evaluation with strict/unordered/subset/superset matching and LLM-as-judge trajectory gradingcode-correctness-eval: Code correctness evaluation via LLM judging, static analysis (Pyright/MyPy), and sandboxed executionmulti-turn-simulation: Multi-turn conversation simulation between an app and simulated users with trajectory evaluationeval-prompt-library: Library of 20+ prompt templates across quality, RAG, safety, security, conversation, image, and voice domainssandbox-lifecycle: Sandbox creation, retrieval, forking, and deletion with Firecracker MicroVM isolationsandbox-command-execution: Command execution with blocking, detached, and streaming modes, timeout enforcement, and signal-based process controlsandbox-filesystem: POSIX filesystem operations (read/write/stat/ls/mkdir/cp/mv/rm/chmod/chown/symlink) via remote sandboxsandbox-snapshots: Point-in-time sandbox snapshots with expiration, retention policies, and ancestry treessandbox-network-policy: Network access control with domain allow/deny, request transformers, and subnet rulesstate-graph: Stateful graph with Annotation.Root state definition, typed reducers, and StateGraph builderpregel-execution: Superstep-based execution engine with channel communication, parallel task execution, and checkpointinghuman-in-the-loop: Node-level interrupts with resume values, multi-interrupt support, and Command-based graph resumptiongraph-streaming: Multi-mode streaming protocol (events, messages, metadata) with envelope parsing
Modified Capabilities
None — this is a greenfield system.
Impact
- New packages:
@agent-runtime/core,@agent-runtime/eval,@agent-runtime/sandbox,@agent-runtime/graph - Languages: TypeScript (all packages), Python support planned for eval package
- Dependencies: Zero required runtime deps for core eval logic; optional checkpointer backends (SQLite, Redis, Postgres) for graph; sandbox requires HTTP client
- Target platforms: Node.js 20+, edge-compatible for eval-only usage
- No existing code is modified — this is pure additive