## Context This design defines a unified Agent Evaluation & Execution Runtime combining three subsystems inspired by OpenEvals, Vercel Sandbox, and langgraphjs. The system is a TypeScript monorepo with four packages: - **`@agent-runtime/core`** — Shared types, serialization protocol, provider abstraction - **`@agent-runtime/eval`** — LLM-as-judge, trajectory, code correctness, multi-turn sim, prompt library - **`@agent-runtime/sandbox`** — Remote sandbox lifecycle, command execution, filesystem, snapshots, network policy - **`@agent-runtime/graph`** — Stateful graph, Pregel execution, checkpoints, interrupts, streaming Each package is independently usable but designed to compose: evals run code in sandboxes, sandbox lifecycles are orchestrated by graphs, and graph nodes can be evaluated by evals. ## Goals / Non-Goals **Goals:** - Zero required runtime dependencies for eval core (optional providers via adapter pattern) - Sandbox abstraction that works with any provider (Vercel, Fly, custom) via APIClient interface - Graph execution with pluggable checkpointers (in-memory, SQLite, Redis, Postgres) - All three subsystems share a common serialization protocol for cross-persistence - Evaluation can target code running inside sandbox instances - Graph nodes can suspend/resume via interrupts with persistent checkpointing **Non-Goals:** - Not a replacement for LangChain/LlamaIndex — no integrations with existing frameworks in v1 - Not a general-purpose workflow engine — focused on agent/task orchestration patterns - No UI or dashboard in v1 — CLI and programmatic API only - No Python SDK in v1 — TypeScript-first, Python planned ## Decisions ### D1: Package Architecture — `core` + 3 domain packages - **Rationale**: Eval, Sandbox, and Graph have zero overlap in concerns but share types (serialization, error handling, config). A shared core avoids circular deps and keeps each package lightweight. - **Alternatives considered**: Monolithic single package — rejected because users may want only one subsystem. ### D2: Eval Factory Pattern (from OpenEvals) - **Rationale**: OpenEvals' `create_llm_as_judge(prompt, model, ...)` returning a callable is elegant — the evaluator is a function, not a class. Users compose evaluators into test suites. This pattern is preserved exactly. - **Deviation**: Drop LangChain dependency. Use a minimal `ModelClient` protocol (like OpenEvals' `ModelClient` protocol) instead of `BaseChatModel`. Users pass an OpenAI-compatible client or a custom adapter. ### D3: Sandbox as API Wrapper (from Vercel Sandbox) - **Rationale**: The Vercel Sandbox `Sandbox` class cleanly separates the **Sandbox** (persistent config) from **Session** (running VM). `Sandbox.create()` → VM, `sandbox.runCommand()` → execute, `sandbox.fs` → filesystem. This maps naturally to any provider with Firecracker/kata-containers. - **Deviation**: Abstract `APIClient` behind `SandboxProvider` interface so multiple backends can be plugged in. The `"use step"` Vercel compiler directive is replaced with explicit serialization methods. ### D4: Graph as Pregel + Checkpointer (from langgraphjs) - **Rationale**: The superstep-based Pregel engine with typed channels is a proven pattern for stateful agent graphs. Separating graph definition (`StateGraph`) from execution (`Pregel.compile()`) is the right abstraction. - **Deviation**: Drop `@langchain/core/runnables` dependency. Define `Runnable` as a minimal interface (invoke, stream only). Use native `Promise` concurrency instead of LangChain callback system. ### D5: Interrupt/Resume via Checkpoint (from langgraphjs) - **Rationale**: `interrupt()` throwing a typed error that's caught by the execution loop, persisted to checkpoints, and resumed via `Command({resume: ...})` is the cleanest HITL pattern. - **Deviation**: Simplify to a single `GraphInterrupt` error type. No scratchpad — just a sequential interrupt index stored in checkpoint metadata. ### D6: Serialization Protocol - **Rationale**: Vercel Sandbox's `WORKFLOW_SERIALIZE`/`WORKFLOW_DESERIALIZE` pattern enables cross-session persistence. We adopt `toJSON()`/`fromJSON()` static methods on all stateful types. - **Channels** → serialized as plain objects. - **Checkpoints** → serialized as versioned JSON with hash verification. ### D7: Filesystem API over Shell Commands (from Vercel Sandbox) - **Rationale**: Vercel's `FileSystem` class implements the full `node:fs/promises` API by running shell commands (`stat`, `find`, `mkdir`, etc.) inside the sandbox. This is pragmatic and avoids building a special FS protocol. - **Limitation**: Stat parsing from shell output is fragile. Mitigate with structured output format (JSON + delimiter parsing). ### D8: Network Policy as TypeScript Types (from Vercel Sandbox) - **Rationale**: The `NetworkPolicy` union type (`"allow-all" | "deny-all" | { allow: ... }`) maps directly to firewall rules. It's declarative, serializable, and provider-agnostic. - **Extension**: Add `tls` and `rateLimit` options beyond what Vercel provides. ## Risks / Trade-offs - **[Risk] Provider coupling for sandbox**: Abstracting `SandboxProvider` might leak provider-specific features. **Mitigation**: Define the interface minimally (CRUD + exec + fs); provider-specific features are accessed via `(sandbox as any)` escape hatch. - **[Risk] Pregel complexity**: The superstep execution model is sophisticated (~2700 lines in langgraphjs). **Mitigation**: Start with sequential execution, add parallelism as optimization. The channel model stays from day one. - **[Risk] Eval without LangChain**: Dropping LangChain means reimplementing structured output parsing (`with_structured_output`). **Mitigation**: Target OpenAI-compatible APIs first (they support `response_format: json_schema` natively). Add generic Zod/json-schema path for other providers. - **[Trade-off] TypeScript-first**: Python users of OpenEvals patterns won't get a direct migration path. **Mitigation**: The eval prompt templates are language-agnostic strings; the core logic is portable. - **[Trade-off] Monorepo overhead**: Four packages with shared config. **Mitigation**: Use minimal workspaces (pnpm/turbo), keep build config shared. ## Open Questions - Should the sandbox provider interface include a `createCheckpoint`/`restoreCheckpoint` for VM-level snapshots, or should that be graph-layer only? - What's the minimum Node.js version? Node 20+ for `AsyncDisposable` support (used in Sandbox lifecycle). - Should the eval prompt library ship as part of `@agent-runtime/eval` or as a separate `@agent-runtime/prompts` package? - How should eval results feed back into graph state? E.g., a "code correctness eval" runs inside a graph node, and the score influences routing.