Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
6.6 KiB
6.6 KiB
Context
This design defines a unified Agent Evaluation & Execution Runtime combining three subsystems inspired by OpenEvals, Vercel Sandbox, and langgraphjs. The system is a TypeScript monorepo with four packages:
@agent-runtime/core— Shared types, serialization protocol, provider abstraction@agent-runtime/eval— LLM-as-judge, trajectory, code correctness, multi-turn sim, prompt library@agent-runtime/sandbox— Remote sandbox lifecycle, command execution, filesystem, snapshots, network policy@agent-runtime/graph— Stateful graph, Pregel execution, checkpoints, interrupts, streaming
Each package is independently usable but designed to compose: evals run code in sandboxes, sandbox lifecycles are orchestrated by graphs, and graph nodes can be evaluated by evals.
Goals / Non-Goals
Goals:
- Zero required runtime dependencies for eval core (optional providers via adapter pattern)
- Sandbox abstraction that works with any provider (Vercel, Fly, custom) via APIClient interface
- Graph execution with pluggable checkpointers (in-memory, SQLite, Redis, Postgres)
- All three subsystems share a common serialization protocol for cross-persistence
- Evaluation can target code running inside sandbox instances
- Graph nodes can suspend/resume via interrupts with persistent checkpointing
Non-Goals:
- Not a replacement for LangChain/LlamaIndex — no integrations with existing frameworks in v1
- Not a general-purpose workflow engine — focused on agent/task orchestration patterns
- No UI or dashboard in v1 — CLI and programmatic API only
- No Python SDK in v1 — TypeScript-first, Python planned
Decisions
D1: Package Architecture — core + 3 domain packages
- Rationale: Eval, Sandbox, and Graph have zero overlap in concerns but share types (serialization, error handling, config). A shared core avoids circular deps and keeps each package lightweight.
- Alternatives considered: Monolithic single package — rejected because users may want only one subsystem.
D2: Eval Factory Pattern (from OpenEvals)
- Rationale: OpenEvals'
create_llm_as_judge(prompt, model, ...)returning a callable is elegant — the evaluator is a function, not a class. Users compose evaluators into test suites. This pattern is preserved exactly. - Deviation: Drop LangChain dependency. Use a minimal
ModelClientprotocol (like OpenEvals'ModelClientprotocol) instead ofBaseChatModel. Users pass an OpenAI-compatible client or a custom adapter.
D3: Sandbox as API Wrapper (from Vercel Sandbox)
- Rationale: The Vercel Sandbox
Sandboxclass cleanly separates the Sandbox (persistent config) from Session (running VM).Sandbox.create()→ VM,sandbox.runCommand()→ execute,sandbox.fs→ filesystem. This maps naturally to any provider with Firecracker/kata-containers. - Deviation: Abstract
APIClientbehindSandboxProviderinterface so multiple backends can be plugged in. The"use step"Vercel compiler directive is replaced with explicit serialization methods.
D4: Graph as Pregel + Checkpointer (from langgraphjs)
- Rationale: The superstep-based Pregel engine with typed channels is a proven pattern for stateful agent graphs. Separating graph definition (
StateGraph) from execution (Pregel.compile()) is the right abstraction. - Deviation: Drop
@langchain/core/runnablesdependency. DefineRunnableas a minimal interface (invoke, stream only). Use nativePromiseconcurrency instead of LangChain callback system.
D5: Interrupt/Resume via Checkpoint (from langgraphjs)
- Rationale:
interrupt()throwing a typed error that's caught by the execution loop, persisted to checkpoints, and resumed viaCommand({resume: ...})is the cleanest HITL pattern. - Deviation: Simplify to a single
GraphInterrupterror type. No scratchpad — just a sequential interrupt index stored in checkpoint metadata.
D6: Serialization Protocol
- Rationale: Vercel Sandbox's
WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZEpattern enables cross-session persistence. We adopttoJSON()/fromJSON()static methods on all stateful types. - Channels → serialized as plain objects.
- Checkpoints → serialized as versioned JSON with hash verification.
D7: Filesystem API over Shell Commands (from Vercel Sandbox)
- Rationale: Vercel's
FileSystemclass implements the fullnode:fs/promisesAPI by running shell commands (stat,find,mkdir, etc.) inside the sandbox. This is pragmatic and avoids building a special FS protocol. - Limitation: Stat parsing from shell output is fragile. Mitigate with structured output format (JSON + delimiter parsing).
D8: Network Policy as TypeScript Types (from Vercel Sandbox)
- Rationale: The
NetworkPolicyunion type ("allow-all" | "deny-all" | { allow: ... }) maps directly to firewall rules. It's declarative, serializable, and provider-agnostic. - Extension: Add
tlsandrateLimitoptions beyond what Vercel provides.
Risks / Trade-offs
- [Risk] Provider coupling for sandbox: Abstracting
SandboxProvidermight leak provider-specific features. Mitigation: Define the interface minimally (CRUD + exec + fs); provider-specific features are accessed via(sandbox as any)escape hatch. - [Risk] Pregel complexity: The superstep execution model is sophisticated (~2700 lines in langgraphjs). Mitigation: Start with sequential execution, add parallelism as optimization. The channel model stays from day one.
- [Risk] Eval without LangChain: Dropping LangChain means reimplementing structured output parsing (
with_structured_output). Mitigation: Target OpenAI-compatible APIs first (they supportresponse_format: json_schemanatively). Add generic Zod/json-schema path for other providers. - [Trade-off] TypeScript-first: Python users of OpenEvals patterns won't get a direct migration path. Mitigation: The eval prompt templates are language-agnostic strings; the core logic is portable.
- [Trade-off] Monorepo overhead: Four packages with shared config. Mitigation: Use minimal workspaces (pnpm/turbo), keep build config shared.
Open Questions
- Should the sandbox provider interface include a
createCheckpoint/restoreCheckpointfor VM-level snapshots, or should that be graph-layer only? - What's the minimum Node.js version? Node 20+ for
AsyncDisposablesupport (used in Sandbox lifecycle). - Should the eval prompt library ship as part of
@agent-runtime/evalor as a separate@agent-runtime/promptspackage? - How should eval results feed back into graph state? E.g., a "code correctness eval" runs inside a graph node, and the score influences routing.