Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
77 lines
6.6 KiB
Markdown
77 lines
6.6 KiB
Markdown
## Context
|
|
|
|
This design defines a unified Agent Evaluation & Execution Runtime combining three subsystems inspired by OpenEvals, Vercel Sandbox, and langgraphjs. The system is a TypeScript monorepo with four packages:
|
|
|
|
- **`@agent-runtime/core`** — Shared types, serialization protocol, provider abstraction
|
|
- **`@agent-runtime/eval`** — LLM-as-judge, trajectory, code correctness, multi-turn sim, prompt library
|
|
- **`@agent-runtime/sandbox`** — Remote sandbox lifecycle, command execution, filesystem, snapshots, network policy
|
|
- **`@agent-runtime/graph`** — Stateful graph, Pregel execution, checkpoints, interrupts, streaming
|
|
|
|
Each package is independently usable but designed to compose: evals run code in sandboxes, sandbox lifecycles are orchestrated by graphs, and graph nodes can be evaluated by evals.
|
|
|
|
## Goals / Non-Goals
|
|
|
|
**Goals:**
|
|
- Zero required runtime dependencies for eval core (optional providers via adapter pattern)
|
|
- Sandbox abstraction that works with any provider (Vercel, Fly, custom) via APIClient interface
|
|
- Graph execution with pluggable checkpointers (in-memory, SQLite, Redis, Postgres)
|
|
- All three subsystems share a common serialization protocol for cross-persistence
|
|
- Evaluation can target code running inside sandbox instances
|
|
- Graph nodes can suspend/resume via interrupts with persistent checkpointing
|
|
|
|
**Non-Goals:**
|
|
- Not a replacement for LangChain/LlamaIndex — no integrations with existing frameworks in v1
|
|
- Not a general-purpose workflow engine — focused on agent/task orchestration patterns
|
|
- No UI or dashboard in v1 — CLI and programmatic API only
|
|
- No Python SDK in v1 — TypeScript-first, Python planned
|
|
|
|
## Decisions
|
|
|
|
### D1: Package Architecture — `core` + 3 domain packages
|
|
- **Rationale**: Eval, Sandbox, and Graph have zero overlap in concerns but share types (serialization, error handling, config). A shared core avoids circular deps and keeps each package lightweight.
|
|
- **Alternatives considered**: Monolithic single package — rejected because users may want only one subsystem.
|
|
|
|
### D2: Eval Factory Pattern (from OpenEvals)
|
|
- **Rationale**: OpenEvals' `create_llm_as_judge(prompt, model, ...)` returning a callable is elegant — the evaluator is a function, not a class. Users compose evaluators into test suites. This pattern is preserved exactly.
|
|
- **Deviation**: Drop LangChain dependency. Use a minimal `ModelClient` protocol (like OpenEvals' `ModelClient` protocol) instead of `BaseChatModel`. Users pass an OpenAI-compatible client or a custom adapter.
|
|
|
|
### D3: Sandbox as API Wrapper (from Vercel Sandbox)
|
|
- **Rationale**: The Vercel Sandbox `Sandbox` class cleanly separates the **Sandbox** (persistent config) from **Session** (running VM). `Sandbox.create()` → VM, `sandbox.runCommand()` → execute, `sandbox.fs` → filesystem. This maps naturally to any provider with Firecracker/kata-containers.
|
|
- **Deviation**: Abstract `APIClient` behind `SandboxProvider` interface so multiple backends can be plugged in. The `"use step"` Vercel compiler directive is replaced with explicit serialization methods.
|
|
|
|
### D4: Graph as Pregel + Checkpointer (from langgraphjs)
|
|
- **Rationale**: The superstep-based Pregel engine with typed channels is a proven pattern for stateful agent graphs. Separating graph definition (`StateGraph`) from execution (`Pregel.compile()`) is the right abstraction.
|
|
- **Deviation**: Drop `@langchain/core/runnables` dependency. Define `Runnable` as a minimal interface (invoke, stream only). Use native `Promise` concurrency instead of LangChain callback system.
|
|
|
|
### D5: Interrupt/Resume via Checkpoint (from langgraphjs)
|
|
- **Rationale**: `interrupt()` throwing a typed error that's caught by the execution loop, persisted to checkpoints, and resumed via `Command({resume: ...})` is the cleanest HITL pattern.
|
|
- **Deviation**: Simplify to a single `GraphInterrupt` error type. No scratchpad — just a sequential interrupt index stored in checkpoint metadata.
|
|
|
|
### D6: Serialization Protocol
|
|
- **Rationale**: Vercel Sandbox's `WORKFLOW_SERIALIZE`/`WORKFLOW_DESERIALIZE` pattern enables cross-session persistence. We adopt `toJSON()`/`fromJSON()` static methods on all stateful types.
|
|
- **Channels** → serialized as plain objects.
|
|
- **Checkpoints** → serialized as versioned JSON with hash verification.
|
|
|
|
### D7: Filesystem API over Shell Commands (from Vercel Sandbox)
|
|
- **Rationale**: Vercel's `FileSystem` class implements the full `node:fs/promises` API by running shell commands (`stat`, `find`, `mkdir`, etc.) inside the sandbox. This is pragmatic and avoids building a special FS protocol.
|
|
- **Limitation**: Stat parsing from shell output is fragile. Mitigate with structured output format (JSON + delimiter parsing).
|
|
|
|
### D8: Network Policy as TypeScript Types (from Vercel Sandbox)
|
|
- **Rationale**: The `NetworkPolicy` union type (`"allow-all" | "deny-all" | { allow: ... }`) maps directly to firewall rules. It's declarative, serializable, and provider-agnostic.
|
|
- **Extension**: Add `tls` and `rateLimit` options beyond what Vercel provides.
|
|
|
|
## Risks / Trade-offs
|
|
|
|
- **[Risk] Provider coupling for sandbox**: Abstracting `SandboxProvider` might leak provider-specific features. **Mitigation**: Define the interface minimally (CRUD + exec + fs); provider-specific features are accessed via `(sandbox as any)` escape hatch.
|
|
- **[Risk] Pregel complexity**: The superstep execution model is sophisticated (~2700 lines in langgraphjs). **Mitigation**: Start with sequential execution, add parallelism as optimization. The channel model stays from day one.
|
|
- **[Risk] Eval without LangChain**: Dropping LangChain means reimplementing structured output parsing (`with_structured_output`). **Mitigation**: Target OpenAI-compatible APIs first (they support `response_format: json_schema` natively). Add generic Zod/json-schema path for other providers.
|
|
- **[Trade-off] TypeScript-first**: Python users of OpenEvals patterns won't get a direct migration path. **Mitigation**: The eval prompt templates are language-agnostic strings; the core logic is portable.
|
|
- **[Trade-off] Monorepo overhead**: Four packages with shared config. **Mitigation**: Use minimal workspaces (pnpm/turbo), keep build config shared.
|
|
|
|
## Open Questions
|
|
|
|
- Should the sandbox provider interface include a `createCheckpoint`/`restoreCheckpoint` for VM-level snapshots, or should that be graph-layer only?
|
|
- What's the minimum Node.js version? Node 20+ for `AsyncDisposable` support (used in Sandbox lifecycle).
|
|
- Should the eval prompt library ship as part of `@agent-runtime/eval` or as a separate `@agent-runtime/prompts` package?
|
|
- How should eval results feed back into graph state? E.g., a "code correctness eval" runs inside a graph node, and the score influences routing.
|