chore(openspec): drop 9 superseded proposals + 11 stub archive files
Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
This commit is contained in:
@@ -0,0 +1,44 @@
|
||||
## Why
|
||||
|
||||
Building reliable LLM agents requires three tightly integrated subsystems that today exist in separate ecosystems with incompatible interfaces: **evaluation** (OpenEvals), **sandboxed execution** (Vercel Sandbox), and **graph-based orchestration** (langgraphjs). There is no unified runtime that combines all three — meaning every team building production agents must stitch together incompatible libraries, resulting in fragile eval pipelines, uncontained code execution, and ad-hoc workflow orchestration.
|
||||
|
||||
This change proposes **a unified Agent Evaluation & Execution Runtime** that combines patterns from all three into a single, consistent system.
|
||||
|
||||
## What Changes
|
||||
|
||||
- **New `@agent-runtime/eval` package**: LLM-as-judge evaluator, trajectory evaluators, code correctness (static analysis + sandboxed execution), multi-turn simulation, and 20+ built-in eval prompt templates — extracted from OpenEvals without the LangChain dependency
|
||||
- **New `@agent-runtime/sandbox` package**: Remote sandbox provisioning with Firecracker MicroVM isolation, command execution (sync + detached + streaming), POSIX filesystem API, snapshot/checkpoint lifecycle, and network policy control — generalized from Vercel Sandbox patterns
|
||||
- **New `@agent-runtime/graph` package**: Stateful agent graph with Pregel-style superstep execution, typed channel-based state management, pluggable checkpointer, human-in-the-loop interrupts, and multi-mode streaming — inspired by langgraphjs
|
||||
- **New `@agent-runtime/core` package**: Shared types, serialization protocol, configuration system, and provider abstraction layer used by all three subsystems
|
||||
- **Integration wiring**: Eval suite can execute code in sandbox, sandbox lifecycle is orchestrated by graph, graph nodes can be evaluated by evals — forming a virtuous cycle
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
|
||||
- `llm-as-judge`: LLM-as-judge evaluator with structured output, continuous/binary/choices scoring, reasoning, few-shot examples, and multimodal attachments
|
||||
- `trajectory-eval`: Agent trajectory evaluation with strict/unordered/subset/superset matching and LLM-as-judge trajectory grading
|
||||
- `code-correctness-eval`: Code correctness evaluation via LLM judging, static analysis (Pyright/MyPy), and sandboxed execution
|
||||
- `multi-turn-simulation`: Multi-turn conversation simulation between an app and simulated users with trajectory evaluation
|
||||
- `eval-prompt-library`: Library of 20+ prompt templates across quality, RAG, safety, security, conversation, image, and voice domains
|
||||
- `sandbox-lifecycle`: Sandbox creation, retrieval, forking, and deletion with Firecracker MicroVM isolation
|
||||
- `sandbox-command-execution`: Command execution with blocking, detached, and streaming modes, timeout enforcement, and signal-based process control
|
||||
- `sandbox-filesystem`: POSIX filesystem operations (read/write/stat/ls/mkdir/cp/mv/rm/chmod/chown/symlink) via remote sandbox
|
||||
- `sandbox-snapshots`: Point-in-time sandbox snapshots with expiration, retention policies, and ancestry trees
|
||||
- `sandbox-network-policy`: Network access control with domain allow/deny, request transformers, and subnet rules
|
||||
- `state-graph`: Stateful graph with Annotation.Root state definition, typed reducers, and StateGraph builder
|
||||
- `pregel-execution`: Superstep-based execution engine with channel communication, parallel task execution, and checkpointing
|
||||
- `human-in-the-loop`: Node-level interrupts with resume values, multi-interrupt support, and Command-based graph resumption
|
||||
- `graph-streaming`: Multi-mode streaming protocol (events, messages, metadata) with envelope parsing
|
||||
|
||||
### Modified Capabilities
|
||||
|
||||
*None — this is a greenfield system.*
|
||||
|
||||
## Impact
|
||||
|
||||
- **New packages**: `@agent-runtime/core`, `@agent-runtime/eval`, `@agent-runtime/sandbox`, `@agent-runtime/graph`
|
||||
- **Languages**: TypeScript (all packages), Python support planned for eval package
|
||||
- **Dependencies**: Zero required runtime deps for core eval logic; optional checkpointer backends (SQLite, Redis, Postgres) for graph; sandbox requires HTTP client
|
||||
- **Target platforms**: Node.js 20+, edge-compatible for eval-only usage
|
||||
- **No existing code is modified** — this is pure additive
|
||||
Reference in New Issue
Block a user