Files
boocode/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/tasks.md
indifferentketchup c935687725 chore(openspec): drop 9 superseded proposals + 11 stub archive files
Drop 9 batch proposals that are superseded by the boocode-lift-analysis
(boocontext-audit, conductor upgrades, self-healing/verify-gate skills):
add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform,
conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul,
agent-reliability.

Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only)
that provide zero documentation value over the existing CHANGELOG.md + git tags.
2026-06-07 22:15:38 +00:00

51 lines
3.2 KiB
Markdown

## 1. Foundation: Core Types & Monorepo Setup ✅
- [x] 1.1 Initialize pnpm monorepo with turbo.json at root, configure `@agent-runtime/*` workspace packages
- [x] 1.2 Set up shared TypeScript config (strict mode, ESNext modules, path aliases)
- [x] 1.3 Implement `@agent-runtime/core` package: `EvaluatorResult`, `ScoreType`, `ModelClient` protocol, `Serializable` interface
- [x] 1.4 Implement `@agent-runtime/core` serialization protocol: `toJSON()`/`fromJSON()` pattern on stateful types
- [x] 1.5 Implement `@agent-runtime/core` error types: `EvalError`
- [x] 1.6 Implement `@agent-runtime/core` utility functions: message normalization, XML formatting, JSON schema construction
## 2. Eval: LLM-as-Judge Core
- [ ] 2.1 Implement `_construct_default_output_json_schema()` for continuous/binary/choices scoring with reasoning
- [ ] 2.2 Implement prompt formatting (string templates, attachments, system messages)
- [ ] 2.3 Implement `_append_few_shot_examples()` with XML `<example>` formatting
- [ ] 2.4 Implement `_create_llm_as_judge_scorer()` — core scorer with structured output via OpenAI JSON schema
- [ ] 2.5 Implement `create_llm_as_judge()` factory wrapping scorer into `_run_evaluator()`
- [ ] 2.6 Implement async variants: `create_async_llm_as_judge()`, `_create_async_llm_as_judge_scorer()`
- [ ] 2.7 Implement `_run_evaluator_untyped()` and `_process_score()` for result aggregation
- [ ] 2.8 Write unit tests for LLM-as-judge: string prompts, continuous scoring, choices, reasoning, few-shot
## 3. Eval: Trajectory Evaluators
- [ ] 3.1 Implement trajectory matching utilities: `_normalize_to_openai_messages_list()`, `_extract_tool_calls()`
- [ ] 3.2 Implement `_is_trajectory_superset()` core comparator with `_get_matcher_for_tool_name()` override system
- [ ] 3.3 Implement strict/unordered/subset/superset matching scorers
- [ ] 3.4 Implement `create_trajectory_match_evaluator()` with all 4 modes and `tool_args_match_overrides`
- [ ] 3.5 Write tests: all 4 match modes, tool args ignore, custom matchers
## 4. Eval: Code Correctness Evaluators
- [ ] 4.1 Implement code extraction: `_extract_code_from_markdown_code_blocks()` regex parser
- [ ] 4.2 Implement `_create_base_code_evaluator()` with pluggable extraction pipeline
- [ ] 4.3 Implement `create_code_llm_as_judge()` combining extraction + LLM scoring
- [ ] 4.4 Implement `create_pyright_evaluator()` with temp file execution and JSON output parsing
- [ ] 4.5 Write tests: markdown extraction, Pyright static analysis
## 5. Eval: Prompt Library
- [ ] 5.1 Export Quality prompt templates: correctness, conciseness, hallucination, answer_relevance, code_correctness, plan_adherence
- [ ] 5.2 Export Safety/Security prompt templates: toxicity, fairness, pii_leakage, prompt_injection
- [ ] 5.3 Export Trajectory prompt templates: trajectory_accuracy (with and without reference), tool_selection
- [ ] 5.4 Export Conversation prompt templates: user_satisfaction, task_completion, agent_tone
## 6. Documentation & Release
- [ ] 6.1 Write README with architecture overview and getting-started example
- [ ] 6.2 Document each package with tsdoc exports
- [ ] 6.3 Write usage examples: basic eval, code correctness check
- [ ] 6.4 Add CI pipeline: lint, type-check, test
- [ ] 6.5 Publish initial alpha for `@agent-runtime/eval` package