boocode/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/tasks.md

## 1. Foundation: Core Types & Monorepo Setup ✅

- [x] 1.1 Initialize pnpm monorepo with turbo.json at root, configure `@agent-runtime/*` workspace packages
- [x] 1.2 Set up shared TypeScript config (strict mode, ESNext modules, path aliases)
- [x] 1.3 Implement `@agent-runtime/core` package: `EvaluatorResult`, `ScoreType`, `ModelClient` protocol, `Serializable` interface
- [x] 1.4 Implement `@agent-runtime/core` serialization protocol: `toJSON()`/`fromJSON()` pattern on stateful types
- [x] 1.5 Implement `@agent-runtime/core` error types: `EvalError`
- [x] 1.6 Implement `@agent-runtime/core` utility functions: message normalization, XML formatting, JSON schema construction

## 2. Eval: LLM-as-Judge Core

- [ ] 2.1 Implement `_construct_default_output_json_schema()` for continuous/binary/choices scoring with reasoning
- [ ] 2.2 Implement prompt formatting (string templates, attachments, system messages)
- [ ] 2.3 Implement `_append_few_shot_examples()` with XML `<example>` formatting
- [ ] 2.4 Implement `_create_llm_as_judge_scorer()` — core scorer with structured output via OpenAI JSON schema
- [ ] 2.5 Implement `create_llm_as_judge()` factory wrapping scorer into `_run_evaluator()`
- [ ] 2.6 Implement async variants: `create_async_llm_as_judge()`, `_create_async_llm_as_judge_scorer()`
- [ ] 2.7 Implement `_run_evaluator_untyped()` and `_process_score()` for result aggregation
- [ ] 2.8 Write unit tests for LLM-as-judge: string prompts, continuous scoring, choices, reasoning, few-shot

## 3. Eval: Trajectory Evaluators

- [ ] 3.1 Implement trajectory matching utilities: `_normalize_to_openai_messages_list()`, `_extract_tool_calls()`
- [ ] 3.2 Implement `_is_trajectory_superset()` core comparator with `_get_matcher_for_tool_name()` override system
- [ ] 3.3 Implement strict/unordered/subset/superset matching scorers
- [ ] 3.4 Implement `create_trajectory_match_evaluator()` with all 4 modes and `tool_args_match_overrides`
- [ ] 3.5 Write tests: all 4 match modes, tool args ignore, custom matchers

## 4. Eval: Code Correctness Evaluators

- [ ] 4.1 Implement code extraction: `_extract_code_from_markdown_code_blocks()` regex parser
- [ ] 4.2 Implement `_create_base_code_evaluator()` with pluggable extraction pipeline
- [ ] 4.3 Implement `create_code_llm_as_judge()` combining extraction + LLM scoring
- [ ] 4.4 Implement `create_pyright_evaluator()` with temp file execution and JSON output parsing
- [ ] 4.5 Write tests: markdown extraction, Pyright static analysis

## 5. Eval: Prompt Library

- [ ] 5.1 Export Quality prompt templates: correctness, conciseness, hallucination, answer_relevance, code_correctness, plan_adherence
- [ ] 5.2 Export Safety/Security prompt templates: toxicity, fairness, pii_leakage, prompt_injection
- [ ] 5.3 Export Trajectory prompt templates: trajectory_accuracy (with and without reference), tool_selection
- [ ] 5.4 Export Conversation prompt templates: user_satisfaction, task_completion, agent_tone

## 6. Documentation & Release

- [ ] 6.1 Write README with architecture overview and getting-started example
- [ ] 6.2 Document each package with tsdoc exports
- [ ] 6.3 Write usage examples: basic eval, code correctness check
- [ ] 6.4 Add CI pipeline: lint, type-check, test
- [ ] 6.5 Publish initial alpha for `@agent-runtime/eval` package