## 1. Foundation: Core Types & Monorepo Setup ✅ - [x] 1.1 Initialize pnpm monorepo with turbo.json at root, configure `@agent-runtime/*` workspace packages - [x] 1.2 Set up shared TypeScript config (strict mode, ESNext modules, path aliases) - [x] 1.3 Implement `@agent-runtime/core` package: `EvaluatorResult`, `ScoreType`, `ModelClient` protocol, `Serializable` interface - [x] 1.4 Implement `@agent-runtime/core` serialization protocol: `toJSON()`/`fromJSON()` pattern on stateful types - [x] 1.5 Implement `@agent-runtime/core` error types: `EvalError` - [x] 1.6 Implement `@agent-runtime/core` utility functions: message normalization, XML formatting, JSON schema construction ## 2. Eval: LLM-as-Judge Core - [ ] 2.1 Implement `_construct_default_output_json_schema()` for continuous/binary/choices scoring with reasoning - [ ] 2.2 Implement prompt formatting (string templates, attachments, system messages) - [ ] 2.3 Implement `_append_few_shot_examples()` with XML `` formatting - [ ] 2.4 Implement `_create_llm_as_judge_scorer()` — core scorer with structured output via OpenAI JSON schema - [ ] 2.5 Implement `create_llm_as_judge()` factory wrapping scorer into `_run_evaluator()` - [ ] 2.6 Implement async variants: `create_async_llm_as_judge()`, `_create_async_llm_as_judge_scorer()` - [ ] 2.7 Implement `_run_evaluator_untyped()` and `_process_score()` for result aggregation - [ ] 2.8 Write unit tests for LLM-as-judge: string prompts, continuous scoring, choices, reasoning, few-shot ## 3. Eval: Trajectory Evaluators - [ ] 3.1 Implement trajectory matching utilities: `_normalize_to_openai_messages_list()`, `_extract_tool_calls()` - [ ] 3.2 Implement `_is_trajectory_superset()` core comparator with `_get_matcher_for_tool_name()` override system - [ ] 3.3 Implement strict/unordered/subset/superset matching scorers - [ ] 3.4 Implement `create_trajectory_match_evaluator()` with all 4 modes and `tool_args_match_overrides` - [ ] 3.5 Write tests: all 4 match modes, tool args ignore, custom matchers ## 4. Eval: Code Correctness Evaluators - [ ] 4.1 Implement code extraction: `_extract_code_from_markdown_code_blocks()` regex parser - [ ] 4.2 Implement `_create_base_code_evaluator()` with pluggable extraction pipeline - [ ] 4.3 Implement `create_code_llm_as_judge()` combining extraction + LLM scoring - [ ] 4.4 Implement `create_pyright_evaluator()` with temp file execution and JSON output parsing - [ ] 4.5 Write tests: markdown extraction, Pyright static analysis ## 5. Eval: Prompt Library - [ ] 5.1 Export Quality prompt templates: correctness, conciseness, hallucination, answer_relevance, code_correctness, plan_adherence - [ ] 5.2 Export Safety/Security prompt templates: toxicity, fairness, pii_leakage, prompt_injection - [ ] 5.3 Export Trajectory prompt templates: trajectory_accuracy (with and without reference), tool_selection - [ ] 5.4 Export Conversation prompt templates: user_satisfaction, task_completion, agent_tone ## 6. Documentation & Release - [ ] 6.1 Write README with architecture overview and getting-started example - [ ] 6.2 Document each package with tsdoc exports - [ ] 6.3 Write usage examples: basic eval, code correctness check - [ ] 6.4 Add CI pipeline: lint, type-check, test - [ ] 6.5 Publish initial alpha for `@agent-runtime/eval` package