Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
3.2 KiB
3.2 KiB
1. Foundation: Core Types & Monorepo Setup ✅
- 1.1 Initialize pnpm monorepo with turbo.json at root, configure
@agent-runtime/*workspace packages - 1.2 Set up shared TypeScript config (strict mode, ESNext modules, path aliases)
- 1.3 Implement
@agent-runtime/corepackage:EvaluatorResult,ScoreType,ModelClientprotocol,Serializableinterface - 1.4 Implement
@agent-runtime/coreserialization protocol:toJSON()/fromJSON()pattern on stateful types - 1.5 Implement
@agent-runtime/coreerror types:EvalError - 1.6 Implement
@agent-runtime/coreutility functions: message normalization, XML formatting, JSON schema construction
2. Eval: LLM-as-Judge Core
- 2.1 Implement
_construct_default_output_json_schema()for continuous/binary/choices scoring with reasoning - 2.2 Implement prompt formatting (string templates, attachments, system messages)
- 2.3 Implement
_append_few_shot_examples()with XML<example>formatting - 2.4 Implement
_create_llm_as_judge_scorer()— core scorer with structured output via OpenAI JSON schema - 2.5 Implement
create_llm_as_judge()factory wrapping scorer into_run_evaluator() - 2.6 Implement async variants:
create_async_llm_as_judge(),_create_async_llm_as_judge_scorer() - 2.7 Implement
_run_evaluator_untyped()and_process_score()for result aggregation - 2.8 Write unit tests for LLM-as-judge: string prompts, continuous scoring, choices, reasoning, few-shot
3. Eval: Trajectory Evaluators
- 3.1 Implement trajectory matching utilities:
_normalize_to_openai_messages_list(),_extract_tool_calls() - 3.2 Implement
_is_trajectory_superset()core comparator with_get_matcher_for_tool_name()override system - 3.3 Implement strict/unordered/subset/superset matching scorers
- 3.4 Implement
create_trajectory_match_evaluator()with all 4 modes andtool_args_match_overrides - 3.5 Write tests: all 4 match modes, tool args ignore, custom matchers
4. Eval: Code Correctness Evaluators
- 4.1 Implement code extraction:
_extract_code_from_markdown_code_blocks()regex parser - 4.2 Implement
_create_base_code_evaluator()with pluggable extraction pipeline - 4.3 Implement
create_code_llm_as_judge()combining extraction + LLM scoring - 4.4 Implement
create_pyright_evaluator()with temp file execution and JSON output parsing - 4.5 Write tests: markdown extraction, Pyright static analysis
5. Eval: Prompt Library
- 5.1 Export Quality prompt templates: correctness, conciseness, hallucination, answer_relevance, code_correctness, plan_adherence
- 5.2 Export Safety/Security prompt templates: toxicity, fairness, pii_leakage, prompt_injection
- 5.3 Export Trajectory prompt templates: trajectory_accuracy (with and without reference), tool_selection
- 5.4 Export Conversation prompt templates: user_satisfaction, task_completion, agent_tone
6. Documentation & Release
- 6.1 Write README with architecture overview and getting-started example
- 6.2 Document each package with tsdoc exports
- 6.3 Write usage examples: basic eval, code correctness check
- 6.4 Add CI pipeline: lint, type-check, test
- 6.5 Publish initial alpha for
@agent-runtime/evalpackage