Files
boocode/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/tasks.md
indifferentketchup c935687725 chore(openspec): drop 9 superseded proposals + 11 stub archive files
Drop 9 batch proposals that are superseded by the boocode-lift-analysis
(boocontext-audit, conductor upgrades, self-healing/verify-gate skills):
add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform,
conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul,
agent-reliability.

Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only)
that provide zero documentation value over the existing CHANGELOG.md + git tags.
2026-06-07 22:15:38 +00:00

3.2 KiB

1. Foundation: Core Types & Monorepo Setup

  • 1.1 Initialize pnpm monorepo with turbo.json at root, configure @agent-runtime/* workspace packages
  • 1.2 Set up shared TypeScript config (strict mode, ESNext modules, path aliases)
  • 1.3 Implement @agent-runtime/core package: EvaluatorResult, ScoreType, ModelClient protocol, Serializable interface
  • 1.4 Implement @agent-runtime/core serialization protocol: toJSON()/fromJSON() pattern on stateful types
  • 1.5 Implement @agent-runtime/core error types: EvalError
  • 1.6 Implement @agent-runtime/core utility functions: message normalization, XML formatting, JSON schema construction

2. Eval: LLM-as-Judge Core

  • 2.1 Implement _construct_default_output_json_schema() for continuous/binary/choices scoring with reasoning
  • 2.2 Implement prompt formatting (string templates, attachments, system messages)
  • 2.3 Implement _append_few_shot_examples() with XML <example> formatting
  • 2.4 Implement _create_llm_as_judge_scorer() — core scorer with structured output via OpenAI JSON schema
  • 2.5 Implement create_llm_as_judge() factory wrapping scorer into _run_evaluator()
  • 2.6 Implement async variants: create_async_llm_as_judge(), _create_async_llm_as_judge_scorer()
  • 2.7 Implement _run_evaluator_untyped() and _process_score() for result aggregation
  • 2.8 Write unit tests for LLM-as-judge: string prompts, continuous scoring, choices, reasoning, few-shot

3. Eval: Trajectory Evaluators

  • 3.1 Implement trajectory matching utilities: _normalize_to_openai_messages_list(), _extract_tool_calls()
  • 3.2 Implement _is_trajectory_superset() core comparator with _get_matcher_for_tool_name() override system
  • 3.3 Implement strict/unordered/subset/superset matching scorers
  • 3.4 Implement create_trajectory_match_evaluator() with all 4 modes and tool_args_match_overrides
  • 3.5 Write tests: all 4 match modes, tool args ignore, custom matchers

4. Eval: Code Correctness Evaluators

  • 4.1 Implement code extraction: _extract_code_from_markdown_code_blocks() regex parser
  • 4.2 Implement _create_base_code_evaluator() with pluggable extraction pipeline
  • 4.3 Implement create_code_llm_as_judge() combining extraction + LLM scoring
  • 4.4 Implement create_pyright_evaluator() with temp file execution and JSON output parsing
  • 4.5 Write tests: markdown extraction, Pyright static analysis

5. Eval: Prompt Library

  • 5.1 Export Quality prompt templates: correctness, conciseness, hallucination, answer_relevance, code_correctness, plan_adherence
  • 5.2 Export Safety/Security prompt templates: toxicity, fairness, pii_leakage, prompt_injection
  • 5.3 Export Trajectory prompt templates: trajectory_accuracy (with and without reference), tool_selection
  • 5.4 Export Conversation prompt templates: user_satisfaction, task_completion, agent_tone

6. Documentation & Release

  • 6.1 Write README with architecture overview and getting-started example
  • 6.2 Document each package with tsdoc exports
  • 6.3 Write usage examples: basic eval, code correctness check
  • 6.4 Add CI pipeline: lint, type-check, test
  • 6.5 Publish initial alpha for @agent-runtime/eval package