Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
2.1 KiB
2.1 KiB
ADDED Requirements
Requirement: Trajectory match evaluator
The system SHALL provide create_trajectory_match_evaluator() that compares agent tool-call trajectories against reference trajectories.
Parameters:
trajectory_match_mode: "strict" | "unordered" | "subset" | "superset"— matching strategytool_args_match_mode: "exact" | "ignore" | "subset" | "superset"— tool argument comparisontool_args_match_overrides?: Record<string, ToolArgsMatchMode | Callable | string[]>— per-tool custom matching
Scenario: Strict mode requires exact order
- WHEN output trajectory has tool calls
[A, B]and reference is[A, B] - THEN strict mode SHALL return score
true - WHEN output trajectory has tool calls
[B, A]and reference is[A, B] - THEN strict mode SHALL return score
false
Scenario: Unordered mode ignores order
- WHEN output trajectory has tool calls
[B, A]and reference is[A, B] - THEN unordered mode SHALL return score
true
Scenario: Subset mode accepts partial trajectory
- WHEN output trajectory has tool calls
[A]and reference is[A, B] - THEN subset mode SHALL return score
true
Scenario: Superset mode allows extra tool calls
- WHEN output trajectory has tool calls
[A, B, C]and reference is[A, B] - THEN superset mode SHALL return score
true
Scenario: Tool args ignore mode skips argument comparison
- WHEN
tool_args_match_mode="ignore"is set - THEN tool calls match regardless of their arguments
Scenario: Custom tool arg matcher is used
- WHEN
tool_args_match_overridescontains aCallablefor a tool name - THEN that callable SHALL be invoked to compare the tool's arguments
Requirement: Trajectory LLM-as-judge
The system SHALL provide create_trajectory_llm_as_judge() that uses an LLM to grade trajectory quality and accuracy.
Scenario: Trajectory is formatted as XML for LLM
- WHEN an LLM trajectory evaluator is invoked
- THEN the trajectory SHALL be formatted as XML with
<role>,<tool_call>,<tool_result>elements