boocode/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/trajectory-eval/spec.md at 9e2b0a7dc0a340effa8d06febb435f8e5d58f9ea

Files

indifferentketchup c935687725 chore(openspec): drop 9 superseded proposals + 11 stub archive files

Drop 9 batch proposals that are superseded by the boocode-lift-analysis
(boocontext-audit, conductor upgrades, self-healing/verify-gate skills):
add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform,
conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul,
agent-reliability.

Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only)
that provide zero documentation value over the existing CHANGELOG.md + git tags.

2026-06-07 22:15:38 +00:00

2.1 KiB

Raw Blame History

ADDED Requirements

Requirement: Trajectory match evaluator

The system SHALL provide create_trajectory_match_evaluator() that compares agent tool-call trajectories against reference trajectories.

Parameters:

trajectory_match_mode: "strict" | "unordered" | "subset" | "superset" — matching strategy
tool_args_match_mode: "exact" | "ignore" | "subset" | "superset" — tool argument comparison
tool_args_match_overrides?: Record<string, ToolArgsMatchMode | Callable | string[]> — per-tool custom matching

Scenario: Strict mode requires exact order

WHEN output trajectory has tool calls [A, B] and reference is [A, B]
THEN strict mode SHALL return score true
WHEN output trajectory has tool calls [B, A] and reference is [A, B]
THEN strict mode SHALL return score false

Scenario: Unordered mode ignores order

WHEN output trajectory has tool calls [B, A] and reference is [A, B]
THEN unordered mode SHALL return score true

Scenario: Subset mode accepts partial trajectory

WHEN output trajectory has tool calls [A] and reference is [A, B]
THEN subset mode SHALL return score true

Scenario: Superset mode allows extra tool calls

WHEN output trajectory has tool calls [A, B, C] and reference is [A, B]
THEN superset mode SHALL return score true

Scenario: Tool args ignore mode skips argument comparison

WHEN tool_args_match_mode="ignore" is set
THEN tool calls match regardless of their arguments

Scenario: Custom tool arg matcher is used

WHEN tool_args_match_overrides contains a Callable for a tool name
THEN that callable SHALL be invoked to compare the tool's arguments

Requirement: Trajectory LLM-as-judge

The system SHALL provide create_trajectory_llm_as_judge() that uses an LLM to grade trajectory quality and accuracy.

Scenario: Trajectory is formatted as XML for LLM

WHEN an LLM trajectory evaluator is invoked
THEN the trajectory SHALL be formatted as XML with <role>, <tool_call>, <tool_result> elements

2.1 KiB Raw Blame History