## ADDED Requirements ### Requirement: Trajectory match evaluator The system SHALL provide `create_trajectory_match_evaluator()` that compares agent tool-call trajectories against reference trajectories. Parameters: - `trajectory_match_mode: "strict" | "unordered" | "subset" | "superset"` — matching strategy - `tool_args_match_mode: "exact" | "ignore" | "subset" | "superset"` — tool argument comparison - `tool_args_match_overrides?: Record` — per-tool custom matching #### Scenario: Strict mode requires exact order - **WHEN** output trajectory has tool calls `[A, B]` and reference is `[A, B]` - **THEN** strict mode SHALL return score `true` - **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]` - **THEN** strict mode SHALL return score `false` #### Scenario: Unordered mode ignores order - **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]` - **THEN** unordered mode SHALL return score `true` #### Scenario: Subset mode accepts partial trajectory - **WHEN** output trajectory has tool calls `[A]` and reference is `[A, B]` - **THEN** subset mode SHALL return score `true` #### Scenario: Superset mode allows extra tool calls - **WHEN** output trajectory has tool calls `[A, B, C]` and reference is `[A, B]` - **THEN** superset mode SHALL return score `true` #### Scenario: Tool args ignore mode skips argument comparison - **WHEN** `tool_args_match_mode="ignore"` is set - **THEN** tool calls match regardless of their arguments #### Scenario: Custom tool arg matcher is used - **WHEN** `tool_args_match_overrides` contains a `Callable` for a tool name - **THEN** that callable SHALL be invoked to compare the tool's arguments ### Requirement: Trajectory LLM-as-judge The system SHALL provide `create_trajectory_llm_as_judge()` that uses an LLM to grade trajectory quality and accuracy. #### Scenario: Trajectory is formatted as XML for LLM - **WHEN** an LLM trajectory evaluator is invoked - **THEN** the trajectory SHALL be formatted as XML with ``, ``, `` elements