chore(openspec): drop 9 superseded proposals + 11 stub archive files

Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
2026-06-07 22:15:38 +00:00
parent 0d6e9a2413
commit c935687725
119 changed files with 4897 additions and 45 deletions
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/trajectory-eval/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/trajectory-eval/spec.md
@@ -0,0 +1,51 @@
+## ADDED Requirements
+
+### Requirement: Trajectory match evaluator
+
+The system SHALL provide `create_trajectory_match_evaluator()` that compares agent tool-call trajectories against reference trajectories.
+
+Parameters:
+- `trajectory_match_mode: "strict" | "unordered" | "subset" | "superset"` — matching strategy
+- `tool_args_match_mode: "exact" | "ignore" | "subset" | "superset"` — tool argument comparison
+- `tool_args_match_overrides?: Record<string, ToolArgsMatchMode | Callable | string[]>` — per-tool custom matching
+
+#### Scenario: Strict mode requires exact order
+
+- **WHEN** output trajectory has tool calls `[A, B]` and reference is `[A, B]`
+- **THEN** strict mode SHALL return score `true`
+- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
+- **THEN** strict mode SHALL return score `false`
+
+#### Scenario: Unordered mode ignores order
+
+- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
+- **THEN** unordered mode SHALL return score `true`
+
+#### Scenario: Subset mode accepts partial trajectory
+
+- **WHEN** output trajectory has tool calls `[A]` and reference is `[A, B]`
+- **THEN** subset mode SHALL return score `true`
+
+#### Scenario: Superset mode allows extra tool calls
+
+- **WHEN** output trajectory has tool calls `[A, B, C]` and reference is `[A, B]`
+- **THEN** superset mode SHALL return score `true`
+
+#### Scenario: Tool args ignore mode skips argument comparison
+
+- **WHEN** `tool_args_match_mode="ignore"` is set
+- **THEN** tool calls match regardless of their arguments
+
+#### Scenario: Custom tool arg matcher is used
+
+- **WHEN** `tool_args_match_overrides` contains a `Callable` for a tool name
+- **THEN** that callable SHALL be invoked to compare the tool's arguments
+
+### Requirement: Trajectory LLM-as-judge
+
+The system SHALL provide `create_trajectory_llm_as_judge()` that uses an LLM to grade trajectory quality and accuracy.
+
+#### Scenario: Trajectory is formatted as XML for LLM
+
+- **WHEN** an LLM trajectory evaluator is invoked
+- **THEN** the trajectory SHALL be formatted as XML with `<role>`, `<tool_call>`, `<tool_result>` elements