chore(openspec): drop 9 superseded proposals + 11 stub archive files
Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
This commit is contained in:
@@ -0,0 +1,51 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Trajectory match evaluator
|
||||
|
||||
The system SHALL provide `create_trajectory_match_evaluator()` that compares agent tool-call trajectories against reference trajectories.
|
||||
|
||||
Parameters:
|
||||
- `trajectory_match_mode: "strict" | "unordered" | "subset" | "superset"` — matching strategy
|
||||
- `tool_args_match_mode: "exact" | "ignore" | "subset" | "superset"` — tool argument comparison
|
||||
- `tool_args_match_overrides?: Record<string, ToolArgsMatchMode | Callable | string[]>` — per-tool custom matching
|
||||
|
||||
#### Scenario: Strict mode requires exact order
|
||||
|
||||
- **WHEN** output trajectory has tool calls `[A, B]` and reference is `[A, B]`
|
||||
- **THEN** strict mode SHALL return score `true`
|
||||
- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
|
||||
- **THEN** strict mode SHALL return score `false`
|
||||
|
||||
#### Scenario: Unordered mode ignores order
|
||||
|
||||
- **WHEN** output trajectory has tool calls `[B, A]` and reference is `[A, B]`
|
||||
- **THEN** unordered mode SHALL return score `true`
|
||||
|
||||
#### Scenario: Subset mode accepts partial trajectory
|
||||
|
||||
- **WHEN** output trajectory has tool calls `[A]` and reference is `[A, B]`
|
||||
- **THEN** subset mode SHALL return score `true`
|
||||
|
||||
#### Scenario: Superset mode allows extra tool calls
|
||||
|
||||
- **WHEN** output trajectory has tool calls `[A, B, C]` and reference is `[A, B]`
|
||||
- **THEN** superset mode SHALL return score `true`
|
||||
|
||||
#### Scenario: Tool args ignore mode skips argument comparison
|
||||
|
||||
- **WHEN** `tool_args_match_mode="ignore"` is set
|
||||
- **THEN** tool calls match regardless of their arguments
|
||||
|
||||
#### Scenario: Custom tool arg matcher is used
|
||||
|
||||
- **WHEN** `tool_args_match_overrides` contains a `Callable` for a tool name
|
||||
- **THEN** that callable SHALL be invoked to compare the tool's arguments
|
||||
|
||||
### Requirement: Trajectory LLM-as-judge
|
||||
|
||||
The system SHALL provide `create_trajectory_llm_as_judge()` that uses an LLM to grade trajectory quality and accuracy.
|
||||
|
||||
#### Scenario: Trajectory is formatted as XML for LLM
|
||||
|
||||
- **WHEN** an LLM trajectory evaluator is invoked
|
||||
- **THEN** the trajectory SHALL be formatted as XML with `<role>`, `<tool_call>`, `<tool_result>` elements
|
||||
Reference in New Issue
Block a user