Files
boocode/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/llm-as-judge/spec.md
indifferentketchup c935687725 chore(openspec): drop 9 superseded proposals + 11 stub archive files
Drop 9 batch proposals that are superseded by the boocode-lift-analysis
(boocontext-audit, conductor upgrades, self-healing/verify-gate skills):
add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform,
conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul,
agent-reliability.

Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only)
that provide zero documentation value over the existing CHANGELOG.md + git tags.
2026-06-07 22:15:38 +00:00

56 lines
2.4 KiB
Markdown

## ADDED Requirements
### Requirement: LLM-as-judge evaluator factory
The system SHALL provide a `create_llm_as_judge()` function that creates an evaluator using an LLM to assess output quality.
Parameters:
- `prompt: string | Runnable | Callable` — evaluation prompt (string template with `{inputs}`, `{outputs}`, `{reference_outputs}`)
- `judge?: ModelClient | BaseChatModel` — LLM client
- `model?: string` — model identifier
- `system?: string` — optional system message
- `continuous: boolean = false` — float 0-1 scoring when true, boolean when false
- `choices?: number[]` — specific enum float values for score
- `use_reasoning: boolean = true` — include reasoning in output
- `few_shot_examples?: FewShotExample[]` — example evaluations
- `output_schema?: JSONSchema | ZodSchema` — custom structured output format
#### Scenario: String prompt evaluator returns scored result
- **WHEN** `create_llm_as_judge(prompt="Rate: {outputs}")` is invoked with `outputs="Hello world"`
- **THEN** it SHALL return an `EvaluatorResult` with `key: "score"` and a valid `score`
#### Scenario: Continuous scoring returns float
- **WHEN** `create_llm_as_judge(prompt=..., continuous=true)` scores output
- **THEN** the score SHALL be a float between 0.0 and 1.0
#### Scenario: Choices scoring returns enum value
- **WHEN** `create_llm_as_judge(prompt=..., choices=[0.0, 0.5, 1.0])` scores output
- **THEN** the score SHALL be exactly one of the enumerated choices
#### Scenario: Reasoning mode returns comment
- **WHEN** `create_llm_as_judge(prompt=..., use_reasoning=true)` scores output
- **THEN** the `comment` field SHALL contain the LLM's reasoning
#### Scenario: Few-shot examples are appended to prompt
- **WHEN** `few_shot_examples` are provided
- **THEN** they SHALL be appended as `<example>` XML blocks to the last user message
#### Scenario: Output schema returns structured dict
- **WHEN** `output_schema` is provided (e.g., z.object({ quality: z.number() }))
- **THEN** the evaluator SHALL return a dict matching that schema instead of `EvaluatorResult`
### Requirement: Async LLM-as-judge
The system SHALL provide `create_async_llm_as_judge()` with identical parameters, returning an async evaluator.
#### Scenario: Async evaluator returns same structure as sync
- **WHEN** `await` is used on an async evaluator invocation
- **THEN** the result SHALL match the same structure as the sync equivalent