boocode/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/llm-as-judge/spec.md

## ADDED Requirements

### Requirement: LLM-as-judge evaluator factory

The system SHALL provide a `create_llm_as_judge()` function that creates an evaluator using an LLM to assess output quality.

Parameters:
- `prompt: string | Runnable | Callable` — evaluation prompt (string template with `{inputs}`, `{outputs}`, `{reference_outputs}`)
- `judge?: ModelClient | BaseChatModel` — LLM client
- `model?: string` — model identifier
- `system?: string` — optional system message
- `continuous: boolean = false` — float 0-1 scoring when true, boolean when false
- `choices?: number[]` — specific enum float values for score
- `use_reasoning: boolean = true` — include reasoning in output
- `few_shot_examples?: FewShotExample[]` — example evaluations
- `output_schema?: JSONSchema | ZodSchema` — custom structured output format

#### Scenario: String prompt evaluator returns scored result

- **WHEN** `create_llm_as_judge(prompt="Rate: {outputs}")` is invoked with `outputs="Hello world"`
- **THEN** it SHALL return an `EvaluatorResult` with `key: "score"` and a valid `score`

#### Scenario: Continuous scoring returns float

- **WHEN** `create_llm_as_judge(prompt=..., continuous=true)` scores output
- **THEN** the score SHALL be a float between 0.0 and 1.0

#### Scenario: Choices scoring returns enum value

- **WHEN** `create_llm_as_judge(prompt=..., choices=[0.0, 0.5, 1.0])` scores output
- **THEN** the score SHALL be exactly one of the enumerated choices

#### Scenario: Reasoning mode returns comment

- **WHEN** `create_llm_as_judge(prompt=..., use_reasoning=true)` scores output
- **THEN** the `comment` field SHALL contain the LLM's reasoning

#### Scenario: Few-shot examples are appended to prompt

- **WHEN** `few_shot_examples` are provided
- **THEN** they SHALL be appended as `<example>` XML blocks to the last user message

#### Scenario: Output schema returns structured dict

- **WHEN** `output_schema` is provided (e.g., z.object({ quality: z.number() }))
- **THEN** the evaluator SHALL return a dict matching that schema instead of `EvaluatorResult`

### Requirement: Async LLM-as-judge

The system SHALL provide `create_async_llm_as_judge()` with identical parameters, returning an async evaluator.

#### Scenario: Async evaluator returns same structure as sync

- **WHEN** `await` is used on an async evaluator invocation
- **THEN** the result SHALL match the same structure as the sync equivalent