boocode/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/llm-as-judge/spec.md at c4ee377dbc2edd411ecc03779dae3aa49b14e7b6

Files

indifferentketchup c935687725 chore(openspec): drop 9 superseded proposals + 11 stub archive files

Drop 9 batch proposals that are superseded by the boocode-lift-analysis
(boocontext-audit, conductor upgrades, self-healing/verify-gate skills):
add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform,
conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul,
agent-reliability.

Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only)
that provide zero documentation value over the existing CHANGELOG.md + git tags.

2026-06-07 22:15:38 +00:00

2.4 KiB

Raw Blame History

ADDED Requirements

Requirement: LLM-as-judge evaluator factory

The system SHALL provide a create_llm_as_judge() function that creates an evaluator using an LLM to assess output quality.

Parameters:

prompt: string | Runnable | Callable — evaluation prompt (string template with {inputs}, {outputs}, {reference_outputs})
judge?: ModelClient | BaseChatModel — LLM client
model?: string — model identifier
system?: string — optional system message
continuous: boolean = false — float 0-1 scoring when true, boolean when false
choices?: number[] — specific enum float values for score
use_reasoning: boolean = true — include reasoning in output
few_shot_examples?: FewShotExample[] — example evaluations
output_schema?: JSONSchema | ZodSchema — custom structured output format

Scenario: String prompt evaluator returns scored result

WHEN create_llm_as_judge(prompt="Rate: {outputs}") is invoked with outputs="Hello world"
THEN it SHALL return an EvaluatorResult with key: "score" and a valid score

Scenario: Continuous scoring returns float

WHEN create_llm_as_judge(prompt=..., continuous=true) scores output
THEN the score SHALL be a float between 0.0 and 1.0

Scenario: Choices scoring returns enum value

WHEN create_llm_as_judge(prompt=..., choices=[0.0, 0.5, 1.0]) scores output
THEN the score SHALL be exactly one of the enumerated choices

Scenario: Reasoning mode returns comment

WHEN create_llm_as_judge(prompt=..., use_reasoning=true) scores output
THEN the comment field SHALL contain the LLM's reasoning

Scenario: Few-shot examples are appended to prompt

WHEN few_shot_examples are provided
THEN they SHALL be appended as <example> XML blocks to the last user message

Scenario: Output schema returns structured dict

WHEN output_schema is provided (e.g., z.object({ quality: z.number() }))
THEN the evaluator SHALL return a dict matching that schema instead of EvaluatorResult

Requirement: Async LLM-as-judge

The system SHALL provide create_async_llm_as_judge() with identical parameters, returning an async evaluator.

Scenario: Async evaluator returns same structure as sync

WHEN await is used on an async evaluator invocation
THEN the result SHALL match the same structure as the sync equivalent

2.4 KiB Raw Blame History