Files
boocode/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/llm-as-judge/spec.md
indifferentketchup c935687725 chore(openspec): drop 9 superseded proposals + 11 stub archive files
Drop 9 batch proposals that are superseded by the boocode-lift-analysis
(boocontext-audit, conductor upgrades, self-healing/verify-gate skills):
add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform,
conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul,
agent-reliability.

Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only)
that provide zero documentation value over the existing CHANGELOG.md + git tags.
2026-06-07 22:15:38 +00:00

2.4 KiB

ADDED Requirements

Requirement: LLM-as-judge evaluator factory

The system SHALL provide a create_llm_as_judge() function that creates an evaluator using an LLM to assess output quality.

Parameters:

  • prompt: string | Runnable | Callable — evaluation prompt (string template with {inputs}, {outputs}, {reference_outputs})
  • judge?: ModelClient | BaseChatModel — LLM client
  • model?: string — model identifier
  • system?: string — optional system message
  • continuous: boolean = false — float 0-1 scoring when true, boolean when false
  • choices?: number[] — specific enum float values for score
  • use_reasoning: boolean = true — include reasoning in output
  • few_shot_examples?: FewShotExample[] — example evaluations
  • output_schema?: JSONSchema | ZodSchema — custom structured output format

Scenario: String prompt evaluator returns scored result

  • WHEN create_llm_as_judge(prompt="Rate: {outputs}") is invoked with outputs="Hello world"
  • THEN it SHALL return an EvaluatorResult with key: "score" and a valid score

Scenario: Continuous scoring returns float

  • WHEN create_llm_as_judge(prompt=..., continuous=true) scores output
  • THEN the score SHALL be a float between 0.0 and 1.0

Scenario: Choices scoring returns enum value

  • WHEN create_llm_as_judge(prompt=..., choices=[0.0, 0.5, 1.0]) scores output
  • THEN the score SHALL be exactly one of the enumerated choices

Scenario: Reasoning mode returns comment

  • WHEN create_llm_as_judge(prompt=..., use_reasoning=true) scores output
  • THEN the comment field SHALL contain the LLM's reasoning

Scenario: Few-shot examples are appended to prompt

  • WHEN few_shot_examples are provided
  • THEN they SHALL be appended as <example> XML blocks to the last user message

Scenario: Output schema returns structured dict

  • WHEN output_schema is provided (e.g., z.object({ quality: z.number() }))
  • THEN the evaluator SHALL return a dict matching that schema instead of EvaluatorResult

Requirement: Async LLM-as-judge

The system SHALL provide create_async_llm_as_judge() with identical parameters, returning an async evaluator.

Scenario: Async evaluator returns same structure as sync

  • WHEN await is used on an async evaluator invocation
  • THEN the result SHALL match the same structure as the sync equivalent