## ADDED Requirements ### Requirement: LLM-as-judge evaluator factory The system SHALL provide a `create_llm_as_judge()` function that creates an evaluator using an LLM to assess output quality. Parameters: - `prompt: string | Runnable | Callable` — evaluation prompt (string template with `{inputs}`, `{outputs}`, `{reference_outputs}`) - `judge?: ModelClient | BaseChatModel` — LLM client - `model?: string` — model identifier - `system?: string` — optional system message - `continuous: boolean = false` — float 0-1 scoring when true, boolean when false - `choices?: number[]` — specific enum float values for score - `use_reasoning: boolean = true` — include reasoning in output - `few_shot_examples?: FewShotExample[]` — example evaluations - `output_schema?: JSONSchema | ZodSchema` — custom structured output format #### Scenario: String prompt evaluator returns scored result - **WHEN** `create_llm_as_judge(prompt="Rate: {outputs}")` is invoked with `outputs="Hello world"` - **THEN** it SHALL return an `EvaluatorResult` with `key: "score"` and a valid `score` #### Scenario: Continuous scoring returns float - **WHEN** `create_llm_as_judge(prompt=..., continuous=true)` scores output - **THEN** the score SHALL be a float between 0.0 and 1.0 #### Scenario: Choices scoring returns enum value - **WHEN** `create_llm_as_judge(prompt=..., choices=[0.0, 0.5, 1.0])` scores output - **THEN** the score SHALL be exactly one of the enumerated choices #### Scenario: Reasoning mode returns comment - **WHEN** `create_llm_as_judge(prompt=..., use_reasoning=true)` scores output - **THEN** the `comment` field SHALL contain the LLM's reasoning #### Scenario: Few-shot examples are appended to prompt - **WHEN** `few_shot_examples` are provided - **THEN** they SHALL be appended as `` XML blocks to the last user message #### Scenario: Output schema returns structured dict - **WHEN** `output_schema` is provided (e.g., z.object({ quality: z.number() })) - **THEN** the evaluator SHALL return a dict matching that schema instead of `EvaluatorResult` ### Requirement: Async LLM-as-judge The system SHALL provide `create_async_llm_as_judge()` with identical parameters, returning an async evaluator. #### Scenario: Async evaluator returns same structure as sync - **WHEN** `await` is used on an async evaluator invocation - **THEN** the result SHALL match the same structure as the sync equivalent