Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
56 lines
2.4 KiB
Markdown
56 lines
2.4 KiB
Markdown
## ADDED Requirements
|
|
|
|
### Requirement: LLM-as-judge evaluator factory
|
|
|
|
The system SHALL provide a `create_llm_as_judge()` function that creates an evaluator using an LLM to assess output quality.
|
|
|
|
Parameters:
|
|
- `prompt: string | Runnable | Callable` — evaluation prompt (string template with `{inputs}`, `{outputs}`, `{reference_outputs}`)
|
|
- `judge?: ModelClient | BaseChatModel` — LLM client
|
|
- `model?: string` — model identifier
|
|
- `system?: string` — optional system message
|
|
- `continuous: boolean = false` — float 0-1 scoring when true, boolean when false
|
|
- `choices?: number[]` — specific enum float values for score
|
|
- `use_reasoning: boolean = true` — include reasoning in output
|
|
- `few_shot_examples?: FewShotExample[]` — example evaluations
|
|
- `output_schema?: JSONSchema | ZodSchema` — custom structured output format
|
|
|
|
#### Scenario: String prompt evaluator returns scored result
|
|
|
|
- **WHEN** `create_llm_as_judge(prompt="Rate: {outputs}")` is invoked with `outputs="Hello world"`
|
|
- **THEN** it SHALL return an `EvaluatorResult` with `key: "score"` and a valid `score`
|
|
|
|
#### Scenario: Continuous scoring returns float
|
|
|
|
- **WHEN** `create_llm_as_judge(prompt=..., continuous=true)` scores output
|
|
- **THEN** the score SHALL be a float between 0.0 and 1.0
|
|
|
|
#### Scenario: Choices scoring returns enum value
|
|
|
|
- **WHEN** `create_llm_as_judge(prompt=..., choices=[0.0, 0.5, 1.0])` scores output
|
|
- **THEN** the score SHALL be exactly one of the enumerated choices
|
|
|
|
#### Scenario: Reasoning mode returns comment
|
|
|
|
- **WHEN** `create_llm_as_judge(prompt=..., use_reasoning=true)` scores output
|
|
- **THEN** the `comment` field SHALL contain the LLM's reasoning
|
|
|
|
#### Scenario: Few-shot examples are appended to prompt
|
|
|
|
- **WHEN** `few_shot_examples` are provided
|
|
- **THEN** they SHALL be appended as `<example>` XML blocks to the last user message
|
|
|
|
#### Scenario: Output schema returns structured dict
|
|
|
|
- **WHEN** `output_schema` is provided (e.g., z.object({ quality: z.number() }))
|
|
- **THEN** the evaluator SHALL return a dict matching that schema instead of `EvaluatorResult`
|
|
|
|
### Requirement: Async LLM-as-judge
|
|
|
|
The system SHALL provide `create_async_llm_as_judge()` with identical parameters, returning an async evaluator.
|
|
|
|
#### Scenario: Async evaluator returns same structure as sync
|
|
|
|
- **WHEN** `await` is used on an async evaluator invocation
|
|
- **THEN** the result SHALL match the same structure as the sync equivalent
|