Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
2.4 KiB
2.4 KiB
ADDED Requirements
Requirement: LLM-as-judge evaluator factory
The system SHALL provide a create_llm_as_judge() function that creates an evaluator using an LLM to assess output quality.
Parameters:
prompt: string | Runnable | Callable— evaluation prompt (string template with{inputs},{outputs},{reference_outputs})judge?: ModelClient | BaseChatModel— LLM clientmodel?: string— model identifiersystem?: string— optional system messagecontinuous: boolean = false— float 0-1 scoring when true, boolean when falsechoices?: number[]— specific enum float values for scoreuse_reasoning: boolean = true— include reasoning in outputfew_shot_examples?: FewShotExample[]— example evaluationsoutput_schema?: JSONSchema | ZodSchema— custom structured output format
Scenario: String prompt evaluator returns scored result
- WHEN
create_llm_as_judge(prompt="Rate: {outputs}")is invoked withoutputs="Hello world" - THEN it SHALL return an
EvaluatorResultwithkey: "score"and a validscore
Scenario: Continuous scoring returns float
- WHEN
create_llm_as_judge(prompt=..., continuous=true)scores output - THEN the score SHALL be a float between 0.0 and 1.0
Scenario: Choices scoring returns enum value
- WHEN
create_llm_as_judge(prompt=..., choices=[0.0, 0.5, 1.0])scores output - THEN the score SHALL be exactly one of the enumerated choices
Scenario: Reasoning mode returns comment
- WHEN
create_llm_as_judge(prompt=..., use_reasoning=true)scores output - THEN the
commentfield SHALL contain the LLM's reasoning
Scenario: Few-shot examples are appended to prompt
- WHEN
few_shot_examplesare provided - THEN they SHALL be appended as
<example>XML blocks to the last user message
Scenario: Output schema returns structured dict
- WHEN
output_schemais provided (e.g., z.object({ quality: z.number() })) - THEN the evaluator SHALL return a dict matching that schema instead of
EvaluatorResult
Requirement: Async LLM-as-judge
The system SHALL provide create_async_llm_as_judge() with identical parameters, returning an async evaluator.
Scenario: Async evaluator returns same structure as sync
- WHEN
awaitis used on an async evaluator invocation - THEN the result SHALL match the same structure as the sync equivalent