boocode/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/eval-prompt-library/spec.md at c11e26090fe2ca984cd54fecb147700e3773cfa3

Files

indifferentketchup c935687725 chore(openspec): drop 9 superseded proposals + 11 stub archive files

Drop 9 batch proposals that are superseded by the boocode-lift-analysis
(boocontext-audit, conductor upgrades, self-healing/verify-gate skills):
add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform,
conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul,
agent-reliability.

Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only)
that provide zero documentation value over the existing CHANGELOG.md + git tags.

2026-06-07 22:15:38 +00:00

2.0 KiB

Raw Blame History

ADDED Requirements

Requirement: Built-in evaluation prompt templates

The system SHALL ship with a library of prompt templates organized by domain, ready for use with create_llm_as_judge().

Domains and included prompts:

Quality:

CORRECTNESS_PROMPT — factual accuracy and completeness
CONCISENESS_PROMPT — concise responses without hedging or fluff
HALLUCINATION_PROMPT — claims verifiable from context
ANSWER_RELEVANCE_PROMPT — output addresses the input question
PLAN_ADHERENCE_PROMPT — agent actions match declared plan
LAZINESS_PROMPT — detects blank or low-effort responses

RAG:

RAG_GROUNDEDNESS_PROMPT — output claims supported by retrieved context
RAG_HELPFULNESS_PROMPT — output addresses core question
RAG_RETRIEVAL_RELEVANCE_PROMPT — retrieved context is relevant to input

Safety:

TOXICITY_PROMPT — personal attacks, hate speech
FAIRNESS_PROMPT — stereotyping, discrimination

Security:

PII_LEAKAGE_PROMPT — names, contact info, credentials in output
PROMPT_INJECTION_PROMPT — delimiter manipulation, roleplay bypass
CODE_INJECTION_PROMPT — SQL injection, XSS, path traversal

Trajectory:

TRAJECTORY_ACCURACY_PROMPT — logical progression, goal alignment
TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE — semantically equivalent to reference
TOOL_SELECTION_PROMPT — right tools, right order, no redundant calls

Conversation:

USER_SATISFACTION_PROMPT — gratitude, resolution, engagement
TASK_COMPLETION_PROMPT — was the user's goal achieved
AGENT_TONE_PROMPT — appropriate tone and professionalism

Scenario: Each prompt is a string with {inputs}, {outputs}, {reference_outputs} placeholders

WHEN a prompt template is inspected
THEN it SHALL be a string compatible with str.format() containing at least {outputs}

Scenario: Prompt templates follow rubric structure

WHEN a prompt template is read
THEN it SHALL contain <Rubric>, <Instructions>, and <Reminder> XML sections

2.0 KiB Raw Blame History

ADDED Requirements

Requirement: Built-in evaluation prompt templates

Scenario: Each prompt is a string with {inputs}, {outputs}, {reference_outputs} placeholders

Scenario: Prompt templates follow rubric structure

2.0 KiB

Raw Blame History