chore(openspec): drop 9 superseded proposals + 11 stub archive files

Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
2026-06-07 22:15:38 +00:00
parent 0d6e9a2413
commit c935687725
119 changed files with 4897 additions and 45 deletions
--- a/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/eval-prompt-library/spec.md
+++ b/openspec/changes/archived/2026-06-07-eval-sandbox-agent-runtime/specs/eval-prompt-library/spec.md
@@ -0,0 +1,49 @@
+## ADDED Requirements
+
+### Requirement: Built-in evaluation prompt templates
+
+The system SHALL ship with a library of prompt templates organized by domain, ready for use with `create_llm_as_judge()`.
+
+Domains and included prompts:
+
+**Quality:**
+- `CORRECTNESS_PROMPT` — factual accuracy and completeness
+- `CONCISENESS_PROMPT` — concise responses without hedging or fluff
+- `HALLUCINATION_PROMPT` — claims verifiable from context
+- `ANSWER_RELEVANCE_PROMPT` — output addresses the input question
+- `PLAN_ADHERENCE_PROMPT` — agent actions match declared plan
+- `LAZINESS_PROMPT` — detects blank or low-effort responses
+
+**RAG:**
+- `RAG_GROUNDEDNESS_PROMPT` — output claims supported by retrieved context
+- `RAG_HELPFULNESS_PROMPT` — output addresses core question
+- `RAG_RETRIEVAL_RELEVANCE_PROMPT` — retrieved context is relevant to input
+
+**Safety:**
+- `TOXICITY_PROMPT` — personal attacks, hate speech
+- `FAIRNESS_PROMPT` — stereotyping, discrimination
+
+**Security:**
+- `PII_LEAKAGE_PROMPT` — names, contact info, credentials in output
+- `PROMPT_INJECTION_PROMPT` — delimiter manipulation, roleplay bypass
+- `CODE_INJECTION_PROMPT` — SQL injection, XSS, path traversal
+
+**Trajectory:**
+- `TRAJECTORY_ACCURACY_PROMPT` — logical progression, goal alignment
+- `TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE` — semantically equivalent to reference
+- `TOOL_SELECTION_PROMPT` — right tools, right order, no redundant calls
+
+**Conversation:**
+- `USER_SATISFACTION_PROMPT` — gratitude, resolution, engagement
+- `TASK_COMPLETION_PROMPT` — was the user's goal achieved
+- `AGENT_TONE_PROMPT` — appropriate tone and professionalism
+
+#### Scenario: Each prompt is a string with {inputs}, {outputs}, {reference_outputs} placeholders
+
+- **WHEN** a prompt template is inspected
+- **THEN** it SHALL be a string compatible with `str.format()` containing at least `{outputs}`
+
+#### Scenario: Prompt templates follow rubric structure
+
+- **WHEN** a prompt template is read
+- **THEN** it SHALL contain `<Rubric>`, `<Instructions>`, and `<Reminder>` XML sections