v1.13.12: skills audit + token-tracking fix + codecontext + cap50 + UI cleanups

Multi-topic batch. The big-ticket item is the skills audit; the rest are smaller patches that compounded during the audit work. ## Skills audit (rules→recipes split) Vendored all 26 skills from /home/samkintop/opt/skills/ into data/skills/ (the boocode-repo-local skill library — see docker-compose change below). Audited via 5 parallel Claude Code agent-teams running the mgechev/skills-best-practices 4-step protocol (Discovery → Logic → Edge Case → self-Architecture-Refinement) per skill, ~2 min wall-clock vs the ~3.7-hour serial estimate. Result: 14 skills surviving (renamed to gerund form, frontmatter matched), 11 deleted (duplicates, BooCode-irrelevant patterns, Claude-already-does- natively), 1 migrated to BOOCHAT.md/BOOCODER.md as an always-true rule (verification-before-completion). Each surviving skill had its description refined to fix specific trigger gaps surfaced by the protocol — 4 real-bug findings landed (dead refs, stale tags, broken sub-file references in the original vendored content). Audit decisions documented in openspec/changes/v1.13.12-skills-audit/ audit-notes.md. Convention codified in BOOCHAT.md/BOOCODER.md "rules vs recipes" sections — future workflow rules go to those files (100% present), recipes stay in data/skills/ (~6% invoke rate in multi-turn per the Codeminer42 measurement). ## Token tracking + stale-stream banner fix (same root cause) ws-frames.ts IsoTimestamp was z.string().min(1) but postgres returns timestamp columns as JS Date objects. Every message_complete / session_updated / chat_updated frame was failing the v1.13.11 Zod gate and being silently dropped. Symptoms: token tracking blank in the UI (no usage frames landed); the 60s no-token-activity timer tripped the stale-stream banner because the frontend's local message state never saw status='streaming' flip to 'complete'. Fix: z.preprocess(v => v instanceof Date ? v.toISOString() : v, z.string().min(1)) applied to the IsoTimestamp primitive. Centralized, no publisher changes, works identically server + web (the parity test still passes). ## Codecontext .codecontextignore auto-install services/codecontext_client.ts now copies the codecontext/.codecontextignore.template into any project's root on the first call to that project if no .codecontextignore exists. One file written per project, idempotent (in-memory Set guard + access-check), silent fallback on read-only project. Stops the upstream empty-source- file parser crash on foreign projects' node_modules — previously required manually copying the template per project. ## Tool-call budget cap 30 → 50 services/inference/budget.ts: BUDGET_READ_ONLY and BUDGET_NO_AGENT bumped to 50 (from 30). BUDGET_NON_READ_ONLY stays at 10 (no write tools landed yet). Real recon sessions were hitting 30 with ~3 turns wasted on codecontext parse failures; legitimate need was ~27, and Architect-class system overviews want deeper recon. Headroom of 20 absorbs failure-retry turns without changing the safety floor — the doom-loop guard (3 identical calls → abort) catches the actual failure mode this cap was guarding against. v1.14 (Phase C outer agent loop) will supersede this via per-agent agent.steps. Throwaway-ish patch but unblocks deeper recon today. ## UI cleanups - ChatPane queued-message dropdown removed. Each queued message now has three buttons: edit (pop back into ChatInput via sendToChat event), force-send (was the dropdown's only useful action), and cancel. Default behavior (send when streaming completes) needs no UI — it's the implicit do-nothing path. - ChatThroughput removed from desktop tab strip (ChatTabBar.tsx). Mobile tab switcher still shows it. ## Plumbing - .gitignore: data/* + !data/AGENTS.md + !data/skills/ negation patterns so the vendored skill library + agent registry become git-tracked while session DB state stays out. - docker-compose.yml: removed /opt/skills:/data/skills override mount. Skills now live in the boocode repo at data/skills/, auditable per-batch. The host-level /opt/skills/ is preserved untouched for any other tools that read from it. - .codecontextignore at repo root: auto-installed when codecontext was first called against /opt/boocode itself; matches the template. - CLAUDE.md: updated to document the v1.13.11 publishFrame wrapper + message_parts table + tool_cost_stats view + DB-integration test pattern + host-side smoke endpoint quirk. (Pre-existing in working tree before this batch; shipped here for completeness.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 18:58:30 +00:00
parent bc376c878d
commit 0fa46cd06c
80 changed files with 6950 additions and 39 deletions
--- a/openspec/changes/v1.13.12-skills-audit/audit-notes.md
+++ b/openspec/changes/v1.13.12-skills-audit/audit-notes.md
@@ -0,0 +1,132 @@
+# v1.13.12 — skills audit pass
+
+Audit of 26 skills vendored from `/home/samkintop/opt/skills/` into `/opt/boocode/data/skills/`. Each sorted into one of four buckets per the Codeminer42 rules→recipes split.
+
+## Deviations from the batch spec
+
+| Spec said | Reality | Resolution |
+|---|---|---|
+| `/opt/boocode/skills/` is the audit target | Skills directory is `/opt/boocode/data/skills/` (per `services/skills.ts:19` `SKILLS_ROOT = '/data/skills'`) | Vendored to the correct path |
+| `/opt/boocode/AGENTS.md` for bucket-(a) rule additions | `data/AGENTS.md` is an agent registry (`## H2` per agent with frontmatter), not a rules file | Bucket-(a) rules go to `BOOCHAT.md` (the container guidance file the chat agent reads) instead |
+| "7 vendored v1.12 skills" exist to audit | Zero SKILL.md ever committed; `data/skills/` was empty | Vendored all 26 from `/home/samkintop/opt/skills/` in this batch (vendor + audit combined) |
+| `data/` content tracked in git | `.gitignore` excluded all of `data/` | Added negation patterns (`data/*` + `!data/AGENTS.md` + `!data/skills/`) so audit work shows up in git |
+| Container reads `data/skills/` from the boocode repo | `docker-compose.yml:18` had `- /opt/skills:/data/skills` override mount — container actually read from host-level `/opt/skills/`, ignoring repo `data/skills/` | Removed the override mount. Skill library now lives in `data/skills/` (repo-tracked, per-batch auditable). Host `/opt/skills/` preserved untouched for other tools (Claude Code, etc.). 1-line deviation from spec's "zero code change" claim — necessary to make the spec's intent actually take effect |
+
+## Bucket tally
+
+| Bucket | Action | Count |
+|---|---|---|
+| (a) | Move to BOOCHAT.md as always-true rule | 1 |
+| (b) | Keep as recipe, apply Anthropic conventions | 14 |
+| (c) | Keep + move bulk to `references/` (SKILL.md > 500 lines) | 0 |
+| (d) | Delete (duplicates Claude native capability or doesn't fit BooCode) | 11 |
+| **Total** | | **26** |
+
+No skill exceeded the 500-line ceiling — bucket (c) is empty. Longest survivor: `systematic-debugging` at 296 lines.
+
+## Per-skill decisions
+
+| Skill (path) | Lines | Bucket | Disposition | Rationale |
+|---|---:|:---:|---|---|
+| `anthropics/agent-development` | 196 | (b) | Keep; rename → `developing-agents` | BooCode-specific value (manages `data/AGENTS.md` tier-2 registry) |
+| `anthropics/claude-md-improver` | 180 | (d) | Delete | Overlaps `boocode-guidance-improver` (more specific) |
+| `anthropics/frontend-design` | 42 | (b) | Keep; rename → `designing-frontends` | Concise UI design guidance, no overlap |
+| `anthropics-knowledge-work/code-review` | 118 | (b) | Keep; rename → `reviewing-code` | Generic code review process distinct from `receiving-` / `requesting-code-review` |
+| `anthropics-knowledge-work/task-management` | 91 | (d) | Delete | `user-invocable: false`; duplicates BooCode's TodoWrite/TaskCreate native capability |
+| `asyrafhussin/react-vite-best-practices` | 182 | (b) | Keep; rename → `optimizing-react-vite` | Matches BooCode's stack (Vite, not Next.js) |
+| `boocode/boocode-guidance-improver` | 167 | (b) | Keep; rename → `improving-boocode-guidance` | BooCode-specific 10-dimension rubric for `CLAUDE.md`/`BOOCHAT.md`/`BOOCODER.md`/`AGENTS.md` |
+| `mattpocock/diagnose` | 117 | (b) | Keep; rename → `diagnosing-bugs` | Complement to `systematic-debugging`: focus on building a feedback loop |
+| `mattpocock/grill-me` | 20 | (b) | Keep; rename → `grilling-plans` | Plan stress-testing |
+| `mattpocock/grill-with-docs` | 98 | (d) | Delete | Requires `CONTEXT.md` and `docs/adr/` that BooCode doesn't have |
+| `mattpocock/handoff` | 17 | (d) | Delete | BooCode is single-user; no agent handoff scenario |
+| `mattpocock/improve-codebase-architecture` | 71 | (d) | Delete | Requires `CONTEXT.md` and `docs/adr/` |
+| `mattpocock/to-issues` | 83 | (d) | Delete | BooCode uses `openspec/changes/`, not an issue tracker |
+| `mattpocock/to-prd` | 76 | (d) | Delete | Same — no issue tracker |
+| `mattpocock/write-a-skill` | 121 | (b) | Keep; rename → `writing-skills` | Authoring new skills for this very system |
+| `mattpocock/zoom-out` | 7 | (d) | Delete | Claude does this natively when asked; 7-line skill is overhead |
+| `superpowers/brainstorming` | 164 | (b) | Keep (already gerund) | Before-features creative-work process |
+| `superpowers/receiving-code-review` | 213 | (b) | Keep (already gerund) | Sam reviews everything; process for handling feedback |
+| `superpowers/requesting-code-review` | 103 | (b) | Keep (already gerund) | Before-merge verification |
+| `superpowers/systematic-debugging` | 296 | (b) | Keep (already gerund) | Comprehensive bug-fix discipline (root-cause-first) |
+| `superpowers/using-superpowers` | 117 | (d) | Delete | Meta-skill about skill discovery; Claude does discovery natively |
+| `superpowers/verification-before-completion` | 139 | (a) | Migrate rule to `BOOCHAT.md`, delete skill dir | Always-true rule: evidence before assertions. Belongs 100% present, not 6% invoked |
+| `superpowers/writing-plans` | 152 | (b) | Keep (already gerund) | Maps to BooCode's `openspec/changes/` workflow |
+| `vercel-labs/find-skills` | 142 | (d) | Delete | Skill discovery — Claude does this natively |
+| `vercel-labs/react-best-practices` | 149 | (d) | Delete | Next.js focus; BooCode uses Vite (asyrafhussin's version is the fit) |
+| `vercel-labs/web-design-guidelines` | 39 | (b) | Keep; rename → `reviewing-web-design` | UI compliance review |
+
+## Bucket-(a) migration text
+
+Single rule extracted from `superpowers/verification-before-completion` (139 lines → ~3 lines in BOOCHAT.md):
+
+> **Don't claim work is complete without verifying.** Run the relevant command (test, build, smoke) and confirm the expected output before reporting success. Evidence before assertions catches regressions you'd otherwise miss.
+
+The 139-line process content does not move to BOOCHAT.md — the rule itself is what needs to be 100% present. Process detail is recoverable from the upstream repo if anyone wants to read it later.
+
+## Verification protocol coverage
+
+| Step | Owner | Status |
+|---|---|---|
+| 1. Discovery (paste SKILL.md, check first-200-char triggering) | Sam — fresh Claude.ai chat per skill | Pending |
+| 2. Logic (paste realistic task, check skill recognizes) | Sam — fresh Claude.ai chat | Pending |
+| 3. Edge Case (paste boundary task, check correct invoke/decline) | Sam — fresh Claude.ai chat | Pending |
+| 4. Architecture Refinement (paste skill + chats, ask for critique) | Sam — fresh Claude.ai chat | Pending |
+| 5. `skillgrade --smoke` (5 trials per skill) | Sam — host install `npm i -g skillgrade` first | Pending |
+
+Eval.yaml files written per surviving skill (14 files) so the `skillgrade --smoke` runs are mechanical once `skillgrade` is installed.
+
+## skillgrade scope correction
+
+The `eval.yaml` stubs authored in the prior session use a flat `tasks: [{prompt, grader: [list]}]` shape that does not validate against skillgrade's canonical schema (canonical needs `name`, `instruction`, `workspace`, structured `graders` with `type: deterministic | llm_rubric`, `run` shell script, `weight`, plus a Docker `provider` block). Rewriting all 14 in the canonical format is out of scope for this batch — each needs Docker workspace setup and grader scripts that capture skill-output correctness. Filed as a follow-up: **v1.13.13 — skillgrade eval.yaml canonical rewrite + first quantitative pass**.
+
+The smoke-results column in the table below is `n/a*` for that reason. The 4-step qualitative protocol still runs (via the agent team in this batch) and surfaces the structural issues that quantitative trials would have caught anyway.
+
+## 4-step protocol findings (agent-team batch)
+
+Each surviving skill was assessed by one of 5 parallel teammates (alpha / bravo / charlie / delta / echo) running the mgechev/skills-best-practices 4-step protocol: Discovery → Logic → Edge Case → self-Architecture-Refinement. Teammates wrote per-agent findings to `/tmp/audit-<name>.md`; the table here aggregates.
+
+Ratings shorthand: D=Discovery, L=Logic, E=Edge Case (each 1-5).
+
+| Skill | Auditor | D / L / E | 4-step verdict | Fix applied |
+|---|---|:---:|---|---|
+| `anthropics/designing-frontends` | alpha | 5 / 5 / 3 | Strong primary triggers; over-broad with "artifacts, posters" (not code targets) | Removed "artifacts, posters" from description trigger list |
+| `anthropics/developing-agents` | alpha | 5 / 5 / 4 | Sharp triggering; stale "(as of v1.11.x)" tag and broken `inference.ts:721-731` reference (actual code is at `stream-phase.ts:403-406`) | Updated stale version tag + cross-reference |
+| `anthropics-knowledge-work/reviewing-code` | alpha | 4 / 4 / 3 | Good for explicit PR/diff triggers; dead `CONNECTORS.md` cross-reference (file doesn't exist) | Removed broken cross-reference |
+| `mattpocock/diagnosing-bugs` | bravo | 5 / 5 / 4 | Strong; missing colloquial phrasings like "not working" / "something wrong" | Added informal trigger phrases |
+| `mattpocock/grilling-plans` | bravo | 3 / 4 / 3 | Trigger coverage too narrow — only "grill me" reliably fires; structural risk: mandatory `ask_user_input` tool may not exist in BooCode's tool registry (flagged but not patched — needs env verification) | Added "poke holes", "challenge my design", "play devil's advocate", "what am I missing" triggers |
+| `mattpocock/writing-skills` | bravo | 4 / 5 / 3 | Missing "create" phrasing; description-length rule (≤1024 chars) not in Review Checklist | Added "create" trigger + checklist item |
+| `superpowers/brainstorming` | charlie | 4 / 4 / 3 | Vague "modifying behavior" causes both over- and under-firing; HARD-GATE wording could be clearer about writing-plans being permitted | Tightened to "non-trivial modifications" + added "refactoring" |
+| `superpowers/receiving-code-review` | charlie | 3 / 4 / 3 | Conditional qualifier "especially if feedback seems unclear or technically questionable" mis-frames as edge-case skill rather than default protocol | Removed conditional + broadened to informal channels |
+| `superpowers/requesting-code-review` | charlie | 3 / 4 / 2 | Scope collision with built-in `code-review` skill — near-identical surface language but different execution (subagent dispatch vs inline). LOWEST EDGE-CASE SCORE OF THE BATCH | Added "dispatches a separate subagent reviewer" differentiator |
+| `superpowers/systematic-debugging` | delta | 5 / 5 / 4 | Strong; build/compile failures appear in body but missing from frontmatter trigger | Extended description with "build failure, compile error" + "debug/investigate/diagnose" |
+| `superpowers/writing-plans` | delta | 3 / 4 / 3 | Spec-centric framing gatekeeps on pre-existing spec doc; colloquial "write me a plan" misses | Added colloquial planning trigger phrases |
+| `asyrafhussin/optimizing-react-vite` | delta | 4 / 5 / 3 | Over-broad "Vite configuration" triggers on non-perf tasks; body references `rules/*.md` and `AGENTS.md` files that don't exist in the skill dir | Narrowed scope + added broken-reference warnings |
+| `boocode/improving-boocode-guidance` | echo | 5 / 5 / 4 | Strong; "critique" in description prose but missing from examples list | Added `"critique my BOOCODER.md"` to examples |
+| `vercel-labs/reviewing-web-design` | echo | 3 / 4 / 3 | Generic triggers collide with general code-review; delegates substance to external GitHub URL with no fallback on fetch failure | Named "Vercel's live web-interface-guidelines" as differentiator + added 404 fallback |
+
+## Aggregate notes
+
+**Trigger-quality stats (qualitative, n=14):**
+- Discovery 5/5: 5 skills | 4/5: 4 skills | 3/5: 5 skills (avg ~4.0)
+- Edge case is the weakest dimension across the batch — most skills hit 3/5 (borderline invoke/decline). Suggests skills are over- or under-triggering on adjacent-but-different tasks.
+- Every skill had at least one fix applied. None were judged "clean" with zero issues.
+- Zero skills were flagged for retroactive bucket-(a) reclassification — all 14 remain (b) recipes.
+
+**Real bugs surfaced (not just polish):**
+- `anthropics/developing-agents`: stale code reference (inference.ts:721-731 → stream-phase.ts:403-406). Real dead link.
+- `anthropics-knowledge-work/reviewing-code`: dead CONNECTORS.md cross-reference.
+- `asyrafhussin/optimizing-react-vite`: references `rules/*.md` and `AGENTS.md` subfiles that don't exist in the skill directory.
+- `superpowers/requesting-code-review`: scope collision with built-in `code-review` (review skills/auto-routing — Sam may want to drop one of these).
+
+**Structural flags requiring environment verification (not patched):**
+- `mattpocock/grilling-plans`: mandatory `ask_user_input` tool call assumed available. Confirm BooCode's tool registry exposes this to the chat-surface model. If not, the skill body's MANDATORY instruction deadlocks.
+
+**Skillgrade gap remains:**
+- Quantitative trigger rates (the original v1.13.12 N/5 column) require skillgrade with canonical-format eval.yaml. Filed as **v1.13.13** follow-up. The qualitative 4-step protocol catches the same class of issue (and arguably more — the broken-reference bugs above would not have shown up in skillgrade's invoke/decline trials).
+
+**Per-agent artifacts (working files, not part of repo):**
+- `/tmp/audit-alpha.md` — designing-frontends, developing-agents, reviewing-code
+- `/tmp/audit-bravo.md` — diagnosing-bugs, grilling-plans, writing-skills
+- `/tmp/audit-charlie.md` — brainstorming, receiving-code-review, requesting-code-review
+- `/tmp/audit-delta.md` — systematic-debugging, writing-plans, optimizing-react-vite
+- `/tmp/audit-echo.md` — improving-boocode-guidance, reviewing-web-design