v1.13.12: skills audit + token-tracking fix + codecontext + cap50 + UI cleanups

Multi-topic batch. The big-ticket item is the skills audit; the rest are
smaller patches that compounded during the audit work.

## Skills audit (rules→recipes split)

Vendored all 26 skills from /home/samkintop/opt/skills/ into data/skills/
(the boocode-repo-local skill library — see docker-compose change below).
Audited via 5 parallel Claude Code agent-teams running the
mgechev/skills-best-practices 4-step protocol (Discovery → Logic → Edge
Case → self-Architecture-Refinement) per skill, ~2 min wall-clock vs the
~3.7-hour serial estimate.

Result: 14 skills surviving (renamed to gerund form, frontmatter matched),
11 deleted (duplicates, BooCode-irrelevant patterns, Claude-already-does-
natively), 1 migrated to BOOCHAT.md/BOOCODER.md as an always-true rule
(verification-before-completion). Each surviving skill had its description
refined to fix specific trigger gaps surfaced by the protocol — 4
real-bug findings landed (dead refs, stale tags, broken sub-file
references in the original vendored content).

Audit decisions documented in openspec/changes/v1.13.12-skills-audit/
audit-notes.md. Convention codified in BOOCHAT.md/BOOCODER.md "rules vs
recipes" sections — future workflow rules go to those files (100%
present), recipes stay in data/skills/ (~6% invoke rate in multi-turn
per the Codeminer42 measurement).

## Token tracking + stale-stream banner fix (same root cause)

ws-frames.ts IsoTimestamp was z.string().min(1) but postgres returns
timestamp columns as JS Date objects. Every message_complete /
session_updated / chat_updated frame was failing the v1.13.11 Zod gate
and being silently dropped. Symptoms: token tracking blank in the UI
(no usage frames landed); the 60s no-token-activity timer tripped the
stale-stream banner because the frontend's local message state never
saw status='streaming' flip to 'complete'.

Fix: z.preprocess(v => v instanceof Date ? v.toISOString() : v,
z.string().min(1)) applied to the IsoTimestamp primitive. Centralized,
no publisher changes, works identically server + web (the parity test
still passes).

## Codecontext .codecontextignore auto-install

services/codecontext_client.ts now copies the
codecontext/.codecontextignore.template into any project's root on the
first call to that project if no .codecontextignore exists. One file
written per project, idempotent (in-memory Set guard + access-check),
silent fallback on read-only project. Stops the upstream empty-source-
file parser crash on foreign projects' node_modules — previously
required manually copying the template per project.

## Tool-call budget cap 30 → 50

services/inference/budget.ts: BUDGET_READ_ONLY and BUDGET_NO_AGENT
bumped to 50 (from 30). BUDGET_NON_READ_ONLY stays at 10 (no write
tools landed yet). Real recon sessions were hitting 30 with ~3 turns
wasted on codecontext parse failures; legitimate need was ~27, and
Architect-class system overviews want deeper recon. Headroom of 20
absorbs failure-retry turns without changing the safety floor — the
doom-loop guard (3 identical calls → abort) catches the actual
failure mode this cap was guarding against.

v1.14 (Phase C outer agent loop) will supersede this via per-agent
agent.steps. Throwaway-ish patch but unblocks deeper recon today.

## UI cleanups

- ChatPane queued-message dropdown removed. Each queued message now
  has three buttons: edit (pop back into ChatInput via sendToChat
  event), force-send (was the dropdown's only useful action), and
  cancel. Default behavior (send when streaming completes) needs no
  UI — it's the implicit do-nothing path.
- ChatThroughput removed from desktop tab strip (ChatTabBar.tsx).
  Mobile tab switcher still shows it.

## Plumbing

- .gitignore: data/* + !data/AGENTS.md + !data/skills/ negation
  patterns so the vendored skill library + agent registry become
  git-tracked while session DB state stays out.
- docker-compose.yml: removed /opt/skills:/data/skills override
  mount. Skills now live in the boocode repo at data/skills/,
  auditable per-batch. The host-level /opt/skills/ is preserved
  untouched for any other tools that read from it.
- .codecontextignore at repo root: auto-installed when codecontext
  was first called against /opt/boocode itself; matches the template.
- CLAUDE.md: updated to document the v1.13.11 publishFrame wrapper +
  message_parts table + tool_cost_stats view + DB-integration test
  pattern + host-side smoke endpoint quirk. (Pre-existing in working
  tree before this batch; shipped here for completeness.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-22 18:58:30 +00:00
parent bc376c878d
commit 0fa46cd06c
80 changed files with 6950 additions and 39 deletions

View File

@@ -0,0 +1,132 @@
# v1.13.12 — skills audit pass
Audit of 26 skills vendored from `/home/samkintop/opt/skills/` into `/opt/boocode/data/skills/`. Each sorted into one of four buckets per the Codeminer42 rules→recipes split.
## Deviations from the batch spec
| Spec said | Reality | Resolution |
|---|---|---|
| `/opt/boocode/skills/` is the audit target | Skills directory is `/opt/boocode/data/skills/` (per `services/skills.ts:19` `SKILLS_ROOT = '/data/skills'`) | Vendored to the correct path |
| `/opt/boocode/AGENTS.md` for bucket-(a) rule additions | `data/AGENTS.md` is an agent registry (`## H2` per agent with frontmatter), not a rules file | Bucket-(a) rules go to `BOOCHAT.md` (the container guidance file the chat agent reads) instead |
| "7 vendored v1.12 skills" exist to audit | Zero SKILL.md ever committed; `data/skills/` was empty | Vendored all 26 from `/home/samkintop/opt/skills/` in this batch (vendor + audit combined) |
| `data/` content tracked in git | `.gitignore` excluded all of `data/` | Added negation patterns (`data/*` + `!data/AGENTS.md` + `!data/skills/`) so audit work shows up in git |
| Container reads `data/skills/` from the boocode repo | `docker-compose.yml:18` had `- /opt/skills:/data/skills` override mount — container actually read from host-level `/opt/skills/`, ignoring repo `data/skills/` | Removed the override mount. Skill library now lives in `data/skills/` (repo-tracked, per-batch auditable). Host `/opt/skills/` preserved untouched for other tools (Claude Code, etc.). 1-line deviation from spec's "zero code change" claim — necessary to make the spec's intent actually take effect |
## Bucket tally
| Bucket | Action | Count |
|---|---|---|
| (a) | Move to BOOCHAT.md as always-true rule | 1 |
| (b) | Keep as recipe, apply Anthropic conventions | 14 |
| (c) | Keep + move bulk to `references/` (SKILL.md > 500 lines) | 0 |
| (d) | Delete (duplicates Claude native capability or doesn't fit BooCode) | 11 |
| **Total** | | **26** |
No skill exceeded the 500-line ceiling — bucket (c) is empty. Longest survivor: `systematic-debugging` at 296 lines.
## Per-skill decisions
| Skill (path) | Lines | Bucket | Disposition | Rationale |
|---|---:|:---:|---|---|
| `anthropics/agent-development` | 196 | (b) | Keep; rename → `developing-agents` | BooCode-specific value (manages `data/AGENTS.md` tier-2 registry) |
| `anthropics/claude-md-improver` | 180 | (d) | Delete | Overlaps `boocode-guidance-improver` (more specific) |
| `anthropics/frontend-design` | 42 | (b) | Keep; rename → `designing-frontends` | Concise UI design guidance, no overlap |
| `anthropics-knowledge-work/code-review` | 118 | (b) | Keep; rename → `reviewing-code` | Generic code review process distinct from `receiving-` / `requesting-code-review` |
| `anthropics-knowledge-work/task-management` | 91 | (d) | Delete | `user-invocable: false`; duplicates BooCode's TodoWrite/TaskCreate native capability |
| `asyrafhussin/react-vite-best-practices` | 182 | (b) | Keep; rename → `optimizing-react-vite` | Matches BooCode's stack (Vite, not Next.js) |
| `boocode/boocode-guidance-improver` | 167 | (b) | Keep; rename → `improving-boocode-guidance` | BooCode-specific 10-dimension rubric for `CLAUDE.md`/`BOOCHAT.md`/`BOOCODER.md`/`AGENTS.md` |
| `mattpocock/diagnose` | 117 | (b) | Keep; rename → `diagnosing-bugs` | Complement to `systematic-debugging`: focus on building a feedback loop |
| `mattpocock/grill-me` | 20 | (b) | Keep; rename → `grilling-plans` | Plan stress-testing |
| `mattpocock/grill-with-docs` | 98 | (d) | Delete | Requires `CONTEXT.md` and `docs/adr/` that BooCode doesn't have |
| `mattpocock/handoff` | 17 | (d) | Delete | BooCode is single-user; no agent handoff scenario |
| `mattpocock/improve-codebase-architecture` | 71 | (d) | Delete | Requires `CONTEXT.md` and `docs/adr/` |
| `mattpocock/to-issues` | 83 | (d) | Delete | BooCode uses `openspec/changes/`, not an issue tracker |
| `mattpocock/to-prd` | 76 | (d) | Delete | Same — no issue tracker |
| `mattpocock/write-a-skill` | 121 | (b) | Keep; rename → `writing-skills` | Authoring new skills for this very system |
| `mattpocock/zoom-out` | 7 | (d) | Delete | Claude does this natively when asked; 7-line skill is overhead |
| `superpowers/brainstorming` | 164 | (b) | Keep (already gerund) | Before-features creative-work process |
| `superpowers/receiving-code-review` | 213 | (b) | Keep (already gerund) | Sam reviews everything; process for handling feedback |
| `superpowers/requesting-code-review` | 103 | (b) | Keep (already gerund) | Before-merge verification |
| `superpowers/systematic-debugging` | 296 | (b) | Keep (already gerund) | Comprehensive bug-fix discipline (root-cause-first) |
| `superpowers/using-superpowers` | 117 | (d) | Delete | Meta-skill about skill discovery; Claude does discovery natively |
| `superpowers/verification-before-completion` | 139 | (a) | Migrate rule to `BOOCHAT.md`, delete skill dir | Always-true rule: evidence before assertions. Belongs 100% present, not 6% invoked |
| `superpowers/writing-plans` | 152 | (b) | Keep (already gerund) | Maps to BooCode's `openspec/changes/` workflow |
| `vercel-labs/find-skills` | 142 | (d) | Delete | Skill discovery — Claude does this natively |
| `vercel-labs/react-best-practices` | 149 | (d) | Delete | Next.js focus; BooCode uses Vite (asyrafhussin's version is the fit) |
| `vercel-labs/web-design-guidelines` | 39 | (b) | Keep; rename → `reviewing-web-design` | UI compliance review |
## Bucket-(a) migration text
Single rule extracted from `superpowers/verification-before-completion` (139 lines → ~3 lines in BOOCHAT.md):
> **Don't claim work is complete without verifying.** Run the relevant command (test, build, smoke) and confirm the expected output before reporting success. Evidence before assertions catches regressions you'd otherwise miss.
The 139-line process content does not move to BOOCHAT.md — the rule itself is what needs to be 100% present. Process detail is recoverable from the upstream repo if anyone wants to read it later.
## Verification protocol coverage
| Step | Owner | Status |
|---|---|---|
| 1. Discovery (paste SKILL.md, check first-200-char triggering) | Sam — fresh Claude.ai chat per skill | Pending |
| 2. Logic (paste realistic task, check skill recognizes) | Sam — fresh Claude.ai chat | Pending |
| 3. Edge Case (paste boundary task, check correct invoke/decline) | Sam — fresh Claude.ai chat | Pending |
| 4. Architecture Refinement (paste skill + chats, ask for critique) | Sam — fresh Claude.ai chat | Pending |
| 5. `skillgrade --smoke` (5 trials per skill) | Sam — host install `npm i -g skillgrade` first | Pending |
Eval.yaml files written per surviving skill (14 files) so the `skillgrade --smoke` runs are mechanical once `skillgrade` is installed.
## skillgrade scope correction
The `eval.yaml` stubs authored in the prior session use a flat `tasks: [{prompt, grader: [list]}]` shape that does not validate against skillgrade's canonical schema (canonical needs `name`, `instruction`, `workspace`, structured `graders` with `type: deterministic | llm_rubric`, `run` shell script, `weight`, plus a Docker `provider` block). Rewriting all 14 in the canonical format is out of scope for this batch — each needs Docker workspace setup and grader scripts that capture skill-output correctness. Filed as a follow-up: **v1.13.13 — skillgrade eval.yaml canonical rewrite + first quantitative pass**.
The smoke-results column in the table below is `n/a*` for that reason. The 4-step qualitative protocol still runs (via the agent team in this batch) and surfaces the structural issues that quantitative trials would have caught anyway.
## 4-step protocol findings (agent-team batch)
Each surviving skill was assessed by one of 5 parallel teammates (alpha / bravo / charlie / delta / echo) running the mgechev/skills-best-practices 4-step protocol: Discovery → Logic → Edge Case → self-Architecture-Refinement. Teammates wrote per-agent findings to `/tmp/audit-<name>.md`; the table here aggregates.
Ratings shorthand: D=Discovery, L=Logic, E=Edge Case (each 1-5).
| Skill | Auditor | D / L / E | 4-step verdict | Fix applied |
|---|---|:---:|---|---|
| `anthropics/designing-frontends` | alpha | 5 / 5 / 3 | Strong primary triggers; over-broad with "artifacts, posters" (not code targets) | Removed "artifacts, posters" from description trigger list |
| `anthropics/developing-agents` | alpha | 5 / 5 / 4 | Sharp triggering; stale "(as of v1.11.x)" tag and broken `inference.ts:721-731` reference (actual code is at `stream-phase.ts:403-406`) | Updated stale version tag + cross-reference |
| `anthropics-knowledge-work/reviewing-code` | alpha | 4 / 4 / 3 | Good for explicit PR/diff triggers; dead `CONNECTORS.md` cross-reference (file doesn't exist) | Removed broken cross-reference |
| `mattpocock/diagnosing-bugs` | bravo | 5 / 5 / 4 | Strong; missing colloquial phrasings like "not working" / "something wrong" | Added informal trigger phrases |
| `mattpocock/grilling-plans` | bravo | 3 / 4 / 3 | Trigger coverage too narrow — only "grill me" reliably fires; structural risk: mandatory `ask_user_input` tool may not exist in BooCode's tool registry (flagged but not patched — needs env verification) | Added "poke holes", "challenge my design", "play devil's advocate", "what am I missing" triggers |
| `mattpocock/writing-skills` | bravo | 4 / 5 / 3 | Missing "create" phrasing; description-length rule (≤1024 chars) not in Review Checklist | Added "create" trigger + checklist item |
| `superpowers/brainstorming` | charlie | 4 / 4 / 3 | Vague "modifying behavior" causes both over- and under-firing; HARD-GATE wording could be clearer about writing-plans being permitted | Tightened to "non-trivial modifications" + added "refactoring" |
| `superpowers/receiving-code-review` | charlie | 3 / 4 / 3 | Conditional qualifier "especially if feedback seems unclear or technically questionable" mis-frames as edge-case skill rather than default protocol | Removed conditional + broadened to informal channels |
| `superpowers/requesting-code-review` | charlie | 3 / 4 / 2 | Scope collision with built-in `code-review` skill — near-identical surface language but different execution (subagent dispatch vs inline). LOWEST EDGE-CASE SCORE OF THE BATCH | Added "dispatches a separate subagent reviewer" differentiator |
| `superpowers/systematic-debugging` | delta | 5 / 5 / 4 | Strong; build/compile failures appear in body but missing from frontmatter trigger | Extended description with "build failure, compile error" + "debug/investigate/diagnose" |
| `superpowers/writing-plans` | delta | 3 / 4 / 3 | Spec-centric framing gatekeeps on pre-existing spec doc; colloquial "write me a plan" misses | Added colloquial planning trigger phrases |
| `asyrafhussin/optimizing-react-vite` | delta | 4 / 5 / 3 | Over-broad "Vite configuration" triggers on non-perf tasks; body references `rules/*.md` and `AGENTS.md` files that don't exist in the skill dir | Narrowed scope + added broken-reference warnings |
| `boocode/improving-boocode-guidance` | echo | 5 / 5 / 4 | Strong; "critique" in description prose but missing from examples list | Added `"critique my BOOCODER.md"` to examples |
| `vercel-labs/reviewing-web-design` | echo | 3 / 4 / 3 | Generic triggers collide with general code-review; delegates substance to external GitHub URL with no fallback on fetch failure | Named "Vercel's live web-interface-guidelines" as differentiator + added 404 fallback |
## Aggregate notes
**Trigger-quality stats (qualitative, n=14):**
- Discovery 5/5: 5 skills | 4/5: 4 skills | 3/5: 5 skills (avg ~4.0)
- Edge case is the weakest dimension across the batch — most skills hit 3/5 (borderline invoke/decline). Suggests skills are over- or under-triggering on adjacent-but-different tasks.
- Every skill had at least one fix applied. None were judged "clean" with zero issues.
- Zero skills were flagged for retroactive bucket-(a) reclassification — all 14 remain (b) recipes.
**Real bugs surfaced (not just polish):**
- `anthropics/developing-agents`: stale code reference (inference.ts:721-731 → stream-phase.ts:403-406). Real dead link.
- `anthropics-knowledge-work/reviewing-code`: dead CONNECTORS.md cross-reference.
- `asyrafhussin/optimizing-react-vite`: references `rules/*.md` and `AGENTS.md` subfiles that don't exist in the skill directory.
- `superpowers/requesting-code-review`: scope collision with built-in `code-review` (review skills/auto-routing — Sam may want to drop one of these).
**Structural flags requiring environment verification (not patched):**
- `mattpocock/grilling-plans`: mandatory `ask_user_input` tool call assumed available. Confirm BooCode's tool registry exposes this to the chat-surface model. If not, the skill body's MANDATORY instruction deadlocks.
**Skillgrade gap remains:**
- Quantitative trigger rates (the original v1.13.12 N/5 column) require skillgrade with canonical-format eval.yaml. Filed as **v1.13.13** follow-up. The qualitative 4-step protocol catches the same class of issue (and arguably more — the broken-reference bugs above would not have shown up in skillgrade's invoke/decline trials).
**Per-agent artifacts (working files, not part of repo):**
- `/tmp/audit-alpha.md` — designing-frontends, developing-agents, reviewing-code
- `/tmp/audit-bravo.md` — diagnosing-bugs, grilling-plans, writing-skills
- `/tmp/audit-charlie.md` — brainstorming, receiving-code-review, requesting-code-review
- `/tmp/audit-delta.md` — systematic-debugging, writing-plans, optimizing-react-vite
- `/tmp/audit-echo.md` — improving-boocode-guidance, reviewing-web-design