v1.13.12: skills audit + token-tracking fix + codecontext + cap50 + UI cleanups

Multi-topic batch. The big-ticket item is the skills audit; the rest are smaller patches that compounded during the audit work. ## Skills audit (rules→recipes split) Vendored all 26 skills from /home/samkintop/opt/skills/ into data/skills/ (the boocode-repo-local skill library — see docker-compose change below). Audited via 5 parallel Claude Code agent-teams running the mgechev/skills-best-practices 4-step protocol (Discovery → Logic → Edge Case → self-Architecture-Refinement) per skill, ~2 min wall-clock vs the ~3.7-hour serial estimate. Result: 14 skills surviving (renamed to gerund form, frontmatter matched), 11 deleted (duplicates, BooCode-irrelevant patterns, Claude-already-does- natively), 1 migrated to BOOCHAT.md/BOOCODER.md as an always-true rule (verification-before-completion). Each surviving skill had its description refined to fix specific trigger gaps surfaced by the protocol — 4 real-bug findings landed (dead refs, stale tags, broken sub-file references in the original vendored content). Audit decisions documented in openspec/changes/v1.13.12-skills-audit/ audit-notes.md. Convention codified in BOOCHAT.md/BOOCODER.md "rules vs recipes" sections — future workflow rules go to those files (100% present), recipes stay in data/skills/ (~6% invoke rate in multi-turn per the Codeminer42 measurement). ## Token tracking + stale-stream banner fix (same root cause) ws-frames.ts IsoTimestamp was z.string().min(1) but postgres returns timestamp columns as JS Date objects. Every message_complete / session_updated / chat_updated frame was failing the v1.13.11 Zod gate and being silently dropped. Symptoms: token tracking blank in the UI (no usage frames landed); the 60s no-token-activity timer tripped the stale-stream banner because the frontend's local message state never saw status='streaming' flip to 'complete'. Fix: z.preprocess(v => v instanceof Date ? v.toISOString() : v, z.string().min(1)) applied to the IsoTimestamp primitive. Centralized, no publisher changes, works identically server + web (the parity test still passes). ## Codecontext .codecontextignore auto-install services/codecontext_client.ts now copies the codecontext/.codecontextignore.template into any project's root on the first call to that project if no .codecontextignore exists. One file written per project, idempotent (in-memory Set guard + access-check), silent fallback on read-only project. Stops the upstream empty-source- file parser crash on foreign projects' node_modules — previously required manually copying the template per project. ## Tool-call budget cap 30 → 50 services/inference/budget.ts: BUDGET_READ_ONLY and BUDGET_NO_AGENT bumped to 50 (from 30). BUDGET_NON_READ_ONLY stays at 10 (no write tools landed yet). Real recon sessions were hitting 30 with ~3 turns wasted on codecontext parse failures; legitimate need was ~27, and Architect-class system overviews want deeper recon. Headroom of 20 absorbs failure-retry turns without changing the safety floor — the doom-loop guard (3 identical calls → abort) catches the actual failure mode this cap was guarding against. v1.14 (Phase C outer agent loop) will supersede this via per-agent agent.steps. Throwaway-ish patch but unblocks deeper recon today. ## UI cleanups - ChatPane queued-message dropdown removed. Each queued message now has three buttons: edit (pop back into ChatInput via sendToChat event), force-send (was the dropdown's only useful action), and cancel. Default behavior (send when streaming completes) needs no UI — it's the implicit do-nothing path. - ChatThroughput removed from desktop tab strip (ChatTabBar.tsx). Mobile tab switcher still shows it. ## Plumbing - .gitignore: data/* + !data/AGENTS.md + !data/skills/ negation patterns so the vendored skill library + agent registry become git-tracked while session DB state stays out. - docker-compose.yml: removed /opt/skills:/data/skills override mount. Skills now live in the boocode repo at data/skills/, auditable per-batch. The host-level /opt/skills/ is preserved untouched for any other tools that read from it. - .codecontextignore at repo root: auto-installed when codecontext was first called against /opt/boocode itself; matches the template. - CLAUDE.md: updated to document the v1.13.11 publishFrame wrapper + message_parts table + tool_cost_stats view + DB-integration test pattern + host-side smoke endpoint quirk. (Pre-existing in working tree before this batch; shipped here for completeness.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 18:58:30 +00:00
parent bc376c878d
commit 0fa46cd06c
80 changed files with 6950 additions and 39 deletions
--- a/data/skills/boocode/improving-boocode-guidance/SKILL.md
+++ b/data/skills/boocode/improving-boocode-guidance/SKILL.md
@@ -0,0 +1,167 @@
+---
+name: improving-boocode-guidance
+description: This skill should be used when the user asks to audit, review, check, improve, or critique CLAUDE.md, BOOCHAT.md, BOOCODER.md, or AGENTS.md files in a BooCode project. Examples: "audit my CLAUDE.md", "review my container guidance", "check this AGENTS.md for issues", "improve my BOOCHAT.md", "critique my BOOCODER.md".
+---
+
+# BooCode Guidance Improver
+
+Audit guidance files in a BooCode project against a 10-dimension rubric, then propose targeted edits. **Read-only.** Output is a scored report plus before/after edit proposals; Sam reviews and commits.
+
+## Phase 1 — Discovery
+
+Find every guidance file in the project. The expected set:
+
+- `CLAUDE.md` (repo root) — engineering conventions, gotchas, commands
+- `BOOCHAT.md` (repo root) — container guidance for the read-only chat surface
+- `BOOCODER.md` (repo root) — container guidance for the future write-capable surface (currently a stub)
+- `data/AGENTS.md` — single-file tier-2 agent registry, `## H2` per agent
+- `AGENTS.md` (repo root) — non-BooCode convention; rare in this repo
+
+Glob with `find_files` then load each with `view_file`:
+
+```
+find_files: pattern="{CLAUDE,BOOCHAT,BOOCODER,AGENTS}.md", path="."
+find_files: pattern="data/AGENTS.md", path="."
+```
+
+If a file expected by the project's architecture is missing (e.g. BOOCHAT.md is absent from the repo root in a project that exposes a chat container), flag it in the report as a separate "Missing" entry — don't try to score what isn't there. Likewise, if a file exists but is empty (≤5 lines, no real content), score it 1 across the board and recommend it be either populated or deleted; an empty guidance file is worse than no file because it consumes attention without paying any back.
+
+## Phase 2 — Score against the rubric
+
+For each file, score each of the 10 dimensions on 1–5 (1 = absent or actively misleading; 5 = exemplary). Use the rubric below verbatim. Cite a representative line range for each score.
+
+### a. Refusal rails up front
+
+The first ~10 lines name explicit "do not" directives — what the agent must not do, ideally with a one-line reason. Surfacing refusals early prevents the model from acting on a hopeful misread later.
+
+- **5** — first 10 lines contain ≥3 explicit refusals (e.g. *"Do not commit"*, *"Do not push"*, *"Do not write files"*) with brief reasons or contexts
+- **3** — refusals exist but are buried below line 30, or stated only once without context
+- **1** — no refusals anywhere; the agent has to infer constraints from positive instructions only
+
+### b. Version anchor
+
+A concrete version, tag, or date is mentioned near the top so a stale memory becomes obvious to a future reader. Pure "current" / "latest" claims rot silently.
+
+- **5** — version/tag in the first 20 lines, plus a "last meaningful update" date inline somewhere
+- **3** — a version tag exists but only deep in the file (e.g. inside a commit-history block)
+- **1** — no version, no date, no anchor; nothing to detect staleness against
+
+### c. Why-with-what
+
+Every non-obvious convention or rule is followed by a one-line justification (`Why:` / `Reason:` / dash). Rules without reasons can't be reasoned about at the edges; they get either blindly followed or quietly violated.
+
+- **5** — every non-trivial rule has a sentence-level "why" inline
+- **3** — most rules have reasons, but a few load-bearing ones (e.g. "use overflowWrap not wordWrap") are bare
+- **1** — rules read as commandments with no rationale
+
+### d. Authoritative vs misleading sources
+
+Places where a tool can lie (e.g. *"root `tsc --noEmit` uses project references and can miss errors that the per-app tsconfig catches"*) are called out, and the authoritative path is named. Without this, the agent picks the most convenient signal and ships a regression.
+
+- **5** — at least one explicit "X can lie; use Y instead" pair, named with file paths
+- **3** — implicit hints ("CLI is authoritative") without naming what the misleading signal is
+- **1** — no acknowledgement that any tool can lie
+
+### e. Resolution order
+
+For any stacked configuration (system prompts, env vars, agent definitions, schemas), the precedence is documented end-to-end with what wins on conflict. Missing precedence rules force the agent to guess at boundaries.
+
+- **5** — explicit ordered list (e.g. *"base → container guidance → agent.system_prompt → user prompt"*) with "last wins" or "first wins" stated
+- **3** — order is implied by section sequence but not stated; precedence on conflict is unclear
+- **1** — multiple sources mentioned, no order, no winner
+
+### f. Failure modes
+
+Each subsystem has a "what happens when this fails" note — fallbacks, defaults, swallow vs propagate decisions. Without this the agent assumes the happy path and writes brittle code.
+
+- **5** — every major subsystem (DB, broker, LLM call, tool execution) names its failure behavior
+- **3** — some failure paths documented, others implicit
+- **1** — failure modes invisible; reader can't tell what's defensive and what isn't
+
+### g. Don't / refusals (deep)
+
+Beyond the top-of-file refusal rails, the body contains a sustained "don't" thread — anti-patterns the project has burned on. Each "don't" should name what triggered it (PR, incident, refactor) so it can be re-evaluated.
+
+- **5** — multiple "don't" entries scattered through the file, each with a hint at the triggering context
+- **3** — a handful of "don't"s, no context — reader can't tell what's still load-bearing
+- **1** — pure positive instructions; no anti-pattern surface
+
+### h. Concrete call sites
+
+Specific file paths and symbol names are used (e.g. `apps/server/src/services/inference.ts:209-225 buildSystemPrompt`), not vague pointers ("in the service layer", "somewhere in tools"). Vague pointers force the agent into an extra search round-trip per claim.
+
+- **5** — claims about code consistently cite file:line or file:symbol (e.g. *"buildSystemPrompt at apps/server/src/services/system-prompt.ts:42"*)
+- **3** — some claims cite paths but not lines or symbols (*"in apps/server/src/services/inference.ts"*)
+- **1** — claims read like "the broker handles pub/sub" with no path at all
+
+A reliable test for this dimension: pick three random claims about behaviour, and try to land at the named code in two clicks. If you can't, the score drops.
+
+### i. Convention drift guards
+
+Pairs of files that must stay in sync are named explicitly (e.g. *"CHECK constraints in schema.sql ↔ `*_STATUSES` const arrays in `apps/server/src/types/api.ts`"*). Without these guards, one half drifts and the test that would catch it doesn't exist.
+
+- **5** — every cross-file invariant in the project has a "keep in sync" callout
+- **3** — one or two such guards present; obvious sibling files (frontend type ↔ backend type) not mentioned
+- **1** — invariants are invisible; every edit risks silent divergence
+
+### j. No theater
+
+Every line earns its keep. No "be helpful", no "remember to think step by step", no "as an AI assistant" preamble. Theater wastes tokens and trains the model to skim.
+
+- **5** — every line carries either a fact, a rule, or a pointer; reads tight
+- **3** — a few filler sentences ("strive for excellence", "remember to think carefully") but mostly substantive
+- **1** — heavy preamble, motivational platitudes, or restated framework defaults
+
+Worth a separate pass: re-read the file and ask "would removing this line confuse a future reader?" — if the honest answer is no, the line is theater and should go.
+
+## Phase 3 — Propose one concrete edit per ≤3
+
+For every dimension scoring 3 or lower, generate one specific edit proposal. Each proposal must be:
+
+- **File**: full repo-relative path
+- **Anchor**: a quoted ~one-line existing string or `(new section after L<n>)`
+- **Before**: existing text (or `(none)`)
+- **After**: proposed text
+- **Why**: one sentence linking back to the rubric dimension and what the change unlocks
+
+Example proposal:
+
+```
+### Proposed edit 1 — dimension (a) Refusal rails up front
+
+File: BOOCHAT.md
+Anchor: "## Capabilities" (L3)
+Before:
+  ## Capabilities
+After:
+  ## You cannot
+  - Write, edit, or delete files
+  - Run shell commands
+  - Make commits, push, or pull
+
+  ## Capabilities
+Why: the upstream rubric requires explicit "do not" rails in the first 10 lines so the
+model can't reach for a write tool and self-justify after the fact.
+```
+
+Keep proposals minimal. One edit per dimension scoring ≤3 — don't pad. If a single edit would lift two dimensions at once, say so and don't double-count.
+
+Do not propose more than ~10 edits per file. If a file scores ≤3 on more than 10 dimensions (rare), the file needs a rewrite, not patches — say that instead, and propose a high-level outline rather than a flood of line-level edits.
+
+## Phase 4 — Output
+
+Output as a single numbered list, in this order:
+
+1. Per-file score table: 10 rows × score column × one-line evidence column
+2. Per-file aggregate (sum out of 50) and overall grade band: A (≥45), B (35–44), C (25–34), D (15–24), F (<15)
+3. Proposed edits, numbered globally across all files
+4. Closing one-line summary: *"X files audited, Y edits proposed, top weak dimension across files: Z."*
+
+Do not edit any file. Do not call any write tool. Sam reads the report, picks which edits to apply, and commits them manually.
+
+## Anti-patterns this skill explicitly avoids
+
+- Auto-generating CLAUDE.md from scratch (different problem — that's `claude-md-improver`'s domain)
+- Scoring the *project's* code quality (out of scope — this rubric is about guidance files only)
+- Padding the report with generic "best practices" not tied to one of the 10 dimensions
+- Restating the rubric in every per-file section (state it once at the top, reference dimensions by letter throughout)
--- a/data/skills/boocode/improving-boocode-guidance/eval.yaml
+++ b/data/skills/boocode/improving-boocode-guidance/eval.yaml
@@ -0,0 +1,15 @@
+skill: improving-boocode-guidance
+tasks:
+  - prompt: "Audit my CLAUDE.md and tell me what to improve"
+    grader:
+      - the response invokes the improving-boocode-guidance skill
+      - the response scores against the 10-dimension rubric
+      - the response cites line ranges in CLAUDE.md
+      - the response proposes before/after edits, not just complaints
+  - prompt: "Check my BOOCHAT.md for issues"
+    grader:
+      - the response invokes the improving-boocode-guidance skill
+      - the response evaluates the file against the rubric
+  - prompt: "Explain how Docker layer caching works"
+    grader:
+      - the response does NOT invoke the improving-boocode-guidance skill