v1.13.12: skills audit + token-tracking fix + codecontext + cap50 + UI cleanups
Multi-topic batch. The big-ticket item is the skills audit; the rest are smaller patches that compounded during the audit work. ## Skills audit (rules→recipes split) Vendored all 26 skills from /home/samkintop/opt/skills/ into data/skills/ (the boocode-repo-local skill library — see docker-compose change below). Audited via 5 parallel Claude Code agent-teams running the mgechev/skills-best-practices 4-step protocol (Discovery → Logic → Edge Case → self-Architecture-Refinement) per skill, ~2 min wall-clock vs the ~3.7-hour serial estimate. Result: 14 skills surviving (renamed to gerund form, frontmatter matched), 11 deleted (duplicates, BooCode-irrelevant patterns, Claude-already-does- natively), 1 migrated to BOOCHAT.md/BOOCODER.md as an always-true rule (verification-before-completion). Each surviving skill had its description refined to fix specific trigger gaps surfaced by the protocol — 4 real-bug findings landed (dead refs, stale tags, broken sub-file references in the original vendored content). Audit decisions documented in openspec/changes/v1.13.12-skills-audit/ audit-notes.md. Convention codified in BOOCHAT.md/BOOCODER.md "rules vs recipes" sections — future workflow rules go to those files (100% present), recipes stay in data/skills/ (~6% invoke rate in multi-turn per the Codeminer42 measurement). ## Token tracking + stale-stream banner fix (same root cause) ws-frames.ts IsoTimestamp was z.string().min(1) but postgres returns timestamp columns as JS Date objects. Every message_complete / session_updated / chat_updated frame was failing the v1.13.11 Zod gate and being silently dropped. Symptoms: token tracking blank in the UI (no usage frames landed); the 60s no-token-activity timer tripped the stale-stream banner because the frontend's local message state never saw status='streaming' flip to 'complete'. Fix: z.preprocess(v => v instanceof Date ? v.toISOString() : v, z.string().min(1)) applied to the IsoTimestamp primitive. Centralized, no publisher changes, works identically server + web (the parity test still passes). ## Codecontext .codecontextignore auto-install services/codecontext_client.ts now copies the codecontext/.codecontextignore.template into any project's root on the first call to that project if no .codecontextignore exists. One file written per project, idempotent (in-memory Set guard + access-check), silent fallback on read-only project. Stops the upstream empty-source- file parser crash on foreign projects' node_modules — previously required manually copying the template per project. ## Tool-call budget cap 30 → 50 services/inference/budget.ts: BUDGET_READ_ONLY and BUDGET_NO_AGENT bumped to 50 (from 30). BUDGET_NON_READ_ONLY stays at 10 (no write tools landed yet). Real recon sessions were hitting 30 with ~3 turns wasted on codecontext parse failures; legitimate need was ~27, and Architect-class system overviews want deeper recon. Headroom of 20 absorbs failure-retry turns without changing the safety floor — the doom-loop guard (3 identical calls → abort) catches the actual failure mode this cap was guarding against. v1.14 (Phase C outer agent loop) will supersede this via per-agent agent.steps. Throwaway-ish patch but unblocks deeper recon today. ## UI cleanups - ChatPane queued-message dropdown removed. Each queued message now has three buttons: edit (pop back into ChatInput via sendToChat event), force-send (was the dropdown's only useful action), and cancel. Default behavior (send when streaming completes) needs no UI — it's the implicit do-nothing path. - ChatThroughput removed from desktop tab strip (ChatTabBar.tsx). Mobile tab switcher still shows it. ## Plumbing - .gitignore: data/* + !data/AGENTS.md + !data/skills/ negation patterns so the vendored skill library + agent registry become git-tracked while session DB state stays out. - docker-compose.yml: removed /opt/skills:/data/skills override mount. Skills now live in the boocode repo at data/skills/, auditable per-batch. The host-level /opt/skills/ is preserved untouched for any other tools that read from it. - .codecontextignore at repo root: auto-installed when codecontext was first called against /opt/boocode itself; matches the template. - CLAUDE.md: updated to document the v1.13.11 publishFrame wrapper + message_parts table + tool_cost_stats view + DB-integration test pattern + host-side smoke endpoint quirk. (Pre-existing in working tree before this batch; shipped here for completeness.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
167
data/skills/boocode/improving-boocode-guidance/SKILL.md
Normal file
167
data/skills/boocode/improving-boocode-guidance/SKILL.md
Normal file
@@ -0,0 +1,167 @@
|
||||
---
|
||||
name: improving-boocode-guidance
|
||||
description: This skill should be used when the user asks to audit, review, check, improve, or critique CLAUDE.md, BOOCHAT.md, BOOCODER.md, or AGENTS.md files in a BooCode project. Examples: "audit my CLAUDE.md", "review my container guidance", "check this AGENTS.md for issues", "improve my BOOCHAT.md", "critique my BOOCODER.md".
|
||||
---
|
||||
|
||||
# BooCode Guidance Improver
|
||||
|
||||
Audit guidance files in a BooCode project against a 10-dimension rubric, then propose targeted edits. **Read-only.** Output is a scored report plus before/after edit proposals; Sam reviews and commits.
|
||||
|
||||
## Phase 1 — Discovery
|
||||
|
||||
Find every guidance file in the project. The expected set:
|
||||
|
||||
- `CLAUDE.md` (repo root) — engineering conventions, gotchas, commands
|
||||
- `BOOCHAT.md` (repo root) — container guidance for the read-only chat surface
|
||||
- `BOOCODER.md` (repo root) — container guidance for the future write-capable surface (currently a stub)
|
||||
- `data/AGENTS.md` — single-file tier-2 agent registry, `## H2` per agent
|
||||
- `AGENTS.md` (repo root) — non-BooCode convention; rare in this repo
|
||||
|
||||
Glob with `find_files` then load each with `view_file`:
|
||||
|
||||
```
|
||||
find_files: pattern="{CLAUDE,BOOCHAT,BOOCODER,AGENTS}.md", path="."
|
||||
find_files: pattern="data/AGENTS.md", path="."
|
||||
```
|
||||
|
||||
If a file expected by the project's architecture is missing (e.g. BOOCHAT.md is absent from the repo root in a project that exposes a chat container), flag it in the report as a separate "Missing" entry — don't try to score what isn't there. Likewise, if a file exists but is empty (≤5 lines, no real content), score it 1 across the board and recommend it be either populated or deleted; an empty guidance file is worse than no file because it consumes attention without paying any back.
|
||||
|
||||
## Phase 2 — Score against the rubric
|
||||
|
||||
For each file, score each of the 10 dimensions on 1–5 (1 = absent or actively misleading; 5 = exemplary). Use the rubric below verbatim. Cite a representative line range for each score.
|
||||
|
||||
### a. Refusal rails up front
|
||||
|
||||
The first ~10 lines name explicit "do not" directives — what the agent must not do, ideally with a one-line reason. Surfacing refusals early prevents the model from acting on a hopeful misread later.
|
||||
|
||||
- **5** — first 10 lines contain ≥3 explicit refusals (e.g. *"Do not commit"*, *"Do not push"*, *"Do not write files"*) with brief reasons or contexts
|
||||
- **3** — refusals exist but are buried below line 30, or stated only once without context
|
||||
- **1** — no refusals anywhere; the agent has to infer constraints from positive instructions only
|
||||
|
||||
### b. Version anchor
|
||||
|
||||
A concrete version, tag, or date is mentioned near the top so a stale memory becomes obvious to a future reader. Pure "current" / "latest" claims rot silently.
|
||||
|
||||
- **5** — version/tag in the first 20 lines, plus a "last meaningful update" date inline somewhere
|
||||
- **3** — a version tag exists but only deep in the file (e.g. inside a commit-history block)
|
||||
- **1** — no version, no date, no anchor; nothing to detect staleness against
|
||||
|
||||
### c. Why-with-what
|
||||
|
||||
Every non-obvious convention or rule is followed by a one-line justification (`Why:` / `Reason:` / dash). Rules without reasons can't be reasoned about at the edges; they get either blindly followed or quietly violated.
|
||||
|
||||
- **5** — every non-trivial rule has a sentence-level "why" inline
|
||||
- **3** — most rules have reasons, but a few load-bearing ones (e.g. "use overflowWrap not wordWrap") are bare
|
||||
- **1** — rules read as commandments with no rationale
|
||||
|
||||
### d. Authoritative vs misleading sources
|
||||
|
||||
Places where a tool can lie (e.g. *"root `tsc --noEmit` uses project references and can miss errors that the per-app tsconfig catches"*) are called out, and the authoritative path is named. Without this, the agent picks the most convenient signal and ships a regression.
|
||||
|
||||
- **5** — at least one explicit "X can lie; use Y instead" pair, named with file paths
|
||||
- **3** — implicit hints ("CLI is authoritative") without naming what the misleading signal is
|
||||
- **1** — no acknowledgement that any tool can lie
|
||||
|
||||
### e. Resolution order
|
||||
|
||||
For any stacked configuration (system prompts, env vars, agent definitions, schemas), the precedence is documented end-to-end with what wins on conflict. Missing precedence rules force the agent to guess at boundaries.
|
||||
|
||||
- **5** — explicit ordered list (e.g. *"base → container guidance → agent.system_prompt → user prompt"*) with "last wins" or "first wins" stated
|
||||
- **3** — order is implied by section sequence but not stated; precedence on conflict is unclear
|
||||
- **1** — multiple sources mentioned, no order, no winner
|
||||
|
||||
### f. Failure modes
|
||||
|
||||
Each subsystem has a "what happens when this fails" note — fallbacks, defaults, swallow vs propagate decisions. Without this the agent assumes the happy path and writes brittle code.
|
||||
|
||||
- **5** — every major subsystem (DB, broker, LLM call, tool execution) names its failure behavior
|
||||
- **3** — some failure paths documented, others implicit
|
||||
- **1** — failure modes invisible; reader can't tell what's defensive and what isn't
|
||||
|
||||
### g. Don't / refusals (deep)
|
||||
|
||||
Beyond the top-of-file refusal rails, the body contains a sustained "don't" thread — anti-patterns the project has burned on. Each "don't" should name what triggered it (PR, incident, refactor) so it can be re-evaluated.
|
||||
|
||||
- **5** — multiple "don't" entries scattered through the file, each with a hint at the triggering context
|
||||
- **3** — a handful of "don't"s, no context — reader can't tell what's still load-bearing
|
||||
- **1** — pure positive instructions; no anti-pattern surface
|
||||
|
||||
### h. Concrete call sites
|
||||
|
||||
Specific file paths and symbol names are used (e.g. `apps/server/src/services/inference.ts:209-225 buildSystemPrompt`), not vague pointers ("in the service layer", "somewhere in tools"). Vague pointers force the agent into an extra search round-trip per claim.
|
||||
|
||||
- **5** — claims about code consistently cite file:line or file:symbol (e.g. *"buildSystemPrompt at apps/server/src/services/system-prompt.ts:42"*)
|
||||
- **3** — some claims cite paths but not lines or symbols (*"in apps/server/src/services/inference.ts"*)
|
||||
- **1** — claims read like "the broker handles pub/sub" with no path at all
|
||||
|
||||
A reliable test for this dimension: pick three random claims about behaviour, and try to land at the named code in two clicks. If you can't, the score drops.
|
||||
|
||||
### i. Convention drift guards
|
||||
|
||||
Pairs of files that must stay in sync are named explicitly (e.g. *"CHECK constraints in schema.sql ↔ `*_STATUSES` const arrays in `apps/server/src/types/api.ts`"*). Without these guards, one half drifts and the test that would catch it doesn't exist.
|
||||
|
||||
- **5** — every cross-file invariant in the project has a "keep in sync" callout
|
||||
- **3** — one or two such guards present; obvious sibling files (frontend type ↔ backend type) not mentioned
|
||||
- **1** — invariants are invisible; every edit risks silent divergence
|
||||
|
||||
### j. No theater
|
||||
|
||||
Every line earns its keep. No "be helpful", no "remember to think step by step", no "as an AI assistant" preamble. Theater wastes tokens and trains the model to skim.
|
||||
|
||||
- **5** — every line carries either a fact, a rule, or a pointer; reads tight
|
||||
- **3** — a few filler sentences ("strive for excellence", "remember to think carefully") but mostly substantive
|
||||
- **1** — heavy preamble, motivational platitudes, or restated framework defaults
|
||||
|
||||
Worth a separate pass: re-read the file and ask "would removing this line confuse a future reader?" — if the honest answer is no, the line is theater and should go.
|
||||
|
||||
## Phase 3 — Propose one concrete edit per ≤3
|
||||
|
||||
For every dimension scoring 3 or lower, generate one specific edit proposal. Each proposal must be:
|
||||
|
||||
- **File**: full repo-relative path
|
||||
- **Anchor**: a quoted ~one-line existing string or `(new section after L<n>)`
|
||||
- **Before**: existing text (or `(none)`)
|
||||
- **After**: proposed text
|
||||
- **Why**: one sentence linking back to the rubric dimension and what the change unlocks
|
||||
|
||||
Example proposal:
|
||||
|
||||
```
|
||||
### Proposed edit 1 — dimension (a) Refusal rails up front
|
||||
|
||||
File: BOOCHAT.md
|
||||
Anchor: "## Capabilities" (L3)
|
||||
Before:
|
||||
## Capabilities
|
||||
After:
|
||||
## You cannot
|
||||
- Write, edit, or delete files
|
||||
- Run shell commands
|
||||
- Make commits, push, or pull
|
||||
|
||||
## Capabilities
|
||||
Why: the upstream rubric requires explicit "do not" rails in the first 10 lines so the
|
||||
model can't reach for a write tool and self-justify after the fact.
|
||||
```
|
||||
|
||||
Keep proposals minimal. One edit per dimension scoring ≤3 — don't pad. If a single edit would lift two dimensions at once, say so and don't double-count.
|
||||
|
||||
Do not propose more than ~10 edits per file. If a file scores ≤3 on more than 10 dimensions (rare), the file needs a rewrite, not patches — say that instead, and propose a high-level outline rather than a flood of line-level edits.
|
||||
|
||||
## Phase 4 — Output
|
||||
|
||||
Output as a single numbered list, in this order:
|
||||
|
||||
1. Per-file score table: 10 rows × score column × one-line evidence column
|
||||
2. Per-file aggregate (sum out of 50) and overall grade band: A (≥45), B (35–44), C (25–34), D (15–24), F (<15)
|
||||
3. Proposed edits, numbered globally across all files
|
||||
4. Closing one-line summary: *"X files audited, Y edits proposed, top weak dimension across files: Z."*
|
||||
|
||||
Do not edit any file. Do not call any write tool. Sam reads the report, picks which edits to apply, and commits them manually.
|
||||
|
||||
## Anti-patterns this skill explicitly avoids
|
||||
|
||||
- Auto-generating CLAUDE.md from scratch (different problem — that's `claude-md-improver`'s domain)
|
||||
- Scoring the *project's* code quality (out of scope — this rubric is about guidance files only)
|
||||
- Padding the report with generic "best practices" not tied to one of the 10 dimensions
|
||||
- Restating the rubric in every per-file section (state it once at the top, reference dimensions by letter throughout)
|
||||
15
data/skills/boocode/improving-boocode-guidance/eval.yaml
Normal file
15
data/skills/boocode/improving-boocode-guidance/eval.yaml
Normal file
@@ -0,0 +1,15 @@
|
||||
skill: improving-boocode-guidance
|
||||
tasks:
|
||||
- prompt: "Audit my CLAUDE.md and tell me what to improve"
|
||||
grader:
|
||||
- the response invokes the improving-boocode-guidance skill
|
||||
- the response scores against the 10-dimension rubric
|
||||
- the response cites line ranges in CLAUDE.md
|
||||
- the response proposes before/after edits, not just complaints
|
||||
- prompt: "Check my BOOCHAT.md for issues"
|
||||
grader:
|
||||
- the response invokes the improving-boocode-guidance skill
|
||||
- the response evaluates the file against the rubric
|
||||
- prompt: "Explain how Docker layer caching works"
|
||||
grader:
|
||||
- the response does NOT invoke the improving-boocode-guidance skill
|
||||
Reference in New Issue
Block a user