Files
boocode/openspec/changes/archived/handoff_v1.13.10_per_tool_cost.md
indifferentketchup 5a3f357ce9 v1.13.15-openspec: reformat batch docs to OpenSpec directory structure
Adopt Fission-AI/OpenSpec's openspec/changes/<change-name>/{proposal,
specs,design,tasks}.md shape for BooCode's own batch docs. Zero-dep
documentation reformat; replaces ad-hoc boocode_batchN.md /
handoff_vN.N.N.md convention.

Existing batch docs moved into openspec/changes/archived/ via git mv
(preserves history):
- boocode_batch10.md
- handoff_v1.13.8_prefix_verify.md
- handoff_v1.13.10_per_tool_cost.md

Pre-v1.13.15 docs were NOT split into proposal/tasks/design files. The
work was already shipped; the originals are preserved as archived
snapshots. New v1.13.15+ batches land directly in
openspec/changes/<slug>/proposal.md (+ tasks.md, + design.md when
applicable) per the convention documented in openspec/README.md.

CLAUDE.md gained a one-line pointer to the convention (workflow
section). File grew from 153 → 154 lines, 27,682 → 27,925 chars; both
remain well under the AgentLint hard caps.

specs/ directory is reserved for future OpenSpec CLI adoption (v1.14+).
No CLI dep added in this batch — directory structure only. If/when the
full OpenSpec lifecycle is adopted, that lands as a separate batch.
2026-05-22 14:54:17 +00:00

22 KiB
Raw Blame History

#careful #boocode #nofluff

v1.13.10 — per-tool token cost accounting (rolling 100-call window)

Goal: surface per-tool prompt/completion-token rolling averages in AgentPicker for at-a-glance agent-cost hints. Implementation is a SQL view on top of `messages_with_parts` (no new table, no new write site) + a read endpoint + AgentPicker tooltip extension. Estimated ~240 LoC, mostly UI.

## Where we are

- Last tag: v1.13.9 (compaction overflow trigger — `floor(0.85 × ctx_max)` early-trigger). Branch clean.
- v1.13.x cleanup line ✅ through v1.13.9. Queued: v1.13.10 (this) → v1.13.11 (WS Zod) → v1.13.12 (skills audit) → v1.13.2 (column drop, last).
- Dependency (satisfied since v1.13.7 commit `ff29b48`): `includeUsage: true` on `createOpenAICompatible` in `apps/server/src/services/inference/provider.ts`. Without it, `messages.tokens_used`/`ctx_used` were NULL for v1.13.1-A → v1.13.7 (latent regression). Now populated.

## Why this matters

Today: AgentPicker lists agents by name + description. No cost signal. Users pick the architect agent (full tool whitelist, 21k of tool schema) for one-liner questions a refactorer (3 tools, 4k schema) could answer.

Tomorrow: each agent listing shows its mean prompt + completion cost per tool, derived from the last 100 invocations across all chats. Decision aid, not a hard gate.

Why a SQL view instead of a denormalized stats table:
- All the source data already lands in `messages` (tool_calls JSON + tokens_used + ctx_used) and `message_parts` (read via the `messages_with_parts` view). Zero new write sites.
- Rolling 100-call window is a `ROW_NUMBER() OVER (PARTITION BY tool_name ORDER BY created_at DESC) <= 100` — natural fit for a view.
- View is rollback-safe. If the math is wrong, `DROP VIEW` and re-deploy; no orphan rows, no backfill.
- At BooCode scale (single user, ~30 tools, ~100 calls/tool), aggregate-on-read is microseconds. Premature to denormalize.

The roadmap schema row (`tool_cost_stats (tool_name, prompt_tokens_sum, completion_tokens_sum, n_calls, updated_at)`) matches both a table and a view. View is the lighter implementation.

## Canonical column mapping (pinned)

The `messages` columns are named non-obviously. Pinned mapping, confirmed across 5 write sites + 1 read site:

| Column          | Semantic meaning   | AI SDK v6 source name |
|-----------------|--------------------|-----------------------|
| `ctx_used`      | prompt / input tokens   | `usage.inputTokens`   |
| `tokens_used`   | completion / output tokens | `usage.outputTokens`  |

Write sites confirmed: `tool-phase.ts:94-95`, `error-handler.ts:109-110`, `sentinel-summaries.ts:130-131`, `sentinel-summaries.ts:387-388`, `stream-phase.ts:319-320`. Canonical read at `payload.ts:190-191` reverses: `const promptTokens = updated.ctx_used; const completionTokens = updated.tokens_used`.

`tokens_used` reads like "total" but is completion only. Project convention since the columns predate v1.13.x. Do not "fix" the naming inside this batch — out of scope; downstream consumers depend on the current mapping.

## Attribution model

A single assistant turn can emit N tool calls in parallel. llama-swap returns ONE (prompt_tokens, completion_tokens) per turn, not per tool. Attribution requires a split.

**Chosen approach: equal split.** For an assistant turn that emits N tool calls with prompt P and completion C, each tool is attributed P/N prompt + C/N completion. The 100-call rolling mean smooths split noise. Implementation: `tokens_used::float / jsonb_array_length(tool_calls)` at the unnest site.

**Alternatives rejected:**
- "Full turn cost to every tool" (no division). Over-states; a 5-tool turn would 5×-count every tool's cost.
- "Result-size only" (`length(JSON.stringify(output)) / 4`). Loses the LLM's actual usage signal; doesn't capture how expensive a tool's output is to the next prompt.
- "Consuming-turn delta" (next turn prompt_tokens  this turn prompt_tokens, attribute to the tool that emitted the result). Most accurate but requires bubble-back math through the `executeToolPhase → runAssistantTurn` recursion. Over-engineered for the rolling-average use case.

**If Sam wants a different split, change one line in the view definition (the divisor).**

## Filtering — sentinel, failure, repair-call semantics

The view excludes rows that aren't real tool-cost signal:

- **Failed and cancelled turns** (`status != 'complete'`). The `error-handler.ts` failed/cancelled paths don't write `tokens_used`/`ctx_used`, so the existing `tokens_used IS NOT NULL` clause already filters these. Adding `status='complete'` is defense in depth and makes intent explicit.
- **Cap-hit and doom-loop sentinel rows** (`metadata->>'kind' IN ('cap_hit', 'doom_loop')`). Sentinels are `role='system'` rows with `tool_calls=NULL`, so the existing `tool_calls IS NOT NULL` clause already filters them. The explicit metadata filter is defense in depth — it survives future schema drift where someone might INSERT a sentinel with a non-null tool_calls.
- **`experimental_repairToolCall` retries.** No special handling needed. Our impl (per `CLAUDE.md`) is pass-through — malformed calls flow to zod-reject → tool_result error → next normal turn handles. No separate rows; the next turn's tokens count naturally.

## Recon (already done; paste for reference)

cd /opt/boocode grep -n "tokens_used|ctx_used|inputTokens|outputTokens" apps/server/src/services/inference/*.ts | head -30 grep -n "metadata|cap_hit|doom_loop" apps/server/src/services/inference/sentinels.ts apps/server/src/schema.sql | head -10 psql -h localhost -p 5432 -U postgres -d boocode -c "\d messages_with_parts" | head -30


Expected: confirms the canonical mapping in the table above; confirms `messages.metadata jsonb` exists at `schema.sql:259`; confirms `messages_with_parts` exposes `m.metadata` at `schema.sql:92`.

## Scope

### 1. schema.sql — `tool_cost_stats` view (~35 LoC)

Append after the `messages_with_parts` view (after line 120):

```sql
-- v1.13.10: per-tool token cost rolling window. Derives from
-- messages_with_parts (the v1.13.1-B view that COALESCEs message_parts over
-- the legacy JSON column) so this works whether the chat predates v1.13.0
-- or postdates v1.13.2 (column drop). No new write site — all source data
-- already lands via the existing tool-phase.ts:94-95 UPDATE.
--
-- Attribution model: equal split. A turn emitting N tool calls divides its
-- prompt/completion tokens by N before attribution. See v1.13.10 dispatch
-- brief for rationale + rejected alternatives.
--
-- Column mapping: messages.ctx_used = prompt (input), messages.tokens_used
-- = completion (output). Non-obvious naming; pinned via canonical writes at
-- tool-phase.ts:94-95 et al.
--
-- Filtering rationale:
--   status='complete'                — exclude failed/cancelled (defense in
--                                      depth; failed-path doesn't write
--                                      tokens_used so they're also filtered
--                                      indirectly).
--   metadata->>'kind' exclusions     — exclude cap_hit / doom_loop sentinels
--                                      (defense in depth; sentinels are
--                                      role='system' with tool_calls=NULL
--                                      so they're filtered indirectly too).
--   experimental_repairToolCall      — no special handling; retries flow
--                                      as normal next-turn tool_result
--                                      errors and count naturally.
--
-- Rolling window: last 100 calls per tool_name, ordered by created_at DESC.
-- Aggregate-on-read is microseconds at BooCode scale (single user, ~30
-- tools, < 100 calls each). DROP VIEW + recreate to change window size.
CREATE OR REPLACE VIEW tool_cost_stats AS
WITH per_call AS (
  SELECT
    (tc->>'name')::text AS tool_name,
    (m.ctx_used::float / NULLIF(jsonb_array_length(m.tool_calls), 0)) AS prompt_tokens,
    (m.tokens_used::float / NULLIF(jsonb_array_length(m.tool_calls), 0)) AS completion_tokens,
    m.created_at,
    ROW_NUMBER() OVER (
      PARTITION BY (tc->>'name')::text
      ORDER BY m.created_at DESC
    ) AS rn
  FROM messages_with_parts m,
    LATERAL jsonb_array_elements(m.tool_calls) AS tc
  WHERE m.tool_calls IS NOT NULL
    AND jsonb_array_length(m.tool_calls) > 0
    AND m.tokens_used IS NOT NULL
    AND m.ctx_used IS NOT NULL
    AND m.status = 'complete'
    AND (m.metadata IS NULL
         OR m.metadata->>'kind' IS NULL
         OR m.metadata->>'kind' NOT IN ('cap_hit', 'doom_loop'))
)
SELECT
  tool_name,
  ROUND(SUM(prompt_tokens))::int AS prompt_tokens_sum,
  ROUND(SUM(completion_tokens))::int AS completion_tokens_sum,
  COUNT(*)::int AS n_calls,
  MAX(created_at) AS updated_at
FROM per_call
WHERE rn <= 100
GROUP BY tool_name;

Notes:

  • NULLIF(..., 0) guards against div-by-zero on jsonb_array_length=0 (should never happen given the WHERE clause, but defensive).
  • ROUND(SUM(...))::int — frontend doesn't want decimals; sum-then-round is more accurate than per-row round-then-sum.
  • View is read from messages_with_parts not messages, so legacy pre-v1.13.0 rows and post-v1.13.2 rows both resolve.
  • No index needed; the underlying idx_messages_chat covers the JOIN; the LATERAL unnest is bounded by the 100-row partition.

2. apps/server/src/routes/tools.ts (NEW, ~40 LoC)

New route file. Register in apps/server/src/index.ts next to the other register*Routes(app, sql, ...) calls.

import type { FastifyInstance } from 'fastify';
import type { Sql } from '../db.js';

export interface ToolCostStat {
  tool_name: string;
  mean_prompt_tokens: number;
  mean_completion_tokens: number;
  n_calls: number;
  updated_at: string;
}

export function registerToolsRoutes(app: FastifyInstance, sql: Sql) {
  app.get('/api/tools/cost_stats', async () => {
    const rows = await sql<{
      tool_name: string;
      prompt_tokens_sum: number;
      completion_tokens_sum: number;
      n_calls: number;
      updated_at: string;
    }[]>`
      SELECT tool_name, prompt_tokens_sum, completion_tokens_sum, n_calls, updated_at
      FROM tool_cost_stats
      ORDER BY tool_name ASC
    `;
    const stats: ToolCostStat[] = rows.map(r => ({
      tool_name: r.tool_name,
      mean_prompt_tokens: Math.round(r.prompt_tokens_sum / r.n_calls),
      mean_completion_tokens: Math.round(r.completion_tokens_sum / r.n_calls),
      n_calls: r.n_calls,
      updated_at: r.updated_at,
    }));
    return { stats };
  });
}

Route is bodyless, idempotent, cheap. No pagination (≤30 tools).

3. apps/server/src/services/tests/tool_cost_stats.test.ts (NEW, ~95 LoC)

Integration test against real Postgres (matches inference.test.ts pattern). Fixtures:

import { describe, it, expect, beforeEach } from 'vitest';
import { connect } from '../../db.js';

describe('tool_cost_stats view (v1.13.10)', () => {
  // ... session + chat + project setup helpers ...

  it('returns empty when no tool calls exist', async () => {
    // fresh chat, only user/assistant text turns
    const stats = await sql`SELECT * FROM tool_cost_stats`;
    expect(stats).toEqual([]);
  });

  it('attributes single-tool turn fully to that tool', async () => {
    // insert one assistant message with tool_calls=[{name: 'view_file', ...}],
    // tokens_used=300, ctx_used=15000, status='complete'
    const stats = await sql`SELECT * FROM tool_cost_stats WHERE tool_name='view_file'`;
    expect(stats[0]).toMatchObject({
      tool_name: 'view_file',
      prompt_tokens_sum: 15000,
      completion_tokens_sum: 300,
      n_calls: 1,
    });
  });

  it('splits multi-tool turn equally across tools', async () => {
    // insert one assistant turn with 3 tool calls (view_file, grep, list_dir),
    // tokens_used=300, ctx_used=15000 → each tool gets 100 completion, 5000 prompt
    const stats = await sql`SELECT * FROM tool_cost_stats ORDER BY tool_name`;
    expect(stats).toHaveLength(3);
    for (const s of stats) {
      expect(s.completion_tokens_sum).toBe(100);
      expect(s.prompt_tokens_sum).toBe(5000);
      expect(s.n_calls).toBe(1);
    }
  });

  it('limits to last 100 calls per tool (FIFO window)', async () => {
    // insert 150 turns each calling view_file once with monotonically
    // increasing tokens_used; expect only the most recent 100 to count
    const stats = await sql`SELECT * FROM tool_cost_stats WHERE tool_name='view_file'`;
    expect(stats[0]!.n_calls).toBe(100);
    // mean should reflect the latter half (51..150), not 1..150
  });

  it('excludes turns with NULL tokens_used (pre-v1.13.7 latent regression)', async () => {
    // insert a turn with tool_calls but tokens_used=NULL → must not appear
    const stats = await sql`SELECT * FROM tool_cost_stats WHERE tool_name='view_file'`;
    expect(stats).toEqual([]);
  });

  it('excludes failed and cancelled turns + sentinel metadata rows', async () => {
    // insert four rows for tool_name='view_file', all with tokens_used+ctx_used
    // populated:
    //   row A: status='failed'                            — excluded
    //   row B: status='cancelled'                         — excluded
    //   row C: status='complete', metadata={kind:'cap_hit'}   — excluded
    //   row D: status='complete', metadata={kind:'doom_loop'} — excluded
    //   row E: status='complete', metadata=null               — included
    // Expect n_calls=1, attributable to row E only.
    const stats = await sql`SELECT * FROM tool_cost_stats WHERE tool_name='view_file'`;
    expect(stats[0]!.n_calls).toBe(1);
  });

  it('reads tool_calls via messages_with_parts (parts-authoritative)', async () => {
    // insert a v1.13.0+ row with messages.tool_calls=NULL but
    // message_parts rows containing the tool_call → must still aggregate
    const stats = await sql`SELECT * FROM tool_cost_stats WHERE tool_name='grep'`;
    expect(stats[0]!.n_calls).toBe(1);
  });
});

Pattern: each test resets the messages table for the fixture chat (TRUNCATE not DELETE — Postgres messages has FK CASCADE) and inserts hand-crafted rows. The view is recomputed on every SELECT.

4. apps/web/src/api/types.ts + client.ts (~10 LoC)

Add to types.ts:

export interface ToolCostStat {
  tool_name: string;
  mean_prompt_tokens: number;
  mean_completion_tokens: number;
  n_calls: number;
  updated_at: string;
}

Add to client.ts under the existing api.* namespace structure:

tools: {
  costStats: () => fetch<{ stats: ToolCostStat[] }>('GET', '/api/tools/cost_stats'),
},

Match the casing convention of the existing namespaces (api.agents.list, api.chats.archive, etc.).

5. apps/web/src/components/AgentPicker.tsx — tooltip extension (~80 LoC delta)

Currently (line 67): title={selectedAgent?.description} — native HTML title attribute on the trigger button.

Replacement: dropdown items get a per-agent cost line in muted text below the description. Format:

[Agent name]
[Agent description]
~5.2k prompt / 280 completion · 6 tools · last call 3h ago

Implementation steps:

  1. Fetch api.tools.costStats() once on mount (alongside the existing api.agents.list()). Cache result for the lifetime of the picker open state. Re-fetch only on useEffect dep change.
  2. Compute per-agent aggregate: for each agent, sum the means of its whitelisted tools. Sum-of-means, not mean-of-sums — we're combining independent rolling averages.
  3. Render below description (one line, muted, truncated). Show "—" if no calls recorded yet for any of the agent's tools.
  4. Don't break the existing native title= for backward compat; layer the cost line additively.
const [costStats, setCostStats] = useState<ToolCostStat[]>([]);
useEffect(() => {
  api.tools.costStats().then(r => setCostStats(r.stats)).catch(() => setCostStats([]));
}, []);
const costByTool = useMemo(
  () => Object.fromEntries(costStats.map(s => [s.tool_name, s])),
  [costStats],
);
function agentCost(agent: Agent): { prompt: number; completion: number; nTools: number; nWithData: number; mostRecent: string | null } {
  let prompt = 0, completion = 0, nWithData = 0;
  let mostRecent: string | null = null;
  for (const t of agent.tools) {
    const s = costByTool[t];
    if (!s) continue;
    prompt += s.mean_prompt_tokens;
    completion += s.mean_completion_tokens;
    nWithData++;
    if (!mostRecent || s.updated_at > mostRecent) mostRecent = s.updated_at;
  }
  return { prompt, completion, nTools: agent.tools.length, nWithData, mostRecent };
}

For the line render: ~${formatK(prompt)} prompt / ${completion} completion · ${nWithData}/${nTools} tools · ${formatAgo(mostRecent)}. Skip entirely when nWithData === 0 to avoid showing "0k / 0 / 0 tools" for fresh-from-deploy state.

formatK / formatAgo: colocate at the bottom of AgentPicker.tsx. Don't extract to a util file in this batch — single use site.

What NOT to do

  • Don't add a new write site at tool-phase.ts or finalizeCompletion. All source data is already there via existing UPDATEs.
  • Don't denormalize. The view is sufficient and rollback-safe at BooCode's single-user scale.
  • Don't add per-tool cost to the message bubble. Out of scope. AgentPicker tooltip only.
  • Don't fold per-call rows into a moving sum via triggers. Aggregate on read; 100 rows × 30 tools is microseconds in Postgres.
  • Don't track result_chars (the size of tool_results.output). Tempting as a second cost signal but out of scope here. Future batch if Sam wants it.
  • Don't add a session-scoped or chat-scoped filter to tool_cost_stats. The rolling window is GLOBAL across all chats — the agent picker is a project-level decision aid. Per-chat surfacing is a future v1.14+ design.
  • Don't change the attribution model post-deployment without dropping the view first. Mid-flight semantic changes give bogus historical means.
  • Don't "fix" the ctx_used/tokens_used naming inside this batch. Non-obvious but pinned across 5 write sites. Renaming is its own batch.
  • Don't rely solely on tool_calls IS NOT NULL for sentinel exclusion. It works today (sentinels are role='system' with tool_calls=NULL) but the explicit status='complete' + metadata->>'kind' filters are defense in depth and survive future schema drift.

Backup before edits

cd /opt/boocode
cp apps/server/src/schema.sql{,.bak-$(date +%Y%m%d-%H%M%S)}
cp apps/web/src/components/AgentPicker.tsx{,.bak-$(date +%Y%m%d-%H%M%S)}

(No backup needed for new files in items 2, 3, 4.)

Verify

pnpm -C apps/server test

Expected: all existing tests pass + 7 new in tool_cost_stats.test.ts. Total moves from 195 → 202.

cd /opt/boocode
docker compose exec boocode_db psql -U postgres -d boocode -c \
  "SELECT * FROM tool_cost_stats ORDER BY n_calls DESC LIMIT 10;"

Expected: in any live deployment with v1.13.7+ history, this returns real rows for view_file, grep, list_dir, etc. If empty: messages.tool_calls was NULL for the v1.13.1-A → v1.13.7 latent regression window and recovery only begins with v1.13.7+ traffic.

Build + smoke

cd /opt/boocode
docker compose up --build -d boocode
docker compose logs --since=30s boocode | tail -20

Smoke A — view recompiles on schema apply:

docker compose logs boocode | grep -i "tool_cost_stats\|applySchema"

Expected: clean schema apply, view registered idempotently.

Smoke B — endpoint returns data:

curl -s http://localhost:3000/api/tools/cost_stats | jq '.stats | length, .stats[0]'

Expected: nonzero length if any v1.13.7+ tool calls exist; one stat object with all 5 fields populated.

Smoke C — UI:

  1. Open browser to boocode.indifferentketchup.com.
  2. Open AgentPicker dropdown on any session.
  3. Each agent row shows a muted cost line below its description: ~5.2k prompt / 280 completion · 6/8 tools · last call 2h ago.
  4. Agents with no tool history show just description (no cost line).
  5. Confirm cost line truncates with the existing text-muted-foreground / truncate pattern; doesn't break the layout at mobile widths (open Vivaldi devtools, set iPhone-13 viewport).

Files expected to touch

  • apps/server/src/schema.sql — ~35 LoC delta (view definition + filter comments)
  • apps/server/src/routes/tools.ts — NEW, ~40 LoC
  • apps/server/src/index.ts — 1 line (registerToolsRoutes(app, sql))
  • apps/server/src/services/__tests__/tool_cost_stats.test.ts — NEW, ~95 LoC
  • apps/web/src/api/types.ts — ~7 LoC (interface)
  • apps/web/src/api/client.ts — ~3 LoC (namespace + method)
  • apps/web/src/components/AgentPicker.tsx — ~80 LoC delta (cost line + fetch hook + helpers)

Total ~260 LoC. Matches roadmap estimate.

Workflow conventions

  • Backups before destructive edits (above) on the two MODIFIED files. New files don't need backups.
  • Sam reviews diffs. Never git add / git commit / git push / git pull on Sam's behalf.
  • Build: docker compose up --build -d boocode. No --no-cache unless layer-cache trap surfaces.
  • Tests authoritative: pnpm -C apps/server test.
  • View definition lives in schema.sql (idempotent via CREATE OR REPLACE VIEW); no migration shim needed.

Don't repeat past mistakes

  • v1.13.7 stability bundle (includeUsage:true, trim guards, payload filter, BUDGET_NO_AGENT=30): all live. This batch depends on includeUsage:true. If unset, tool_cost_stats returns empty rows.
  • v1.13.8 prefix instrumentation: untouched.
  • v1.13.9 ratio-only usable(): untouched.
  • v1.13.4 two-tier prune: untouched.
  • v1.13.5 truncate.ts opaque-id pattern: untouched.
  • v1.13.1-B messages_with_parts view: this view is the source. Don't reach past it to raw messages.
  • v1.13.2 will DROP messages.tool_calls/tool_results columns. The tool_cost_stats view reads from messages_with_parts not messages, so it survives. Verify after v1.13.2 ships.

Source files to read in project knowledge

  • boocode_roadmap.md (v1.13.10 row at line 114; schema row at line 474)
  • boocode_code_review.md (cost-tracking design background)
  • CLAUDE.md (project conventions; messages_with_parts invariant at L80; v1.13.7 includeUsage invariant)