boocode/openspec/changes/archived/handoff_v1.13.10_per_tool_cost.md at 5a3f357ce9fbcfc52ab6ece00ca8ace913c958d2

Files

indifferentketchup 5a3f357ce9 v1.13.15-openspec: reformat batch docs to OpenSpec directory structure

Adopt Fission-AI/OpenSpec's openspec/changes/<change-name>/{proposal,
specs,design,tasks}.md shape for BooCode's own batch docs. Zero-dep
documentation reformat; replaces ad-hoc boocode_batchN.md /
handoff_vN.N.N.md convention.

Existing batch docs moved into openspec/changes/archived/ via git mv
(preserves history):
- boocode_batch10.md
- handoff_v1.13.8_prefix_verify.md
- handoff_v1.13.10_per_tool_cost.md

Pre-v1.13.15 docs were NOT split into proposal/tasks/design files. The
work was already shipped; the originals are preserved as archived
snapshots. New v1.13.15+ batches land directly in
openspec/changes/<slug>/proposal.md (+ tasks.md, + design.md when
applicable) per the convention documented in openspec/README.md.

CLAUDE.md gained a one-line pointer to the convention (workflow
section). File grew from 153 → 154 lines, 27,682 → 27,925 chars; both
remain well under the AgentLint hard caps.

specs/ directory is reserved for future OpenSpec CLI adoption (v1.14+).
No CLI dep added in this batch — directory structure only. If/when the
full OpenSpec lifecycle is adopted, that lands as a separate batch.

2026-05-22 14:54:17 +00:00

22 KiB

Raw Blame History

#careful #boocode #nofluff

v1.13.10 — per-tool token cost accounting (rolling 100-call window)

Goal: surface per-tool prompt/completion-token rolling averages in AgentPicker for at-a-glance agent-cost hints. Implementation is a SQL view on top of `messages_with_parts` (no new table, no new write site) + a read endpoint + AgentPicker tooltip extension. Estimated ~240 LoC, mostly UI.

## Where we are

- Last tag: v1.13.9 (compaction overflow trigger — `floor(0.85 × ctx_max)` early-trigger). Branch clean.
- v1.13.x cleanup line ✅ through v1.13.9. Queued: v1.13.10 (this) → v1.13.11 (WS Zod) → v1.13.12 (skills audit) → v1.13.2 (column drop, last).
- Dependency (satisfied since v1.13.7 commit `ff29b48`): `includeUsage: true` on `createOpenAICompatible` in `apps/server/src/services/inference/provider.ts`. Without it, `messages.tokens_used`/`ctx_used` were NULL for v1.13.1-A → v1.13.7 (latent regression). Now populated.

## Why this matters

Today: AgentPicker lists agents by name + description. No cost signal. Users pick the architect agent (full tool whitelist, 21k of tool schema) for one-liner questions a refactorer (3 tools, 4k schema) could answer.

Tomorrow: each agent listing shows its mean prompt + completion cost per tool, derived from the last 100 invocations across all chats. Decision aid, not a hard gate.

Why a SQL view instead of a denormalized stats table:
- All the source data already lands in `messages` (tool_calls JSON + tokens_used + ctx_used) and `message_parts` (read via the `messages_with_parts` view). Zero new write sites.
- Rolling 100-call window is a `ROW_NUMBER() OVER (PARTITION BY tool_name ORDER BY created_at DESC) <= 100` — natural fit for a view.
- View is rollback-safe. If the math is wrong, `DROP VIEW` and re-deploy; no orphan rows, no backfill.
- At BooCode scale (single user, ~30 tools, ~100 calls/tool), aggregate-on-read is microseconds. Premature to denormalize.

The roadmap schema row (`tool_cost_stats (tool_name, prompt_tokens_sum, completion_tokens_sum, n_calls, updated_at)`) matches both a table and a view. View is the lighter implementation.

## Canonical column mapping (pinned)

The `messages` columns are named non-obviously. Pinned mapping, confirmed across 5 write sites + 1 read site:

| Column          | Semantic meaning   | AI SDK v6 source name |
|-----------------|--------------------|-----------------------|
| `ctx_used`      | prompt / input tokens   | `usage.inputTokens`   |
| `tokens_used`   | completion / output tokens | `usage.outputTokens`  |

Write sites confirmed: `tool-phase.ts:94-95`, `error-handler.ts:109-110`, `sentinel-summaries.ts:130-131`, `sentinel-summaries.ts:387-388`, `stream-phase.ts:319-320`. Canonical read at `payload.ts:190-191` reverses: `const promptTokens = updated.ctx_used; const completionTokens = updated.tokens_used`.

`tokens_used` reads like "total" but is completion only. Project convention since the columns predate v1.13.x. Do not "fix" the naming inside this batch — out of scope; downstream consumers depend on the current mapping.

## Attribution model

A single assistant turn can emit N tool calls in parallel. llama-swap returns ONE (prompt_tokens, completion_tokens) per turn, not per tool. Attribution requires a split.

**Chosen approach: equal split.** For an assistant turn that emits N tool calls with prompt P and completion C, each tool is attributed P/N prompt + C/N completion. The 100-call rolling mean smooths split noise. Implementation: `tokens_used::float / jsonb_array_length(tool_calls)` at the unnest site.

**Alternatives rejected:**
- "Full turn cost to every tool" (no division). Over-states; a 5-tool turn would 5×-count every tool's cost.
- "Result-size only" (`length(JSON.stringify(output)) / 4`). Loses the LLM's actual usage signal; doesn't capture how expensive a tool's output is to the next prompt.
- "Consuming-turn delta" (next turn prompt_tokens − this turn prompt_tokens, attribute to the tool that emitted the result). Most accurate but requires bubble-back math through the `executeToolPhase → runAssistantTurn` recursion. Over-engineered for the rolling-average use case.

**If Sam wants a different split, change one line in the view definition (the divisor).**

## Filtering — sentinel, failure, repair-call semantics

The view excludes rows that aren't real tool-cost signal:

- **Failed and cancelled turns** (`status != 'complete'`). The `error-handler.ts` failed/cancelled paths don't write `tokens_used`/`ctx_used`, so the existing `tokens_used IS NOT NULL` clause already filters these. Adding `status='complete'` is defense in depth and makes intent explicit.
- **Cap-hit and doom-loop sentinel rows** (`metadata->>'kind' IN ('cap_hit', 'doom_loop')`). Sentinels are `role='system'` rows with `tool_calls=NULL`, so the existing `tool_calls IS NOT NULL` clause already filters them. The explicit metadata filter is defense in depth — it survives future schema drift where someone might INSERT a sentinel with a non-null tool_calls.
- **`experimental_repairToolCall` retries.** No special handling needed. Our impl (per `CLAUDE.md`) is pass-through — malformed calls flow to zod-reject → tool_result error → next normal turn handles. No separate rows; the next turn's tokens count naturally.

## Recon (already done; paste for reference)

cd /opt/boocode grep -n "tokens_used|ctx_used|inputTokens|outputTokens" apps/server/src/services/inference/*.ts | head -30 grep -n "metadata|cap_hit|doom_loop" apps/server/src/services/inference/sentinels.ts apps/server/src/schema.sql | head -10 psql -h localhost -p 5432 -U postgres -d boocode -c "\d messages_with_parts" | head -30


Expected: confirms the canonical mapping in the table above; confirms `messages.metadata jsonb` exists at `schema.sql:259`; confirms `messages_with_parts` exposes `m.metadata` at `schema.sql:92`.

## Scope

### 1. schema.sql — `tool_cost_stats` view (~35 LoC)

Append after the `messages_with_parts` view (after line 120):

```sql
-- v1.13.10: per-tool token cost rolling window. Derives from
-- messages_with_parts (the v1.13.1-B view that COALESCEs message_parts over
-- the legacy JSON column) so this works whether the chat predates v1.13.0
-- or postdates v1.13.2 (column drop). No new write site — all source data
-- already lands via the existing tool-phase.ts:94-95 UPDATE.
--
-- Attribution model: equal split. A turn emitting N tool calls divides its
-- prompt/completion tokens by N before attribution. See v1.13.10 dispatch
-- brief for rationale + rejected alternatives.
--
-- Column mapping: messages.ctx_used = prompt (input), messages.tokens_used
-- = completion (output). Non-obvious naming; pinned via canonical writes at
-- tool-phase.ts:94-95 et al.
--
-- Filtering rationale:
--   status='complete'                — exclude failed/cancelled (defense in
--                                      depth; failed-path doesn't write
--                                      tokens_used so they're also filtered
--                                      indirectly).
--   metadata->>'kind' exclusions     — exclude cap_hit / doom_loop sentinels
--                                      (defense in depth; sentinels are
--                                      role='system' with tool_calls=NULL
--                                      so they're filtered indirectly too).
--   experimental_repairToolCall      — no special handling; retries flow
--                                      as normal next-turn tool_result
--                                      errors and count naturally.
--
-- Rolling window: last 100 calls per tool_name, ordered by created_at DESC.
-- Aggregate-on-read is microseconds at BooCode scale (single user, ~30
-- tools, < 100 calls each). DROP VIEW + recreate to change window size.
CREATE OR REPLACE VIEW tool_cost_stats AS
WITH per_call AS (
  SELECT
    (tc->>'name')::text AS tool_name,
    (m.ctx_used::float / NULLIF(jsonb_array_length(m.tool_calls), 0)) AS prompt_tokens,
    (m.tokens_used::float / NULLIF(jsonb_array_length(m.tool_calls), 0)) AS completion_tokens,
    m.created_at,
    ROW_NUMBER() OVER (
      PARTITION BY (tc->>'name')::text
      ORDER BY m.created_at DESC
    ) AS rn
  FROM messages_with_parts m,
    LATERAL jsonb_array_elements(m.tool_calls) AS tc
  WHERE m.tool_calls IS NOT NULL
    AND jsonb_array_length(m.tool_calls) > 0
    AND m.tokens_used IS NOT NULL
    AND m.ctx_used IS NOT NULL
    AND m.status = 'complete'
    AND (m.metadata IS NULL
         OR m.metadata->>'kind' IS NULL
         OR m.metadata->>'kind' NOT IN ('cap_hit', 'doom_loop'))
)
SELECT
  tool_name,
  ROUND(SUM(prompt_tokens))::int AS prompt_tokens_sum,
  ROUND(SUM(completion_tokens))::int AS completion_tokens_sum,
  COUNT(*)::int AS n_calls,
  MAX(created_at) AS updated_at
FROM per_call
WHERE rn <= 100
GROUP BY tool_name;

Notes:

NULLIF(..., 0) guards against div-by-zero on jsonb_array_length=0 (should never happen given the WHERE clause, but defensive).
ROUND(SUM(...))::int — frontend doesn't want decimals; sum-then-round is more accurate than per-row round-then-sum.
View is read from messages_with_parts not messages, so legacy pre-v1.13.0 rows and post-v1.13.2 rows both resolve.
No index needed; the underlying idx_messages_chat covers the JOIN; the LATERAL unnest is bounded by the 100-row partition.

2. apps/server/src/routes/tools.ts (NEW, ~40 LoC)

New route file. Register in apps/server/src/index.ts next to the other register*Routes(app, sql, ...) calls.

import type { FastifyInstance } from 'fastify';
import type { Sql } from '../db.js';

export interface ToolCostStat {
  tool_name: string;
  mean_prompt_tokens: number;
  mean_completion_tokens: number;
  n_calls: number;
  updated_at: string;
}

export function registerToolsRoutes(app: FastifyInstance, sql: Sql) {
  app.get('/api/tools/cost_stats', async () => {
    const rows = await sql<{
      tool_name: string;
      prompt_tokens_sum: number;
      completion_tokens_sum: number;
      n_calls: number;
      updated_at: string;
    }[]>`
      SELECT tool_name, prompt_tokens_sum, completion_tokens_sum, n_calls, updated_at
      FROM tool_cost_stats
      ORDER BY tool_name ASC
    `;
    const stats: ToolCostStat[] = rows.map(r => ({
      tool_name: r.tool_name,
      mean_prompt_tokens: Math.round(r.prompt_tokens_sum / r.n_calls),
      mean_completion_tokens: Math.round(r.completion_tokens_sum / r.n_calls),
      n_calls: r.n_calls,
      updated_at: r.updated_at,
    }));
    return { stats };
  });
}

Route is bodyless, idempotent, cheap. No pagination (≤30 tools).

3. apps/server/src/services/tests/tool_cost_stats.test.ts (NEW, ~95 LoC)

Integration test against real Postgres (matches inference.test.ts pattern). Fixtures:

import { describe, it, expect, beforeEach } from 'vitest';
import { connect } from '../../db.js';

describe('tool_cost_stats view (v1.13.10)', () => {
  // ... session + chat + project setup helpers ...

  it('returns empty when no tool calls exist', async () => {
    // fresh chat, only user/assistant text turns
    const stats = await sql`SELECT * FROM tool_cost_stats`;
    expect(stats).toEqual([]);
  });

  it('attributes single-tool turn fully to that tool', async () => {
    // insert one assistant message with tool_calls=[{name: 'view_file', ...}],
    // tokens_used=300, ctx_used=15000, status='complete'
    const stats = await sql`SELECT * FROM tool_cost_stats WHERE tool_name='view_file'`;
    expect(stats[0]).toMatchObject({
      tool_name: 'view_file',
      prompt_tokens_sum: 15000,
      completion_tokens_sum: 300,
      n_calls: 1,
    });
  });

  it('splits multi-tool turn equally across tools', async () => {
    // insert one assistant turn with 3 tool calls (view_file, grep, list_dir),
    // tokens_used=300, ctx_used=15000 → each tool gets 100 completion, 5000 prompt
    const stats = await sql`SELECT * FROM tool_cost_stats ORDER BY tool_name`;
    expect(stats).toHaveLength(3);
    for (const s of stats) {
      expect(s.completion_tokens_sum).toBe(100);
      expect(s.prompt_tokens_sum).toBe(5000);
      expect(s.n_calls).toBe(1);
    }
  });

  it('limits to last 100 calls per tool (FIFO window)', async () => {
    // insert 150 turns each calling view_file once with monotonically
    // increasing tokens_used; expect only the most recent 100 to count
    const stats = await sql`SELECT * FROM tool_cost_stats WHERE tool_name='view_file'`;
    expect(stats[0]!.n_calls).toBe(100);
    // mean should reflect the latter half (51..150), not 1..150
  });

  it('excludes turns with NULL tokens_used (pre-v1.13.7 latent regression)', async () => {
    // insert a turn with tool_calls but tokens_used=NULL → must not appear
    const stats = await sql`SELECT * FROM tool_cost_stats WHERE tool_name='view_file'`;
    expect(stats).toEqual([]);
  });

  it('excludes failed and cancelled turns + sentinel metadata rows', async () => {
    // insert four rows for tool_name='view_file', all with tokens_used+ctx_used
    // populated:
    //   row A: status='failed'                            — excluded
    //   row B: status='cancelled'                         — excluded
    //   row C: status='complete', metadata={kind:'cap_hit'}   — excluded
    //   row D: status='complete', metadata={kind:'doom_loop'} — excluded
    //   row E: status='complete', metadata=null               — included
    // Expect n_calls=1, attributable to row E only.
    const stats = await sql`SELECT * FROM tool_cost_stats WHERE tool_name='view_file'`;
    expect(stats[0]!.n_calls).toBe(1);
  });

  it('reads tool_calls via messages_with_parts (parts-authoritative)', async () => {
    // insert a v1.13.0+ row with messages.tool_calls=NULL but
    // message_parts rows containing the tool_call → must still aggregate
    const stats = await sql`SELECT * FROM tool_cost_stats WHERE tool_name='grep'`;
    expect(stats[0]!.n_calls).toBe(1);
  });
});

Pattern: each test resets the messages table for the fixture chat (TRUNCATE not DELETE — Postgres messages has FK CASCADE) and inserts hand-crafted rows. The view is recomputed on every SELECT.

4. apps/web/src/api/types.ts + client.ts (~10 LoC)

Add to types.ts:

export interface ToolCostStat {
  tool_name: string;
  mean_prompt_tokens: number;
  mean_completion_tokens: number;
  n_calls: number;
  updated_at: string;
}

Add to client.ts under the existing api.* namespace structure:

tools: {
  costStats: () => fetch<{ stats: ToolCostStat[] }>('GET', '/api/tools/cost_stats'),
},

Match the casing convention of the existing namespaces (api.agents.list, api.chats.archive, etc.).

Currently (line 67): title={selectedAgent?.description} — native HTML title attribute on the trigger button.

Replacement: dropdown items get a per-agent cost line in muted text below the description. Format:

[Agent name]
[Agent description]
~5.2k prompt / 280 completion · 6 tools · last call 3h ago

Implementation steps:

Fetch api.tools.costStats() once on mount (alongside the existing api.agents.list()). Cache result for the lifetime of the picker open state. Re-fetch only on useEffect dep change.
Compute per-agent aggregate: for each agent, sum the means of its whitelisted tools. Sum-of-means, not mean-of-sums — we're combining independent rolling averages.
Render below description (one line, muted, truncated). Show "—" if no calls recorded yet for any of the agent's tools.
Don't break the existing native title= for backward compat; layer the cost line additively.

const [costStats, setCostStats] = useState<ToolCostStat[]>([]);
useEffect(() => {
  api.tools.costStats().then(r => setCostStats(r.stats)).catch(() => setCostStats([]));
}, []);
const costByTool = useMemo(
  () => Object.fromEntries(costStats.map(s => [s.tool_name, s])),
  [costStats],
);
function agentCost(agent: Agent): { prompt: number; completion: number; nTools: number; nWithData: number; mostRecent: string | null } {
  let prompt = 0, completion = 0, nWithData = 0;
  let mostRecent: string | null = null;
  for (const t of agent.tools) {
    const s = costByTool[t];
    if (!s) continue;
    prompt += s.mean_prompt_tokens;
    completion += s.mean_completion_tokens;
    nWithData++;
    if (!mostRecent || s.updated_at > mostRecent) mostRecent = s.updated_at;
  }
  return { prompt, completion, nTools: agent.tools.length, nWithData, mostRecent };
}

For the line render: ~${formatK(prompt)} prompt / ${completion} completion · ${nWithData}/${nTools} tools · ${formatAgo(mostRecent)}. Skip entirely when nWithData === 0 to avoid showing "0k / 0 / 0 tools" for fresh-from-deploy state.

formatK / formatAgo: colocate at the bottom of AgentPicker.tsx. Don't extract to a util file in this batch — single use site.

What NOT to do

Don't add a new write site at tool-phase.ts or finalizeCompletion. All source data is already there via existing UPDATEs.
Don't denormalize. The view is sufficient and rollback-safe at BooCode's single-user scale.
Don't add per-tool cost to the message bubble. Out of scope. AgentPicker tooltip only.
Don't fold per-call rows into a moving sum via triggers. Aggregate on read; 100 rows × 30 tools is microseconds in Postgres.
Don't track result_chars (the size of tool_results.output). Tempting as a second cost signal but out of scope here. Future batch if Sam wants it.
Don't add a session-scoped or chat-scoped filter to tool_cost_stats. The rolling window is GLOBAL across all chats — the agent picker is a project-level decision aid. Per-chat surfacing is a future v1.14+ design.
Don't change the attribution model post-deployment without dropping the view first. Mid-flight semantic changes give bogus historical means.
Don't "fix" the ctx_used/tokens_used naming inside this batch. Non-obvious but pinned across 5 write sites. Renaming is its own batch.
Don't rely solely on tool_calls IS NOT NULL for sentinel exclusion. It works today (sentinels are role='system' with tool_calls=NULL) but the explicit status='complete' + metadata->>'kind' filters are defense in depth and survive future schema drift.

Backup before edits

cd /opt/boocode
cp apps/server/src/schema.sql{,.bak-$(date +%Y%m%d-%H%M%S)}
cp apps/web/src/components/AgentPicker.tsx{,.bak-$(date +%Y%m%d-%H%M%S)}

(No backup needed for new files in items 2, 3, 4.)

Verify

pnpm -C apps/server test

Expected: all existing tests pass + 7 new in tool_cost_stats.test.ts. Total moves from 195 → 202.

cd /opt/boocode
docker compose exec boocode_db psql -U postgres -d boocode -c \
  "SELECT * FROM tool_cost_stats ORDER BY n_calls DESC LIMIT 10;"

Expected: in any live deployment with v1.13.7+ history, this returns real rows for view_file, grep, list_dir, etc. If empty: messages.tool_calls was NULL for the v1.13.1-A → v1.13.7 latent regression window and recovery only begins with v1.13.7+ traffic.

Build + smoke

cd /opt/boocode
docker compose up --build -d boocode
docker compose logs --since=30s boocode | tail -20

Smoke A — view recompiles on schema apply:

docker compose logs boocode | grep -i "tool_cost_stats\|applySchema"

Expected: clean schema apply, view registered idempotently.

Smoke B — endpoint returns data:

curl -s http://localhost:3000/api/tools/cost_stats | jq '.stats | length, .stats[0]'

Expected: nonzero length if any v1.13.7+ tool calls exist; one stat object with all 5 fields populated.

Smoke C — UI:

Open browser to boocode.indifferentketchup.com.
Open AgentPicker dropdown on any session.
Each agent row shows a muted cost line below its description: ~5.2k prompt / 280 completion · 6/8 tools · last call 2h ago.
Agents with no tool history show just description (no cost line).
Confirm cost line truncates with the existing text-muted-foreground / truncate pattern; doesn't break the layout at mobile widths (open Vivaldi devtools, set iPhone-13 viewport).

Files expected to touch

apps/server/src/schema.sql — ~35 LoC delta (view definition + filter comments)
apps/server/src/routes/tools.ts — NEW, ~40 LoC
apps/server/src/index.ts — 1 line (registerToolsRoutes(app, sql))
apps/server/src/services/__tests__/tool_cost_stats.test.ts — NEW, ~95 LoC
apps/web/src/api/types.ts — ~7 LoC (interface)
apps/web/src/api/client.ts — ~3 LoC (namespace + method)
apps/web/src/components/AgentPicker.tsx — ~80 LoC delta (cost line + fetch hook + helpers)

Total ~260 LoC. Matches roadmap estimate.

Workflow conventions

Backups before destructive edits (above) on the two MODIFIED files. New files don't need backups.
Sam reviews diffs. Never git add / git commit / git push / git pull on Sam's behalf.
Build: docker compose up --build -d boocode. No --no-cache unless layer-cache trap surfaces.
Tests authoritative: pnpm -C apps/server test.
View definition lives in schema.sql (idempotent via CREATE OR REPLACE VIEW); no migration shim needed.

Don't repeat past mistakes

v1.13.7 stability bundle (includeUsage:true, trim guards, payload filter, BUDGET_NO_AGENT=30): all live. This batch depends on includeUsage:true. If unset, tool_cost_stats returns empty rows.
v1.13.8 prefix instrumentation: untouched.
v1.13.9 ratio-only usable(): untouched.
v1.13.4 two-tier prune: untouched.
v1.13.5 truncate.ts opaque-id pattern: untouched.
v1.13.1-B messages_with_parts view: this view is the source. Don't reach past it to raw messages.
v1.13.2 will DROP messages.tool_calls/tool_results columns. The tool_cost_stats view reads from messages_with_parts not messages, so it survives. Verify after v1.13.2 ships.

Source files to read in project knowledge

boocode_roadmap.md (v1.13.10 row at line 114; schema row at line 474)
boocode_code_review.md (cost-tracking design background)
CLAUDE.md (project conventions; messages_with_parts invariant at L80; v1.13.7 includeUsage invariant)

22 KiB Raw Blame History Unescape Escape