From 9ce638c916745a74541dcdfb0095b2bd963a4088 Mon Sep 17 00:00:00 2001 From: indifferentketchup Date: Fri, 22 May 2026 14:42:09 +0000 Subject: [PATCH] v1.13.10: per-tool token cost accounting (rolling 100-call view) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Surfaces per-tool prompt/completion-token rolling averages in AgentPicker for at-a-glance agent-cost hints. Implementation is a SQL view on top of messages_with_parts plus a read endpoint and AgentPicker tooltip extension. No new write site; all source data already lands via the existing tool-phase.ts:94-95 / error-handler.ts: 109-110 / sentinel-summaries.ts UPDATEs that v1.13.7's includeUsage: true fix made non-NULL. (1) schema.sql — new tool_cost_stats view. Window-functions over messages_with_parts.tool_calls with LATERAL jsonb_array_elements. Attribution: equal split — multi-tool turn divides tokens N-ways; the 100-call rolling mean absorbs split noise. Filters: status= 'complete' + metadata.kind NOT IN ('cap_hit','doom_loop') exclude failed turns and sentinels respectively; tool_calls IS NOT NULL is defense-in-depth since sentinels are role='system' rows. CREATE OR REPLACE means schema apply is idempotent. (2) routes/tools.ts NEW + index.ts wire-in. GET /api/tools/cost_stats returns { stats: ToolCostStat[] } with mean_prompt_tokens / mean_ completion_tokens computed at read time (sum / n_calls). Sorted by tool_name ASC. No pagination — ≤30 tools. (3) __tests__/tool_cost_stats.test.ts NEW — 7 integration tests keyed off DATABASE_URL env var. Tests skip gracefully when unset (no-DB default). beforeAll applies the schema via sql.unsafe(read FileSync(schema.sql)) for self-contained runs. Helper insertAssistant Turn shared across cases. Covers: empty state, single-tool attribution, multi-tool equal split, 100-call FIFO window, NULL-tokens exclusion, parts-authoritative read via messages_with_parts, failed/sentinel exclusion. (4) web/api/types.ts + client.ts — ToolCostStat interface + api.tools. costStats() method binding. (5) AgentPicker.tsx — fetch costStats on mount, compute per-agent sum-of-means across whitelisted tools, render muted cost line below description: "~5.2k prompt / 280 completion · 6/8 tools · last call 3h ago". Skips line entirely when no tool history; preserves existing native title= for layout backward-compat. formatK/formatAgo colocated. Tests: 202/202 pass (195 prior + 7 new view-integration). Server + web tsc clean. Smoke: schema applied cleanly; GET /api/tools/cost_stats returns canonical JSON; view + endpoint agree. Single-row result expected given the v1.13.1-A → v1.13.7 NULL latent regression window; new traffic populates organically. Roadmap row at boocode_roadmap.md:114 plus schema row at :474 both match. View vs table decision documented in handoff_v1.13.10_per_ tool_cost.md (rollback-safe, microsecond-fast at BooCode scale). ~270 LoC across 8 files (5 modified + 3 new). --- apps/server/src/index.ts | 2 + apps/server/src/routes/tools.ts | 40 ++ apps/server/src/schema.sql | 62 +++ .../__tests__/tool_cost_stats.test.ts | 228 +++++++++ apps/web/src/api/client.ts | 9 + apps/web/src/api/types.ts | 12 + apps/web/src/components/AgentPicker.tsx | 123 ++++- handoff_v1.13.10_per_tool_cost.md | 441 ++++++++++++++++++ 8 files changed, 896 insertions(+), 21 deletions(-) create mode 100644 apps/server/src/routes/tools.ts create mode 100644 apps/server/src/services/__tests__/tool_cost_stats.test.ts create mode 100644 handoff_v1.13.10_per_tool_cost.md diff --git a/apps/server/src/index.ts b/apps/server/src/index.ts index ca76111..3ec63d3 100644 --- a/apps/server/src/index.ts +++ b/apps/server/src/index.ts @@ -16,6 +16,7 @@ import { registerWebSocket } from './routes/ws.js'; import { registerModelRoutes } from './routes/models.js'; import { registerAgentRoutes } from './routes/agents.js'; import { registerSkillsRoutes } from './routes/skills.js'; +import { registerToolsRoutes } from './routes/tools.js'; import { createInferenceRunner } from './services/inference/index.js'; import { createBroker } from './services/broker.js'; import { listSkills } from './services/skills.js'; @@ -83,6 +84,7 @@ async function main() { registerAgentRoutes(app, sql); registerSidebarRoutes(app, sql); registerChatRoutes(app, sql, broker); + registerToolsRoutes(app, sql); // Batch 9.6: warm the skills cache at boot and surface the count. Empty or // missing /data/skills is non-fatal — the skill tools just return empty. diff --git a/apps/server/src/routes/tools.ts b/apps/server/src/routes/tools.ts new file mode 100644 index 0000000..7930857 --- /dev/null +++ b/apps/server/src/routes/tools.ts @@ -0,0 +1,40 @@ +import type { FastifyInstance } from 'fastify'; +import type { Sql } from '../db.js'; + +export interface ToolCostStat { + tool_name: string; + mean_prompt_tokens: number; + mean_completion_tokens: number; + n_calls: number; + updated_at: string; +} + +// v1.13.10: per-tool token cost rolling window read endpoint. Backed by the +// tool_cost_stats view in schema.sql (last 100 calls per tool, equal-split +// attribution across multi-tool turns, sentinel/failed-turn excluded). +// Consumed by AgentPicker for at-a-glance per-agent cost hints. +export function registerToolsRoutes(app: FastifyInstance, sql: Sql): void { + app.get('/api/tools/cost_stats', async () => { + const rows = await sql< + { + tool_name: string; + prompt_tokens_sum: number; + completion_tokens_sum: number; + n_calls: number; + updated_at: string; + }[] + >` + SELECT tool_name, prompt_tokens_sum, completion_tokens_sum, n_calls, updated_at + FROM tool_cost_stats + ORDER BY tool_name ASC + `; + const stats: ToolCostStat[] = rows.map((r) => ({ + tool_name: r.tool_name, + mean_prompt_tokens: Math.round(r.prompt_tokens_sum / r.n_calls), + mean_completion_tokens: Math.round(r.completion_tokens_sum / r.n_calls), + n_calls: r.n_calls, + updated_at: r.updated_at, + })); + return { stats }; + }); +} diff --git a/apps/server/src/schema.sql b/apps/server/src/schema.sql index 9be69f6..6c6bb0e 100644 --- a/apps/server/src/schema.sql +++ b/apps/server/src/schema.sql @@ -119,6 +119,68 @@ SELECT WHERE p.message_id = m.id AND p.kind = 'reasoning' AND p.hidden_at IS NULL) AS reasoning_parts FROM messages m; +-- v1.13.10: per-tool token cost rolling window. Derives from +-- messages_with_parts (the v1.13.1-B view that COALESCEs message_parts over +-- the legacy JSON column) so this works whether the chat predates v1.13.0 +-- or postdates v1.13.2 (column drop). No new write site — all source data +-- already lands via the existing tool-phase.ts:94-95 UPDATE. +-- +-- Attribution model: equal split. A turn emitting N tool calls divides its +-- prompt/completion tokens by N before attribution. See v1.13.10 dispatch +-- brief for rationale + rejected alternatives. +-- +-- Column mapping: messages.ctx_used = prompt (input), messages.tokens_used +-- = completion (output). Non-obvious naming; pinned via canonical writes at +-- tool-phase.ts:94-95 et al. +-- +-- Filtering rationale: +-- status='complete' — exclude failed/cancelled (defense in +-- depth; failed-path doesn't write +-- tokens_used so they're filtered +-- indirectly too). +-- metadata->>'kind' exclusions — exclude cap_hit / doom_loop sentinels +-- (defense in depth; sentinels are +-- role='system' with tool_calls=NULL +-- so they're filtered indirectly too). +-- experimental_repairToolCall — no special handling; retries flow +-- as normal next-turn tool_result +-- errors and count naturally. +-- +-- Rolling window: last 100 calls per tool_name, ordered by created_at DESC. +-- Aggregate-on-read is microseconds at BooCode scale (single user, ~30 +-- tools, < 100 calls each). DROP VIEW + recreate to change window size. +CREATE OR REPLACE VIEW tool_cost_stats AS +WITH per_call AS ( + SELECT + (tc->>'name')::text AS tool_name, + (m.ctx_used::float / NULLIF(jsonb_array_length(m.tool_calls), 0)) AS prompt_tokens, + (m.tokens_used::float / NULLIF(jsonb_array_length(m.tool_calls), 0)) AS completion_tokens, + m.created_at, + ROW_NUMBER() OVER ( + PARTITION BY (tc->>'name')::text + ORDER BY m.created_at DESC + ) AS rn + FROM messages_with_parts m, + LATERAL jsonb_array_elements(m.tool_calls) AS tc + WHERE m.tool_calls IS NOT NULL + AND jsonb_array_length(m.tool_calls) > 0 + AND m.tokens_used IS NOT NULL + AND m.ctx_used IS NOT NULL + AND m.status = 'complete' + AND (m.metadata IS NULL + OR m.metadata->>'kind' IS NULL + OR m.metadata->>'kind' NOT IN ('cap_hit', 'doom_loop')) +) +SELECT + tool_name, + ROUND(SUM(prompt_tokens))::int AS prompt_tokens_sum, + ROUND(SUM(completion_tokens))::int AS completion_tokens_sum, + COUNT(*)::int AS n_calls, + MAX(created_at) AS updated_at +FROM per_call +WHERE rn <= 100 +GROUP BY tool_name; + ALTER TABLE messages ADD COLUMN IF NOT EXISTS tokens_used INTEGER; ALTER TABLE messages ADD COLUMN IF NOT EXISTS ctx_used INTEGER; ALTER TABLE messages ADD COLUMN IF NOT EXISTS ctx_max INTEGER; diff --git a/apps/server/src/services/__tests__/tool_cost_stats.test.ts b/apps/server/src/services/__tests__/tool_cost_stats.test.ts new file mode 100644 index 0000000..87ea7ab --- /dev/null +++ b/apps/server/src/services/__tests__/tool_cost_stats.test.ts @@ -0,0 +1,228 @@ +import { describe, it, expect, beforeAll, afterAll } from 'vitest'; +import postgres from 'postgres'; +import { readFileSync } from 'node:fs'; +import { resolve } from 'node:path'; +import { fileURLToPath } from 'node:url'; + +// v1.13.10: integration tests for the tool_cost_stats view. Skipped unless +// DATABASE_URL is set so they don't break `pnpm test` on a fresh checkout. +// Run with: +// DATABASE_URL=postgres://boocode:@localhost:5500/boocode pnpm -C apps/server test +// +// Isolation: each test uses a unique tool_name suffix derived from a per-test +// counter. The view aggregates globally across all chats, so without unique +// tool names parallel test runs would interfere. Cleanup deletes by tool_name +// suffix in afterAll. + +const DB_URL = process.env.DATABASE_URL; +const describeFn = DB_URL ? describe : describe.skip; + +const TEST_RUN_ID = `v13_10_${Date.now()}`; +const tname = (suffix: string) => `${TEST_RUN_ID}_${suffix}`; + +describeFn('tool_cost_stats view (v1.13.10)', () => { + let sql: ReturnType; + let projectId: string; + let sessionId: string; + let chatId: string; + + beforeAll(async () => { + if (!DB_URL) return; + sql = postgres(DB_URL, { max: 2, idle_timeout: 5, connect_timeout: 5, onnotice: () => {} }); + + // Apply the schema before fixtures so the view exists. Idempotent via + // CREATE OR REPLACE VIEW + CREATE TABLE IF NOT EXISTS; safe to run on a + // pre-populated DB. Mirrors apps/server/src/db.ts:applySchema. + const here = fileURLToPath(import.meta.url); + const schemaPath = resolve(here, '../../../schema.sql'); + const ddl = readFileSync(schemaPath, 'utf8'); + await sql.unsafe(ddl); + + // Fixture project + session + chat for all inserts in this file. + const proj = await sql<{ id: string }[]>` + INSERT INTO projects (name, path) + VALUES (${`tool_cost_stats_test_${TEST_RUN_ID}`}, ${`/tmp/${TEST_RUN_ID}`}) + RETURNING id + `; + projectId = proj[0]!.id; + const sess = await sql<{ id: string }[]>` + INSERT INTO sessions (project_id, name, model) + VALUES (${projectId}, ${'test'}, ${'test-model'}) + RETURNING id + `; + sessionId = sess[0]!.id; + const chat = await sql<{ id: string }[]>` + INSERT INTO chats (session_id, name) VALUES (${sessionId}, ${'test'}) RETURNING id + `; + chatId = chat[0]!.id; + }); + + afterAll(async () => { + if (!DB_URL) return; + // Project FK CASCADE cleans sessions/chats/messages/parts in one shot. + await sql`DELETE FROM projects WHERE id = ${projectId}`; + await sql.end({ timeout: 5 }); + }); + + async function insertAssistantTurn(opts: { + toolNames: string[]; + tokensUsed: number | null; + ctxUsed: number | null; + status?: 'streaming' | 'complete' | 'failed' | 'cancelled'; + metadata?: { kind: string } | null; + createdAt?: Date; + }): Promise { + const toolCalls = opts.toolNames.map((name, i) => ({ + id: `call_${TEST_RUN_ID}_${name}_${i}`, + name, + args: {}, + })); + const created = opts.createdAt ?? new Date(); + const rows = await sql<{ id: string }[]>` + INSERT INTO messages ( + session_id, chat_id, role, content, kind, status, + tool_calls, tokens_used, ctx_used, + metadata, created_at + ) + VALUES ( + ${sessionId}, ${chatId}, 'assistant', '', 'message', + ${opts.status ?? 'complete'}, + ${sql.json(toolCalls as never)}, + ${opts.tokensUsed}, + ${opts.ctxUsed}, + ${opts.metadata ? sql.json(opts.metadata as never) : null}, + ${created} + ) + RETURNING id + `; + return rows[0]!.id; + } + + it('returns empty when no tool calls exist for a tool name', async () => { + const t = tname('absent'); + const stats = await sql<{ tool_name: string }[]>` + SELECT * FROM tool_cost_stats WHERE tool_name = ${t} + `; + expect(stats).toEqual([]); + }); + + it('attributes single-tool turn fully to that tool', async () => { + const t = tname('single'); + await insertAssistantTurn({ toolNames: [t], tokensUsed: 300, ctxUsed: 15000 }); + const stats = await sql<{ + tool_name: string; + prompt_tokens_sum: number; + completion_tokens_sum: number; + n_calls: number; + }[]>`SELECT * FROM tool_cost_stats WHERE tool_name = ${t}`; + expect(stats[0]).toMatchObject({ + tool_name: t, + prompt_tokens_sum: 15000, + completion_tokens_sum: 300, + n_calls: 1, + }); + }); + + it('splits multi-tool turn equally across tools', async () => { + const a = tname('multi_a'); + const b = tname('multi_b'); + const c = tname('multi_c'); + // 3 tools, 300 completion / 15000 prompt → each gets 100 / 5000 + await insertAssistantTurn({ toolNames: [a, b, c], tokensUsed: 300, ctxUsed: 15000 }); + const stats = await sql<{ + tool_name: string; + prompt_tokens_sum: number; + completion_tokens_sum: number; + n_calls: number; + }[]>` + SELECT * FROM tool_cost_stats + WHERE tool_name IN (${a}, ${b}, ${c}) + ORDER BY tool_name + `; + expect(stats).toHaveLength(3); + for (const s of stats) { + expect(s.completion_tokens_sum).toBe(100); + expect(s.prompt_tokens_sum).toBe(5000); + expect(s.n_calls).toBe(1); + } + }); + + it('limits to last 100 calls per tool (FIFO window)', async () => { + const t = tname('window'); + // Insert 110 turns with monotonically-increasing created_at and tokensUsed. + // Expect view to keep only the most recent 100. + const base = Date.now() + 1_000_000; // distant future to avoid colliding with other tests + for (let i = 1; i <= 110; i++) { + await insertAssistantTurn({ + toolNames: [t], + tokensUsed: i, // 1..110 + ctxUsed: i * 10, + createdAt: new Date(base + i), + }); + } + const [stat] = await sql<{ + n_calls: number; + completion_tokens_sum: number; + }[]>`SELECT n_calls, completion_tokens_sum FROM tool_cost_stats WHERE tool_name = ${t}`; + expect(stat!.n_calls).toBe(100); + // Last 100 are tokensUsed=11..110, sum = (11+110)*100/2 = 6050. + expect(stat!.completion_tokens_sum).toBe(6050); + }); + + it('excludes turns with NULL tokens_used (pre-v1.13.7 latent regression)', async () => { + const t = tname('null_tokens'); + await insertAssistantTurn({ toolNames: [t], tokensUsed: null, ctxUsed: 1000 }); + await insertAssistantTurn({ toolNames: [t], tokensUsed: 100, ctxUsed: null }); + const stats = await sql`SELECT * FROM tool_cost_stats WHERE tool_name = ${t}`; + expect(stats).toEqual([]); + }); + + it('excludes failed/cancelled turns and cap_hit/doom_loop sentinel rows', async () => { + const t = tname('filtered'); + // A: status='failed' — excluded + // B: status='cancelled' — excluded + // C: status='complete', metadata={kind:'cap_hit'} — excluded + // D: status='complete', metadata={kind:'doom_loop'} — excluded + // E: status='complete', metadata=null — included + await insertAssistantTurn({ toolNames: [t], tokensUsed: 100, ctxUsed: 1000, status: 'failed' }); + await insertAssistantTurn({ toolNames: [t], tokensUsed: 100, ctxUsed: 1000, status: 'cancelled' }); + await insertAssistantTurn({ toolNames: [t], tokensUsed: 100, ctxUsed: 1000, metadata: { kind: 'cap_hit' } }); + await insertAssistantTurn({ toolNames: [t], tokensUsed: 100, ctxUsed: 1000, metadata: { kind: 'doom_loop' } }); + await insertAssistantTurn({ toolNames: [t], tokensUsed: 100, ctxUsed: 1000, metadata: null }); + const [stat] = await sql<{ n_calls: number }[]>` + SELECT n_calls FROM tool_cost_stats WHERE tool_name = ${t} + `; + expect(stat!.n_calls).toBe(1); + }); + + it('reads tool_calls via messages_with_parts (parts-authoritative)', async () => { + const t = tname('parts'); + // Insert an assistant row with messages.tool_calls=NULL but a + // message_parts row carrying the tool_call. The view reads via + // messages_with_parts, which COALESCEs the parts table over the legacy + // column — so this row should still aggregate. + const rows = await sql<{ id: string }[]>` + INSERT INTO messages ( + session_id, chat_id, role, content, kind, status, + tool_calls, tokens_used, ctx_used + ) + VALUES ( + ${sessionId}, ${chatId}, 'assistant', '', 'message', 'complete', + NULL, 200, 5000 + ) + RETURNING id + `; + const messageId = rows[0]!.id; + await sql` + INSERT INTO message_parts (message_id, sequence, kind, payload) + VALUES ( + ${messageId}, 0, 'tool_call', + ${sql.json({ id: `tc_parts_${TEST_RUN_ID}`, name: t, args: {} } as never)} + ) + `; + const [stat] = await sql<{ n_calls: number }[]>` + SELECT n_calls FROM tool_cost_stats WHERE tool_name = ${t} + `; + expect(stat!.n_calls).toBe(1); + }); +}); diff --git a/apps/web/src/api/client.ts b/apps/web/src/api/client.ts index f3f70ab..3b82e6d 100644 --- a/apps/web/src/api/client.ts +++ b/apps/web/src/api/client.ts @@ -12,6 +12,7 @@ import type { GitMeta, Skill, AskUserAnswer, + ToolCostStat, } from './types'; export class ApiError extends Error { @@ -262,6 +263,14 @@ export const api = { list: () => request<{ skills: Skill[] }>('/api/skills'), }, + // v1.13.10: per-tool cost rolling-window stats (last 100 calls per tool, + // equal-split attribution across multi-tool turns). Read endpoint backed by + // the tool_cost_stats view. AgentPicker consumes this for per-agent cost + // hints. + tools: { + costStats: () => request<{ stats: ToolCostStat[] }>('/api/tools/cost_stats'), + }, + settings: { get: () => request>('/api/settings'), patch: (body: Record) => diff --git a/apps/web/src/api/types.ts b/apps/web/src/api/types.ts index 1e332f3..9fa6378 100644 --- a/apps/web/src/api/types.ts +++ b/apps/web/src/api/types.ts @@ -1,6 +1,18 @@ export const PROJECT_STATUSES = ['open', 'archived'] as const; export type ProjectStatus = typeof PROJECT_STATUSES[number]; +// v1.13.10: per-tool cost rolling-window stat. Returned by +// GET /api/tools/cost_stats — one entry per tool with mean prompt/completion +// tokens over the last 100 invocations. AgentPicker sums across an agent's +// whitelisted tools for per-agent cost hints. +export interface ToolCostStat { + tool_name: string; + mean_prompt_tokens: number; + mean_completion_tokens: number; + n_calls: number; + updated_at: string; +} + export interface Project { id: string; name: string; diff --git a/apps/web/src/components/AgentPicker.tsx b/apps/web/src/components/AgentPicker.tsx index 78181cd..f0cbe69 100644 --- a/apps/web/src/components/AgentPicker.tsx +++ b/apps/web/src/components/AgentPicker.tsx @@ -1,8 +1,8 @@ -import { useEffect, useState } from 'react'; +import { useEffect, useMemo, useState } from 'react'; import { Check, ChevronDown } from 'lucide-react'; import { toast } from 'sonner'; import { api } from '@/api/client'; -import type { Agent, AgentParseError } from '@/api/types'; +import type { Agent, AgentParseError, ToolCostStat } from '@/api/types'; import { DropdownMenu, DropdownMenuContent, @@ -22,6 +22,10 @@ export function AgentPicker({ projectId, value, onChange }: Props) { const [parseErrors, setParseErrors] = useState([]); const [error, setError] = useState(null); const [open, setOpen] = useState(false); + // v1.13.10: per-tool cost rolling window. Fetched once on mount; would + // refresh on remount or page reload. Acceptable for a decision aid — the + // 100-call rolling mean doesn't shift fast. + const [costStats, setCostStats] = useState([]); // v1.8.1: per-agent parse errors are non-blocking. Silent if any agents // loaded successfully; a gray warning toast fires only when EVERY agent @@ -52,6 +56,29 @@ export function AgentPicker({ projectId, value, onChange }: Props) { }; }, [projectId]); + // v1.13.10: cost stats are project-independent — the 100-call rolling + // window is global across all chats. Fetch once per mount; tolerate failure + // silently (cost line hides). + useEffect(() => { + let cancelled = false; + api.tools + .costStats() + .then((r) => { + if (!cancelled) setCostStats(r.stats); + }) + .catch(() => { + if (!cancelled) setCostStats([]); + }); + return () => { + cancelled = true; + }; + }, []); + + const costByTool = useMemo( + () => Object.fromEntries(costStats.map((s) => [s.tool_name, s])), + [costStats], + ); + const selectedAgent = agents?.find((a) => a.id === value) ?? null; const triggerLabel = value === null ? 'No agent' @@ -86,25 +113,33 @@ export function AgentPicker({ projectId, value, onChange }: Props) { No agent {agents.length > 0 && } - {agents.map((a) => ( - void onChange(a.id)} - className="text-xs flex-col items-start gap-0.5" - > -
- - {a.name} -
- {a.description && ( - - {a.description} - - )} -
- ))} + {agents.map((a) => { + const cost = agentCost(a, costByTool); + return ( + void onChange(a.id)} + className="text-xs flex-col items-start gap-0.5" + > +
+ + {a.name} +
+ {a.description && ( + + {a.description} + + )} + {cost.nWithData > 0 && ( + + ~{formatK(cost.prompt)} prompt / {cost.completion} completion · {cost.nWithData}/{cost.nTools} tools{cost.mostRecent ? ` · last call ${formatAgo(cost.mostRecent)}` : ''} + + )} +
+ ); + })} {parseErrors.length > 0 && (
); } + +// v1.13.10: sum the per-tool means across an agent's whitelisted tools. +// Sum-of-means, not mean-of-sums — we're combining independent rolling +// averages. nWithData reflects how many of the agent's tools have any +// history yet; the line hides entirely when zero so a fresh deploy doesn't +// render "0k / 0 / 0 tools". +function agentCost( + agent: Agent, + costByTool: Record, +): { + prompt: number; + completion: number; + nTools: number; + nWithData: number; + mostRecent: string | null; +} { + let prompt = 0; + let completion = 0; + let nWithData = 0; + let mostRecent: string | null = null; + for (const t of agent.tools) { + const s = costByTool[t]; + if (!s) continue; + prompt += s.mean_prompt_tokens; + completion += s.mean_completion_tokens; + nWithData++; + if (!mostRecent || s.updated_at > mostRecent) mostRecent = s.updated_at; + } + return { prompt, completion, nTools: agent.tools.length, nWithData, mostRecent }; +} + +function formatK(n: number): string { + if (n < 1000) return String(n); + if (n < 10_000) return `${(n / 1000).toFixed(1)}k`; + return `${Math.round(n / 1000)}k`; +} + +function formatAgo(iso: string): string { + const then = new Date(iso).getTime(); + if (Number.isNaN(then)) return '—'; + const diff = Date.now() - then; + if (diff < 60_000) return 'just now'; + if (diff < 3_600_000) return `${Math.round(diff / 60_000)}m ago`; + if (diff < 86_400_000) return `${Math.round(diff / 3_600_000)}h ago`; + return `${Math.round(diff / 86_400_000)}d ago`; +} diff --git a/handoff_v1.13.10_per_tool_cost.md b/handoff_v1.13.10_per_tool_cost.md new file mode 100644 index 0000000..323c09c --- /dev/null +++ b/handoff_v1.13.10_per_tool_cost.md @@ -0,0 +1,441 @@ +``` +#careful #boocode #nofluff + +v1.13.10 — per-tool token cost accounting (rolling 100-call window) + +Goal: surface per-tool prompt/completion-token rolling averages in AgentPicker for at-a-glance agent-cost hints. Implementation is a SQL view on top of `messages_with_parts` (no new table, no new write site) + a read endpoint + AgentPicker tooltip extension. Estimated ~240 LoC, mostly UI. + +## Where we are + +- Last tag: v1.13.9 (compaction overflow trigger — `floor(0.85 × ctx_max)` early-trigger). Branch clean. +- v1.13.x cleanup line ✅ through v1.13.9. Queued: v1.13.10 (this) → v1.13.11 (WS Zod) → v1.13.12 (skills audit) → v1.13.2 (column drop, last). +- Dependency (satisfied since v1.13.7 commit `ff29b48`): `includeUsage: true` on `createOpenAICompatible` in `apps/server/src/services/inference/provider.ts`. Without it, `messages.tokens_used`/`ctx_used` were NULL for v1.13.1-A → v1.13.7 (latent regression). Now populated. + +## Why this matters + +Today: AgentPicker lists agents by name + description. No cost signal. Users pick the architect agent (full tool whitelist, 21k of tool schema) for one-liner questions a refactorer (3 tools, 4k schema) could answer. + +Tomorrow: each agent listing shows its mean prompt + completion cost per tool, derived from the last 100 invocations across all chats. Decision aid, not a hard gate. + +Why a SQL view instead of a denormalized stats table: +- All the source data already lands in `messages` (tool_calls JSON + tokens_used + ctx_used) and `message_parts` (read via the `messages_with_parts` view). Zero new write sites. +- Rolling 100-call window is a `ROW_NUMBER() OVER (PARTITION BY tool_name ORDER BY created_at DESC) <= 100` — natural fit for a view. +- View is rollback-safe. If the math is wrong, `DROP VIEW` and re-deploy; no orphan rows, no backfill. +- At BooCode scale (single user, ~30 tools, ~100 calls/tool), aggregate-on-read is microseconds. Premature to denormalize. + +The roadmap schema row (`tool_cost_stats (tool_name, prompt_tokens_sum, completion_tokens_sum, n_calls, updated_at)`) matches both a table and a view. View is the lighter implementation. + +## Canonical column mapping (pinned) + +The `messages` columns are named non-obviously. Pinned mapping, confirmed across 5 write sites + 1 read site: + +| Column | Semantic meaning | AI SDK v6 source name | +|-----------------|--------------------|-----------------------| +| `ctx_used` | prompt / input tokens | `usage.inputTokens` | +| `tokens_used` | completion / output tokens | `usage.outputTokens` | + +Write sites confirmed: `tool-phase.ts:94-95`, `error-handler.ts:109-110`, `sentinel-summaries.ts:130-131`, `sentinel-summaries.ts:387-388`, `stream-phase.ts:319-320`. Canonical read at `payload.ts:190-191` reverses: `const promptTokens = updated.ctx_used; const completionTokens = updated.tokens_used`. + +`tokens_used` reads like "total" but is completion only. Project convention since the columns predate v1.13.x. Do not "fix" the naming inside this batch — out of scope; downstream consumers depend on the current mapping. + +## Attribution model + +A single assistant turn can emit N tool calls in parallel. llama-swap returns ONE (prompt_tokens, completion_tokens) per turn, not per tool. Attribution requires a split. + +**Chosen approach: equal split.** For an assistant turn that emits N tool calls with prompt P and completion C, each tool is attributed P/N prompt + C/N completion. The 100-call rolling mean smooths split noise. Implementation: `tokens_used::float / jsonb_array_length(tool_calls)` at the unnest site. + +**Alternatives rejected:** +- "Full turn cost to every tool" (no division). Over-states; a 5-tool turn would 5×-count every tool's cost. +- "Result-size only" (`length(JSON.stringify(output)) / 4`). Loses the LLM's actual usage signal; doesn't capture how expensive a tool's output is to the next prompt. +- "Consuming-turn delta" (next turn prompt_tokens − this turn prompt_tokens, attribute to the tool that emitted the result). Most accurate but requires bubble-back math through the `executeToolPhase → runAssistantTurn` recursion. Over-engineered for the rolling-average use case. + +**If Sam wants a different split, change one line in the view definition (the divisor).** + +## Filtering — sentinel, failure, repair-call semantics + +The view excludes rows that aren't real tool-cost signal: + +- **Failed and cancelled turns** (`status != 'complete'`). The `error-handler.ts` failed/cancelled paths don't write `tokens_used`/`ctx_used`, so the existing `tokens_used IS NOT NULL` clause already filters these. Adding `status='complete'` is defense in depth and makes intent explicit. +- **Cap-hit and doom-loop sentinel rows** (`metadata->>'kind' IN ('cap_hit', 'doom_loop')`). Sentinels are `role='system'` rows with `tool_calls=NULL`, so the existing `tool_calls IS NOT NULL` clause already filters them. The explicit metadata filter is defense in depth — it survives future schema drift where someone might INSERT a sentinel with a non-null tool_calls. +- **`experimental_repairToolCall` retries.** No special handling needed. Our impl (per `CLAUDE.md`) is pass-through — malformed calls flow to zod-reject → tool_result error → next normal turn handles. No separate rows; the next turn's tokens count naturally. + +## Recon (already done; paste for reference) + +``` +cd /opt/boocode +grep -n "tokens_used\|ctx_used\|inputTokens\|outputTokens" apps/server/src/services/inference/*.ts | head -30 +grep -n "metadata\|cap_hit\|doom_loop" apps/server/src/services/inference/sentinels.ts apps/server/src/schema.sql | head -10 +psql -h localhost -p 5432 -U postgres -d boocode -c "\d messages_with_parts" | head -30 +``` + +Expected: confirms the canonical mapping in the table above; confirms `messages.metadata jsonb` exists at `schema.sql:259`; confirms `messages_with_parts` exposes `m.metadata` at `schema.sql:92`. + +## Scope + +### 1. schema.sql — `tool_cost_stats` view (~35 LoC) + +Append after the `messages_with_parts` view (after line 120): + +```sql +-- v1.13.10: per-tool token cost rolling window. Derives from +-- messages_with_parts (the v1.13.1-B view that COALESCEs message_parts over +-- the legacy JSON column) so this works whether the chat predates v1.13.0 +-- or postdates v1.13.2 (column drop). No new write site — all source data +-- already lands via the existing tool-phase.ts:94-95 UPDATE. +-- +-- Attribution model: equal split. A turn emitting N tool calls divides its +-- prompt/completion tokens by N before attribution. See v1.13.10 dispatch +-- brief for rationale + rejected alternatives. +-- +-- Column mapping: messages.ctx_used = prompt (input), messages.tokens_used +-- = completion (output). Non-obvious naming; pinned via canonical writes at +-- tool-phase.ts:94-95 et al. +-- +-- Filtering rationale: +-- status='complete' — exclude failed/cancelled (defense in +-- depth; failed-path doesn't write +-- tokens_used so they're also filtered +-- indirectly). +-- metadata->>'kind' exclusions — exclude cap_hit / doom_loop sentinels +-- (defense in depth; sentinels are +-- role='system' with tool_calls=NULL +-- so they're filtered indirectly too). +-- experimental_repairToolCall — no special handling; retries flow +-- as normal next-turn tool_result +-- errors and count naturally. +-- +-- Rolling window: last 100 calls per tool_name, ordered by created_at DESC. +-- Aggregate-on-read is microseconds at BooCode scale (single user, ~30 +-- tools, < 100 calls each). DROP VIEW + recreate to change window size. +CREATE OR REPLACE VIEW tool_cost_stats AS +WITH per_call AS ( + SELECT + (tc->>'name')::text AS tool_name, + (m.ctx_used::float / NULLIF(jsonb_array_length(m.tool_calls), 0)) AS prompt_tokens, + (m.tokens_used::float / NULLIF(jsonb_array_length(m.tool_calls), 0)) AS completion_tokens, + m.created_at, + ROW_NUMBER() OVER ( + PARTITION BY (tc->>'name')::text + ORDER BY m.created_at DESC + ) AS rn + FROM messages_with_parts m, + LATERAL jsonb_array_elements(m.tool_calls) AS tc + WHERE m.tool_calls IS NOT NULL + AND jsonb_array_length(m.tool_calls) > 0 + AND m.tokens_used IS NOT NULL + AND m.ctx_used IS NOT NULL + AND m.status = 'complete' + AND (m.metadata IS NULL + OR m.metadata->>'kind' IS NULL + OR m.metadata->>'kind' NOT IN ('cap_hit', 'doom_loop')) +) +SELECT + tool_name, + ROUND(SUM(prompt_tokens))::int AS prompt_tokens_sum, + ROUND(SUM(completion_tokens))::int AS completion_tokens_sum, + COUNT(*)::int AS n_calls, + MAX(created_at) AS updated_at +FROM per_call +WHERE rn <= 100 +GROUP BY tool_name; +``` + +Notes: +- `NULLIF(..., 0)` guards against div-by-zero on `jsonb_array_length=0` (should never happen given the WHERE clause, but defensive). +- `ROUND(SUM(...))::int` — frontend doesn't want decimals; sum-then-round is more accurate than per-row round-then-sum. +- View is read from `messages_with_parts` not `messages`, so legacy pre-v1.13.0 rows and post-v1.13.2 rows both resolve. +- No index needed; the underlying `idx_messages_chat` covers the JOIN; the LATERAL unnest is bounded by the 100-row partition. + +### 2. apps/server/src/routes/tools.ts (NEW, ~40 LoC) + +New route file. Register in `apps/server/src/index.ts` next to the other `register*Routes(app, sql, ...)` calls. + +```ts +import type { FastifyInstance } from 'fastify'; +import type { Sql } from '../db.js'; + +export interface ToolCostStat { + tool_name: string; + mean_prompt_tokens: number; + mean_completion_tokens: number; + n_calls: number; + updated_at: string; +} + +export function registerToolsRoutes(app: FastifyInstance, sql: Sql) { + app.get('/api/tools/cost_stats', async () => { + const rows = await sql<{ + tool_name: string; + prompt_tokens_sum: number; + completion_tokens_sum: number; + n_calls: number; + updated_at: string; + }[]>` + SELECT tool_name, prompt_tokens_sum, completion_tokens_sum, n_calls, updated_at + FROM tool_cost_stats + ORDER BY tool_name ASC + `; + const stats: ToolCostStat[] = rows.map(r => ({ + tool_name: r.tool_name, + mean_prompt_tokens: Math.round(r.prompt_tokens_sum / r.n_calls), + mean_completion_tokens: Math.round(r.completion_tokens_sum / r.n_calls), + n_calls: r.n_calls, + updated_at: r.updated_at, + })); + return { stats }; + }); +} +``` + +Route is bodyless, idempotent, cheap. No pagination (≤30 tools). + +### 3. apps/server/src/services/__tests__/tool_cost_stats.test.ts (NEW, ~95 LoC) + +Integration test against real Postgres (matches `inference.test.ts` pattern). Fixtures: + +```ts +import { describe, it, expect, beforeEach } from 'vitest'; +import { connect } from '../../db.js'; + +describe('tool_cost_stats view (v1.13.10)', () => { + // ... session + chat + project setup helpers ... + + it('returns empty when no tool calls exist', async () => { + // fresh chat, only user/assistant text turns + const stats = await sql`SELECT * FROM tool_cost_stats`; + expect(stats).toEqual([]); + }); + + it('attributes single-tool turn fully to that tool', async () => { + // insert one assistant message with tool_calls=[{name: 'view_file', ...}], + // tokens_used=300, ctx_used=15000, status='complete' + const stats = await sql`SELECT * FROM tool_cost_stats WHERE tool_name='view_file'`; + expect(stats[0]).toMatchObject({ + tool_name: 'view_file', + prompt_tokens_sum: 15000, + completion_tokens_sum: 300, + n_calls: 1, + }); + }); + + it('splits multi-tool turn equally across tools', async () => { + // insert one assistant turn with 3 tool calls (view_file, grep, list_dir), + // tokens_used=300, ctx_used=15000 → each tool gets 100 completion, 5000 prompt + const stats = await sql`SELECT * FROM tool_cost_stats ORDER BY tool_name`; + expect(stats).toHaveLength(3); + for (const s of stats) { + expect(s.completion_tokens_sum).toBe(100); + expect(s.prompt_tokens_sum).toBe(5000); + expect(s.n_calls).toBe(1); + } + }); + + it('limits to last 100 calls per tool (FIFO window)', async () => { + // insert 150 turns each calling view_file once with monotonically + // increasing tokens_used; expect only the most recent 100 to count + const stats = await sql`SELECT * FROM tool_cost_stats WHERE tool_name='view_file'`; + expect(stats[0]!.n_calls).toBe(100); + // mean should reflect the latter half (51..150), not 1..150 + }); + + it('excludes turns with NULL tokens_used (pre-v1.13.7 latent regression)', async () => { + // insert a turn with tool_calls but tokens_used=NULL → must not appear + const stats = await sql`SELECT * FROM tool_cost_stats WHERE tool_name='view_file'`; + expect(stats).toEqual([]); + }); + + it('excludes failed and cancelled turns + sentinel metadata rows', async () => { + // insert four rows for tool_name='view_file', all with tokens_used+ctx_used + // populated: + // row A: status='failed' — excluded + // row B: status='cancelled' — excluded + // row C: status='complete', metadata={kind:'cap_hit'} — excluded + // row D: status='complete', metadata={kind:'doom_loop'} — excluded + // row E: status='complete', metadata=null — included + // Expect n_calls=1, attributable to row E only. + const stats = await sql`SELECT * FROM tool_cost_stats WHERE tool_name='view_file'`; + expect(stats[0]!.n_calls).toBe(1); + }); + + it('reads tool_calls via messages_with_parts (parts-authoritative)', async () => { + // insert a v1.13.0+ row with messages.tool_calls=NULL but + // message_parts rows containing the tool_call → must still aggregate + const stats = await sql`SELECT * FROM tool_cost_stats WHERE tool_name='grep'`; + expect(stats[0]!.n_calls).toBe(1); + }); +}); +``` + +Pattern: each test resets the messages table for the fixture chat (TRUNCATE not DELETE — Postgres `messages` has FK CASCADE) and inserts hand-crafted rows. The view is recomputed on every SELECT. + +### 4. apps/web/src/api/types.ts + client.ts (~10 LoC) + +Add to `types.ts`: + +```ts +export interface ToolCostStat { + tool_name: string; + mean_prompt_tokens: number; + mean_completion_tokens: number; + n_calls: number; + updated_at: string; +} +``` + +Add to `client.ts` under the existing `api.*` namespace structure: + +```ts +tools: { + costStats: () => fetch<{ stats: ToolCostStat[] }>('GET', '/api/tools/cost_stats'), +}, +``` + +Match the casing convention of the existing namespaces (`api.agents.list`, `api.chats.archive`, etc.). + +### 5. apps/web/src/components/AgentPicker.tsx — tooltip extension (~80 LoC delta) + +Currently (line 67): `title={selectedAgent?.description}` — native HTML title attribute on the trigger button. + +Replacement: dropdown items get a per-agent cost line in muted text below the description. Format: + +``` +[Agent name] +[Agent description] +~5.2k prompt / 280 completion · 6 tools · last call 3h ago +``` + +Implementation steps: +1. Fetch `api.tools.costStats()` once on mount (alongside the existing `api.agents.list()`). Cache result for the lifetime of the picker open state. Re-fetch only on `useEffect` dep change. +2. Compute per-agent aggregate: for each agent, sum the means of its whitelisted tools. Sum-of-means, not mean-of-sums — we're combining independent rolling averages. +3. Render below description (one line, muted, truncated). Show "—" if no calls recorded yet for any of the agent's tools. +4. Don't break the existing native `title=` for backward compat; layer the cost line additively. + +```tsx +const [costStats, setCostStats] = useState([]); +useEffect(() => { + api.tools.costStats().then(r => setCostStats(r.stats)).catch(() => setCostStats([])); +}, []); +const costByTool = useMemo( + () => Object.fromEntries(costStats.map(s => [s.tool_name, s])), + [costStats], +); +function agentCost(agent: Agent): { prompt: number; completion: number; nTools: number; nWithData: number; mostRecent: string | null } { + let prompt = 0, completion = 0, nWithData = 0; + let mostRecent: string | null = null; + for (const t of agent.tools) { + const s = costByTool[t]; + if (!s) continue; + prompt += s.mean_prompt_tokens; + completion += s.mean_completion_tokens; + nWithData++; + if (!mostRecent || s.updated_at > mostRecent) mostRecent = s.updated_at; + } + return { prompt, completion, nTools: agent.tools.length, nWithData, mostRecent }; +} +``` + +For the line render: `~${formatK(prompt)} prompt / ${completion} completion · ${nWithData}/${nTools} tools · ${formatAgo(mostRecent)}`. Skip entirely when `nWithData === 0` to avoid showing "0k / 0 / 0 tools" for fresh-from-deploy state. + +**`formatK` / `formatAgo`:** colocate at the bottom of `AgentPicker.tsx`. Don't extract to a util file in this batch — single use site. + +## What NOT to do + +- **Don't add a new write site at `tool-phase.ts` or `finalizeCompletion`.** All source data is already there via existing UPDATEs. +- **Don't denormalize.** The view is sufficient and rollback-safe at BooCode's single-user scale. +- **Don't add per-tool cost to the message bubble.** Out of scope. AgentPicker tooltip only. +- **Don't fold per-call rows into a moving sum via triggers.** Aggregate on read; 100 rows × 30 tools is microseconds in Postgres. +- **Don't track `result_chars` (the size of `tool_results.output`).** Tempting as a second cost signal but out of scope here. Future batch if Sam wants it. +- **Don't add a session-scoped or chat-scoped filter to `tool_cost_stats`.** The rolling window is GLOBAL across all chats — the agent picker is a project-level decision aid. Per-chat surfacing is a future v1.14+ design. +- **Don't change the attribution model post-deployment** without dropping the view first. Mid-flight semantic changes give bogus historical means. +- **Don't "fix" the `ctx_used`/`tokens_used` naming inside this batch.** Non-obvious but pinned across 5 write sites. Renaming is its own batch. +- **Don't rely solely on `tool_calls IS NOT NULL` for sentinel exclusion.** It works today (sentinels are role='system' with tool_calls=NULL) but the explicit `status='complete'` + `metadata->>'kind'` filters are defense in depth and survive future schema drift. + +## Backup before edits + +``` +cd /opt/boocode +cp apps/server/src/schema.sql{,.bak-$(date +%Y%m%d-%H%M%S)} +cp apps/web/src/components/AgentPicker.tsx{,.bak-$(date +%Y%m%d-%H%M%S)} +``` + +(No backup needed for new files in items 2, 3, 4.) + +## Verify + +``` +pnpm -C apps/server test +``` + +Expected: all existing tests pass + 7 new in `tool_cost_stats.test.ts`. Total moves from 195 → 202. + +``` +cd /opt/boocode +docker compose exec boocode_db psql -U postgres -d boocode -c \ + "SELECT * FROM tool_cost_stats ORDER BY n_calls DESC LIMIT 10;" +``` + +Expected: in any live deployment with v1.13.7+ history, this returns real rows for `view_file`, `grep`, `list_dir`, etc. If empty: `messages.tool_calls` was NULL for the v1.13.1-A → v1.13.7 latent regression window and recovery only begins with v1.13.7+ traffic. + +## Build + smoke + +``` +cd /opt/boocode +docker compose up --build -d boocode +docker compose logs --since=30s boocode | tail -20 +``` + +Smoke A — view recompiles on schema apply: +``` +docker compose logs boocode | grep -i "tool_cost_stats\|applySchema" +``` +Expected: clean schema apply, view registered idempotently. + +Smoke B — endpoint returns data: +``` +curl -s http://localhost:3000/api/tools/cost_stats | jq '.stats | length, .stats[0]' +``` +Expected: nonzero length if any v1.13.7+ tool calls exist; one stat object with all 5 fields populated. + +Smoke C — UI: +1. Open browser to `boocode.indifferentketchup.com`. +2. Open AgentPicker dropdown on any session. +3. Each agent row shows a muted cost line below its description: `~5.2k prompt / 280 completion · 6/8 tools · last call 2h ago`. +4. Agents with no tool history show just description (no cost line). +5. Confirm cost line truncates with the existing text-muted-foreground / truncate pattern; doesn't break the layout at mobile widths (open Vivaldi devtools, set iPhone-13 viewport). + +## Files expected to touch + +- `apps/server/src/schema.sql` — ~35 LoC delta (view definition + filter comments) +- `apps/server/src/routes/tools.ts` — NEW, ~40 LoC +- `apps/server/src/index.ts` — 1 line (`registerToolsRoutes(app, sql)`) +- `apps/server/src/services/__tests__/tool_cost_stats.test.ts` — NEW, ~95 LoC +- `apps/web/src/api/types.ts` — ~7 LoC (interface) +- `apps/web/src/api/client.ts` — ~3 LoC (namespace + method) +- `apps/web/src/components/AgentPicker.tsx` — ~80 LoC delta (cost line + fetch hook + helpers) + +Total ~260 LoC. Matches roadmap estimate. + +## Workflow conventions + +- Backups before destructive edits (above) on the two MODIFIED files. New files don't need backups. +- Sam reviews diffs. Never `git add` / `git commit` / `git push` / `git pull` on Sam's behalf. +- Build: `docker compose up --build -d boocode`. No `--no-cache` unless layer-cache trap surfaces. +- Tests authoritative: `pnpm -C apps/server test`. +- View definition lives in `schema.sql` (idempotent via `CREATE OR REPLACE VIEW`); no migration shim needed. + +## Don't repeat past mistakes + +- v1.13.7 stability bundle (`includeUsage:true`, trim guards, payload filter, `BUDGET_NO_AGENT=30`): all live. This batch depends on `includeUsage:true`. If unset, `tool_cost_stats` returns empty rows. +- v1.13.8 prefix instrumentation: untouched. +- v1.13.9 ratio-only `usable()`: untouched. +- v1.13.4 two-tier prune: untouched. +- v1.13.5 truncate.ts opaque-id pattern: untouched. +- v1.13.1-B `messages_with_parts` view: this view is the source. Don't reach past it to raw `messages`. +- v1.13.2 will DROP `messages.tool_calls`/`tool_results` columns. The `tool_cost_stats` view reads from `messages_with_parts` not `messages`, so it survives. Verify after v1.13.2 ships. + +## Source files to read in project knowledge + +- `boocode_roadmap.md` (v1.13.10 row at line 114; schema row at line 474) +- `boocode_code_review.md` (cost-tracking design background) +- `CLAUDE.md` (project conventions; messages_with_parts invariant at L80; v1.13.7 includeUsage invariant) +```