v1.13.4: two-tier compaction prune — opencode pattern half-shipped in v1.11.0

- message_parts.hidden_at timestamptz column (NULL by default) with a partial index on (message_id) WHERE hidden_at IS NULL for the common visible-parts filter. - messages_with_parts view changed from COALESCE(parts, legacy) to CASE WHEN EXISTS(any parts of kind) THEN visible-parts ELSE legacy. COALESCE would have leaked hidden parts back via the legacy fallback when every part was pruned (smoke caught it pre-commit). The CASE distinguishes "no parts at all → fall back to legacy column for pre-v1.13.0 history" from "all parts hidden → return null/empty so the row drops out of the model payload" exactly. - prune.ts: scans tool_result parts newest-first, protects the last 40k tokens (PROTECTED_TOKENS), marks older candidates hidden when their combined estimate clears 20k (PRUNE_TRIGGER_TOKENS — equal to COMPACTION_BUFFER from v1.11.0, so a successful prune is exactly the budget the summary path would have freed). Stops at chats.tail_start_id so it doesn't double-erase across the last summary boundary. Pure decision helper selectPruneTargets exported separately for unit tests. - Wired into maybeFlagForCompaction: prune runs synchronously when overflow is detected; if it freed >= PRUNE_TRIGGER_TOKENS, the needs_compaction flag is NOT set and the (expensive) summary inference call is skipped this turn. The next turn's overflow check re-evaluates from scratch. - 6 new unit tests in prune.test.ts cover: empty input, protection-only (no candidates), candidates below trigger, candidates above trigger, candidates straddling a summary boundary, exactly-protection-tokens. 179 tests total (was 173). Smoke verified post-rebuild: - \\d message_parts shows hidden_at + partial index. - View definition shows AND p.hidden_at IS NULL filters on all three subselects. - Synthetic hide-then-restore confirmed the view drops the tool_result jsonb to null when its only part is hidden, and restores when un-hidden. - EXPLAIN ANALYZE on the 42-message stress chat: 0.325ms (faster than v1.13.1-B's 1.018ms — EXISTS short-circuits cleanly for the common no-parts case). - Normal turn (plain text prompt) completes unaffected. Closes a v1.11.0 design item that was scoped but never implemented. With v1.13's parts table the prune is dramatically cheaper to write — pre-parts it would have meant editing JSON blobs in-place; now it's a hidden_at flag and a view subselect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v1.13.3: cleanup bundle — statement timeout + alpha ordering + stuck-row sweeper + repairToolCall
2026-05-22 07:02:17 +00:00 · 2026-05-22 06:46:03 +00:00 · 2026-05-22 06:34:10 +00:00 · 2026-05-22 06:22:47 +00:00 · 2026-05-22 06:17:56 +00:00 · 2026-05-22 05:46:29 +00:00
34 changed files with 2928 additions and 1826 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -46,7 +46,9 @@ Tests: `pnpm -C apps/server test` runs the vitest suite. No test harness on `app
 - **Zod** for request validation and config parsing.
 Key services:
- **`services/inference.ts`** — Streams LLM responses, executes tool loops (max depth 15, see `MAX_TOOL_LOOP_DEPTH`), flushes to DB every 500ms. Publishes `InferenceFrame` events through the broker. **`TurnArgs`** is the per-turn state envelope threaded through the `executeToolPhase → runAssistantTurn` recursion (`toolsUsed`, `recentToolCalls`, `assistantMessageId`, `signal`); reset to defaults in `runInference` at the user-message boundary. Cap-hit (`toolsUsed >= budget`) and doom-loop (`detectDoomLoop(recentToolCalls)`) checks both read from this envelope. Add new per-turn state here, not in module-level closures.
+- **`services/inference/`** (v1.12.4 split — was a single `inference.ts` file). Public surface re-exported via `inference/index.ts`; callers import from `./services/inference/index.js`. Layout: `turn.ts` (runAssistantTurn / runInference / createInferenceRunner orchestration, plus `InferenceFrame`, `InferenceContext`, `TurnArgs`, `StreamResult` exported), `stream-phase.ts` (streamCompletion + executeStreamPhase + SSE parsing), `tool-phase.ts` (executeToolPhase; back-edges into turn.ts for the runAssistantTurn recursion — cycle is safe because dereferenced at call time, not module top-level), `sentinel-summaries.ts` (runCapHitSummary + runDoomLoopSummary + their sentinel inserters; two near-clones kept side-by-side until a third sentinel justifies factoring out runWrapUpSummary), `error-handler.ts` (handleAbortOrError, finalizeCompletion), `payload.ts` (buildMessagesPayload, loadContext, maybeFlagForCompaction, `OpenAiMessage`), `sentinels.ts` (`detectDoomLoop`, `DOOM_LOOP_THRESHOLD`, sentinel predicates), `budget.ts` (resolveToolBudget), `xml-parser.ts` (Qwen-coder XML tool-call fallback), `types.ts` (`StreamPhaseState`, `DB_FLUSH_INTERVAL_MS` shared between stream-phase and sentinel-summaries). **`TurnArgs`** is the per-turn state envelope threaded through the `executeToolPhase → runAssistantTurn` recursion (`toolsUsed`, `recentToolCalls`, `assistantMessageId`, `signal`); reset to defaults in `runInference` at the user-message boundary. Cap-hit (`toolsUsed >= budget`) and doom-loop (`detectDoomLoop(recentToolCalls)`) checks both read from this envelope. Add new per-turn state to `TurnArgs` in `turn.ts`, not module-level closures.
 - **`chat_status` frame shape** (published via `broker.publishUser`) — `status: 'streaming' | 'tool_running' | 'waiting_for_input' | 'idle' | 'error'` (widened from `working|idle|error` in v1.12.1). Frontend `useChatStatus` derives `idle_warm` (<30s since idle) vs `idle_cold`. `ChatThroughput` renders inline beside `StatusDot` only when streaming or tool_running, fed by 500ms-throttled `'usage'` WS frames (`completion_tokens` + `ctx_used` + `ctx_max`). The `POST /api/chats/:id/discard_stale` endpoint exists to mark a stuck-streaming row as `failed` when the frontend's 60s no-token-activity timer (`ChatPane` content-length watcher) gives up.
 - **Boot-time stale-streaming sweep** in `apps/server/src/index.ts` after `applySchema()`: any `messages.status='streaming'` older than 5 minutes flips to `'failed'`. Logs only on non-zero count. Recovers from container restart while inference was mid-stream (v1.12.1).
 - **`services/broker.ts`** — In-memory pub/sub with two channel types: per-session (message streaming) and per-user (sidebar updates). No persistence; clients reconnect on restart.
 - **`services/tools.ts`** — Tool registry (`ALL_TOOLS`, `READ_ONLY_TOOL_NAMES`, `TOOLS_BY_NAME`). Filesystem tools (view_file/list_dir/grep/find_files) go through three guard layers: `path_guard.ts` (workspace scope), `secret_guard.ts` (filename deny list), `url_guard.ts` (SSRF/private-IP block for web_fetch). v1.11.8+ web tools (`web_search`, `web_fetch`) are opt-in per chat via `session.web_search_enabled` (resolved with `project.default_web_search_enabled` fallback) and filtered out of the LLM's tool schema when false.
 - **`services/compaction.ts`** + **`services/model-context.ts`** — v1.11.0 anchored rolling summary (single `summary=true` assistant row per chat, supersedes itself on each compaction). Triggered when `chats.needs_compaction` is set after an inference turn exceeds `usable(ctx_max) = ctx_max - 20k`. **`ctx_max` comes from `model-context.getModelContext()` which fetches `${LLAMA_SWAP_URL}/upstream/<model>/props`** — NOT from `parsed.timings.n_ctx` (the stream completion's `timings` doesn't carry n_ctx; that read was dead code until v1.11.3 ripped it out).
@@ -87,15 +89,14 @@ Font / CSS pipeline (apps/web):
 ### Multi-pane workspace
-Sessions hold 1–5 panes (chat / empty / placeholder terminal+agent). Workspace pane state is **client-side only** (localStorage key `boocode.workspace.panes.<sessionId>`); the legacy `session_panes` table and its REST endpoints are deprecated — no `/api/panes/*` routes exist. Each chat lives in at most one pane; tab strip is per-pane and tracks `chatIds[]` + `activeChatIdx`. Sessions 1:N chats; chats own messages. Tab reorder via native HTML5 drag events.
+Sessions hold 1–5 panes (chat / empty / placeholder terminal+agent). v1.12.1 moved pane state from per-device localStorage to `sessions.workspace_panes jsonb` for cross-device sync. `PATCH /api/sessions/:id/workspace` persists; `session_workspace_updated` user-channel frame broadcasts to every device watching the session. `useWorkspacePanes` debounces saves 300ms and dedups echoes by JSON string. Legacy localStorage key `boocode.workspace.panes.<sessionId>` is read once on first hydrate (one-time seed-and-delete migration when server is empty but localStorage has data); no longer written. The deprecated `session_panes` table was dropped. `validatePanes(validChatIds)` prunes panes referencing chat IDs that no longer exist (called by `useSessionChats` after the chat list fetch lands). Each chat lives in at most one pane; tab strip is per-pane and tracks `chatIds[]` + `activeChatIdx`. Tab reorder via native HTML5 drag events.
 ## Database
-PostgreSQL 16. Tables: `projects`, `sessions`, `chats`, `messages`, `settings`, `session_panes` (deprecated). Schema applied idempotently on startup via `applySchema()`. Use `clock_timestamp()` (not `NOW()`) inside transactions. CHECK constraints in place: `projects_status_chk` ('open'|'archived'), `sessions_status_chk` (same), `chats_status_chk` (same), `messages_role_chk`, `messages_status_chk` — keep in sync with the `*_STATUSES` const arrays in `apps/server/src/types/api.ts`.
+PostgreSQL 16. Tables: `projects`, `sessions`, `chats`, `messages`, `settings`. (`session_panes` was dropped in v1.12.1; workspace pane state lives in `sessions.workspace_panes jsonb`.) Schema applied idempotently on startup via `applySchema()`. Use `clock_timestamp()` (not `NOW()`) inside transactions. CHECK constraints in place: `projects_status_chk` ('open'|'archived'), `sessions_status_chk` (same), `chats_status_chk` (same), `messages_role_chk`, `messages_status_chk` — keep in sync with the `*_STATUSES` const arrays in `apps/server/src/types/api.ts`. The older anonymous `messages_status_check` (without 'cancelled') and `messages_role_check` (without 'system') were dropped in v1.12.1; only the `_chk` variants remain.
 Schema CHECK migration order when renaming allowed values: (1) `ALTER TABLE ... DROP CONSTRAINT IF EXISTS <system_name>` (inline `CREATE TABLE` checks get `<table>_<column>_check`), (2) `UPDATE` rows to new values, (3) wrap new constraint ADD in `DO $$ ... pg_constraint` guard — that block is the only way to get `ADD CONSTRAINT IF NOT EXISTS`.
 Position-shift pattern for panes (legacy `session_panes` table): negate-and-restore to avoid UNIQUE(session_id, position) collisions during reorder/insert/delete. Sentinel value -100 for the moving pane.
 ## Environment
@@ -125,6 +126,7 @@ Required: `DATABASE_URL`, `LLAMA_SWAP_URL`. Optional: `PORT` (3000), `HOST` (0.0
 - TypeScript strict mode. Both apps share `tsconfig.base.json`.
 - Server uses NodeNext module resolution (`.js` extensions in imports).
 - Discriminated unions for type narrowing: `Pane` (by `kind`), `SessionEvent` (by `type`), `InferenceFrame` (by `type`).
 - **Adding a new WS frame type** requires updating BOTH the server's `InferenceFrame` (loose `type:` union + optional fields in `services/inference/turn.ts`) AND the web `WsFrame` (strict discriminated union in `apps/web/src/api/types.ts`). Server publish is permissive; the frontend type is the wire-format gate. The `'usage'` frame added in v1.12.2 needed both sides; missing the web side silently drops the frame at JSON-parse.
 - shadcn primitives live in `components/ui/`. Don't modify them unless adding a new primitive.
 - `inferLanguage()` from `lib/attachments.ts` is the canonical file-extension-to-language map. `CodeBlock.tsx` keeps its own `LANG_MAP` because it also resolves markdown fence names.
 - Two UI event buses: `hooks/sessionEvents.ts` for DB-state events (chat_created, session_updated); `lib/events.ts` for ephemeral UI (`sendToTerminal`, `terminalsRegistry`). Don't merge — different subscriber lifecycles.
--- a/apps/server/package.json
+++ b/apps/server/package.json
@@ -11,8 +11,10 @@
    "test": "vitest run"
  },
  "dependencies": {
    "@ai-sdk/openai-compatible": "^2.0.47",
    "@fastify/static": "^7.0.4",
    "@fastify/websocket": "^10.0.1",
    "ai": "^6.0.190",
    "fastify": "^4.28.1",
    "postgres": "^3.4.4",
    "ws": "^8.18.0",
--- a/apps/server/src/index.ts
+++ b/apps/server/src/index.ts
@@ -16,7 +16,7 @@ import { registerWebSocket } from './routes/ws.js';
 import { registerModelRoutes } from './routes/models.js';
 import { registerAgentRoutes } from './routes/agents.js';
 import { registerSkillsRoutes } from './routes/skills.js';
-import { createInferenceRunner } from './services/inference.js';
+import { createInferenceRunner } from './services/inference/index.js';
 import { createBroker } from './services/broker.js';
 import { listSkills } from './services/skills.js';
 import * as compaction from './services/compaction.js';
@@ -201,6 +201,46 @@ async function main() {
    app.log.info(`serving static frontend from ${webDist}`);
  }
  // v1.13.3: periodic in-process sweeper for streaming rows orphaned by a
  // mid-session crash. The boot sweep (above) only fires once at startup;
  // this loop catches the in-flight case. 60s cadence + 5-min threshold
  // matches the boot sweep so behavior is consistent. Publishes
  // chat_status='idle' on the user channel so the UI dot drops without a
  // refresh — same pattern as handleAbortOrError.
  const SWEEP_INTERVAL_MS = 60_000;
  const sweepStaleStreaming = async (): Promise<void> => {
    try {
      const rows = await sql<{ id: string; chat_id: string }[]>`
        UPDATE messages
        SET status = 'failed', finished_at = clock_timestamp()
        WHERE status = 'streaming'
          AND created_at < NOW() - INTERVAL '5 minutes'
        RETURNING id, chat_id
      `;
      if (rows.length === 0) return;
      app.log.warn(
        { swept: rows.length, ids: rows.map((r) => r.id) },
        'swept stale streaming rows',
      );
      const seenChats = new Set<string>();
      const now = new Date().toISOString();
      for (const row of rows) {
        if (seenChats.has(row.chat_id)) continue;
        seenChats.add(row.chat_id);
        broker.publishUser('default', {
          type: 'chat_status',
          chat_id: row.chat_id,
          status: 'idle',
          at: now,
        });
      }
    } catch (err) {
      app.log.error({ err }, 'stuck-row sweeper failed');
    }
  };
  const sweepTimer = setInterval(() => { void sweepStaleStreaming(); }, SWEEP_INTERVAL_MS);
  app.addHook('onClose', async () => { clearInterval(sweepTimer); });
  const shutdown = async (signal: string) => {
    app.log.info(`received ${signal}, shutting down`);
    try {
--- a/apps/server/src/routes/chats.ts
+++ b/apps/server/src/routes/chats.ts
@@ -313,6 +313,28 @@ export function registerChatRoutes(
            AND created_at <= ${target.created_at}::timestamptz
            AND status = 'complete'
        `;
        // v1.13.0: clone message_parts for the forked messages. Source and
        // destination preserve ordering (the INSERT above orders by created_at,
        // id) so a ROW_NUMBER pairing maps source.id → dest.id deterministically.
        await tx`
          WITH src AS (
            SELECT id, ROW_NUMBER() OVER (ORDER BY created_at ASC, id ASC) AS rn
            FROM messages
            WHERE chat_id = ${source.id}
              AND created_at <= ${target.created_at}::timestamptz
              AND status = 'complete'
          ),
          dst AS (
            SELECT id, ROW_NUMBER() OVER (ORDER BY created_at ASC, id ASC) AS rn
            FROM messages
            WHERE chat_id = ${chat!.id}
          )
          INSERT INTO message_parts (message_id, sequence, kind, payload)
          SELECT dst.id, p.sequence, p.kind, p.payload
          FROM message_parts p
          JOIN src ON p.message_id = src.id
          JOIN dst ON dst.rn = src.rn
        `;
        return chat!;
      });
@@ -401,11 +423,12 @@ export function registerChatRoutes(
        reply.code(404);
        return { error: 'chat not found' };
      }
      // v1.13.1-B: reads tool_calls/tool_results via the parts-merged view.
      const rows = await sql<Message[]>`
        SELECT id, session_id, chat_id, role, content, kind, tool_calls, tool_results, status, last_seq,
               tokens_used, ctx_used, ctx_max, started_at, finished_at, created_at, metadata,
               summary, tail_start_id, compacted_at
-        FROM messages
+        FROM messages_with_parts
        WHERE chat_id = ${req.params.id}
        ORDER BY created_at ASC, id ASC
      `;
--- a/apps/server/src/routes/messages.ts
+++ b/apps/server/src/routes/messages.ts
@@ -91,11 +91,12 @@ export function registerMessageRoutes(
      // SummaryCard) and shows compacted_at-stamped rows inline for context.
      // Internal inference assembly filters compacted_at IS NULL separately —
      // see services/inference.ts loadContext + services/compaction.ts.
      // v1.13.1-B: reads tool_calls/tool_results via the parts-merged view.
      const rows = await sql<Message[]>`
        SELECT id, session_id, chat_id, role, content, kind, tool_calls, tool_results, status, last_seq,
               tokens_used, ctx_used, ctx_max, started_at, finished_at, created_at, metadata,
               summary, tail_start_id, compacted_at
-        FROM messages
+        FROM messages_with_parts
        WHERE session_id = ${req.params.id}
        ORDER BY created_at ASC, id ASC
      `;
@@ -469,30 +470,36 @@ export function registerMessageRoutes(
      const chat = chatRows[0]!;
      const sessionId = chat.session_id;
-      // Find the assistant message that emitted this tool_call. Scoped by
+      // v1.13.1-C: find the assistant's tool_call by indexing message_parts
-      // chat_id + role to avoid cross-chat lookups; ordered by created_at DESC
+      // directly on payload->>'id'. Scoped by chat_id + role via the JOIN.
-      // because the most recent issuance wins when an LLM reuses call IDs
+      // Pre-v1.13.0 history has no parts rows — those tool_calls become
-      // across turns (the older, already-answered one is a different row with
+      // unreachable here (404). Acceptable per the dispatch decision: any
-      // populated tool_results downstream).
+      // pending elicitation from before v1.13.0 is long timed out by now;
-      const callerRows = await sql<{ id: string; tool_calls: ToolCall[] | null }[]>`
+      // promote to a hotfix with a JSON-column fallback if it ever surfaces.
-        SELECT id, tool_calls FROM messages
+      const callerRows = await sql<{
-        WHERE chat_id = ${chat.id}
+        message_id: string;
-          AND role = 'assistant'
+        payload: { id: string; name: string; args: Record<string, unknown> };
-          AND tool_calls IS NOT NULL
+      }[]>`
-        ORDER BY created_at DESC
+        SELECT p.message_id, p.payload
        FROM message_parts p
        JOIN messages m ON m.id = p.message_id
        WHERE m.chat_id = ${chat.id}
          AND m.role = 'assistant'
          AND p.kind = 'tool_call'
          AND p.payload->>'id' = ${tool_call_id}
        ORDER BY m.created_at DESC
        LIMIT 1
      `;
-      let foundCall: ToolCall | null = null;
+      const callerRow = callerRows[0];
-      for (const row of callerRows) {
+      if (!callerRow) {
        const match = row.tool_calls?.find((tc) => tc.id === tool_call_id);
        if (match) {
          foundCall = match;
          break;
        }
      }
      if (!foundCall) {
        reply.code(404);
        return { error: 'unknown_tool_call_id' };
      }
      const foundCall: ToolCall = {
        id: callerRow.payload.id,
        name: callerRow.payload.name,
        args: callerRow.payload.args,
      };
      if (foundCall.name !== 'ask_user_input') {
        reply.code(400);
        return { error: 'tool_call_not_ask_user_input' };
@@ -539,18 +546,21 @@ export function registerMessageRoutes(
        }
      }
-      // Find the pending tool row. ORDER BY created_at DESC + LIMIT 1 picks
+      // v1.13.1-C: find the pending tool row via message_parts on
-      // the most recent row with this tool_call_id; the already-answered
+      // payload->>'tool_call_id'. Same fallback caveat as the caller lookup
-      // check below guards against UPDATE-ing a stale answer.
+      // above — pre-v1.13.0 rows are unreachable here.
      const toolRows = await sql<{
-        id: string;
+        message_id: string;
-        tool_results: { tool_call_id: string; output: unknown } | null;
+        payload: { tool_call_id: string; output: unknown };
      }[]>`
-        SELECT id, tool_results FROM messages
+        SELECT p.message_id, p.payload
-        WHERE chat_id = ${chat.id}
+        FROM message_parts p
-          AND role = 'tool'
+        JOIN messages m ON m.id = p.message_id
-          AND tool_results->>'tool_call_id' = ${tool_call_id}
+        WHERE m.chat_id = ${chat.id}
-        ORDER BY created_at DESC
+          AND m.role = 'tool'
          AND p.kind = 'tool_result'
          AND p.payload->>'tool_call_id' = ${tool_call_id}
        ORDER BY m.created_at DESC
        LIMIT 1
      `;
      const toolRow = toolRows[0];
@@ -558,7 +568,7 @@ export function registerMessageRoutes(
        reply.code(404);
        return { error: 'unknown_tool_call_id', detail: 'tool message not found' };
      }
-      if (toolRow.tool_results && toolRow.tool_results.output !== null) {
+      if (toolRow.payload && toolRow.payload.output !== null) {
        reply.code(409);
        return { error: 'tool_call_already_answered' };
      }
@@ -570,11 +580,21 @@ export function registerMessageRoutes(
        truncated: false,
      };
      const toolMessageId = toolRow.message_id;
      const result = await sql.begin(async (tx) => {
        await tx`
          UPDATE messages
          SET tool_results = ${tx.json(newToolResults as never)}
-          WHERE id = ${toolRow.id}
+          WHERE id = ${toolMessageId}
        `;
        // v1.13.0: replace the pending tool_result part inserted at message
        // creation (tool-phase.ts) with the answered one. Delete-then-insert
        // is simpler than UPDATE because parts are append-style elsewhere;
        // the UNIQUE (message_id, sequence) constraint blocks plain insert.
        await tx`DELETE FROM message_parts WHERE message_id = ${toolMessageId} AND kind = 'tool_result'`;
        await tx`
          INSERT INTO message_parts (message_id, sequence, kind, payload)
          VALUES (${toolMessageId}, 0, 'tool_result', ${tx.json(newToolResults as never)})
        `;
        const [assistantMsg] = await tx<{ id: string }[]>`
          INSERT INTO messages (session_id, chat_id, role, content, status, created_at)
@@ -584,7 +604,7 @@ export function registerMessageRoutes(
        await tx`UPDATE sessions SET updated_at = clock_timestamp() WHERE id = ${sessionId}`;
        await tx`UPDATE chats SET updated_at = clock_timestamp() WHERE id = ${chat.id}`;
        return {
-          tool_message_id: toolRow.id,
+          tool_message_id: toolMessageId,
          assistant_message_id: assistantMsg!.id,
        };
      });
--- a/apps/server/src/routes/skills.ts
+++ b/apps/server/src/routes/skills.ts
@@ -90,11 +90,26 @@ export function registerSkillsRoutes(
          VALUES (${sessionId}, ${chat.id}, 'assistant', '', ${sql.json(toolCalls as never)}, 'complete', clock_timestamp())
          RETURNING id
        `;
        // v1.13.0: dual-write the synthetic assistant message's tool_call.
        // Single skill_use tool_call, no text content, so one part at seq 0.
        await tx`
          INSERT INTO message_parts (message_id, sequence, kind, payload)
          VALUES (${synthAssistant!.id}, 0, 'tool_call', ${tx.json({
            id: toolCallId,
            name: 'skill_use',
            args: { name: skill_name },
          } as never)})
        `;
        const [toolMsg] = await tx<{ id: string }[]>`
          INSERT INTO messages (session_id, chat_id, role, content, tool_results, status, created_at)
          VALUES (${sessionId}, ${chat.id}, 'tool', '', ${sql.json(toolResults as never)}, 'complete', clock_timestamp())
          RETURNING id
        `;
        // v1.13.0: dual-write the synthetic tool result (the skill body).
        await tx`
          INSERT INTO message_parts (message_id, sequence, kind, payload)
          VALUES (${toolMsg!.id}, 0, 'tool_result', ${tx.json(toolResults as never)})
        `;
        const [userMsg] = await tx<{ id: string }[]>`
          INSERT INTO messages (session_id, chat_id, role, content, status, created_at)
          VALUES (${sessionId}, ${chat.id}, 'user', ${userText}, 'complete', clock_timestamp())
--- a/apps/server/src/routes/ws.ts
+++ b/apps/server/src/routes/ws.ts
@@ -23,11 +23,12 @@ export function registerWebSocket(
      // v1.11: snapshot includes compaction fields so MessageBubble can
      // render the SummaryCard for summary=true rows on first connect.
      // v1.13.1-B: reads tool_calls/tool_results via the parts-merged view.
      const messages = await sql<Message[]>`
        SELECT id, session_id, chat_id, role, content, kind, tool_calls, tool_results, status, last_seq,
               tokens_used, ctx_used, ctx_max, started_at, finished_at, created_at, metadata,
               summary, tail_start_id, compacted_at
-        FROM messages
+        FROM messages_with_parts
        WHERE session_id = ${sessionId}
        ORDER BY created_at ASC, id ASC
      `;
--- a/apps/server/src/schema.sql
+++ b/apps/server/src/schema.sql
@@ -1,3 +1,10 @@
 -- v1.13.3: statement_timeout is set at database level via:
 --   ALTER DATABASE boocode SET statement_timeout = '30s';
 -- ALTER DATABASE can't run inside a DO block, so this is an operational
 -- step rather than schema. Re-apply after a volume reset (the setting
 -- lives in pg_db which survives `docker compose up --build` but NOT a
 -- `docker volume rm boocode_pgdata`).
 CREATE TABLE IF NOT EXISTS projects (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name TEXT NOT NULL,
@@ -32,6 +39,86 @@ CREATE TABLE IF NOT EXISTS messages (
 CREATE INDEX IF NOT EXISTS idx_messages_session ON messages(session_id, created_at);
 -- v1.13.0: granular message parts table for AI SDK migration. Old
 -- messages.content / tool_calls / tool_results columns stay authoritative
 -- for reads in v1.13.0; this table is dual-written so the swap can happen
 -- in a later dispatch without a backfill window. ON DELETE CASCADE means
 -- removing a message removes its parts in one go.
 CREATE TABLE IF NOT EXISTS message_parts (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  message_id uuid NOT NULL REFERENCES messages(id) ON DELETE CASCADE,
  sequence int NOT NULL,
  kind text NOT NULL,
  payload jsonb NOT NULL,
  created_at timestamptz NOT NULL DEFAULT clock_timestamp(),
  CONSTRAINT message_parts_kind_chk CHECK (kind IN ('text', 'tool_call', 'tool_result', 'reasoning', 'step_start')),
  CONSTRAINT message_parts_seq_uniq UNIQUE (message_id, sequence)
 );
 CREATE INDEX IF NOT EXISTS message_parts_msg_seq_idx ON message_parts (message_id, sequence);
 -- v1.13.4: prune support. hidden_at marks parts that have been pruned out
 -- of the model payload by the two-tier compaction prune (services/inference/
 -- prune.ts). Rows stay in the DB so frontend can still display them with a
 -- "hidden" indicator (out of scope this dispatch). messages_with_parts
 -- view filters these out — see below. Partial index speeds the common
 -- "visible parts only" filter.
 DO $$
 BEGIN
  IF NOT EXISTS (
    SELECT 1 FROM information_schema.columns
    WHERE table_name = 'message_parts' AND column_name = 'hidden_at'
  ) THEN
    ALTER TABLE message_parts ADD COLUMN hidden_at timestamptz NULL;
  END IF;
 END $$;
 CREATE INDEX IF NOT EXISTS message_parts_hidden_idx
  ON message_parts (message_id) WHERE hidden_at IS NULL;
 -- v1.13.1-B: read-path view. Read sites SELECT FROM messages_with_parts
 -- instead of messages so tool_calls / tool_results / reasoning_parts come
 -- from the granular message_parts table. The COALESCE means pre-v1.13.0
 -- history (no parts rows) still resolves via the legacy JSON columns; the
 -- dual-write from v1.13.0 keeps both in sync for all rows written since.
 -- Writes continue to target `messages` directly — the view is read-only.
 -- Shapes match the in-memory ToolCall / ToolResult types: tool_calls is a
 -- jsonb array of {id, name, args}, tool_results is a single jsonb object
 -- {tool_call_id, output, truncated, error?}. reasoning_parts is new — only
 -- consumed by the inference history fetch (payload.ts) so v1.13.1-C can
 -- wire reasoning into the model payload. Not surfaced in external APIs yet.
 CREATE OR REPLACE VIEW messages_with_parts AS
 SELECT
  m.id, m.session_id, m.chat_id, m.role, m.content, m.kind, m.status,
  m.last_seq, m.tokens_used, m.ctx_used, m.ctx_max,
  m.started_at, m.finished_at, m.created_at, m.metadata,
  m.summary, m.tail_start_id, m.compacted_at,
  -- v1.13.4: prune semantics need to distinguish "no parts row exists"
  -- (pre-v1.13.0 fallback to legacy column) from "all parts hidden"
  -- (prune intended — return null/empty so the row drops from the model
  -- payload). A naive COALESCE would fall back to the legacy column when
  -- every part is hidden, undoing the prune. CASE on EXISTS(any kind)
  -- splits the two cases.
  CASE
    WHEN EXISTS (SELECT 1 FROM message_parts pp
                  WHERE pp.message_id = m.id AND pp.kind = 'tool_call')
    THEN (SELECT jsonb_agg(p.payload ORDER BY p.sequence)
            FROM message_parts p
           WHERE p.message_id = m.id AND p.kind = 'tool_call' AND p.hidden_at IS NULL)
    ELSE m.tool_calls
  END AS tool_calls,
  CASE
    WHEN EXISTS (SELECT 1 FROM message_parts pp
                  WHERE pp.message_id = m.id AND pp.kind = 'tool_result')
    THEN (SELECT p.payload
            FROM message_parts p
           WHERE p.message_id = m.id AND p.kind = 'tool_result' AND p.hidden_at IS NULL
           ORDER BY p.sequence LIMIT 1)
    ELSE m.tool_results
  END AS tool_results,
  (SELECT jsonb_agg(p.payload ORDER BY p.sequence)
     FROM message_parts p
    WHERE p.message_id = m.id AND p.kind = 'reasoning' AND p.hidden_at IS NULL) AS reasoning_parts
 FROM messages m;
 ALTER TABLE messages ADD COLUMN IF NOT EXISTS tokens_used INTEGER;
 ALTER TABLE messages ADD COLUMN IF NOT EXISTS ctx_used INTEGER;
 ALTER TABLE messages ADD COLUMN IF NOT EXISTS ctx_max INTEGER;
--- a/apps/server/src/services/tests/doom-loop.test.ts
+++ b/apps/server/src/services/tests/doom-loop.test.ts
@@ -1,5 +1,5 @@
 import { describe, it, expect } from 'vitest';
-import { DOOM_LOOP_THRESHOLD, detectDoomLoop } from '../inference.js';
+import { DOOM_LOOP_THRESHOLD, detectDoomLoop } from '../inference/index.js';
 import type { ToolCall } from '../../types/api.js';
 // ---- fixture ----------------------------------------------------------------
--- a/apps/server/src/services/tests/inference.test.ts
+++ b/apps/server/src/services/tests/inference.test.ts
@@ -1,5 +1,5 @@
 import { describe, it, expect } from 'vitest';
-import { buildMessagesPayload } from '../inference.js';
+import { buildMessagesPayload } from '../inference/index.js';
 import type {
  Message,
  MessageRole,
--- a/apps/server/src/services/tests/parts.test.ts
+++ b/apps/server/src/services/tests/parts.test.ts
@@ -0,0 +1,121 @@
 import { describe, it, expect } from 'vitest';
 import { partsFromAssistantMessage, partsFromToolMessage } from '../inference/parts.js';
 import type { ToolCall, ToolResult } from '../../types/api.js';
 describe('partsFromAssistantMessage', () => {
  it('emits one text part for content-only assistant', () => {
    const parts = partsFromAssistantMessage({ content: 'hello world', tool_calls: null });
    expect(parts).toHaveLength(1);
    expect(parts[0]).toEqual({
      sequence: 0,
      kind: 'text',
      payload: { text: 'hello world' },
    });
  });
  it('emits one tool_call part for empty-content + single tool_call', () => {
    const tc: ToolCall = { id: 'call_1', name: 'view_file', args: { path: 'src/a.ts' } };
    const parts = partsFromAssistantMessage({ content: '', tool_calls: [tc] });
    expect(parts).toHaveLength(1);
    expect(parts[0]).toEqual({
      sequence: 0,
      kind: 'tool_call',
      payload: { id: 'call_1', name: 'view_file', args: { path: 'src/a.ts' } },
    });
  });
  it('emits text then tool_call parts in order when both present', () => {
    const tc: ToolCall = { id: 'call_2', name: 'grep', args: { pattern: 'foo' } };
    const parts = partsFromAssistantMessage({ content: 'let me search', tool_calls: [tc] });
    expect(parts.map((p) => [p.sequence, p.kind])).toEqual([
      [0, 'text'],
      [1, 'tool_call'],
    ]);
  });
  it('preserves tool_call order with multiple calls', () => {
    const calls: ToolCall[] = [
      { id: 'a', name: 'list_dir', args: { path: '.' } },
      { id: 'b', name: 'view_file', args: { path: 'x.ts' } },
      { id: 'c', name: 'grep', args: { pattern: 'y' } },
    ];
    const parts = partsFromAssistantMessage({ content: '', tool_calls: calls });
    expect(parts).toHaveLength(3);
    expect(parts.map((p) => p.payload)).toEqual([
      { id: 'a', name: 'list_dir', args: { path: '.' } },
      { id: 'b', name: 'view_file', args: { path: 'x.ts' } },
      { id: 'c', name: 'grep', args: { pattern: 'y' } },
    ]);
    expect(parts.map((p) => p.sequence)).toEqual([0, 1, 2]);
  });
  it('returns empty array for empty content + null tool_calls', () => {
    expect(partsFromAssistantMessage({ content: '', tool_calls: null })).toEqual([]);
  });
  it('v1.13.1-C: reasoning lands at sequence 0 before text + tool_calls', () => {
    const tc: ToolCall = { id: 'call_r', name: 'view_file', args: { path: 'x.ts' } };
    const parts = partsFromAssistantMessage({
      content: 'inspecting now',
      tool_calls: [tc],
      reasoning: 'user asked about x.ts; I should view it',
    });
    expect(parts.map((p) => [p.sequence, p.kind])).toEqual([
      [0, 'reasoning'],
      [1, 'text'],
      [2, 'tool_call'],
    ]);
    expect(parts[0]!.payload).toEqual({
      text: 'user asked about x.ts; I should view it',
    });
  });
  it('v1.13.1-C: reasoning + empty content + tool_calls preserves seq 0 reasoning', () => {
    const tc: ToolCall = { id: 'call_r2', name: 'grep', args: { pattern: 'foo' } };
    const parts = partsFromAssistantMessage({
      content: '',
      tool_calls: [tc],
      reasoning: 'jumping straight to grep',
    });
    expect(parts.map((p) => [p.sequence, p.kind])).toEqual([
      [0, 'reasoning'],
      [1, 'tool_call'],
    ]);
  });
 });
 describe('partsFromToolMessage', () => {
  it('emits a single tool_result part at sequence 0', () => {
    const tr: ToolResult = {
      tool_call_id: 'call_1',
      output: { contents: 'console.log(1)' },
      truncated: false,
    };
    const parts = partsFromToolMessage({ tool_results: tr });
    expect(parts).toHaveLength(1);
    expect(parts[0]).toEqual({
      sequence: 0,
      kind: 'tool_result',
      payload: {
        tool_call_id: 'call_1',
        output: { contents: 'console.log(1)' },
        truncated: false,
      },
    });
  });
  it('includes error in payload when present', () => {
    const tr: ToolResult = {
      tool_call_id: 'call_2',
      output: null,
      truncated: false,
      error: 'permission denied',
    };
    const parts = partsFromToolMessage({ tool_results: tr });
    expect(parts[0]!.payload).toMatchObject({ error: 'permission denied' });
  });
  it('returns empty array when tool_results is null', () => {
    expect(partsFromToolMessage({ tool_results: null })).toEqual([]);
  });
 });
--- a/apps/server/src/services/tests/prune.test.ts
+++ b/apps/server/src/services/tests/prune.test.ts
@@ -0,0 +1,96 @@
 import { describe, it, expect, beforeEach } from 'vitest';
 import {
  selectPruneTargets,
  PROTECTED_TOKENS,
  PRUNE_TRIGGER_TOKENS,
  type PartForPrune,
 } from '../inference/prune.js';
 // Test fixture: build a tool_result part whose payload size yields a known
 // token estimate (chars/4). The decision logic only cares about
 // JSON.stringify(payload).length, so a string payload of `4n` chars
 // produces exactly `n` tokens.
 let seq = 0;
 function part(tokens: number, createdAt: Date): PartForPrune {
  seq += 1;
  // JSON.stringify("xxx...") wraps in quotes (adds 2 chars), so subtract 2
  // before multiplying. Math.ceil((len+2)/4) needs len ≈ 4*tokens - 2 so the
  // total stringified length is 4*tokens. Approximate by padding 4 chars per
  // token; the off-by-one from quotes is small and tests check totals, not
  // exact per-part counts.
  const text = 'x'.repeat(tokens * 4 - 2);
  return { id: `p${seq}`, payload: text, created_at: createdAt };
 }
 const T_NOW = new Date('2026-05-22T12:00:00Z');
 function ago(secondsBack: number): Date {
  return new Date(T_NOW.getTime() - secondsBack * 1000);
 }
 describe('selectPruneTargets', () => {
  beforeEach(() => {
    seq = 0;
  });
  it('returns nothing when there are no parts', () => {
    expect(selectPruneTargets([], null)).toEqual({ ids: [], freedTokens: 0 });
  });
  it('returns nothing when total tokens are under the protection window', () => {
    const parts: PartForPrune[] = [
      part(10_000, ago(10)),
      part(10_000, ago(20)),
    ]; // 20k total, all protected
    expect(selectPruneTargets(parts, null)).toEqual({ ids: [], freedTokens: 0 });
  });
  it('returns nothing when candidate total is below the prune trigger', () => {
    // Protection fills with ~40k newest, candidates only ~5k. Below 20k trigger.
    const parts: PartForPrune[] = [
      part(20_000, ago(10)),
      part(20_000, ago(20)),
      // Past protection; total ~5k won't trigger.
      part(5_000, ago(30)),
    ];
    const result = selectPruneTargets(parts, null);
    expect(result.ids).toEqual([]);
    expect(result.freedTokens).toBe(0);
  });
  it('hides candidates past protection when their total clears the trigger', () => {
    // Newest 40k protected; older 30k cleanly above the 20k trigger.
    const parts: PartForPrune[] = [
      part(20_000, ago(10)),
      part(20_000, ago(20)),
      // Past protection, total ~30k freed.
      part(15_000, ago(30)),
      part(15_000, ago(40)),
    ];
    const result = selectPruneTargets(parts, null);
    expect(result.ids).toEqual(['p3', 'p4']);
    expect(result.freedTokens).toBeGreaterThanOrEqual(PRUNE_TRIGGER_TOKENS);
  });
  it('stops at the compaction summary boundary', () => {
    // Newest 30k protected (just under PROTECTED_TOKENS=40k); then 30k of
    // older parts. Boundary sits at ago(35), so the ago(40) part is
    // beyond it and gets skipped.
    const parts: PartForPrune[] = [
      part(15_000, ago(10)),
      part(15_000, ago(20)),
      part(15_000, ago(30)), // crosses protection threshold; candidate
      part(15_000, ago(40)), // beyond summary boundary; skipped
    ];
    const tailStart = ago(35);
    const result = selectPruneTargets(parts, tailStart);
    // ago(30) is the only candidate inside the window; 15k is below the
    // 20k trigger so we expect no hides.
    expect(result.ids).toEqual([]);
  });
  it('does not prune when only protected parts exist (no candidates)', () => {
    // Exactly PROTECTED_TOKENS of newest parts; no older candidates.
    const parts: PartForPrune[] = [part(PROTECTED_TOKENS, ago(10))];
    expect(selectPruneTargets(parts, null)).toEqual({ ids: [], freedTokens: 0 });
  });
 });
--- a/apps/server/src/services/tests/tools.test.ts
+++ b/apps/server/src/services/tests/tools.test.ts
@@ -0,0 +1,14 @@
 import { describe, it, expect } from 'vitest';
 import { ALL_TOOLS } from '../tools.js';
 describe('ALL_TOOLS registry', () => {
  // v1.13.3: tools must be alpha-sorted at module load. llama.cpp's prompt
  // cache hits on byte-identical prefixes; the tool list lives near the
  // top of the system prompt, so any order drift invalidates every cached
  // turn. The registry sort is the single source of truth; downstream
  // helpers (toolJsonSchemas, TOOLS_BY_NAME, buildAiTools) inherit it.
  it('exports tools in alphabetical order by name', () => {
    const names = ALL_TOOLS.map((t) => t.name);
    expect(names).toEqual([...names].sort((a, b) => a.localeCompare(b)));
  });
 });
--- a/apps/server/src/services/auto_name.ts
+++ b/apps/server/src/services/auto_name.ts
@@ -1,4 +1,4 @@
-import type { InferenceContext } from './inference.js';
+import type { InferenceContext } from './inference/index.js';
 const NAMING_SYSTEM_PROMPT =
  'You name chat sessions. Reply directly with no thinking, reasoning, or explanation. Output ONLY the title, 4 words max, no quotes, no punctuation, no prefix like "Title:".';
--- a/apps/server/src/services/compaction.ts
+++ b/apps/server/src/services/compaction.ts
@@ -342,9 +342,11 @@ export async function process(input: ProcessInput): Promise<void> {
  // 2. All currently-active messages in this chat (compacted_at IS NULL).
  // ORDER BY (created_at, id) matches loadContext in inference.ts so the
  // turns() boundary logic sees the same sequence the LLM will.
  // v1.13.1-B: reads tool_calls/tool_results via the parts-merged view so
  // the compaction payload matches what the LLM saw on the original turn.
  const messages = await sql<CompactionMessage[]>`
    SELECT id, role, content, kind, summary, status, tool_calls, tool_results, metadata, created_at
-    FROM messages
+    FROM messages_with_parts
    WHERE chat_id = ${chatId} AND compacted_at IS NULL
    ORDER BY created_at ASC, id ASC
  `;
--- a/apps/server/src/services/inference.ts
+++ b/apps/server/src/services/inference.ts
--- a/apps/server/src/services/inference/budget.ts
+++ b/apps/server/src/services/inference/budget.ts
@@ -0,0 +1,20 @@
 import type { Agent } from '../../types/api.js';
 import { READ_ONLY_TOOL_NAMES } from '../tools.js';
 // v1.8.2: tool-call budget defaults. Resolved per-turn by resolveToolBudget.
 //   - Agent with explicit max_tool_calls: that value.
 //   - Agent with read-only-only tools:    BUDGET_READ_ONLY (30).
 //   - Agent with any non-read-only tool:  BUDGET_NON_READ_ONLY (10).
 //   - No agent (raw chat):                BUDGET_NO_AGENT (15).
 export const BUDGET_READ_ONLY = 30;
 export const BUDGET_NON_READ_ONLY = 10;
 export const BUDGET_NO_AGENT = 15;
 const READ_ONLY_SET: ReadonlySet<string> = new Set(READ_ONLY_TOOL_NAMES);
 export function resolveToolBudget(agent: Agent | null): number {
  if (agent?.max_tool_calls != null) return agent.max_tool_calls;
  if (!agent) return BUDGET_NO_AGENT;
  const allReadOnly = agent.tools.every((t) => READ_ONLY_SET.has(t));
  return allReadOnly ? BUDGET_READ_ONLY : BUDGET_NON_READ_ONLY;
 }
--- a/apps/server/src/services/inference/error-handler.ts
+++ b/apps/server/src/services/inference/error-handler.ts
@@ -0,0 +1,167 @@
 import type { MessageMetadata, Session } from '../../types/api.js';
 import * as modelContext from '../model-context.js';
 import { maybeFlagForCompaction } from './payload.js';
 import { insertParts, partsFromAssistantMessage } from './parts.js';
 import type { InferenceContext, StreamResult, TurnArgs } from './turn.js';
 export async function handleAbortOrError(
  ctx: InferenceContext,
  args: TurnArgs,
  accumulated: string,
  err: unknown
 ): Promise<void> {
  const { sessionId, chatId, assistantMessageId } = args;
  const isAbort = err instanceof Error && err.name === 'AbortError';
  const finalStatus = isAbort ? 'cancelled' : 'failed';
  const errMsg = err instanceof Error ? err.message : String(err);
  // v1.8.2: persist a structured error metadata blob on genuine failures so
  // the bubble can render the reason on reload without re-deriving from the
  // (one-shot) WS error frame. User-initiated abort skips this — there's no
  // "reason" to surface for a stop the user already explicitly chose.
  const errorMetadata: MessageMetadata | null = isAbort
    ? null
    : { kind: 'error', error_reason: 'llm_provider_error', error_text: errMsg };
  if (errorMetadata) {
    await ctx.sql`
      UPDATE messages
      SET status = ${finalStatus},
          content = ${accumulated},
          finished_at = clock_timestamp(),
          metadata = ${ctx.sql.json(errorMetadata as never)}
      WHERE id = ${assistantMessageId}
    `;
  } else {
    await ctx.sql`
      UPDATE messages
      SET status = ${finalStatus},
          content = ${accumulated},
          finished_at = clock_timestamp()
      WHERE id = ${assistantMessageId}
    `;
  }
  const [failSessRow] = await ctx.sql<{ project_id: string; name: string; updated_at: string }[]>`
    UPDATE sessions SET updated_at = clock_timestamp()
    WHERE id = ${sessionId}
    RETURNING project_id, name, updated_at
  `;
  ctx.publishUser({ type: 'session_updated', session_id: sessionId, project_id: failSessRow!.project_id, name: failSessRow!.name, updated_at: failSessRow!.updated_at });
  // v1.8 mobile-tabs: cancellation is a user-initiated stop, treat as idle;
  // genuine errors flip the dot red. v1.8.2: error path also carries a
  // machine-readable `reason` so the UI can render specifics inline.
  if (isAbort) {
    // v1.12.1: defensive cancellation write. The status=${finalStatus} UPDATE
    // above already sets 'cancelled' for the AbortError case, but a row can
    // leak as 'streaming' when the abort fires between the post-tool-phase
    // INSERT (executeToolPhase) and the next runAssistantTurn's stream setup,
    // bypassing the try/catch around executeStreamPhase. The status guard
    // makes this a no-op when the earlier write already landed.
    await ctx.sql`
      UPDATE messages
      SET status = 'cancelled', content = ${accumulated}, finished_at = clock_timestamp()
      WHERE id = ${args.assistantMessageId} AND status = 'streaming'
    `;
    ctx.publishUser({ type: 'chat_status', chat_id: chatId, status: 'idle', at: new Date().toISOString() });
    ctx.publish(sessionId, {
      type: 'message_complete',
      message_id: assistantMessageId,
      chat_id: chatId,
    });
    ctx.log.info({ sessionId, chatId, assistantMessageId }, 'inference cancelled');
  } else {
    ctx.publishUser({
      type: 'chat_status',
      chat_id: chatId,
      status: 'error',
      at: new Date().toISOString(),
      reason: 'llm_provider_error',
    });
    ctx.publish(sessionId, {
      type: 'error',
      message_id: assistantMessageId,
      chat_id: chatId,
      error: errMsg,
      reason: 'llm_provider_error',
    });
    ctx.log.error({ err, sessionId, assistantMessageId }, 'inference failed');
  }
 }
 export async function finalizeCompletion(
  ctx: InferenceContext,
  args: TurnArgs,
  result: StreamResult,
  startedAt: string | null,
  session: Session
 ): Promise<void> {
  const { sessionId, chatId, assistantMessageId } = args;
  const { content, finishReason, promptTokens, completionTokens } = result;
  // v1.11.3: see executeToolPhase for the rationale.
  const mctx = await modelContext.getModelContext(session.model);
  const nCtx = mctx?.n_ctx ?? null;
  const [updated] = await ctx.sql<
    { tokens_used: number | null; ctx_used: number | null; ctx_max: number | null; finished_at: string | null }[]
  >`
    UPDATE messages
    SET content = ${content},
        status = 'complete',
        tokens_used = ${completionTokens},
        ctx_used = ${promptTokens},
        ctx_max = ${nCtx},
        finished_at = clock_timestamp()
    WHERE id = ${assistantMessageId}
    RETURNING tokens_used, ctx_used, ctx_max, finished_at
  `;
  // v1.13.0: dual-write the text part. finalizeCompletion is the terminal
  // path for text-only assistant turns (no tool calls); tool_calls are null
  // here by construction (the tool-bearing path goes through executeToolPhase).
  // v1.13.1-C: include result.reasoning so reasoning-channel models capture
  // a kind='reasoning' part alongside the text.
  // TODO(v1.13.1): wrap the UPDATE above and this insertParts in a single
  // sql.begin before flipping read authority to message_parts.
  await insertParts(
    ctx.sql,
    partsFromAssistantMessage({
      content,
      tool_calls: null,
      reasoning: result.reasoning,
    }).map((p) => ({
      ...p,
      message_id: assistantMessageId,
    })),
  );
  // v1.11: flag for compaction on the terminal turn too. Catches the common
  // case of a turn that hit the limit without invoking tools.
  await maybeFlagForCompaction(ctx, chatId, updated);
  const [completeSessRow] = await ctx.sql<{ project_id: string; name: string; updated_at: string }[]>`
    UPDATE sessions SET updated_at = clock_timestamp()
    WHERE id = ${sessionId}
    RETURNING project_id, name, updated_at
  `;
  ctx.publishUser({ type: 'session_updated', session_id: sessionId, project_id: completeSessRow!.project_id, name: completeSessRow!.name, updated_at: completeSessRow!.updated_at });
  ctx.publishUser({ type: 'chat_status', chat_id: chatId, status: 'idle', at: new Date().toISOString() });
  ctx.publish(sessionId, {
    type: 'message_complete',
    message_id: assistantMessageId,
    chat_id: chatId,
    tokens_used: updated?.tokens_used ?? null,
    ctx_used: updated?.ctx_used ?? null,
    ctx_max: updated?.ctx_max ?? null,
    started_at: startedAt,
    finished_at: updated?.finished_at ?? null,
    model: session.model,
  });
  ctx.log.info(
    {
      sessionId,
      chatId,
      assistantMessageId,
      finishReason,
      chars: content.length,
      tokens_used: updated?.tokens_used,
      ctx_used: updated?.ctx_used,
    },
    'inference complete'
  );
 }
--- a/apps/server/src/services/inference/index.ts
+++ b/apps/server/src/services/inference/index.ts
@@ -0,0 +1,20 @@
 // v1.12.4: re-export shim. Outside callers (apps/server/src/index.ts and the
 // vitest inference tests) import from './services/inference/index.js'. The
 // directory is now the public surface; turn.ts holds runAssistantTurn /
 // runInference / createInferenceRunner while the other inference/*.ts files
 // stay implementation-private.
 export {
  createInferenceRunner,
  runAssistantTurn,
  runInference,
 } from './turn.js';
 export type {
  FramePublisher,
  InferenceContext,
  InferenceFrame,
  StreamResult,
  TurnArgs,
 } from './turn.js';
 export { detectDoomLoop, DOOM_LOOP_THRESHOLD } from './sentinels.js';
 export { buildMessagesPayload } from './payload.js';
--- a/apps/server/src/services/inference/parts.ts
+++ b/apps/server/src/services/inference/parts.ts
@@ -0,0 +1,95 @@
 import type { Sql } from '../../db.js';
 import type { ToolCall, ToolResult } from '../../types/api.js';
 // v1.13.0: dual-write helper. Every site that writes the legacy
 // messages.tool_calls / messages.tool_results JSON columns calls into here
 // to mirror the same data into message_parts rows. Reads still go to the
 // JSON columns; the swap to parts-as-source-of-truth happens in a later
 // v1.13 dispatch alongside the AI SDK streamText migration.
 export type PartKind = 'text' | 'tool_call' | 'tool_result' | 'reasoning' | 'step_start';
 export interface PartInsert {
  message_id: string;
  sequence: number;
  kind: PartKind;
  payload: unknown;
 }
 export async function insertParts(sql: Sql, parts: PartInsert[]): Promise<void> {
  if (parts.length === 0) return;
  // postgres-js fans out an array of objects to a multi-row INSERT. Each
  // payload field needs sql.json() so jsonb storage receives a JSON value
  // rather than a quoted string.
  await sql`
    INSERT INTO message_parts ${sql(
      parts.map((p) => ({
        message_id: p.message_id,
        sequence: p.sequence,
        kind: p.kind,
        payload: sql.json(p.payload as never),
      })),
      'message_id',
      'sequence',
      'kind',
      'payload',
    )}
  `;
 }
 // Derive parts from the canonical messages row for an assistant message.
 // reasoning (when non-empty) becomes a 'reasoning' part at sequence 0 —
 // it precedes user-visible content logically. content (when non-empty)
 // becomes a 'text' part next; each tool_call becomes a 'tool_call' part
 // with payload { id, name, args } where args is the parsed object (we
 // use the in-memory ToolCall shape, not the OpenAI stringified one).
 export function partsFromAssistantMessage(args: {
  content: string;
  tool_calls: ToolCall[] | null;
  // v1.13.1-C: optional reasoning text streamed alongside the answer.
  // Most rows have none — only models with separate reasoning channels
  // (qwen3.6 etc.) populate this.
  reasoning?: string;
 }): Omit<PartInsert, 'message_id'>[] {
  const out: Omit<PartInsert, 'message_id'>[] = [];
  let seq = 0;
  if (args.reasoning && args.reasoning.length > 0) {
    out.push({ sequence: seq, kind: 'reasoning', payload: { text: args.reasoning } });
    seq += 1;
  }
  if (args.content && args.content.length > 0) {
    out.push({ sequence: seq, kind: 'text', payload: { text: args.content } });
    seq += 1;
  }
  for (const tc of args.tool_calls ?? []) {
    out.push({
      sequence: seq,
      kind: 'tool_call',
      payload: { id: tc.id, name: tc.name, args: tc.args },
    });
    seq += 1;
  }
  return out;
 }
 // Derive a single tool_result part from a tool message's tool_results JSON.
 // The payload includes the same shape that buildMessagesPayload reads from
 // later: tool_call_id, output, optional error/truncated metadata.
 export function partsFromToolMessage(args: {
  tool_results: ToolResult | null;
 }): Omit<PartInsert, 'message_id'>[] {
  if (!args.tool_results) return [];
  const tr = args.tool_results;
  return [
    {
      sequence: 0,
      kind: 'tool_result',
      payload: {
        tool_call_id: tr.tool_call_id,
        output: tr.output,
        truncated: tr.truncated,
        ...(tr.error ? { error: tr.error } : {}),
      },
    },
  ];
 }
--- a/apps/server/src/services/inference/payload.ts
+++ b/apps/server/src/services/inference/payload.ts
@@ -0,0 +1,192 @@
 import type { Sql } from '../../db.js';
 import type {
  Agent,
  Message,
  Project,
  Session,
 } from '../../types/api.js';
 import * as compaction from '../compaction.js';
 import { buildSystemPrompt } from '../system-prompt.js';
 import { isAnySentinel } from './sentinels.js';
 import { PRUNE_TRIGGER_TOKENS, prune } from './prune.js';
 import type { InferenceContext } from './turn.js';
 export interface OpenAiMessage {
  role: 'system' | 'user' | 'assistant' | 'tool';
  content: string | null;
  tool_calls?: Array<{
    id: string;
    type: 'function';
    function: { name: string; arguments: string };
  }>;
  tool_call_id?: string;
  // v1.13.1-C: reasoning text from a prior assistant turn, sourced from
  // message_parts kind='reasoning' rows joined in via reasoning_parts on
  // the messages_with_parts view. stream-phase.ts/toModelMessages threads
  // this into the AI SDK ReasoningPart when forwarding to the model so
  // reasoning models can resume mid-thought across tool-call boundaries.
  reasoning?: string;
 }
 // v1.12: buildSystemPrompt lives in services/system-prompt.ts. It awaits the
 // container-guidance loader, so this function is async too and every call
 // site in inference.ts awaits the result.
 export async function buildMessagesPayload(
  session: Session,
  project: Project,
  history: Message[],
  agent: Agent | null = null
 ): Promise<OpenAiMessage[]> {
  const out: OpenAiMessage[] = [];
  const systemPrompt = await buildSystemPrompt(project, session, agent);
  out.push({ role: 'system', content: systemPrompt });
  // Find the latest compact marker — only send messages from that point onwards
  let startIdx = 0;
  for (let i = history.length - 1; i >= 0; i--) {
    if (history[i]!.kind === 'compact') {
      startIdx = i;
      break;
    }
  }
  for (let i = startIdx; i < history.length; i++) {
    const m = history[i]!;
    if (m.kind === 'compact') {
      out.push({ role: 'system', content: m.content });
      continue;
    }
    // v1.8.2 / v1.11.6: cap-hit and doom-loop sentinels are UI-only — never
    // send them to the LLM. The synthetic instruction note lives only inside
    // the summary call's messages array and is never persisted, so on a
    // follow-up turn the model resumes with a clean context.
    if (isAnySentinel(m)) continue;
    if (m.role === 'assistant' && m.status === 'streaming') continue;
    if (m.role === 'assistant' && m.status === 'cancelled') continue;
    if (m.role === 'tool') {
      const tr = m.tool_results;
      if (!tr) continue;
      const outputText = tr.error
        ? `error: ${tr.error}`
        : typeof tr.output === 'string'
          ? tr.output
          : JSON.stringify(tr.output);
      out.push({
        role: 'tool',
        content: outputText,
        tool_call_id: tr.tool_call_id,
      });
      continue;
    }
    if (m.role === 'assistant') {
      const msg: OpenAiMessage = {
        role: 'assistant',
        content: m.content && m.content.length > 0 ? m.content : null,
      };
      if (m.tool_calls && m.tool_calls.length > 0) {
        msg.tool_calls = m.tool_calls.map((tc) => ({
          id: tc.id,
          type: 'function' as const,
          function: { name: tc.name, arguments: JSON.stringify(tc.args) },
        }));
      }
      // v1.13.1-C: collapse reasoning_parts into a single string. The view
      // returns them ordered by sequence; multiple reasoning parts on one
      // message are rare but concat preserves ordering. Skip when absent.
      if (m.reasoning_parts && m.reasoning_parts.length > 0) {
        msg.reasoning = m.reasoning_parts.map((p) => p.text ?? '').join('');
      }
      out.push(msg);
      continue;
    }
    out.push({ role: 'user', content: m.content });
  }
  return out;
 }
 export async function loadContext(
  sql: Sql,
  sessionId: string,
  chatId: string
 ): Promise<{ session: Session; project: Project; history: Message[] } | null> {
  const sessionRows = await sql<Session[]>`
    SELECT id, project_id, name, model, system_prompt, status, created_at, updated_at,
           agent_id, web_search_enabled
    FROM sessions WHERE id = ${sessionId}
  `;
  if (sessionRows.length === 0) return null;
  const session = sessionRows[0]!;
  const projectRows = await sql<Project[]>`
    SELECT id, name, path, added_at, last_session_id, status, gitea_remote,
           default_system_prompt, default_web_search_enabled
    FROM projects WHERE id = ${session.project_id}
  `;
  if (projectRows.length === 0) return null;
  const project = projectRows[0]!;
  // v1.11: filter compacted messages out of the inference assembly. The GET
  // /api/sessions/:id/messages endpoint still returns everything (so the UI
  // can show history with the summary card inline); only LLM payloads skip
  // compacted rows. compacted_at IS NULL keeps the active summary + tail.
  // v1.13.1-B: reads tool_calls/tool_results via the parts-merged view.
  // v1.13.1-C: also pull reasoning_parts so assistant messages from
  // reasoning models can be replayed with their reasoning context preserved.
  const history = await sql<Message[]>`
    SELECT id, session_id, chat_id, role, content, kind, tool_calls, tool_results, status, last_seq,
           tokens_used, ctx_used, ctx_max, started_at, finished_at, created_at, metadata,
           reasoning_parts
    FROM messages_with_parts
    WHERE chat_id = ${chatId} AND compacted_at IS NULL
    ORDER BY created_at ASC, id ASC
  `;
  return { session, project, history };
 }
 // v1.11: shared helper used after both finalizeCompletion and executeToolPhase
 // persist their token counts. Reads tokens off the just-UPDATEd row (which
 // the caller returns from RETURNING), runs compaction.isOverflow, and flips
 // chats.needs_compaction. The next runAssistantTurn invocation acts on it.
 // Silent on missing tokens — llama-swap occasionally omits usage on truncated
 // streams, and we'd rather miss one overflow than crash the inference path.
 export async function maybeFlagForCompaction(
  ctx: InferenceContext,
  chatId: string,
  updated: { tokens_used: number | null; ctx_used: number | null; ctx_max: number | null } | undefined,
 ): Promise<void> {
  if (!updated) return;
  const promptTokens = updated.ctx_used;
  const completionTokens = updated.tokens_used;
  const contextLimit = updated.ctx_max;
  if (typeof promptTokens !== 'number') return;
  if (typeof completionTokens !== 'number') return;
  if (typeof contextLimit !== 'number') return;
  const overflow = compaction.isOverflow(
    { prompt_tokens: promptTokens, completion_tokens: completionTokens },
    contextLimit,
  );
  if (!overflow) return;
  // v1.13.4: try the cheap prune first. If it freed at least the buffer
  // worth of tokens (PRUNE_TRIGGER_TOKENS, identical to COMPACTION_BUFFER),
  // we're below the threshold again — skip flagging summarize for the next
  // turn. The next turn's overflow check will re-evaluate from scratch.
  // Prune failures (DB errors etc.) propagate so the surrounding inference
  // path sees them; the catch in finalizeCompletion / executeToolPhase
  // doesn't shield this — by design, we want to know if prune is broken.
  const pruned = await prune({ sql: ctx.sql, chatId });
  if (pruned.hidden > 0) {
    ctx.log.info(
      { chatId, hidden: pruned.hidden, freedTokens: pruned.freedTokens },
      'inference: prune freed context budget',
    );
  }
  if (pruned.freedTokens >= PRUNE_TRIGGER_TOKENS) {
    // Prune handled it; skip the (expensive) summarize path.
    return;
  }
  await ctx.sql`UPDATE chats SET needs_compaction = true WHERE id = ${chatId}`;
  ctx.log.info({ chatId, promptTokens, completionTokens, contextLimit }, 'inference: flagged for compaction');
 }
--- a/apps/server/src/services/inference/provider.ts
+++ b/apps/server/src/services/inference/provider.ts
@@ -0,0 +1,26 @@
 import { createOpenAICompatible } from '@ai-sdk/openai-compatible';
 import type { LanguageModel } from 'ai';
 // v1.13.1-A: AI SDK provider against llama-swap. baseURL is threaded from
 // config.LLAMA_SWAP_URL at call time (not module-load) so tests can stub the
 // upstream without touching env vars. No apiKey — llama-swap is unauth in our
 // Tailscale topology and exposing it over the public internet is gated by
 // Authelia at the Caddy layer, not by API keys.
 const cache = new Map<string, ReturnType<typeof createOpenAICompatible>>();
 function getProvider(baseURL: string): ReturnType<typeof createOpenAICompatible> {
  let provider = cache.get(baseURL);
  if (!provider) {
    provider = createOpenAICompatible({
      name: 'llama-swap',
      baseURL: baseURL.endsWith('/v1') ? baseURL : `${baseURL}/v1`,
    });
    cache.set(baseURL, provider);
  }
  return provider;
 }
 export function upstreamModel(baseURL: string, modelId: string): LanguageModel {
  return getProvider(baseURL).chatModel(modelId);
 }
--- a/apps/server/src/services/inference/prune.ts
+++ b/apps/server/src/services/inference/prune.ts
@@ -0,0 +1,127 @@
 import type { Sql } from '../../db.js';
 // v1.13.4: two-tier compaction prune. Opencode's prune half (the cheap one);
 // summarize half shipped in v1.11.0 as services/compaction.ts.
 //
 // Algorithm: scan tool_result parts newest-first. Protect the last
 // PROTECTED_TOKENS of content (the model recently saw these — pruning them
 // kills coherence). Older parts are candidates. Mark them hidden_at only
 // if the candidate pool would free at least PRUNE_TRIGGER_TOKENS — pruning
 // 3 small tool_results to recover 500 tokens isn't worth the loss of
 // fidelity for the model's next turn.
 //
 // Stops at the last compaction summary boundary (chats.tail_start_id). The
 // v1.11.0 summary already encodes everything before that point; pruning
 // across the boundary would double-erase.
 export const PROTECTED_TOKENS = 40_000;
 export const PRUNE_TRIGGER_TOKENS = 20_000;
 // Rough char-to-token estimate. Same heuristic compaction's usable() uses
 // implicitly via the buffer constant.
 function estimateTokens(text: string): number {
  return Math.ceil(text.length / 4);
 }
 function payloadTokens(payload: unknown): number {
  return estimateTokens(JSON.stringify(payload ?? ''));
 }
 export interface PruneResult {
  hidden: number;
  freedTokens: number;
 }
 // Pure algorithmic core, exported for unit-test access. Takes parts already
 // ordered newest-first, plus an optional cutoff (last compaction summary
 // boundary). Returns the part ids to hide and the total token estimate of
 // the candidates. Caller does the DB UPDATE.
 export interface PartForPrune {
  id: string;
  payload: unknown;
  created_at: Date;
 }
 export function selectPruneTargets(
  partsNewestFirst: ReadonlyArray<PartForPrune>,
  tailStartCreatedAt: Date | null,
 ): { ids: string[]; freedTokens: number } {
  let protectedTokens = 0;
  const candidates: { id: string; tokens: number }[] = [];
  let crossedProtection = false;
  for (const part of partsNewestFirst) {
    if (tailStartCreatedAt && part.created_at < tailStartCreatedAt) {
      // Past the last summary boundary; the v1.11.0 anchored summary already
      // covers everything older. Bail rather than double-erase.
      break;
    }
    const tokens = payloadTokens(part.payload);
    if (!crossedProtection) {
      protectedTokens += tokens;
      if (protectedTokens >= PROTECTED_TOKENS) {
        crossedProtection = true;
      }
      continue;
    }
    candidates.push({ id: part.id, tokens });
  }
  const candidateTokens = candidates.reduce((s, c) => s + c.tokens, 0);
  if (candidates.length === 0 || candidateTokens < PRUNE_TRIGGER_TOKENS) {
    return { ids: [], freedTokens: 0 };
  }
  return { ids: candidates.map((c) => c.id), freedTokens: candidateTokens };
 }
 export async function prune(args: {
  sql: Sql;
  chatId: string;
 }): Promise<PruneResult> {
  const { sql, chatId } = args;
  // Newest-first scan of visible tool_result parts in this chat. Pull
  // chats.tail_start_id alongside so we know where the last summary boundary
  // sits (don't prune across it).
  const parts = await sql<{
    id: string;
    payload: unknown;
    created_at: Date;
    tail_start_id: string | null;
  }[]>`
    SELECT p.id, p.payload, m.created_at,
      (SELECT c.tail_start_id FROM chats c WHERE c.id = ${chatId}) AS tail_start_id
    FROM message_parts p
    JOIN messages m ON m.id = p.message_id
    WHERE m.chat_id = ${chatId}
      AND p.kind = 'tool_result'
      AND p.hidden_at IS NULL
    ORDER BY m.created_at DESC, p.sequence DESC
  `;
  if (parts.length === 0) {
    return { hidden: 0, freedTokens: 0 };
  }
  // Read the boundary cutoff timestamp once. Older messages are off-limits.
  let tailStartCreatedAt: Date | null = null;
  const firstTailId = parts[0]?.tail_start_id ?? null;
  if (firstTailId) {
    const tailRow = await sql<{ created_at: Date }[]>`
      SELECT created_at FROM messages WHERE id = ${firstTailId}
    `;
    tailStartCreatedAt = tailRow[0]?.created_at ?? null;
  }
  const decision = selectPruneTargets(parts, tailStartCreatedAt);
  if (decision.ids.length === 0) {
    return { hidden: 0, freedTokens: 0 };
  }
  await sql`
    UPDATE message_parts
    SET hidden_at = clock_timestamp()
    WHERE id = ANY(${decision.ids})
  `;
  return { hidden: decision.ids.length, freedTokens: decision.freedTokens };
 }
--- a/apps/server/src/services/inference/sentinel-summaries.ts
+++ b/apps/server/src/services/inference/sentinel-summaries.ts
@@ -0,0 +1,523 @@
 import type {
  Agent,
  Message,
  MessageMetadata,
  Project,
  Session,
 } from '../../types/api.js';
 import * as modelContext from '../model-context.js';
 import { buildMessagesPayload } from './payload.js';
 import { DOOM_LOOP_THRESHOLD } from './sentinels.js';
 import { streamCompletion } from './stream-phase.js';
 import { DB_FLUSH_INTERVAL_MS } from './types.js';
 import type {
  InferenceContext,
  StreamResult,
  TurnArgs,
 } from './turn.js';
 // Synthetic system note appended to the cap-hit summary call. Verbatim from
 // the v1.8.2 spec — do not paraphrase: the model is more reliable when the
 // instruction is short, declarative, and identical across calls.
 const CAP_HIT_SUMMARY_NOTE = (limit: number) =>
  `You've reached the tool budget (${limit} calls). Produce the best answer you can with what you have. Do not call more tools.`;
 const DOOM_LOOP_NOTE = (name: string) =>
  `You called ${name} with the same arguments ${DOOM_LOOP_THRESHOLD} times in a row. Stop calling it. Produce the best answer you can with what you have.`;
 export async function runCapHitSummary(
  ctx: InferenceContext,
  args: TurnArgs,
  session: Session,
  project: Project,
  history: Message[],
  agent: Agent | null,
  budget: number,
 ): Promise<void> {
  const { sessionId, chatId, assistantMessageId, signal } = args;
  const messages = await buildMessagesPayload(session, project, history, agent);
  messages.push({ role: 'system', content: CAP_HIT_SUMMARY_NOTE(budget) });
  const startedRow = await ctx.sql<{ started_at: string }[]>`
    UPDATE messages
    SET started_at = clock_timestamp()
    WHERE id = ${assistantMessageId}
    RETURNING started_at
  `;
  const startedAt = startedRow[0]?.started_at ?? null;
  ctx.publish(sessionId, {
    type: 'message_started',
    message_id: assistantMessageId,
    chat_id: chatId,
    role: 'assistant',
  });
  let accumulated = '';
  let pendingFlushTimer: NodeJS.Timeout | null = null;
  let flushPromise: Promise<unknown> = Promise.resolve();
  const flushNow = () => {
    if (pendingFlushTimer) {
      clearTimeout(pendingFlushTimer);
      pendingFlushTimer = null;
    }
    const snapshot = accumulated;
    flushPromise = flushPromise.then(() =>
      ctx.sql`UPDATE messages SET content = ${snapshot} WHERE id = ${assistantMessageId}`
    );
  };
  const scheduleFlush = () => {
    if (pendingFlushTimer) return;
    pendingFlushTimer = setTimeout(() => {
      pendingFlushTimer = null;
      flushNow();
    }, DB_FLUSH_INTERVAL_MS);
  };
  let summaryOk = false;
  let summarySoftCancelled = false;
  let summaryError: string | null = null;
  let result: StreamResult | null = null;
  try {
    result = await streamCompletion(
      ctx,
      session.model,
      messages,
      { tools: null, temperature: agent?.temperature },
      (delta) => {
        accumulated += delta;
        ctx.publish(sessionId, {
          type: 'delta',
          message_id: assistantMessageId,
          chat_id: chatId,
          content: delta,
        });
        scheduleFlush();
      },
      undefined,
      signal,
    );
    summaryOk = true;
  } catch (err) {
    if (err instanceof Error && err.name === 'AbortError') {
      summarySoftCancelled = true;
    } else {
      summaryError = err instanceof Error ? err.message : String(err);
    }
  } finally {
    if (pendingFlushTimer) {
      clearTimeout(pendingFlushTimer);
      pendingFlushTimer = null;
    }
    await flushPromise;
  }
  // Finalize the summary message based on the three outcomes. The sentinel
  // is inserted regardless so the user always has the Continue affordance —
  // even on a partial / failed summary the chat history shows where the
  // budget was hit.
  if (summaryOk && result) {
    // v1.11.3: see executeToolPhase for the rationale.
    const mctx = await modelContext.getModelContext(session.model);
    const nCtx = mctx?.n_ctx ?? null;
    const [updated] = await ctx.sql<
      { tokens_used: number | null; ctx_used: number | null; ctx_max: number | null; finished_at: string | null }[]
    >`
      UPDATE messages
      SET content = ${result.content},
          status = 'complete',
          tokens_used = ${result.completionTokens},
          ctx_used = ${result.promptTokens},
          ctx_max = ${nCtx},
          finished_at = clock_timestamp()
      WHERE id = ${assistantMessageId}
      RETURNING tokens_used, ctx_used, ctx_max, finished_at
    `;
    ctx.publish(sessionId, {
      type: 'message_complete',
      message_id: assistantMessageId,
      chat_id: chatId,
      tokens_used: updated?.tokens_used ?? null,
      ctx_used: updated?.ctx_used ?? null,
      ctx_max: updated?.ctx_max ?? null,
      started_at: startedAt,
      finished_at: updated?.finished_at ?? null,
      model: session.model,
    });
  } else if (summarySoftCancelled) {
    await ctx.sql`
      UPDATE messages
      SET content = ${accumulated},
          status = 'cancelled',
          finished_at = clock_timestamp()
      WHERE id = ${assistantMessageId}
    `;
    ctx.publish(sessionId, {
      type: 'message_complete',
      message_id: assistantMessageId,
      chat_id: chatId,
    });
  } else {
    const errMeta: MessageMetadata = {
      kind: 'error',
      error_reason: 'summary_after_cap_failed',
      error_text: summaryError ?? 'summary failed',
    };
    await ctx.sql`
      UPDATE messages
      SET content = ${accumulated},
          status = 'failed',
          finished_at = clock_timestamp(),
          metadata = ${ctx.sql.json(errMeta as never)}
      WHERE id = ${assistantMessageId}
    `;
    ctx.publish(sessionId, {
      type: 'error',
      message_id: assistantMessageId,
      chat_id: chatId,
      error: summaryError ?? 'summary failed',
      reason: 'summary_after_cap_failed',
    });
  }
  // Bump session/chat updated_at exactly once for this turn.
  const [sessRow] = await ctx.sql<{ project_id: string; name: string; updated_at: string }[]>`
    UPDATE sessions SET updated_at = clock_timestamp()
    WHERE id = ${sessionId}
    RETURNING project_id, name, updated_at
  `;
  ctx.publishUser({
    type: 'session_updated',
    session_id: sessionId,
    project_id: sessRow!.project_id,
    name: sessRow!.name,
    updated_at: sessRow!.updated_at,
  });
  await insertCapHitSentinel(ctx, sessionId, chatId, agent, budget);
  // Status frame fires last so the dot color reflects the terminal state.
  // Success → idle, abort → idle (user-driven stop), error → error+reason.
  if (summaryOk) {
    ctx.publishUser({ type: 'chat_status', chat_id: chatId, status: 'idle', at: new Date().toISOString() });
  } else if (summarySoftCancelled) {
    ctx.publishUser({ type: 'chat_status', chat_id: chatId, status: 'idle', at: new Date().toISOString() });
  } else {
    ctx.publishUser({
      type: 'chat_status',
      chat_id: chatId,
      status: 'error',
      at: new Date().toISOString(),
      reason: 'summary_after_cap_failed',
    });
  }
  ctx.log.info(
    { sessionId, chatId, assistantMessageId, budget, summaryOk, summaryCancelled: summarySoftCancelled },
    'inference cap-hit summary finished',
  );
 }
 async function insertCapHitSentinel(
  ctx: InferenceContext,
  sessionId: string,
  chatId: string,
  agent: Agent | null,
  budget: number,
 ): Promise<void> {
  // Hard ceiling: count prior cap_hit sentinels in this chat. After two
  // continues (sentinel count of 2), the next sentinel reports can_continue
  // false and the UI disables the Continue button.
  const priorRows = await ctx.sql<{ count: number }[]>`
    SELECT COUNT(*)::int AS count
    FROM messages
    WHERE chat_id = ${chatId}
      AND role = 'system'
      AND metadata->>'kind' = 'cap_hit'
  `;
  const priorCount = priorRows[0]?.count ?? 0;
  const canContinue = priorCount < 2;
  const metadata: MessageMetadata = {
    kind: 'cap_hit',
    used: budget,
    limit: budget,
    agent_name: agent?.name ?? null,
    can_continue: canContinue,
  };
  const content = `Reached tool budget (${budget}/${budget}). Continue to extend.`;
  const [row] = await ctx.sql<{ id: string }[]>`
    INSERT INTO messages (session_id, chat_id, role, content, status, created_at, metadata)
    VALUES (${sessionId}, ${chatId}, 'system', ${content}, 'complete', clock_timestamp(), ${ctx.sql.json(metadata as never)})
    RETURNING id
  `;
  // The sentinel content is static, but we still walk the standard frame
  // sequence (started → delta → complete) so useSessionStream's reducer
  // appends it via the same path it uses for streaming assistant messages.
  // The delta carries the full text in one chunk.
  ctx.publish(sessionId, {
    type: 'message_started',
    message_id: row!.id,
    chat_id: chatId,
    role: 'system',
  });
  ctx.publish(sessionId, {
    type: 'delta',
    message_id: row!.id,
    chat_id: chatId,
    content,
  });
  ctx.publish(sessionId, {
    type: 'message_complete',
    message_id: row!.id,
    chat_id: chatId,
    metadata,
  });
 }
 // v1.11.6: doom-loop wrap-up. Mirrors runCapHitSummary structurally — same
 // in-flight-slot reuse, same tools-disabled streaming-summary call, same
 // post-finalize sentinel insert + chat_status drop. Differences:
 //   - synthetic note text comes from DOOM_LOOP_NOTE (names the looping tool)
 //   - sentinel metadata is { kind: 'doom_loop', tool_name, args, threshold }
 //     and has no Continue affordance (manual retry would just re-loop)
 //   - chat_status error path uses reason: 'doom_loop_summary_failed'
 // Kept as a clone rather than refactored into a shared helper because the
 // two summary paths still differ in error reason + sentinel shape; a third
 // sentinel would justify factoring out runWrapUpSummary(opts).
 export async function runDoomLoopSummary(
  ctx: InferenceContext,
  args: TurnArgs,
  session: Session,
  project: Project,
  history: Message[],
  agent: Agent | null,
  loop: { name: string; args: Record<string, unknown> },
 ): Promise<void> {
  const { sessionId, chatId, assistantMessageId, signal } = args;
  const messages = await buildMessagesPayload(session, project, history, agent);
  messages.push({ role: 'system', content: DOOM_LOOP_NOTE(loop.name) });
  const startedRow = await ctx.sql<{ started_at: string }[]>`
    UPDATE messages
    SET started_at = clock_timestamp()
    WHERE id = ${assistantMessageId}
    RETURNING started_at
  `;
  const startedAt = startedRow[0]?.started_at ?? null;
  ctx.publish(sessionId, {
    type: 'message_started',
    message_id: assistantMessageId,
    chat_id: chatId,
    role: 'assistant',
  });
  let accumulated = '';
  let pendingFlushTimer: NodeJS.Timeout | null = null;
  let flushPromise: Promise<unknown> = Promise.resolve();
  const flushNow = () => {
    if (pendingFlushTimer) {
      clearTimeout(pendingFlushTimer);
      pendingFlushTimer = null;
    }
    const snapshot = accumulated;
    flushPromise = flushPromise.then(() =>
      ctx.sql`UPDATE messages SET content = ${snapshot} WHERE id = ${assistantMessageId}`
    );
  };
  const scheduleFlush = () => {
    if (pendingFlushTimer) return;
    pendingFlushTimer = setTimeout(() => {
      pendingFlushTimer = null;
      flushNow();
    }, DB_FLUSH_INTERVAL_MS);
  };
  let summaryOk = false;
  let summarySoftCancelled = false;
  let summaryError: string | null = null;
  let result: StreamResult | null = null;
  try {
    result = await streamCompletion(
      ctx,
      session.model,
      messages,
      { tools: null, temperature: agent?.temperature },
      (delta) => {
        accumulated += delta;
        ctx.publish(sessionId, {
          type: 'delta',
          message_id: assistantMessageId,
          chat_id: chatId,
          content: delta,
        });
        scheduleFlush();
      },
      undefined,
      signal,
    );
    summaryOk = true;
  } catch (err) {
    if (err instanceof Error && err.name === 'AbortError') {
      summarySoftCancelled = true;
    } else {
      summaryError = err instanceof Error ? err.message : String(err);
    }
  } finally {
    if (pendingFlushTimer) {
      clearTimeout(pendingFlushTimer);
      pendingFlushTimer = null;
    }
    await flushPromise;
  }
  if (summaryOk && result) {
    const mctx = await modelContext.getModelContext(session.model);
    const nCtx = mctx?.n_ctx ?? null;
    const [updated] = await ctx.sql<
      { tokens_used: number | null; ctx_used: number | null; ctx_max: number | null; finished_at: string | null }[]
    >`
      UPDATE messages
      SET content = ${result.content},
          status = 'complete',
          tokens_used = ${result.completionTokens},
          ctx_used = ${result.promptTokens},
          ctx_max = ${nCtx},
          finished_at = clock_timestamp()
      WHERE id = ${assistantMessageId}
      RETURNING tokens_used, ctx_used, ctx_max, finished_at
    `;
    ctx.publish(sessionId, {
      type: 'message_complete',
      message_id: assistantMessageId,
      chat_id: chatId,
      tokens_used: updated?.tokens_used ?? null,
      ctx_used: updated?.ctx_used ?? null,
      ctx_max: updated?.ctx_max ?? null,
      started_at: startedAt,
      finished_at: updated?.finished_at ?? null,
      model: session.model,
    });
  } else if (summarySoftCancelled) {
    await ctx.sql`
      UPDATE messages
      SET content = ${accumulated},
          status = 'cancelled',
          finished_at = clock_timestamp()
      WHERE id = ${assistantMessageId}
    `;
    ctx.publish(sessionId, {
      type: 'message_complete',
      message_id: assistantMessageId,
      chat_id: chatId,
    });
  } else {
    // Doom-loop summary failure reuses the existing summary_after_cap_failed
    // error reason — the ErrorReason union is shared between sentinel paths
    // and the UI surfaces a generic "summary failed" line for both. We don't
    // add a new reason code because the user-visible failure mode is the
    // same (model gave up mid-summary). Sentinel below still fires.
    const errMeta: MessageMetadata = {
      kind: 'error',
      error_reason: 'summary_after_cap_failed',
      error_text: summaryError ?? 'doom-loop summary failed',
    };
    await ctx.sql`
      UPDATE messages
      SET content = ${accumulated},
          status = 'failed',
          finished_at = clock_timestamp(),
          metadata = ${ctx.sql.json(errMeta as never)}
      WHERE id = ${assistantMessageId}
    `;
    ctx.publish(sessionId, {
      type: 'error',
      message_id: assistantMessageId,
      chat_id: chatId,
      error: summaryError ?? 'doom-loop summary failed',
      reason: 'summary_after_cap_failed',
    });
  }
  const [sessRow] = await ctx.sql<{ project_id: string; name: string; updated_at: string }[]>`
    UPDATE sessions SET updated_at = clock_timestamp()
    WHERE id = ${sessionId}
    RETURNING project_id, name, updated_at
  `;
  ctx.publishUser({
    type: 'session_updated',
    session_id: sessionId,
    project_id: sessRow!.project_id,
    name: sessRow!.name,
    updated_at: sessRow!.updated_at,
  });
  await insertDoomLoopSentinel(ctx, sessionId, chatId, loop);
  if (summaryOk || summarySoftCancelled) {
    ctx.publishUser({ type: 'chat_status', chat_id: chatId, status: 'idle', at: new Date().toISOString() });
  } else {
    ctx.publishUser({
      type: 'chat_status',
      chat_id: chatId,
      status: 'error',
      at: new Date().toISOString(),
      reason: 'summary_after_cap_failed',
    });
  }
  ctx.log.info(
    { sessionId, chatId, assistantMessageId, loopedTool: loop.name, summaryOk, summaryCancelled: summarySoftCancelled },
    'inference doom-loop summary finished',
  );
 }
 async function insertDoomLoopSentinel(
  ctx: InferenceContext,
  sessionId: string,
  chatId: string,
  loop: { name: string; args: Record<string, unknown> },
 ): Promise<void> {
  // No hard-ceiling / can-continue logic here — doom-loop is a different
  // failure mode from cap-hit. Continuing would re-trigger the loop with
  // the same tools available; the user needs to restate their question
  // or switch agents instead.
  const metadata: MessageMetadata = {
    kind: 'doom_loop',
    tool_name: loop.name,
    args: loop.args,
    threshold: DOOM_LOOP_THRESHOLD,
  };
  const content = `Detected ${DOOM_LOOP_THRESHOLD} identical calls to ${loop.name}. Stopping the tool-call loop. Produce the best answer you can with what you have.`;
  const [row] = await ctx.sql<{ id: string }[]>`
    INSERT INTO messages (session_id, chat_id, role, content, status, created_at, metadata)
    VALUES (${sessionId}, ${chatId}, 'system', ${content}, 'complete', clock_timestamp(), ${ctx.sql.json(metadata as never)})
    RETURNING id
  `;
  // Standard frame sequence — same as cap-hit sentinel — so
  // useSessionStream's reducer appends the row via the existing path.
  ctx.publish(sessionId, {
    type: 'message_started',
    message_id: row!.id,
    chat_id: chatId,
    role: 'system',
  });
  ctx.publish(sessionId, {
    type: 'delta',
    message_id: row!.id,
    chat_id: chatId,
    content,
  });
  ctx.publish(sessionId, {
    type: 'message_complete',
    message_id: row!.id,
    chat_id: chatId,
    metadata,
  });
 }
--- a/apps/server/src/services/inference/sentinels.ts
+++ b/apps/server/src/services/inference/sentinels.ts
@@ -0,0 +1,53 @@
 import type { Message, ToolCall } from '../../types/api.js';
 // v1.11.6: doom-loop guard. When the model calls the same tool with the
 // same arguments DOOM_LOOP_THRESHOLD times in a row within one user-message
 // turn, abort the recursion and run the same wrap-up summary path as the
 // cap-hit case. Ported from opencode (DOOM_LOOP_THRESHOLD in
 // session/processor.ts). Threshold of 3 is the smallest value that doesn't
 // false-positive on a model that retries once after a transient error.
 export const DOOM_LOOP_THRESHOLD = 3;
 // Returns the name + args of the looping tool when the LAST
 // DOOM_LOOP_THRESHOLD entries in `recentToolCalls` are identical (same name
 // AND deep-equal args via JSON.stringify). Returns null otherwise.
 // Pure; exported for unit-test access.
 export function detectDoomLoop(
  recentToolCalls: ToolCall[],
 ): { name: string; args: Record<string, unknown> } | null {
  if (recentToolCalls.length < DOOM_LOOP_THRESHOLD) return null;
  const last = recentToolCalls.slice(-DOOM_LOOP_THRESHOLD);
  const ref = last[0]!;
  const refArgs = JSON.stringify(ref.args);
  for (let i = 1; i < last.length; i++) {
    const tc = last[i]!;
    if (tc.name !== ref.name) return null;
    if (JSON.stringify(tc.args) !== refArgs) return null;
  }
  return { name: ref.name, args: ref.args };
 }
 export function isCapHitSentinel(m: Message): boolean {
  return (
    m.role === 'system' &&
    m.metadata !== null &&
    typeof m.metadata === 'object' &&
    (m.metadata as { kind?: unknown }).kind === 'cap_hit'
  );
 }
 // v1.11.6: parallel predicate. Same UI-only semantics as cap-hit sentinels —
 // never sent to the LLM (filtered by buildMessagesPayload through the
 // isAnySentinel check below).
 export function isDoomLoopSentinel(m: Message): boolean {
  return (
    m.role === 'system' &&
    m.metadata !== null &&
    typeof m.metadata === 'object' &&
    (m.metadata as { kind?: unknown }).kind === 'doom_loop'
  );
 }
 export function isAnySentinel(m: Message): boolean {
  return isCapHitSentinel(m) || isDoomLoopSentinel(m);
 }
--- a/apps/server/src/services/inference/stream-phase.ts
+++ b/apps/server/src/services/inference/stream-phase.ts
@@ -0,0 +1,482 @@
 import type {
  Agent,
  Session,
  ToolCall,
 } from '../../types/api.js';
 import * as modelContext from '../model-context.js';
 import { toolJsonSchemas, type ToolJsonSchema } from '../tools.js';
 import type { OpenAiMessage } from './payload.js';
 import {
  XML_TOOL_CLOSE,
  XML_TOOL_OPEN,
  parseXmlToolCall,
  partialXmlOpenerStart,
 } from './xml-parser.js';
 import { DB_FLUSH_INTERVAL_MS, type StreamPhaseState } from './types.js';
 import type {
  InferenceContext,
  StreamResult,
  TurnArgs,
 } from './turn.js';
 import { upstreamModel } from './provider.js';
 import {
  jsonSchema,
  streamText,
  tool,
  type JSONValue,
  type ModelMessage,
  type ToolCallRepairFunction,
 } from 'ai';
 interface StreamOptions {
  // null = omit tools entirely (compact phase); [] = caller stripped all tools
  // (rare; we still omit from the request body to avoid OpenAI 400).
  tools: ToolJsonSchema[] | null;
  temperature?: number;
 }
 // v1.13.1-A: convert BooCode's OpenAI-shaped history into AI SDK
 // ModelMessage[]. Tool result messages need a `toolName` field that the
 // OpenAI shape doesn't carry; we look it up by scanning earlier assistant
 // `tool_calls` entries for a matching id.
 function toModelMessages(messages: OpenAiMessage[]): ModelMessage[] {
  const toolNameById = new Map<string, string>();
  for (const m of messages) {
    if (m.role === 'assistant' && m.tool_calls) {
      for (const tc of m.tool_calls) {
        toolNameById.set(tc.id, tc.function.name);
      }
    }
  }
  const out: ModelMessage[] = [];
  for (const m of messages) {
    if (m.role === 'system' || m.role === 'user') {
      out.push({ role: m.role, content: m.content ?? '' });
      continue;
    }
    if (m.role === 'assistant') {
      const hasTools = m.tool_calls && m.tool_calls.length > 0;
      const hasReasoning = typeof m.reasoning === 'string' && m.reasoning.length > 0;
      if (!hasTools && !hasReasoning) {
        // Bare text assistant (string content). null content + no tool_calls
        // is degenerate but harmless to forward.
        out.push({ role: 'assistant', content: m.content ?? '' });
        continue;
      }
      // v1.13.1-C: AI SDK ReasoningPart precedes text + tool-calls in the
      // assistant content array. Reasoning models (qwen3.6) consume their
      // prior reasoning context to resume mid-thought across tool boundaries.
      const parts: Array<
        | { type: 'reasoning'; text: string }
        | { type: 'text'; text: string }
        | { type: 'tool-call'; toolCallId: string; toolName: string; input: unknown }
      > = [];
      if (hasReasoning) {
        parts.push({ type: 'reasoning', text: m.reasoning! });
      }
      if (m.content && m.content.length > 0) {
        parts.push({ type: 'text', text: m.content });
      }
      for (const tc of m.tool_calls ?? []) {
        let input: unknown = {};
        try {
          input = tc.function.arguments.length > 0 ? JSON.parse(tc.function.arguments) : {};
        } catch {
          // Malformed args from a prior turn: pass through as a raw blob so
          // the model sees the same shape it emitted. Wraps the string under
          // _raw to match the buildMessagesPayload upstream convention.
          input = { _raw: tc.function.arguments };
        }
        parts.push({ type: 'tool-call', toolCallId: tc.id, toolName: tc.function.name, input });
      }
      out.push({ role: 'assistant', content: parts });
      continue;
    }
    if (m.role === 'tool') {
      const toolCallId = m.tool_call_id ?? '';
      const toolName = toolNameById.get(toolCallId) ?? 'unknown';
      const raw = m.content ?? '';
      let output: { type: 'text'; value: string } | { type: 'json'; value: JSONValue };
      try {
        // JSON.parse returns `any`; cast to JSONValue since the upstream
        // tool_results column is already JSON-serializable by construction.
        output = { type: 'json', value: JSON.parse(raw) as JSONValue };
      } catch {
        output = { type: 'text', value: raw };
      }
      out.push({
        role: 'tool',
        content: [{ type: 'tool-result', toolCallId, toolName, output }],
      });
      continue;
    }
  }
  return out;
 }
 // Build the AI SDK tools record from BooCode's JSON-schema tool definitions.
 // No `execute` field: BooCode runs tools itself in tool-phase.ts; streamText
 // surfaces the tool-call parts via fullStream and we capture them for the
 // outer loop to dispatch.
 function buildAiTools(schemas: ToolJsonSchema[]): Record<string, ReturnType<typeof tool>> {
  const out: Record<string, ReturnType<typeof tool>> = {};
  for (const s of schemas) {
    out[s.function.name] = tool({
      description: s.function.description,
      inputSchema: jsonSchema(s.function.parameters),
    });
  }
  return out;
 }
 // v1.10.5 Qwen-coder XML fallback. Some local models (notably qwen3-coder via
 // llama-swap) emit tool calls as inline XML inside delta.content rather than
 // the structured tool_calls field. We extract them out of the streamed text
 // before flushing it to the client, mirroring the pre-AI-SDK behavior.
 //
 // XML shape:
 //   <tool_call>
 //   <function=NAME>
 //   <parameter=KEY>VALUE</parameter>
 //   ...
 //   </function>
 //   </tool_call>
 // Multiple <tool_call> blocks may appear back-to-back; they never nest.
 export async function streamCompletion(
  ctx: InferenceContext,
  model: string,
  messages: OpenAiMessage[],
  opts: StreamOptions,
  onDelta: (content: string) => void,
  onUsage: ((prompt: number | null, completion: number | null) => void) | undefined,
  signal?: AbortSignal
 ): Promise<StreamResult> {
  const aiMessages = toModelMessages(messages);
  const hasTools = opts.tools !== null && opts.tools.length > 0;
  const aiTools = hasTools ? buildAiTools(opts.tools!) : undefined;
  const startedAt = Date.now();
  // v1.13.1-C: accumulate reasoning text across reasoning-delta parts.
  // qwen3.6 emits these on a separate channel from text content; we capture
  // them per stream so finalizeCompletion can dual-write a 'reasoning' part.
  // Replaces the v1.13.1-A counter-only diagnostic.
  let reasoningAccumulated = '';
  // v1.13.3: experimental_repairToolCall keeps the stream alive when the
  // model emits a malformed tool call (bad JSON args, unknown name, etc.).
  // Without a repair function streamText throws and the WHOLE stream dies;
  // with one, the SDK invokes us and we route the bad call through normally.
  // Strategy: pass through unmodified. executeToolPhase's existing error
  // path (unknown tool name → "unknown tool: X" result; zod-reject → tool
  // 'X' rejected — fieldname: required) already gives the model a clean
  // recovery surface on the next turn. Logging gives us visibility into
  // how often qwen3.6 actually emits broken calls.
  const repairToolCall: ToolCallRepairFunction<NonNullable<typeof aiTools>> = async ({
    toolCall,
    error,
  }) => {
    ctx.log.warn(
      {
        toolCallId: toolCall.toolCallId,
        toolName: toolCall.toolName,
        error: error.message,
      },
      'malformed tool call surfaced via repairToolCall',
    );
    return toolCall;
  };
  const result = streamText({
    model: upstreamModel(ctx.config.LLAMA_SWAP_URL, model),
    messages: aiMessages,
    ...(aiTools
      ? { tools: aiTools, toolChoice: 'auto' as const, experimental_repairToolCall: repairToolCall }
      : {}),
    ...(typeof opts.temperature === 'number' ? { temperature: opts.temperature } : {}),
    abortSignal: signal,
  });
  let content = '';
  let pendingBuffer = '';
  let finishReason: string | null = null;
  // v1.13.1-A: AI SDK emits one `tool-call` part per fully-aggregated call,
  // so we no longer need the OpenAI-index reassembly map the manual SSE
  // parser used. XML tool calls extracted from text content go into the
  // same flat list and keep the v1.10.5 synthetic id convention.
  const toolCalls: ToolCall[] = [];
  for await (const part of result.fullStream) {
    switch (part.type) {
      case 'text-delta': {
        pendingBuffer += part.text;
        // Extract any complete <tool_call>...</tool_call> blocks before
        // flushing visible text.
        while (true) {
          const startIdx = pendingBuffer.indexOf(XML_TOOL_OPEN);
          if (startIdx === -1) break;
          const closeIdx = pendingBuffer.indexOf(XML_TOOL_CLOSE, startIdx);
          if (closeIdx === -1) break;
          const blockEnd = closeIdx + XML_TOOL_CLOSE.length;
          const block = pendingBuffer.slice(startIdx, blockEnd);
          if (startIdx > 0) {
            const before = pendingBuffer.slice(0, startIdx);
            content += before;
            onDelta(before);
          }
          const parsedCall = parseXmlToolCall(block);
          if (parsedCall) {
            const synthIdx = toolCalls.length;
            toolCalls.push({
              id: `xml_call_${synthIdx}`,
              name: parsedCall.name,
              args: parsedCall.args,
            });
          }
          // Parse failures still drop the block — leaking <tool_call> XML to
          // the chat would look worse than silently swallowing the bad block.
          pendingBuffer = pendingBuffer.slice(blockEnd);
        }
        // Hold back any (partial or full) unclosed opener; flush the rest.
        const partialIdx = partialXmlOpenerStart(pendingBuffer);
        if (partialIdx >= 0) {
          if (partialIdx > 0) {
            const flush = pendingBuffer.slice(0, partialIdx);
            content += flush;
            onDelta(flush);
          }
          pendingBuffer = pendingBuffer.slice(partialIdx);
        } else if (pendingBuffer.length > 0) {
          content += pendingBuffer;
          onDelta(pendingBuffer);
          pendingBuffer = '';
        }
        break;
      }
      case 'tool-call': {
        // AI SDK has already parsed the input into an object. Match the
        // ToolCall shape BooCode passes around in toolCallsBuffer downstream.
        toolCalls.push({
          id: part.toolCallId,
          name: part.toolName,
          args: (part.input ?? {}) as Record<string, unknown>,
        });
        break;
      }
      case 'reasoning-delta': {
        // v1.13.1-C: accumulate; finalizeCompletion / executeToolPhase
        // dual-write the resulting text as a kind='reasoning' part.
        if (typeof part.text === 'string') {
          reasoningAccumulated += part.text;
        }
        break;
      }
      case 'finish': {
        if (typeof part.finishReason === 'string') {
          finishReason = part.finishReason;
        }
        break;
      }
      case 'error': {
        const err = part.error;
        throw err instanceof Error ? err : new Error(String(err));
      }
      // Intentional no-op: start, start-step, text-start, text-end,
      // reasoning-start, reasoning-end, source, file, tool-input-start,
      // tool-input-delta, tool-input-end, tool-result, tool-error,
      // finish-step, raw. We only care about the aggregated tool-call and
      // text-delta paths above; the rest are AI SDK lifecycle/streaming
      // breadcrumbs that don't change BooCode's persistence or WS contract.
      default:
        break;
    }
  }
  // v1.13.1-A: drain any buffered partial XML opener as plain text. The
  // pre-AI-SDK path did this on stream end too — better to leak `<tool_c`
  // than vanish the text.
  if (pendingBuffer.length > 0) {
    content += pendingBuffer;
    onDelta(pendingBuffer);
    pendingBuffer = '';
  }
  // AI SDK v6 fullStream returns normally on abort; check signal explicitly.
  // Without this throw the row would land as status='complete' with partial
  // content instead of going through handleAbortOrError → status='cancelled'.
  // Smoke D caught this in v1.13.1-A — don't refactor it away.
  if (signal?.aborted) {
    const abortErr = new Error('aborted');
    abortErr.name = 'AbortError';
    throw abortErr;
  }
  // Usage lands as a promise on the result; awaiting after fullStream is
  // drained is safe. AI SDK v6 names: `inputTokens` / `outputTokens`.
  let promptTokens: number | null = null;
  let completionTokens: number | null = null;
  try {
    const usage = await result.usage;
    if (typeof usage.inputTokens === 'number') promptTokens = usage.inputTokens;
    if (typeof usage.outputTokens === 'number') completionTokens = usage.outputTokens;
  } catch {
    // Some providers omit usage on partial streams; leave both null.
  }
  if (onUsage && (promptTokens !== null || completionTokens !== null)) {
    onUsage(promptTokens, completionTokens);
  }
  if (reasoningAccumulated.length > 0) {
    ctx.log.debug(
      { reasoningChars: reasoningAccumulated.length, model, elapsed_ms: Date.now() - startedAt },
      'streamCompletion: captured reasoning',
    );
  }
  return {
    finishReason,
    content,
    toolCalls,
    promptTokens,
    completionTokens,
    reasoning: reasoningAccumulated,
  };
 }
 export async function executeStreamPhase(
  ctx: InferenceContext,
  args: TurnArgs,
  session: Session,
  messages: OpenAiMessage[],
  state: StreamPhaseState,
  agent: Agent | null,
  // v1.11.8: when false, web_search and web_fetch are stripped from the
  // tool list sent to the LLM, so the model can't even attempt them.
  webToolsEnabled: boolean,
 ): Promise<StreamResult> {
  const { sessionId, chatId, assistantMessageId, signal } = args;
  const startedRow = await ctx.sql<{ started_at: string }[]>`
    UPDATE messages
    SET started_at = clock_timestamp()
    WHERE id = ${assistantMessageId}
    RETURNING started_at
  `;
  state.startedAt = startedRow[0]?.started_at ?? null;
  ctx.publish(sessionId, {
    type: 'message_started',
    message_id: assistantMessageId,
    chat_id: chatId,
    role: 'assistant',
  });
  let pendingFlushTimer: NodeJS.Timeout | null = null;
  let flushPromise: Promise<unknown> = Promise.resolve();
  const flushNow = () => {
    if (pendingFlushTimer) {
      clearTimeout(pendingFlushTimer);
      pendingFlushTimer = null;
    }
    const snapshot = state.accumulated;
    flushPromise = flushPromise.then(() =>
      ctx.sql`UPDATE messages SET content = ${snapshot} WHERE id = ${assistantMessageId}`
    );
  };
  const scheduleFlush = () => {
    if (pendingFlushTimer) return;
    pendingFlushTimer = setTimeout(() => {
      pendingFlushTimer = null;
      flushNow();
    }, DB_FLUSH_INTERVAL_MS);
  };
  // Tool whitelist: if an agent is set, filter the global tool list to only the
  // tool names it allows. Unknown names in agent.tools are dropped silently
  // (handled here by intersection). When no agent: send all tools.
  // v1.11.8: a second filter strips web_search + web_fetch unless the chat
  // has them explicitly enabled. Counts as an opt-in security boundary: the
  // model can't summon a tool that wasn't offered to it.
  const WEB_TOOL_NAMES: ReadonlySet<string> = new Set(['web_search', 'web_fetch']);
  const effectiveTools: ToolJsonSchema[] = (agent
    ? toolJsonSchemas().filter((t) => agent.tools.includes(t.function.name))
    : toolJsonSchemas()
  ).filter((t) => webToolsEnabled || !WEB_TOOL_NAMES.has(t.function.name));
  const effectiveTemperature = agent?.temperature;
  // v1.12.2: ctx_max lookup is cached after the first hit per model, so this
  // is a Map probe in steady state. We capture nCtx once at the top of the
  // stream so the throttled usage publish doesn't refetch each tick.
  const mctxForStream = await modelContext.getModelContext(session.model);
  const nCtxForStream = mctxForStream?.n_ctx ?? null;
  // v1.12.2 → v1.13.1-A: live usage publishes were throttled to ~500ms when
  // the manual SSE parser saw `parsed.usage` per chunk. AI SDK v6 surfaces
  // usage only at stream end (result.usage promise), so the throttle is
  // effectively a single trailing publish. ChatThroughput will tick once at
  // stream completion rather than mid-stream — known regression vs v1.12.2,
  // recovered if a future dispatch interpolates from delta cadence.
  const USAGE_THROTTLE_MS = 500;
  let lastUsageAt = 0;
  let pendingUsage: { p: number | null; c: number | null } | null = null;
  let usageTimer: NodeJS.Timeout | null = null;
  const flushUsage = () => {
    if (!pendingUsage) return;
    const { p, c } = pendingUsage;
    pendingUsage = null;
    lastUsageAt = Date.now();
    ctx.publish(sessionId, {
      type: 'usage',
      message_id: assistantMessageId,
      chat_id: chatId,
      completion_tokens: c,
      ctx_used: p,
      ctx_max: nCtxForStream,
    });
  };
  try {
    return await streamCompletion(
      ctx,
      session.model,
      messages,
      { tools: effectiveTools, temperature: effectiveTemperature },
      (delta) => {
        state.accumulated += delta;
        ctx.publish(sessionId, {
          type: 'delta',
          message_id: assistantMessageId,
          chat_id: chatId,
          content: delta,
        });
        ctx.log.debug({ sessionId, delta }, 'inference delta');
        scheduleFlush();
      },
      (prompt, completion) => {
        pendingUsage = { p: prompt, c: completion };
        const elapsed = Date.now() - lastUsageAt;
        if (elapsed >= USAGE_THROTTLE_MS) {
          flushUsage();
        } else if (!usageTimer) {
          usageTimer = setTimeout(() => {
            usageTimer = null;
            flushUsage();
          }, USAGE_THROTTLE_MS - elapsed);
        }
      },
      signal
    );
  } finally {
    if (pendingFlushTimer) {
      clearTimeout(pendingFlushTimer);
      pendingFlushTimer = null;
    }
    if (usageTimer) {
      clearTimeout(usageTimer);
      usageTimer = null;
    }
    await flushPromise;
  }
 }
--- a/apps/server/src/services/inference/tool-phase.ts
+++ b/apps/server/src/services/inference/tool-phase.ts
@@ -0,0 +1,256 @@
 import type { Session, ToolCall } from '../../types/api.js';
 import * as modelContext from '../model-context.js';
 import { PathScopeError } from '../path_guard.js';
 import { TOOLS_BY_NAME } from '../tools.js';
 import { maybeFlagForCompaction } from './payload.js';
 import { insertParts, partsFromAssistantMessage, partsFromToolMessage } from './parts.js';
 import type {
  InferenceContext,
  StreamResult,
  TurnArgs,
 } from './turn.js';
 // v1.12.4: ESM value-import cycle. executeToolPhase recurses into
 // runAssistantTurn which lives in inference.ts. The cycle is safe because
 // the reference is read at call time (inside an async function body), not
 // at module top-level. Node + tsc resolve this cleanly.
 import { runAssistantTurn } from './turn.js';
 async function executeToolCall(
  projectRoot: string,
  toolCall: ToolCall
 ): Promise<{ output: unknown; truncated: boolean; error?: string }> {
  const tool = TOOLS_BY_NAME[toolCall.name];
  if (!tool) {
    return { output: null, truncated: false, error: `unknown tool: ${toolCall.name}` };
  }
  const parsed = tool.inputSchema.safeParse(toolCall.args);
  if (!parsed.success) {
    // v1.12 Track B.2: enrich the zod-reject path so the model sees a
    // one-line, tool-named hint ("tool 'search_symbols' rejected — query:
    // Required") instead of a JSON blob of flatten output. Higher recovery
    // rate on the next turn; doom-loop guard still bounds infinite retries.
    // The cast is because tool.inputSchema is ZodType<unknown>, so zod can't
    // statically narrow flatten()'s fieldErrors key set — but the runtime
    // shape is the standard { formErrors: string[]; fieldErrors: Record<...> }.
    const flatten = parsed.error.flatten() as {
      formErrors: string[];
      fieldErrors: Record<string, string[] | undefined>;
    };
    const fieldErrors = Object.entries(flatten.fieldErrors)
      .map(([field, errs]) => `${field}: ${errs?.[0] ?? 'invalid'}`)
      .join('; ');
    const formError = flatten.formErrors[0];
    const hint = fieldErrors || formError || 'unknown validation error';
    return {
      output: null,
      truncated: false,
      error: `tool '${toolCall.name}' rejected — ${hint}`,
    };
  }
  try {
    const output = await tool.execute(parsed.data, projectRoot);
    const truncated =
      typeof output === 'object' && output !== null && 'truncated' in output
        ? Boolean((output as { truncated: unknown }).truncated)
        : false;
    return { output, truncated };
  } catch (err) {
    if (err instanceof PathScopeError) {
      return { output: null, truncated: false, error: err.message };
    }
    return {
      output: null,
      truncated: false,
      error: err instanceof Error ? err.message : String(err),
    };
  }
 }
 export async function executeToolPhase(
  ctx: InferenceContext,
  args: TurnArgs,
  result: StreamResult,
  startedAt: string | null,
  session: Session,
  projectRoot: string
 ): Promise<void> {
  const { sessionId, chatId, assistantMessageId, toolsUsed, signal } = args;
  const { content, toolCalls, promptTokens, completionTokens } = result;
  // v1.11.3: ctx_max comes from llama-swap /upstream/<model>/props, not the
  // streaming completion (which doesn't emit n_ctx). getModelContext caches
  // the positive lookup for the process lifetime, so this is a single Map
  // hit after the first invocation per model.
  const mctx = await modelContext.getModelContext(session.model);
  const nCtx = mctx?.n_ctx ?? null;
  const [updated] = await ctx.sql<
    { tokens_used: number | null; ctx_used: number | null; ctx_max: number | null; finished_at: string | null }[]
  >`
    UPDATE messages
    SET content = ${content},
        status = 'complete',
        tool_calls = ${ctx.sql.json(toolCalls as never)},
        tokens_used = ${completionTokens},
        ctx_used = ${promptTokens},
        ctx_max = ${nCtx},
        finished_at = clock_timestamp()
    WHERE id = ${assistantMessageId}
    RETURNING tokens_used, ctx_used, ctx_max, finished_at
  `;
  // v1.13.0: dual-write to message_parts. v1.13.1-B made parts authoritative
  // for reads via the messages_with_parts view; the JSON column write above
  // remains for v1.13.1 fallback compatibility (dropped in v1.13.2).
  // v1.13.1-C: include result.reasoning so models with separate reasoning
  // channels (qwen3.6) get a kind='reasoning' part at sequence 0.
  // TODO(v1.13.1): wrap the UPDATE above and this insertParts in a single
  // sql.begin before flipping read authority to message_parts. Without the
  // transaction, a crash between the two leaves an orphan message that
  // becomes invisible in the parts-authoritative read path.
  await insertParts(
    ctx.sql,
    partsFromAssistantMessage({
      content,
      tool_calls: toolCalls,
      reasoning: result.reasoning,
    }).map((p) => ({
      ...p,
      message_id: assistantMessageId,
    })),
  );
  // v1.11: flag for compaction if this turn pushed us over the usable budget.
  // We never compact mid-loop (the recursive runAssistantTurn keeps tools
  // flowing); the flag fires on the NEXT turn's pre-fetch hook above.
  await maybeFlagForCompaction(ctx, chatId, updated);
  const [toolSessRow] = await ctx.sql<{ project_id: string; name: string; updated_at: string }[]>`
    UPDATE sessions SET updated_at = clock_timestamp()
    WHERE id = ${sessionId}
    RETURNING project_id, name, updated_at
  `;
  ctx.publishUser({ type: 'session_updated', session_id: sessionId, project_id: toolSessRow!.project_id, name: toolSessRow!.name, updated_at: toolSessRow!.updated_at });
  for (const tc of toolCalls) {
    ctx.publish(sessionId, {
      type: 'tool_call',
      message_id: assistantMessageId,
      chat_id: chatId,
      tool_call: tc,
    });
  }
  ctx.publish(sessionId, {
    type: 'message_complete',
    message_id: assistantMessageId,
    chat_id: chatId,
    tokens_used: updated?.tokens_used ?? null,
    ctx_used: updated?.ctx_used ?? null,
    ctx_max: updated?.ctx_max ?? null,
    started_at: startedAt,
    finished_at: updated?.finished_at ?? null,
    model: session.model,
  });
  // Batch 9.7: ask_user_input pauses the loop. The tool row is still inserted
  // (the answer endpoint needs a target row to UPDATE), but tool_results is
  // pre-stamped with output=null as a "pending" sentinel and no tool_result
  // frame goes out — the card renders from the tool_call frame alone. Mixed
  // batches still execute the other tools normally.
  ctx.publishUser({ type: 'chat_status', chat_id: chatId, status: 'tool_running', at: new Date().toISOString() });
  let pausingForUserInput = false;
  await Promise.all(
    toolCalls.map(async (tc) => {
      const [toolRow] = await ctx.sql<{ id: string }[]>`
        INSERT INTO messages (session_id, chat_id, role, content, status, created_at)
        VALUES (${sessionId}, ${chatId}, 'tool', '', 'complete', clock_timestamp())
        RETURNING id
      `;
      const toolMessageId = toolRow!.id;
      if (tc.name === 'ask_user_input') {
        pausingForUserInput = true;
        const sentinel = { tool_call_id: tc.id, output: null, truncated: false };
        await ctx.sql`
          UPDATE messages
          SET tool_results = ${ctx.sql.json(sentinel as never)}
          WHERE id = ${toolMessageId}
        `;
        // v1.13.0: mirror the pending sentinel into message_parts. The
        // answer-endpoint UPDATE later (messages.ts:576) will delete and
        // re-insert this part when the user submits their answer.
        // TODO(v1.13.1): wrap the INSERT + UPDATE + insertParts triple in
        // a per-iteration sql.begin before flipping read authority.
        await insertParts(
          ctx.sql,
          partsFromToolMessage({ tool_results: sentinel }).map((p) => ({
            ...p,
            message_id: toolMessageId,
          })),
        );
        return;
      }
      const tres = await executeToolCall(projectRoot, tc);
      const stored = {
        tool_call_id: tc.id,
        output: tres.output,
        truncated: tres.truncated,
        ...(tres.error ? { error: tres.error } : {}),
      };
      await ctx.sql`
        UPDATE messages
        SET tool_results = ${ctx.sql.json(stored as never)}
        WHERE id = ${toolMessageId}
      `;
      // v1.13.0: dual-write the tool_result part.
      // TODO(v1.13.1): wrap the INSERT + UPDATE + insertParts triple in a
      // per-iteration sql.begin before flipping read authority.
      await insertParts(
        ctx.sql,
        partsFromToolMessage({ tool_results: stored }).map((p) => ({
          ...p,
          message_id: toolMessageId,
        })),
      );
      ctx.publish(sessionId, {
        type: 'tool_result',
        tool_message_id: toolMessageId,
        chat_id: chatId,
        tool_call_id: tc.id,
        output: tres.output,
        truncated: tres.truncated,
        ...(tres.error ? { error: tres.error } : {}),
      });
    })
  );
  if (pausingForUserInput) {
    ctx.publishUser({
      type: 'chat_status',
      chat_id: chatId,
      status: 'waiting_for_input',
      at: new Date().toISOString(),
    });
    ctx.log.info(
      { sessionId, chatId, assistantMessageId },
      'inference paused awaiting user input',
    );
    return;
  }
  const [nextAssistant] = await ctx.sql<{ id: string }[]>`
    INSERT INTO messages (session_id, chat_id, role, content, status, created_at)
    VALUES (${sessionId}, ${chatId}, 'assistant', '', 'streaming', clock_timestamp())
    RETURNING id
  `;
  await runAssistantTurn(ctx, {
    sessionId,
    chatId,
    assistantMessageId: nextAssistant!.id,
    // v1.8.2: charge this turn's actual tool invocations against the budget.
    // One assistant message can emit multiple tool_calls, so we add the run
    // count, not 1. The next turn's budget check sees the cumulative total.
    toolsUsed: toolsUsed + result.toolCalls.length,
    // v1.11.6: append the just-executed tool calls to the per-turn history
    // so the next runAssistantTurn's doom-loop check can see them. We don't
    // cap the array length here — per-turn budgets keep it bounded
    // (typically <30 entries), and slicing happens inside detectDoomLoop.
    recentToolCalls: [...args.recentToolCalls, ...result.toolCalls],
    signal,
  });
 }
--- a/apps/server/src/services/inference/turn.ts
+++ b/apps/server/src/services/inference/turn.ts
@@ -0,0 +1,329 @@
 import type { FastifyBaseLogger } from 'fastify';
 import type { Sql } from '../../db.js';
 import type { Config } from '../../config.js';
 import type {
  Agent,
  ErrorReason,
  Message,
  MessageMetadata,
  Project,
  Session,
  ToolCall,
  UserStreamFrame,
 } from '../../types/api.js';
 import { ALL_TOOLS } from '../tools.js';
 import { resolveProjectRoot } from '../path_guard.js';
 import { maybeAutoNameChat } from '../auto_name.js';
 import { getAgentById } from '../agents.js';
 import * as compaction from '../compaction.js';
 import * as modelContext from '../model-context.js';
 import type { Broker } from '../broker.js';
 import { resolveToolBudget } from './budget.js';
 import {
  DOOM_LOOP_THRESHOLD,
  detectDoomLoop,
 } from './sentinels.js';
 import {
  buildMessagesPayload,
  loadContext,
 } from './payload.js';
 import {
  finalizeCompletion,
  handleAbortOrError,
 } from './error-handler.js';
 import {
  executeStreamPhase,
  streamCompletion,
 } from './stream-phase.js';
 import { executeToolPhase } from './tool-phase.js';
 import { DB_FLUSH_INTERVAL_MS, type StreamPhaseState } from './types.js';
 import {
  runCapHitSummary,
  runDoomLoopSummary,
 } from './sentinel-summaries.js';
 // v1.12.4: re-exported so external callers (tests, future consumers) keep
 // importing from services/inference.js as the public surface.
 export { detectDoomLoop, DOOM_LOOP_THRESHOLD } from './sentinels.js';
 export { buildMessagesPayload } from './payload.js';
 export interface InferenceFrame {
  type:
    | 'message_started'
    | 'delta'
    | 'tool_call'
    | 'tool_result'
    | 'message_complete'
    | 'usage'
    | 'messages_deleted'
    | 'session_renamed'
    | 'chat_renamed'
    | 'error';
  message_id?: string;
  message_ids?: string[];
  chat_id?: string;
  tool_message_id?: string;
  tool_call_id?: string;
  // v1.8.2: 'system' added so cap-hit sentinel messages can announce themselves
  // through the normal message_started → delta → message_complete sequence.
  role?: 'assistant' | 'tool' | 'user' | 'system';
  content?: string;
  tool_call?: ToolCall;
  output?: unknown;
  truncated?: boolean;
  error?: string;
  // v1.8.2: structured error reason. Set on `type: 'error'` so the UI can
  // surface a specific message; `error` stays the human-readable text.
  reason?: ErrorReason;
  // v1.8.2: piggybacks on `message_complete` so static or terminally-resolved
  // messages can carry their persisted metadata to the live stream without a
  // refetch (sentinels carry { kind: 'cap_hit', ... }; failed messages carry
  // { kind: 'error', ... }).
  metadata?: MessageMetadata | null;
  tokens_used?: number | null;
  ctx_used?: number | null;
  ctx_max?: number | null;
  completion_tokens?: number | null;
  started_at?: string | null;
  finished_at?: string | null;
  model?: string;
  session_id?: string;
  name?: string;
 }
 export type FramePublisher = (sessionId: string, frame: InferenceFrame) => void;
 export interface InferenceContext {
  sql: Sql;
  config: Config;
  log: FastifyBaseLogger;
  publish: FramePublisher;
  publishUser: (frame: UserStreamFrame) => void;
  // v1.11: passed through so compaction.process can publish 'compacted'
  // frames on the same session WS channel useSessionStream subscribes to.
  // Compaction is the only path that needs the raw broker handle (regular
  // inference goes through `publish`); keeping a separate field avoids
  // tempting other code paths into bypassing the session-id binding.
  broker: Broker;
 }
 // v1.12.4: payload assembly extracted to ./inference/payload.ts (tests
 // import buildMessagesPayload from this module, so a re-export below
 // preserves the public surface). Stream + tool phases extracted to
 // ./inference/stream-phase.ts and ./inference/tool-phase.ts.
 export interface StreamResult {
  finishReason: string | null;
  content: string;
  toolCalls: ToolCall[];
  promptTokens: number | null;
  completionTokens: number | null;
  // v1.13.1-C: reasoning text accumulated across reasoning-delta parts.
  // Empty string when the model doesn't emit reasoning (most cases).
  reasoning: string;
 }
 export interface TurnArgs {
  sessionId: string;
  chatId: string;
  assistantMessageId: string;
  // v1.8.2: cumulative tool calls executed this run. Compared against the
  // resolved budget at the top of each turn. Replaces the older `depth`
  // counter (which counted iterations, not invocations).
  toolsUsed: number;
  // v1.11.6: ordered tool calls executed in this user-message turn (across
  // recursive runAssistantTurn invocations). Reset to [] at user-message
  // boundaries by runInference, same as toolsUsed. Doom-loop check at the
  // top of runAssistantTurn slices the last DOOM_LOOP_THRESHOLD entries.
  recentToolCalls: ToolCall[];
  signal: AbortSignal | undefined;
 }
 export async function runAssistantTurn(
  ctx: InferenceContext,
  args: TurnArgs,
 ): Promise<void> {
  const { sessionId, chatId } = args;
  // v1.11: if the prior turn flagged this chat for compaction, run it first
  // so loadContext below reads the post-compaction history. We swallow
  // compaction failures (clearing the flag so we don't loop) and proceed
  // with the un-compacted history — a slow turn that hits the model's
  // hard limit is recoverable; a dead session is not.
  const chatFlag = await ctx.sql<{ needs_compaction: boolean }[]>`
    SELECT needs_compaction FROM chats WHERE id = ${chatId}
  `;
  if (chatFlag[0]?.needs_compaction) {
    try {
      await compaction.process({
        sql: ctx.sql,
        config: ctx.config,
        log: ctx.log,
        broker: ctx.broker,
        chatId,
      });
    } catch (err) {
      ctx.log.warn({ err, chatId }, 'auto-compaction failed; clearing flag and proceeding');
      await ctx.sql`UPDATE chats SET needs_compaction = false WHERE id = ${chatId}`;
    }
  }
  const loaded = await loadContext(ctx.sql, sessionId, chatId);
  if (!loaded) {
    ctx.log.warn({ sessionId }, 'inference: session or project missing');
    return;
  }
  const { session, project, history } = loaded;
  const projectRoot = await resolveProjectRoot(project.path);
  // Agent resolution is per-turn so PATCH agent_id mid-conversation takes
  // effect on the next message. Unknown agent_id returns null silently —
  // session falls back to base prompt + all tools + default temperature.
  const agent = session.agent_id
    ? await getAgentById(project.path, session.agent_id)
    : null;
  // v1.8.2: cap-hit replaces the older "tool loop depth exceeded" failure.
  // When we've already burned the budget *before* this turn even runs, we
  // skip straight to the summary flow — the in-flight assistant message slot
  // gets reused for the wrap-up reply instead of being marked failed.
  const budget = resolveToolBudget(agent);
  if (args.toolsUsed >= budget) {
    await runCapHitSummary(ctx, args, session, project, history, agent, budget);
    return;
  }
  // v1.11.6: doom-loop guard. Detected BEFORE the budget cap (the model can
  // burn through 3 identical calls long before the 15-call budget fires).
  // Same in-flight-slot-reuse pattern as runCapHitSummary — wrap-up reply
  // lands in args.assistantMessageId, then a doom_loop sentinel is inserted
  // to make the abort visible in the chat history.
  const loop = detectDoomLoop(args.recentToolCalls);
  if (loop) {
    await runDoomLoopSummary(ctx, args, session, project, history, agent, loop);
    return;
  }
  const messages = await buildMessagesPayload(session, project, history, agent);
  // v1.11.8: resolve per-chat web-tools opt-in. Tri-state on the wire:
  //   - session.web_search_enabled = null → inherit project default
  //   - session.web_search_enabled = true/false → explicit
  // Both web_search and web_fetch are gated by this single flag (the UI
  // label is "Enable web search and fetch" — same store, both tools).
  // Default is false unless explicitly opted in, matching the v1.9
  // plumbing intent ("inert until Batch 8 ships the actual tools").
  const webToolsEnabled =
    session.web_search_enabled ?? project.default_web_search_enabled ?? false;
  const state: StreamPhaseState = { accumulated: '', startedAt: null };
  let result: StreamResult;
  try {
    result = await executeStreamPhase(ctx, args, session, messages, state, agent, webToolsEnabled);
  } catch (err) {
    await handleAbortOrError(ctx, args, state.accumulated, err);
    return;
  }
  if (result.toolCalls.length > 0) {
    await executeToolPhase(ctx, args, result, state.startedAt, session, projectRoot);
    return;
  }
  await finalizeCompletion(ctx, args, result, state.startedAt, session);
 }
 export async function runInference(
  ctx: InferenceContext,
  sessionId: string,
  chatId: string,
  assistantMessageId: string,
  signal?: AbortSignal
 ): Promise<void> {
  // v1.8.2: every fresh inference (initial send, regenerate, force_send,
  // continue) starts with a clean budget. Tool-call accumulation across
  // Continue invocations is what the hard ceiling guards against, not the
  // per-call budget.
  // v1.11.6: recentToolCalls also resets — doom-loop detection is scoped
  // to a single user-message turn, so a Continue starts with no history.
  return runAssistantTurn(ctx, {
    sessionId,
    chatId,
    assistantMessageId,
    toolsUsed: 0,
    recentToolCalls: [],
    signal,
  });
 }
 // v1.8.2: cap-hit summary flow. Called instead of erroring when the loop
 // hits its budget. Reuses the in-flight assistant message slot to stream a
 // short wrap-up reply with the synthetic note prepended and tools disabled,
 // then always inserts a cap_hit sentinel afterward (regardless of summary
 // outcome) so the UI can show a Continue affordance.
 interface InferenceRegistration {
  controller: AbortController;
  completed: Promise<void>;
 }
 export function createInferenceRunner(
  ctx: Omit<InferenceContext, 'publishUser'>,
  publishUserFn: (user: string, frame: UserStreamFrame) => void
 ) {
  const registry = new Map<string, InferenceRegistration>();
  return {
    enqueue(sessionId: string, chatId: string, assistantMessageId: string, user: string) {
      const callCtx: InferenceContext = {
        ...ctx,
        publishUser: (frame) => publishUserFn(user, frame),
        // v1.11: broker comes in via ctx (set at registration time). Repeated
        // here so the destructure carries it onto the per-call ctx without
        // having to add it to every enqueue/cancel signature individually.
        broker: ctx.broker,
      };
      // v1.8 mobile-tabs: announce working before the async loop starts so
      // every device subscribed to the user channel sees the amber dot.
      callCtx.publishUser({ type: 'chat_status', chat_id: chatId, status: 'streaming', at: new Date().toISOString() });
      const controller = new AbortController();
      let resolveCompleted!: () => void;
      const completed = new Promise<void>((res) => { resolveCompleted = res; });
      const registration: InferenceRegistration = { controller, completed };
      registry.set(chatId, registration);
      void (async () => {
        try {
          await runInference(callCtx, sessionId, chatId, assistantMessageId, controller.signal);
          setImmediate(() => {
            void maybeAutoNameChat(callCtx, chatId, sessionId).catch((err: Error) => {
              callCtx.log.warn({ err, chatId }, 'auto-name failed');
            });
          });
        } catch (err) {
          callCtx.log.error({ err }, 'unhandled inference error');
        } finally {
          resolveCompleted();
          // Only clear our own registration; a force-send may have replaced it.
          if (registry.get(chatId) === registration) {
            registry.delete(chatId);
          }
        }
      })();
    },
    async cancel(_sessionId: string, chatId: string): Promise<boolean> {
      const reg = registry.get(chatId);
      if (!reg) return false;
      reg.controller.abort();
      // Swallow — we just need to wait for the catch/finally to persist state.
      await reg.completed.catch(() => {});
      return true;
    },
    hasActive(chatId: string): boolean {
      return registry.has(chatId);
    },
  };
 }
 export const _toolNames = ALL_TOOLS.map((t) => t.name);
--- a/apps/server/src/services/inference/types.ts
+++ b/apps/server/src/services/inference/types.ts
@@ -0,0 +1,13 @@
 // v1.12.4: shared inter-phase types/constants for the extracted phase files.
 // Lives here so stream-phase, tool-phase, and the summary functions still in
 // inference.ts can all reference the same definitions without circular imports.
 export interface StreamPhaseState {
  accumulated: string;
  startedAt: string | null;
 }
 // 500ms keeps the DB UPDATE rate bounded under heavy streaming. Used by
 // executeStreamPhase, runCapHitSummary, and runDoomLoopSummary — every site
 // that does a debounced content flush during streaming.
 export const DB_FLUSH_INTERVAL_MS = 500;
--- a/apps/server/src/services/inference/xml-parser.ts
+++ b/apps/server/src/services/inference/xml-parser.ts
@@ -0,0 +1,53 @@
 // v1.10.5: XML-tag tool-call fallback. Some models emit
 // <tool_call><function=foo><parameter=key>value</parameter></function></tool_call>
 // in plain content instead of using the OpenAI tool_calls JSON channel.
 // The streaming loop in inference.ts extracts these blocks via these helpers.
 export const XML_TOOL_OPEN = '<tool_call>';
 export const XML_TOOL_CLOSE = '</tool_call>';
 export function parseXmlToolCall(
  block: string,
 ): { name: string; args: Record<string, unknown> } | null {
  const nameMatch = block.match(/<function=([^>]+)>/);
  if (!nameMatch || !nameMatch[1]) return null;
  const name = nameMatch[1].trim();
  if (!name) return null;
  const args: Record<string, unknown> = {};
  // Non-greedy body so each <parameter=…>…</parameter> pair is matched
  // independently even when multiple appear in the same block.
  const paramRe = /<parameter=([^>]+)>([\s\S]*?)<\/parameter>/g;
  for (const m of block.matchAll(paramRe)) {
    const key = (m[1] ?? '').trim();
    if (!key) continue;
    const raw = (m[2] ?? '').trim();
    try {
      args[key] = JSON.parse(raw);
    } catch {
      args[key] = raw;
    }
  }
  return { name, args };
 }
 // Locate the first character that begins (or completely contains) an
 // unfinished <tool_call> opener in `s`. Returns -1 when `s` can be flushed
 // to the client in full without risking a partial tag leak.
 //   Case 1: a full `<tool_call>` opener with no matching closer — caller
 //           must keep everything from that index forward until the next
 //           chunk arrives with the closer.
 //   Case 2: `s` ends with a strict prefix of `<tool_call>` (e.g. `<tool_c`).
 //           Caller must keep just that suffix in the buffer.
 // Note: case 1 assumes the calling loop already extracted every complete
 // <tool_call>…</tool_call> pair before reaching this check.
 export function partialXmlOpenerStart(s: string): number {
  const fullOpener = s.indexOf(XML_TOOL_OPEN);
  if (fullOpener !== -1) return fullOpener;
  const lastLt = s.lastIndexOf('<');
  if (lastLt === -1) return -1;
  const suffix = s.slice(lastLt);
  if (XML_TOOL_OPEN.startsWith(suffix) && suffix.length < XML_TOOL_OPEN.length) {
    return lastLt;
  }
  return -1;
 }
--- a/apps/server/src/services/tools.ts
+++ b/apps/server/src/services/tools.ts
@@ -527,6 +527,11 @@ export const askUserInput: ToolDef<AskUserInputInputT> = {
  },
 };
 // v1.13.3: alpha-sorted by tool.name at module load. llama.cpp's prompt
 // cache hits on byte-identical prefixes; the tool list lives near the top
 // of the system prompt, so any order drift would invalidate every cached
 // turn. Single source of truth for ordering lives here — toolJsonSchemas()
 // and TOOLS_BY_NAME inherit it.
 export const ALL_TOOLS: ReadonlyArray<ToolDef<unknown>> = [
  viewFile as ToolDef<unknown>,
  listDir as ToolDef<unknown>,
@@ -553,7 +558,7 @@ export const ALL_TOOLS: ReadonlyArray<ToolDef<unknown>> = [
  watchChanges as ToolDef<unknown>,
  getSemanticNeighborhoods as ToolDef<unknown>,
  getFrameworkAnalysis as ToolDef<unknown>,
-];
+].sort((a, b) => a.name.localeCompare(b.name));
 // v1.8.2: forward-compatible read-only whitelist. An agent whose `tools` is
 // fully contained in this set gets a generous default tool budget (30);
--- a/apps/server/src/types/api.ts
+++ b/apps/server/src/types/api.ts
@@ -186,6 +186,11 @@ export interface Message {
  // v1.8.2: per-message metadata. See MessageMetadata for the discriminated
  // shapes currently in use.
  metadata: MessageMetadata | null;
  // v1.13.1-C: reasoning content captured from the model's reasoning stream
  // (qwen3.6 etc.). Populated from message_parts via the messages_with_parts
  // view's reasoning_parts column. Optional — most rows have no reasoning
  // and the API may omit the field on legacy responses.
  reasoning_parts?: Array<{ text: string }> | null;
  // v1.11: anchored rolling compaction. Optional so consumers that SELECT
  // the pre-v1.11 column set still type-check. See compaction.ts +
  // schema.sql for semantics.
--- a/apps/web/src/api/types.ts
+++ b/apps/web/src/api/types.ts
@@ -161,6 +161,11 @@ export interface Message {
  // v1.8.2: per-message metadata; see MessageMetadata. null for the vast
  // majority of messages.
  metadata: MessageMetadata | null;
  // v1.13.1-C: reasoning content captured from models that stream reasoning
  // tokens separately (qwen3.6 etc.). Backend populates from message_parts;
  // optional on the wire — frontend doesn't render this yet (reserved for
  // a v1.14 UI surface).
  reasoning_parts?: Array<{ text: string }> | null;
  // v1.11: anchored rolling compaction fields. Optional on the wire so that
  // older API responses (or test fixtures) parse without explicit nulls.
  //   summary       — true on the assistant row that holds the active
--- a/pnpm-lock.yaml
+++ b/pnpm-lock.yaml
@@ -48,12 +48,18 @@ importers:
  apps/server:
    dependencies:
      '@ai-sdk/openai-compatible':
        specifier: ^2.0.47
        version: 2.0.47(zod@3.25.76)
      '@fastify/static':
        specifier: ^7.0.4
        version: 7.0.4
      '@fastify/websocket':
        specifier: ^10.0.1
        version: 10.0.1
      ai:
        specifier: ^6.0.190
        version: 6.0.190(zod@3.25.76)
      fastify:
        specifier: ^4.28.1
        version: 4.29.1
@@ -179,6 +185,28 @@ importers:
 packages:
  '@ai-sdk/gateway@3.0.119':
    resolution: {integrity: sha512-VAhfRWC+JexZakkVfmjaJKaTj00x7/UHdE8kMWL3NhuQAlf8oXtg9r4dfvFZrByXxchGRBvYE3biEUyibkg0xg==}
    engines: {node: '>=18'}
    peerDependencies:
      zod: ^3.25.76 || ^4.1.8
  '@ai-sdk/openai-compatible@2.0.47':
    resolution: {integrity: sha512-Enm5UlL0zUCrW3792opk5h7hRWxZOZzDe6eQYVFqX9LUOGGCe1h8MZWAGim765nwzgnjlpeYOsuzZmLtRsTPlg==}
    engines: {node: '>=18'}
    peerDependencies:
      zod: ^3.25.76 || ^4.1.8
  '@ai-sdk/provider-utils@4.0.27':
    resolution: {integrity: sha512-ubkAJ+xODouwtmN1tYlvTPphH1hPOBfZaEQe8U7skGvFAnIRs9PPpsq57bC2+Ky/MB4yzhd6YOsxTAx9sGpazw==}
    engines: {node: '>=18'}
    peerDependencies:
      zod: ^3.25.76 || ^4.1.8
  '@ai-sdk/provider@3.0.10':
    resolution: {integrity: sha512-Q3BZ27qfpYqnCYGvE3vt+Qi6LGOF9R5Nmzn+9JoM1lCRsD9mYaIhfJLkSunN48nfGXJ6n+XNV0J/XVpqGQl7Dw==}
    engines: {node: '>=18'}
  '@alloc/quick-lru@5.2.0':
    resolution: {integrity: sha512-UrcABB+4bUrFABwbluTIBErXwvbsU/V7TZWfmbgJfbkwiBuziS9gxdODUyuiecfdGQ85jglMW6juS3+z5TsKLw==}
    engines: {node: '>=10'}
@@ -789,6 +817,10 @@ packages:
  '@open-draft/until@2.1.0':
    resolution: {integrity: sha512-U69T3ItWHvLwGg5eJ0n3I62nWuE6ilHlmz7zM0npLBRvPRd7e6NYmg54vvRtP5mZG7kZqZCFVdsTWo7BPtBujg==}
  '@opentelemetry/api@1.9.1':
    resolution: {integrity: sha512-gLyJlPHPZYdAk1JENA9LeHejZe1Ti77/pTeFm/nMXmQH/HFZlcS/O2XJB+L8fkbrNSqhdtlvjBVjxwUYanNH5Q==}
    engines: {node: '>=8.0.0'}
  '@pinojs/redact@0.4.0':
    resolution: {integrity: sha512-k2ENnmBugE/rzQfEcdWHcCY+/FM3VLzH9cYEsbdsoqrvzAKRhUZeRNhAZvB8OitQJ1TBed3yqWtdjzS6wJKBwg==}
@@ -1646,6 +1678,9 @@ packages:
    resolution: {integrity: sha512-tlqY9xq5ukxTUZBmoOp+m61cqwQD5pHJtFY3Mn8CA8ps6yghLH/Hw8UPdqg4OLmFW3IFlcXnQNmo/dh8HzXYIQ==}
    engines: {node: '>=18'}
  '@standard-schema/spec@1.1.0':
    resolution: {integrity: sha512-l2aFy5jALhniG5HgqrD6jXLi/rUWrKvqN/qJx6yoJsgKhblVd+iqqU4RCXavm/jPityDo5TCvKMnpjKnOriy0w==}
  '@tailwindcss/node@4.3.0':
    resolution: {integrity: sha512-aFb4gUhFOgdh9AXo4IzBEOzBkkAxm9VigwDJnMIYv3lcfXCJVesNfbEaBl4BNgVRyid92AmdviqwBUBRKSeY3g==}
@@ -1811,6 +1846,10 @@ packages:
  '@ungap/structured-clone@1.3.1':
    resolution: {integrity: sha512-mUFwbeTqrVgDQxFveS+df2yfap6iuP20NAKAsBt5jDEoOTDew+zwLAOilHCeQJOVSvmgCX4ogqIrA0mnyr08yQ==}
  '@vercel/oidc@3.2.0':
    resolution: {integrity: sha512-UycprH3T6n3jH0k44NHMa7pnFHGu/N05MjojYr+Mc6I7obkoLIJujSWwin1pCvdy/eOxrI/l3uDLQsmcrOb4ug==}
    engines: {node: '>= 20'}
  '@vitejs/plugin-react@4.7.0':
    resolution: {integrity: sha512-gUu9hwfWvvEDBBmgtAowQCojwZmJ5mcLn3aufeCsitijs3+f2NsrPtlAWIR6OPiqljl96GVCUbLe0HyqIpVaoA==}
    engines: {node: ^14.18.0 || >=16.0.0}
@@ -1878,6 +1917,12 @@ packages:
    resolution: {integrity: sha512-MnA+YT8fwfJPgBx3m60MNqakm30XOkyIoH1y6huTQvC0PwZG7ki8NacLBcrPbNoo8vEZy7Jpuk7+jMO+CUovTQ==}
    engines: {node: '>= 14'}
  ai@6.0.190:
    resolution: {integrity: sha512-T+ixHbWZ6jmHRREpVVJTkFyWJeCekCdzLPan7lp1F32jG5OUw4+odlVYjtMRXVzogU+pWzpMmXdRiHUmdL/q0w==}
    engines: {node: '>=18'}
    peerDependencies:
      zod: ^3.25.76 || ^4.1.8
  ajv-formats@2.1.1:
    resolution: {integrity: sha512-Wx0Kx52hxE7C18hkMEggYlEifqWZtYaRgouJor+WMdPnQyEK13vgEWyVNup7SoeeoLMsr4kf5h6dOW11I15MUA==}
    peerDependencies:
@@ -2694,6 +2739,9 @@ packages:
  json-schema-typed@8.0.2:
    resolution: {integrity: sha512-fQhoXdcvc3V28x7C7BMs4P5+kNlgUURe2jmUT1T//oBRMDrqy1QPelJimwZGo7Hg9VPV3EQV5Bnq4hbFy2vetA==}
  json-schema@0.4.0:
    resolution: {integrity: sha512-es94M3nTIfsEPisRafak+HDLfHXnKBhV3vU5eqPcS3flIWqcxJWgXHXiey3YrpaNsanY5ei1VoYEbOzijuq9BA==}
  json5@2.2.3:
    resolution: {integrity: sha512-XmOWe7eyHYH14cLdVPoyg+GOH3rYX++KpzrylJwSW98t3Nk+U8XOl8FWKOgwtzdb8lXGf6zYwDUzeHMWfxasyg==}
    engines: {node: '>=6'}
@@ -3966,6 +4014,30 @@ packages:
 snapshots:
  '@ai-sdk/gateway@3.0.119(zod@3.25.76)':
    dependencies:
      '@ai-sdk/provider': 3.0.10
      '@ai-sdk/provider-utils': 4.0.27(zod@3.25.76)
      '@vercel/oidc': 3.2.0
      zod: 3.25.76
  '@ai-sdk/openai-compatible@2.0.47(zod@3.25.76)':
    dependencies:
      '@ai-sdk/provider': 3.0.10
      '@ai-sdk/provider-utils': 4.0.27(zod@3.25.76)
      zod: 3.25.76
  '@ai-sdk/provider-utils@4.0.27(zod@3.25.76)':
    dependencies:
      '@ai-sdk/provider': 3.0.10
      '@standard-schema/spec': 1.1.0
      eventsource-parser: 3.0.8
      zod: 3.25.76
  '@ai-sdk/provider@3.0.10':
    dependencies:
      json-schema: 0.4.0
  '@alloc/quick-lru@5.2.0': {}
  '@babel/code-frame@7.29.0':
@@ -4516,6 +4588,8 @@ snapshots:
  '@open-draft/until@2.1.0': {}
  '@opentelemetry/api@1.9.1': {}
  '@pinojs/redact@0.4.0': {}
  '@pkgjs/parseargs@0.11.0':
@@ -5386,6 +5460,8 @@ snapshots:
  '@sindresorhus/merge-streams@4.0.0': {}
  '@standard-schema/spec@1.1.0': {}
  '@tailwindcss/node@4.3.0':
    dependencies:
      '@jridgewell/remapping': 2.3.5
@@ -5548,6 +5624,8 @@ snapshots:
  '@ungap/structured-clone@1.3.1': {}
  '@vercel/oidc@3.2.0': {}
  '@vitejs/plugin-react@4.7.0(vite@5.4.21(@types/node@20.19.41)(lightningcss@1.32.0))':
    dependencies:
      '@babel/core': 7.29.0
@@ -5628,6 +5706,14 @@ snapshots:
  agent-base@7.1.4: {}
  ai@6.0.190(zod@3.25.76):
    dependencies:
      '@ai-sdk/gateway': 3.0.119(zod@3.25.76)
      '@ai-sdk/provider': 3.0.10
      '@ai-sdk/provider-utils': 4.0.27(zod@3.25.76)
      '@opentelemetry/api': 1.9.1
      zod: 3.25.76
  ajv-formats@2.1.1(ajv@8.20.0):
    optionalDependencies:
      ajv: 8.20.0
@@ -6453,6 +6539,8 @@ snapshots:
  json-schema-typed@8.0.2: {}
  json-schema@0.4.0: {}
  json5@2.2.3: {}
  jsonfile@6.2.1:
Author	SHA1	Message	Date
indifferentketchup	ec8593cf77	v1.13.4: two-tier compaction prune — opencode pattern half-shipped in v1.11.0 - message_parts.hidden_at timestamptz column (NULL by default) with a partial index on (message_id) WHERE hidden_at IS NULL for the common visible-parts filter. - messages_with_parts view changed from COALESCE(parts, legacy) to CASE WHEN EXISTS(any parts of kind) THEN visible-parts ELSE legacy. COALESCE would have leaked hidden parts back via the legacy fallback when every part was pruned (smoke caught it pre-commit). The CASE distinguishes "no parts at all → fall back to legacy column for pre-v1.13.0 history" from "all parts hidden → return null/empty so the row drops out of the model payload" exactly. - prune.ts: scans tool_result parts newest-first, protects the last 40k tokens (PROTECTED_TOKENS), marks older candidates hidden when their combined estimate clears 20k (PRUNE_TRIGGER_TOKENS — equal to COMPACTION_BUFFER from v1.11.0, so a successful prune is exactly the budget the summary path would have freed). Stops at chats.tail_start_id so it doesn't double-erase across the last summary boundary. Pure decision helper selectPruneTargets exported separately for unit tests. - Wired into maybeFlagForCompaction: prune runs synchronously when overflow is detected; if it freed >= PRUNE_TRIGGER_TOKENS, the needs_compaction flag is NOT set and the (expensive) summary inference call is skipped this turn. The next turn's overflow check re-evaluates from scratch. - 6 new unit tests in prune.test.ts cover: empty input, protection-only (no candidates), candidates below trigger, candidates above trigger, candidates straddling a summary boundary, exactly-protection-tokens. 179 tests total (was 173). Smoke verified post-rebuild: - \\d message_parts shows hidden_at + partial index. - View definition shows AND p.hidden_at IS NULL filters on all three subselects. - Synthetic hide-then-restore confirmed the view drops the tool_result jsonb to null when its only part is hidden, and restores when un-hidden. - EXPLAIN ANALYZE on the 42-message stress chat: 0.325ms (faster than v1.13.1-B's 1.018ms — EXISTS short-circuits cleanly for the common no-parts case). - Normal turn (plain text prompt) completes unaffected. Closes a v1.11.0 design item that was scoped but never implemented. With v1.13's parts table the prune is dramatically cheaper to write — pre-parts it would have meant editing JSON blobs in-place; now it's a hidden_at flag and a view subselect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 07:02:17 +00:00
indifferentketchup	a08d809b73	v1.13.3: cleanup bundle — statement timeout + alpha ordering + stuck-row sweeper + repairToolCall Four independent items, all owed from prior dispatches. - statement_timeout at the database level via: ALTER DATABASE boocode SET statement_timeout = '30s'; Applied operationally; documented as a comment at the top of schema.sql (ALTER DATABASE can't run inside a DO block, so it's not idempotent inside applySchema). Re-apply after a volume reset. - Tool registry alpha-sorted at module load. llama.cpp's prompt cache hits on byte-identical prefixes; any reordering of the tool list near the top of the system prompt would invalidate every cached turn. Single-source sort at the ALL_TOOLS export so toolJsonSchemas() and TOOLS_BY_NAME inherit the order automatically. New tools.test.ts asserts the invariant; total tests 173 (was 172). - Periodic in-process stuck-row sweeper. Runs every 60s, marks 'streaming' rows older than 5 minutes as 'failed', and publishes chat_status='idle' on the user channel so the UI dot drops without a refresh. Closes the mid-session crash UX gap; the v1.12.1 boot sweep only fires once at startup, so sessions used to stay stuck until next reboot. setInterval cleaned up via app.addHook('onClose'). Mirrors handleAbortOrError's publish pattern. - experimental_repairToolCall wired through AI SDK v6 streamText. Pass- through implementation: log + return the original toolCall so the stream keeps going. executeToolPhase's existing error paths (unknown tool name → 'unknown tool: X' result; zod-reject → 'tool X rejected — field: required') already surface bad calls to the model; the value here is preventing the AI SDK from THROWING on parse errors and killing the whole stream. Owed since v1.13.1-A. Smoke verified: - statement_timeout = '30s' confirmed via SHOW. - Tool path normal flow intact (list_dir prompt → tool_call → result → final assistant). No malformed tool calls in the test run; repair log will surface them when qwen3.6 actually emits one. - Alpha order verified at runtime via the dist bundle: match: true. - Sweeper logic not traffic-tested (no stuck rows to find), but the SQL UPDATE + broker.publishUser pattern is identical to handleAbort and the boot sweep — synthesis-only verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 06:46:03 +00:00
indifferentketchup	ac1a71f583	v1.13.1-C: port ask_user_input correlation to parts + wire reasoning_parts end-to-end Pass 1 — ask_user_input correlation port (messages.ts:478, :549): - The two correlation queries that backed the elicitation flow used to scan messages.tool_calls and messages.tool_results JSON columns directly. They now JOIN message_parts on payload->>'id' (for the caller assistant) and payload->>'tool_call_id' (for the pending tool row). Semantics preserved: ORDER BY m.created_at DESC LIMIT 1 still picks the latest issuance, the already-answered 409 guard now reads payload.output, and the UPDATE + parts replace inside sql.begin is unchanged from v1.13.0. - Pre-v1.13.0 history has no parts rows and is unreachable to this lookup path (404). Acceptable per dispatch decision — no pending elicitation from before v1.13.0 will still be open. JSON-column fallback can land as a hotfix if it ever surfaces. Pass 2 — reasoning_parts wired end-to-end: - types.ts/StreamResult gains `reasoning: string`. stream-phase.ts accumulates reasoning-delta text per stream (replacing the v1.13.1-A counter-only diagnostic) and returns it on the result. - parts.ts/partsFromAssistantMessage gains an optional `reasoning` param. When present it emits a kind='reasoning' part at sequence 0, ahead of the text and tool_call parts. - error-handler.ts/finalizeCompletion and tool-phase.ts/executeToolPhase both thread result.reasoning into the dual-write call so reasoning-channel models (qwen3.6) get persistent reasoning rows. - payload.ts: loadContext SELECT pulls reasoning_parts from the v1.13.1-B view; OpenAiMessage gains an optional `reasoning` field; buildMessagesPayload collapses reasoning_parts into a single string per assistant message. - stream-phase.ts/toModelMessages converts assistant messages with reasoning into an AI SDK ModelMessage content array starting with a ReasoningPart, matching the @ai-sdk/provider-utils AssistantContent union. Reasoning models can now replay prior reasoning context across tool-call boundaries. - types/api.ts and apps/web/src/api/types.ts Message interface gain reasoning_parts (optional, nullable). Frontend doesn't render this yet — field reserved for a v1.14 UI surface. Tests: 2 new in parts.test.ts cover reasoning-at-sequence-0 with and without text content. 172 tests pass (170 prior + 2 new). Smoke verified against the live container: - A reasoning-prompt ("walk through 17 × 23 step by step") produced one message with kind='reasoning' (361 chars) at sequence 0 and kind='text' (429 chars) at sequence 1. Adapter log confirmed reasoning capture. - The new correlation SQL was validated against existing tool_call / tool_result parts: returns the expected message_id + payload shape with pending state correctly identified via payload.output IS NULL. - ask_user_input end-to-end through the UI is Sam's smoke — the Prompt Builder agent does not always trigger ask_user_input for these prompts, so synthetic verification via SQL substituted for traffic-driven cover. Annotation: the v1.13.1-A abort-throw site in stream-phase.ts got a one-liner comment ("AI SDK v6 fullStream returns normally on abort; check signal explicitly.") to prevent a future refactor removing it. v1.13.2 drops the dual-write + the JSON columns + collapses the view. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 06:34:10 +00:00
indifferentketchup	13c3aa5b4e	v1.13.1-B: read-path flip from tool_calls/tool_results JSON columns to message_parts - schema.sql: new messages_with_parts view. tool_calls aggregates parts with kind='tool_call' as a jsonb array of {id, name, args}; tool_results picks the single sequence=0 part with kind='tool_result' as a jsonb {tool_call_id, output, truncated, error?}. COALESCE against the legacy jsonb columns means pre-v1.13.0 history (no parts rows) still reads correctly via the fallback, and fresh inserts (where parts dual-write follows the row INSERT) hit the legacy columns until the parts land. - reasoning_parts column added to the view but not selected by any caller yet — v1.13.1-C extends the Message type and pulls it into the model payload alongside the type extension. - Read sites switched to FROM messages_with_parts: - routes/chats.ts:427 (chat history GET) - routes/messages.ts:95 (session history GET) - routes/ws.ts:27 (WS snapshot on session connect, resume path) - services/inference/payload.ts (loadContext for model assembly) - services/compaction.ts (compaction's payload assembly) - chats.ts:394 (discard_stale UPDATE RETURNING) unchanged — UPDATEs target messages directly and the returned shape is for a freshly-modified row where the legacy column is dual-written and correct. - messages.ts:478/549 (ask_user_input correlation) intentionally not migrated — those query a different shape, ported in v1.13.1-C. - Writes still target `messages` directly; the view is read-only. Smoke verified against the live container: - Equivalence: 5/5 messages with both legacy column and parts row return identical tool_calls jsonb between FROM messages and FROM messages_with_parts. - Perf: EXPLAIN ANALYZE on the 42-message stress chat returns in ~1ms (50ms threshold). Bitmap Index Scan on message_parts_msg_seq_idx carries the parts lookups. - API contract: GET /api/chats/:id/messages returns identical {id, name, args} tool_calls and {tool_call_id, output, truncated, error} tool_results shapes to frontend consumers — no UI changes needed. - Inference path: sent a view_file prompt; assistant turn 1 emitted the tool_call, tool message captured the result, follow-up assistant turn read the result back via loadContext (now view-backed) and answered correctly. End-to-end loop intact. v1.13.2 drops the dual-write + the JSON columns + simplifies the view to just SELECT FROM message_parts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 06:22:47 +00:00
indifferentketchup	c2c4f78a26	v1.13.1-A: install AI SDK v6 + swap streamText into stream-phase.ts adapter - Add ai@^6 and @ai-sdk/openai-compatible@^2 to apps/server. - New services/inference/provider.ts: createOpenAICompatible against llama-swap (baseURL threaded from config.LLAMA_SWAP_URL, cached per baseURL). No apiKey — Authelia + Tailscale gate llama-swap, not keys. - streamCompletion rewritten as an adapter over streamText. AI SDK fullStream parts (text-delta, tool-call, finish, error) map back to the legacy {content?, tool_calls?, finishReason} StreamResult shape that executeStreamPhase already consumes. No layer above streamCompletion changes. - toModelMessages converts BooCode's OpenAI-shaped history to AI SDK ModelMessage[]; tool messages need toolName which we look up by scanning earlier assistant tool_calls for the matching id. - buildAiTools wraps BooCode's JSON-schema tool defs via tool({ inputSchema: jsonSchema(parameters) }) with NO execute — BooCode dispatches tools in tool-phase.ts, not the AI SDK loop. - XML fallback parser preserved as-is — qwen3.6 still emits XML tool calls in text content that the structured tool-call layer misses. - reasoning-delta parts dropped with a debug-level counter — captured properly in v1.13.1-C. - Abort path: streamText({ abortSignal }) wires ctx.signal through, but AI SDK v6 swallows the abort (fullStream iterator exits cleanly rather than throwing). Post-iteration `if (signal?.aborted) throw` so handleAbortOrError owns the row and writes status='cancelled'. Caught by smoke D; would have shipped as status='complete' on stop otherwise. - Usage frame reads result.usage (inputTokens / outputTokens v6 names) AFTER stream drain. Single trailing publish through the existing 500ms throttle. Known regression: ChatThroughput's live mid-stream tick (v1.12.2) is gone — it now shows a single value at stream end. TODO(v1.13.1-followup): interpolate outputTokens during streaming via a delta-cadence counter (e.g. part.text.length/4 token proxy) and publish every 500ms; reconcile against result.usage at finish. - Write-path dual-write from v1.13.0 unaffected. Read path stays on JSON columns. v1.13.1-B flips reads to message_parts. Smoke verified end-to-end against running container: - A. Plain text: status='complete', 1 text part. - B. Single tool prompt → multi-tool chain (4 calls): every assistant with tool_calls has 2 parts (text+tool_call), every tool row has 1 part (tool_result). - C. Multi-step covered by B's chain. - D. Stop mid-stream: status='cancelled' written via handleAbortOrError after the post-iteration abort throw. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 06:17:56 +00:00
indifferentketchup	1cb6eee24c	v1.13.0: message_parts table + dual-write at every tool_calls/tool_results site Adds a granular message_parts table (one row per text/tool_call/tool_result chunk) without changing any read path. Old messages.content / tool_calls / tool_results columns remain authoritative for v1.13.0; this dispatch is write-only mirroring so the AI SDK migration in v1.13.1 can flip read authority without a backfill window. Schema: CREATE TABLE message_parts (id, message_id FK ON DELETE CASCADE, sequence int, kind text CHECK (text\|tool_call\|tool_result\|reasoning\|step_start), payload jsonb, created_at, UNIQUE (message_id, sequence)) New module services/inference/parts.ts with two pure derive helpers (partsFromAssistantMessage, partsFromToolMessage) and insertParts that fan-outs a multi-row INSERT via postgres-js. Wired dual-write at every site that writes tool_calls or tool_results: - tool-phase.ts: assistant finalize UPDATE, executed-tool UPDATE, ask_user_input sentinel UPDATE - messages.ts answer flow: DELETE pending tool_result part + INSERT answered one inside the existing sql.begin - skills.ts: synthetic assistant + tool INSERTs both inside existing tx - chats.ts fork: CTE clones parts via ROW_NUMBER pairing (source→dest message id mapping in one statement, no N+1) - error-handler.ts finalizeCompletion: text part for plain text-only assistant turns Deviation: tool-phase.ts finalize UPDATEs and finalizeCompletion text-part write are not wrapped in fresh sql.begin transactions. Safe in v1.13.0 because JSON columns are authoritative for reads. v1.13.1 must wrap these sites before flipping read authority — TODO comments added at each unwrapped site referencing v1.13.1. Tests: 8 new unit tests for the derive helpers in services/__tests__/parts.test.ts. Existing 162 tests untouched. 170 total. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 05:46:29 +00:00
indifferentketchup	ca64bf9f0a	docs: CLAUDE.md updates from /claude-md-management session - services/inference.ts → services/inference/ directory map (v1.12.4 split) - workspace_panes server-side jsonb (was: localStorage-only line) - chat_status 5-state model + ChatThroughput + discard_stale endpoint - boot-time stale-streaming sweep documented - WS frame sync gotcha (server InferenceFrame ↔ web WsFrame) - session_panes table noted as dropped (not deprecated) - messages_status_check/role_check drift cleanup noted Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 05:46:14 +00:00
indifferentketchup	9ef00c0268	v1.12.4: complete inference.ts split into services/inference/ - sentinel-summaries.ts: runCapHitSummary, insertCapHitSentinel, runDoomLoopSummary, insertDoomLoopSentinel - inference.ts → inference/turn.ts: residue is runAssistantTurn, runInference, createInferenceRunner orchestration only - inference/index.ts: re-export shim preserves the public surface (createInferenceRunner, runInference, runAssistantTurn, detectDoomLoop, DOOM_LOOP_THRESHOLD, buildMessagesPayload, plus type-side InferenceContext/InferenceFrame/StreamResult/TurnArgs/ FramePublisher) - src/index.ts + auto_name.ts + the two vitest test files updated to import from ./services/inference/index.js explicitly (NodeNext ESM doesn't honor directory-index resolution) Final tally: 11 files under services/inference/, the largest being sentinel-summaries.ts at 523 LoC (two near-clone summary paths kept side-by-side until a third sentinel justifies factoring out a shared runWrapUpSummary). turn.ts is now 326 LoC, the next-largest is stream-phase.ts at 380. Public import surface unchanged. tool-phase.ts → turn.ts back-edge for runAssistantTurn remains (cycle is safe; resolved at call time). Prepares the file structure for v1.13 AI SDK migration — streamText swap targets stream-phase.ts only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 22:36:35 +00:00
indifferentketchup	c87df6981a	v1.12.4-rc3: extract stream-phase + tool-phase from inference.ts - stream-phase.ts: streamCompletion, executeStreamPhase (plus sseLines, StreamOptions, ChatCompletionDelta/Chunk as private helpers) - tool-phase.ts: executeToolPhase + private executeToolCall - types.ts: shared StreamPhaseState + DB_FLUSH_INTERVAL_MS so the summary functions still in inference.ts can reference them without pulling from a phase file Cycle: executeToolPhase recurses into runAssistantTurn, which stays in inference.ts. Resolved by direct value back-edge — tool-phase.ts does `import { runAssistantTurn } from '../inference.js'` and runAssistantTurn is now exported. Safe because the dereference happens inside an async function body, after both modules have fully evaluated. No callback-through-args fallback needed. inference.ts shrinks from ~1401 to ~828 LoC. Final Dispatch D moves the sentinel summaries out and renames the residue to inference/turn.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 22:28:23 +00:00
indifferentketchup	8fa7b7fce9	v1.12.4-rc2: extract payload + error-handler from inference.ts - payload.ts: buildMessagesPayload (re-exported), loadContext, maybeFlagForCompaction - error-handler.ts: handleAbortOrError, finalizeCompletion Both new files type-import InferenceContext/StreamResult/TurnArgs from inference.ts; ESM elides type imports so there's no runtime cycle. handleAbortOrError turned out not to call the summary functions, so no back-edge needed. inference.ts shrinks from ~1676 to ~1401 LoC. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 22:09:50 +00:00
indifferentketchup	ea468ca7fb	v1.12.4-rc1: extract budget, sentinels, xml-parser from inference.ts Pure file moves. No behavior change. inference.ts retains createInferenceRunner public surface; new files are internal to services/inference/. - budget.ts: resolveToolBudget - sentinels.ts: detectDoomLoop (re-exported through inference.ts), isCapHitSentinel, isDoomLoopSentinel, isAnySentinel - xml-parser.ts: parseXmlToolCall, partialXmlOpenerStart First of four refactor batches preparing inference.ts for the v1.13 AI SDK migration. inference.ts goes from 1780 LoC to ~1620. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 21:42:41 +00:00