Compare commits

...

2 Commits

Author SHA1 Message Date
bcc89d8adc feat: MistakeTracker + file-provenance ledger (v2.7.4)
Two native-inference hardening features from boocode_code_review_v2 §1 #12.

MistakeTracker: new pure mistake-tracker.ts tracks consecutive heterogeneous
tool failures (kinds surfaced per tool from tool-phase.ts). On 3 in a row the
turn loop soft-nudges (model-facing recovery guidance + mistake_recovery
sentinel + reset), then escalates to stopping the turn (cap-hit-style, Continue
affordance) on a re-trip. Complements doom-loop (identical repeats) + cap-hit.

File-provenance ledger: compaction.ts derives a deterministic ## Files Read list
from the head messages' read-tool calls and injects it into the rolling-summary
prompt so provenance survives compaction (no new table; read-only).

mistake_recovery sentinel: MessageMetadata arm (server + web) + MessageBubble
render branch. Built by 2 parallel agents. Server 545 tests passing (23 new);
build + web tsc clean. Native-inference only. Builds on v2.7.3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 13:05:03 +00:00
f53d6a8afd Merge sampling-knobs-streamjson: v2.7.3 sampling knobs + live PTY stream-json + token UI 2026-06-01 12:47:31 +00:00
15 changed files with 816 additions and 20 deletions

View File

@@ -2,6 +2,10 @@
All notable changes per release tag. Most recent on top, ordered by tag creation date (which matches the git history). Tag names follow `vMAJOR.MINOR.PATCH-slug` — the slug describes what shipped, so the tag name alone is enough to recall the batch. All notable changes per release tag. Most recent on top, ordered by tag creation date (which matches the git history). Tag names follow `vMAJOR.MINOR.PATCH-slug` — the slug describes what shipped, so the tag name alone is enough to recall the batch.
## v2.7.4-mistake-tracker-ledger — 2026-06-01
Two native-inference hardening features from `boocode_code_review_v2.md` §1 #12 (cline, algorithm-reimplemented). **MistakeTracker:** complements the doom-loop guard (identical repeats) and cap-hit (budget) by catching a run of consecutive tool *failures*. A new pure `mistake-tracker.ts` tracks heterogeneous failure kinds (`zod_reject`/`tool_not_found`/`exec_error`/`api_error`/`permission_denied`, surfaced per tool from `tool-phase.ts`); after 3 consecutive failures the `turn.ts` loop does a **soft nudge** — injects model-facing recovery guidance into the next step + drops a `mistake_recovery` UI sentinel + resets — then **escalates** to stopping the turn (cap-hit-style, with a Continue affordance) if it re-trips without an intervening success, so heterogeneous failures can't burn the whole step budget. **File-provenance ledger:** `compaction.ts` now derives a deterministic, sorted `## Files Read` list from the head messages' read-tool calls (`view_file`/`grep`/`find_files`/`list_dir`) and injects it into the rolling-summary prompt so file provenance survives compaction (no new table; prompt-driven merge, read-only since BooChat has no write tools). The `mistake_recovery` sentinel adds an arm to `MessageMetadata` in both server + web type copies plus a `MessageBubble` render branch. Built by two parallel agents (backend + frontend sentinel) over disjoint apps; server 545 tests passing (23 new: 12 mistake-tracker + 11 compaction), build + web tsc clean. Native-inference only (external agents run their own loops). Builds on `v2.7.3-sampling-streamjson-tokens`; openspec `mistake-tracker-file-ledger`.
## v2.7.3-sampling-streamjson-tokens — 2026-06-01 ## v2.7.3-sampling-streamjson-tokens — 2026-06-01
Three small BooCode wins from `boocode_code_review_v2.md` §1 #11/#7/#8. **Sampling knobs:** per-agent `top_n_sigma` + the `dry_*` repetition family (`dry_multiplier`/`dry_base`/`dry_allowed_length`/`dry_penalty_last_n`) are now first-class Agent frontmatter fields, parsed in `agents.ts` and threaded into the llama-swap chat-completion body via `providerOptions.openaiCompatible` (the `@ai-sdk/openai-compatible` extra-body channel). This surfaced and fixed a **latent bug**: `top_k` (rejected by the AI-SDK provider as unsupported) and `min_p` (never passed to `streamText` at all) had been dead on the wire — no agent's `top_k`/`min_p` ever affected sampling; both now route through the same channel, so agents that set them will start using them. `--reasoning-budget` is documented in `data/AGENTS.md` (already works via `llama_extra_args`, permitted by the deny-list validator). **Live PTY stream-json:** qwen/claude PTY dispatch sliced stdout opaque; a new `stream-json-parser.ts` line-buffers the Claude-Code-compatible NDJSON and emits text/reasoning/tool frames live as they arrive (mirroring the ACP/opencode paths) + persists the structured parts, with a clean fallback to the old opaque slice when output isn't NDJSON (claude now runs `--output-format stream-json --verbose`). **Token UI:** the per-`(chat,agent)` `agent_sessions.input_tokens`/`output_tokens`/`cost` columns (accumulated since `v2.6.8` but dropped by the read route + wire type) now flow through and render condensed beside the AgentComposerBar session chip. Built by three parallel agents over disjoint subsystems; server 523 + coder 245 tests passing (incl. 11 new stream-json-parser + new agent-parse tests), all builds + web tsc clean. Builds on `v2.7.2-checkpoint-idor`; openspec `sampling-streamjson-tokens`. The qwen-vs-claude `usage` field names in #7 are best-guess pending a live smoke. Three small BooCode wins from `boocode_code_review_v2.md` §1 #11/#7/#8. **Sampling knobs:** per-agent `top_n_sigma` + the `dry_*` repetition family (`dry_multiplier`/`dry_base`/`dry_allowed_length`/`dry_penalty_last_n`) are now first-class Agent frontmatter fields, parsed in `agents.ts` and threaded into the llama-swap chat-completion body via `providerOptions.openaiCompatible` (the `@ai-sdk/openai-compatible` extra-body channel). This surfaced and fixed a **latent bug**: `top_k` (rejected by the AI-SDK provider as unsupported) and `min_p` (never passed to `streamText` at all) had been dead on the wire — no agent's `top_k`/`min_p` ever affected sampling; both now route through the same channel, so agents that set them will start using them. `--reasoning-budget` is documented in `data/AGENTS.md` (already works via `llama_extra_args`, permitted by the deny-list validator). **Live PTY stream-json:** qwen/claude PTY dispatch sliced stdout opaque; a new `stream-json-parser.ts` line-buffers the Claude-Code-compatible NDJSON and emits text/reasoning/tool frames live as they arrive (mirroring the ACP/opencode paths) + persists the structured parts, with a clean fallback to the old opaque slice when output isn't NDJSON (claude now runs `--output-format stream-json --verbose`). **Token UI:** the per-`(chat,agent)` `agent_sessions.input_tokens`/`output_tokens`/`cost` columns (accumulated since `v2.6.8` but dropped by the read route + wire type) now flow through and render condensed beside the AgentComposerBar session chip. Built by three parallel agents over disjoint subsystems; server 523 + coder 245 tests passing (incl. 11 new stream-json-parser + new agent-parse tests), all builds + web tsc clean. Builds on `v2.7.2-checkpoint-idor`; openspec `sampling-streamjson-tokens`. The qwen-vs-claude `usage` field names in #7 are best-guess pending a live smoke.

View File

@@ -7,6 +7,8 @@ import {
select, select,
buildPrompt, buildPrompt,
buildHeadPayload, buildHeadPayload,
deriveFilesRead,
buildFilesReadContext,
type CompactionMessage, type CompactionMessage,
} from '../compaction.js'; } from '../compaction.js';
import { SUMMARY_TEMPLATE } from '../compaction-prompt.js'; import { SUMMARY_TEMPLATE } from '../compaction-prompt.js';
@@ -321,3 +323,105 @@ describe('buildHeadPayload reasoning render', () => {
expect(out[1]!.content).not.toContain('<reasoning>'); expect(out[1]!.content).not.toContain('<reasoning>');
}); });
}); });
// ---- buildHeadPayload sentinel stripping (#12) -------------------------------
describe('buildHeadPayload strips all UI sentinels', () => {
it('drops cap_hit, doom_loop, and mistake_recovery system rows', () => {
const out = buildHeadPayload([
mkMsg('user', 'do the thing'),
mkMsg('system', 'budget reached', { metadata: { kind: 'cap_hit' } }),
mkMsg('system', 'looping', { metadata: { kind: 'doom_loop' } }),
mkMsg('system', 'repeated errors', { metadata: { kind: 'mistake_recovery' } }),
mkMsg('assistant', 'answer'),
]);
// Only the user + assistant rows survive; all three sentinels stripped.
expect(out).toHaveLength(2);
expect(out[0]!.role).toBe('user');
expect(out[1]!.role).toBe('assistant');
});
it('keeps a non-sentinel system row (e.g. compact bridge) untouched', () => {
const out = buildHeadPayload([
mkMsg('system', 'legacy compact', { kind: 'compact', metadata: null }),
mkMsg('user', 'q'),
]);
expect(out[0]!.role).toBe('system');
expect(out[0]!.content).toBe('legacy compact');
});
});
// ---- file-provenance ledger (#12, Part B) -----------------------------------
describe('deriveFilesRead', () => {
it('returns [] when the head has no read-tool calls', () => {
expect(deriveFilesRead([mkMsg('user', 'hi'), mkMsg('assistant', 'hello')])).toEqual([]);
});
it('extracts the path arg from view_file / list_dir / grep / find_files', () => {
const head = [
mkMsg('assistant', '', {
tool_calls: [
{ id: 'c1', name: 'view_file', args: { path: 'src/index.ts' } },
{ id: 'c2', name: 'list_dir', args: { path: 'src' } },
{ id: 'c3', name: 'grep', args: { pattern: 'TODO', path: 'apps' } },
{ id: 'c4', name: 'find_files', args: { pattern: '**/*.ts', path: 'lib' } },
],
}),
];
expect(deriveFilesRead(head)).toEqual(['apps', 'lib', 'src', 'src/index.ts']);
});
it('dedupes and sorts paths across multiple assistant turns', () => {
const head = [
mkMsg('assistant', '', { tool_calls: [{ id: 'c1', name: 'view_file', args: { path: 'b.ts' } }] }),
mkMsg('assistant', '', { tool_calls: [{ id: 'c2', name: 'view_file', args: { path: 'a.ts' } }] }),
mkMsg('assistant', '', { tool_calls: [{ id: 'c3', name: 'view_file', args: { path: 'b.ts' } }] }),
];
expect(deriveFilesRead(head)).toEqual(['a.ts', 'b.ts']);
});
it('ignores non-read tools and grep calls without a path arg', () => {
const head = [
mkMsg('assistant', '', {
tool_calls: [
{ id: 'c1', name: 'web_search', args: { query: 'x' } },
{ id: 'c2', name: 'grep', args: { pattern: 'foo' } }, // no path → root, skipped
{ id: 'c3', name: 'view_file', args: { path: 'kept.ts' } },
],
}),
];
expect(deriveFilesRead(head)).toEqual(['kept.ts']);
});
it('ignores read-tool calls on non-assistant rows', () => {
const head = [
mkMsg('user', '', { tool_calls: [{ id: 'c1', name: 'view_file', args: { path: 'nope.ts' } }] }),
];
expect(deriveFilesRead(head)).toEqual([]);
});
});
describe('buildFilesReadContext', () => {
it('returns null when nothing was read (no empty section injected)', () => {
expect(buildFilesReadContext([mkMsg('user', 'hi')])).toBeNull();
});
it('formats a ## Files Read block with sorted bullet paths', () => {
const head = [
mkMsg('assistant', '', {
tool_calls: [
{ id: 'c1', name: 'view_file', args: { path: 'z.ts' } },
{ id: 'c2', name: 'view_file', args: { path: 'a.ts' } },
],
}),
];
expect(buildFilesReadContext(head)).toBe('## Files Read\n- a.ts\n- z.ts');
});
});
describe('SUMMARY_TEMPLATE includes the Files Read section (#12)', () => {
it('declares a ## Files Read section the model must maintain', () => {
expect(SUMMARY_TEMPLATE).toContain('## Files Read');
});
});

View File

@@ -0,0 +1,164 @@
import { describe, it, expect } from 'vitest';
import {
MISTAKE_THRESHOLD,
freshMistakeState,
recordStep,
detectMistakePattern,
MISTAKE_RECOVERY_NOTE,
type FailureKind,
} from '../inference/mistake-tracker.js';
// ---- helpers ----------------------------------------------------------------
// Replays a sequence of outcomes against a fresh state, returning the final
// state so assertions can read .run / .nudges. The caller mimics turn.ts: after
// each recordStep we consult detectMistakePattern and, if it returns 'nudge',
// bump nudges + reset run (the loop's nudge-handling side effect).
function replay(
outcomes: (FailureKind | 'success')[],
{ applyNudge = false }: { applyNudge?: boolean } = {},
) {
const state = freshMistakeState();
const decisions: (ReturnType<typeof detectMistakePattern>)[] = [];
for (const o of outcomes) {
recordStep(state, o);
const decision = detectMistakePattern(state);
decisions.push(decision);
if (applyNudge && decision === 'nudge') {
// Mirror turn.ts's nudge side effect: bump the counter, reset the streak.
state.nudges += 1;
state.run = [];
}
}
return { state, decisions };
}
// ---- fresh state ------------------------------------------------------------
describe('freshMistakeState', () => {
it('starts with an empty run and zero nudges', () => {
const s = freshMistakeState();
expect(s.run).toEqual([]);
expect(s.nudges).toBe(0);
});
});
// ---- below threshold --------------------------------------------------------
describe('detectMistakePattern — below threshold', () => {
it('returns null on a fresh state', () => {
expect(detectMistakePattern(freshMistakeState())).toBeNull();
});
it('returns null after fewer than MISTAKE_THRESHOLD failures', () => {
const { decisions } = replay(['zod_reject', 'exec_error']);
expect(decisions).toEqual([null, null]);
});
});
// ---- success reset ----------------------------------------------------------
describe('recordStep — success resets', () => {
it("'success' clears both the run streak and the nudge counter", () => {
const state = freshMistakeState();
recordStep(state, 'zod_reject');
recordStep(state, 'exec_error');
state.nudges = 2; // simulate prior nudges
recordStep(state, 'success');
expect(state.run).toEqual([]);
expect(state.nudges).toBe(0);
});
it('a success mid-streak prevents the threshold from tripping', () => {
// fail, fail, success, fail, fail → streak never reaches 3.
const { decisions } = replay([
'zod_reject',
'exec_error',
'success',
'tool_not_found',
'permission_denied',
]);
expect(decisions.every((d) => d === null)).toBe(true);
});
});
// ---- 3-streak nudge ---------------------------------------------------------
describe('detectMistakePattern — nudge on 3-streak', () => {
it("returns 'nudge' the first time the streak reaches MISTAKE_THRESHOLD", () => {
const { decisions } = replay(['zod_reject', 'exec_error', 'tool_not_found']);
expect(decisions).toEqual([null, null, 'nudge']);
});
it("fires 'nudge' for a streak of identical kinds too (kind-agnostic)", () => {
const { decisions } = replay(['exec_error', 'exec_error', 'exec_error']);
expect(decisions[2]).toBe('nudge');
});
});
// ---- re-trip escalate -------------------------------------------------------
describe('detectMistakePattern — escalate on re-trip', () => {
it("escalates when the streak re-trips after a nudge with no intervening success", () => {
// 3 fails → nudge (run reset, nudges=1), then 3 more fails → escalate.
const { decisions } = replay(
[
'zod_reject',
'exec_error',
'tool_not_found',
'permission_denied',
'exec_error',
'zod_reject',
],
{ applyNudge: true },
);
expect(decisions[2]).toBe('nudge');
expect(decisions[5]).toBe('escalate');
});
it("does NOT escalate if a success lands between the nudge and the next streak", () => {
const { decisions } = replay(
[
'zod_reject',
'exec_error',
'tool_not_found', // nudge here
'success', // clears nudges back to 0
'exec_error',
'zod_reject',
'tool_not_found', // 3-streak again → nudge, NOT escalate
],
{ applyNudge: true },
);
expect(decisions[2]).toBe('nudge');
expect(decisions[6]).toBe('nudge');
expect(decisions).not.toContain('escalate');
});
});
// ---- mixed kinds ------------------------------------------------------------
describe('detectMistakePattern — mixed failure kinds', () => {
it('counts a streak of all five distinct kinds toward the threshold', () => {
const { state, decisions } = replay([
'zod_reject',
'tool_not_found',
'exec_error',
]);
expect(decisions[2]).toBe('nudge');
expect(state.run).toEqual(['zod_reject', 'tool_not_found', 'exec_error']);
});
});
// ---- contract ---------------------------------------------------------------
describe('MISTAKE_THRESHOLD + MISTAKE_RECOVERY_NOTE', () => {
it('threshold is a positive integer (tests assume 3)', () => {
expect(MISTAKE_THRESHOLD).toBeGreaterThan(0);
expect(Number.isInteger(MISTAKE_THRESHOLD)).toBe(true);
});
it('recovery note is a non-empty model-facing string', () => {
expect(typeof MISTAKE_RECOVERY_NOTE).toBe('string');
expect(MISTAKE_RECOVERY_NOTE.length).toBeGreaterThan(0);
});
});

View File

@@ -31,10 +31,16 @@ export const SUMMARY_TEMPLATE = `Output exactly the Markdown structure shown ins
## Relevant Files ## Relevant Files
- [file or directory path: why it matters, or "(none)"] - [file or directory path: why it matters, or "(none)"]
## Files Read
- [file or directory path that has been read/searched this session, or "(none)"]
</template> </template>
Rules: Rules:
- Keep every section, even when empty. - Keep every section, even when empty.
- Use terse bullets, not prose paragraphs. - Use terse bullets, not prose paragraphs.
- Preserve exact file paths, commands, error strings, and identifiers when known. - Preserve exact file paths, commands, error strings, and identifiers when known.
- For ## Files Read: this is a cumulative provenance ledger. MERGE the paths
listed in any "## Files Read" block provided below with those already in the
previous summary — never drop a previously-recorded path. Sort and dedupe.
- Do not mention the summary process or that context was compacted.`; - Do not mention the summary process or that context was compacted.`;

View File

@@ -181,6 +181,54 @@ export function select(
}; };
} }
// === file-provenance ledger (#12, Part B) ===
// Read tools whose path/target arg names a file or directory that was read.
// BooChat (apps/server) is read-only — there are no write tools, so the ledger
// only ever has a "Files Read" side (apps/coder can add "Modified" later).
const READ_TOOL_ARG: Record<string, string> = {
view_file: 'path',
list_dir: 'path',
grep: 'path',
find_files: 'path',
};
// Derive a deterministic, deduped, sorted list of file/dir paths read by the
// HEAD messages being summarized. Pure — scans assistant tool_calls only; the
// boundary (which messages are "head") is decided by select() at the call site.
// We derive at compaction time rather than via a live accumulator because
// TurnArgs resets per turn and would miss reads on non-compacting turns; the
// head messages are the authoritative record of what was read in the window
// being summarized. The result propagates forward as summary text across
// compactions (the LLM merges it into ## Files Read), so a path read long ago
// survives even after its originating messages are compacted out.
export function deriveFilesRead(head: CompactionMessage[]): string[] {
const paths = new Set<string>();
for (const m of head) {
if (m.role !== 'assistant') continue;
if (!m.tool_calls) continue;
for (const tc of m.tool_calls) {
const argName = READ_TOOL_ARG[tc.name];
if (!argName) continue;
const raw = (tc.args as Record<string, unknown> | null)?.[argName];
if (typeof raw === 'string' && raw.trim().length > 0) {
paths.add(raw.trim());
}
}
}
return [...paths].sort();
}
// Format the derived paths as a deterministic ## Files Read block for injection
// into buildPrompt's context array. Returns null when nothing was read (so we
// don't inject an empty section). The summarizer merges this into the rolling
// summary's ## Files Read section per the SUMMARY_TEMPLATE instructions.
export function buildFilesReadContext(head: CompactionMessage[]): string | null {
const paths = deriveFilesRead(head);
if (paths.length === 0) return null;
return ['## Files Read', ...paths.map((p) => `- ${p}`)].join('\n');
}
// === prompt assembly === // === prompt assembly ===
// Build the final user message that asks the model to (re)produce the // Build the final user message that asks the model to (re)produce the
@@ -220,15 +268,26 @@ export interface OpenAiMessage {
tool_call_id?: string; tool_call_id?: string;
} }
function isCapHitSentinel(m: CompactionMessage): boolean { // #12: mirror inference/sentinels.ts:isAnySentinel over the CompactionMessage
return m.role === 'system' && m.metadata != null && m.metadata.kind === 'cap_hit'; // shape (which carries metadata as { kind?: string } | null, not the full
// Message type isAnySentinel expects). All UI-only sentinels are stripped from
// the head payload — they never go to the summarizer LLM. Keep the kind list in
// sync with isAnySentinel in sentinels.ts.
const SENTINEL_KINDS = new Set(['cap_hit', 'doom_loop', 'mistake_recovery']);
function isAnySentinel(m: CompactionMessage): boolean {
return (
m.role === 'system' &&
m.metadata != null &&
typeof m.metadata.kind === 'string' &&
SENTINEL_KINDS.has(m.metadata.kind)
);
} }
// v1.13.6: exported for unit-test access (reasoning render coverage). // v1.13.6: exported for unit-test access (reasoning render coverage).
export function buildHeadPayload(head: CompactionMessage[]): OpenAiMessage[] { export function buildHeadPayload(head: CompactionMessage[]): OpenAiMessage[] {
const out: OpenAiMessage[] = []; const out: OpenAiMessage[] = [];
for (const m of head) { for (const m of head) {
if (isCapHitSentinel(m)) continue; if (isAnySentinel(m)) continue;
if (m.role === 'assistant' && (m.status === 'streaming' || m.status === 'cancelled')) continue; if (m.role === 'assistant' && (m.status === 'streaming' || m.status === 'cancelled')) continue;
if (m.kind === 'compact') { if (m.kind === 'compact') {
// Legacy compact row — pass through as system context. The new // Legacy compact row — pass through as system context. The new
@@ -417,7 +476,14 @@ export async function process(input: ProcessInput): Promise<void> {
// user message carrying buildPrompt(previousSummary, []). No system prompt // user message carrying buildPrompt(previousSummary, []). No system prompt
// — matches opencode (`system: []`); the template + anchor are sufficient. // — matches opencode (`system: []`); the template + anchor are sufficient.
const headPayload = buildHeadPayload(sel.head); const headPayload = buildHeadPayload(sel.head);
const finalUser: OpenAiMessage = { role: 'user', content: buildPrompt(previousSummary, []) }; // #12 Part B: derive the file-provenance ledger from the head's read-tool
// calls and inject it as a deterministic ## Files Read context block so the
// summarizer merges it into the rolling summary. Empty → no injection.
const filesReadCtx = buildFilesReadContext(sel.head);
const finalUser: OpenAiMessage = {
role: 'user',
content: buildPrompt(previousSummary, filesReadCtx ? [filesReadCtx] : []),
};
const payload = [...headPayload, finalUser]; const payload = [...headPayload, finalUser];
log.info( log.info(

View File

@@ -19,6 +19,14 @@ export type {
} from './turn.js'; } from './turn.js';
export type { ToolPhaseResult } from './tool-phase.js'; export type { ToolPhaseResult } from './tool-phase.js';
export { detectDoomLoop, DOOM_LOOP_THRESHOLD } from './sentinels.js'; export { detectDoomLoop, DOOM_LOOP_THRESHOLD } from './sentinels.js';
export {
detectMistakePattern,
freshMistakeState,
recordStep,
MISTAKE_THRESHOLD,
MISTAKE_RECOVERY_NOTE,
} from './mistake-tracker.js';
export type { FailureKind, MistakeState } from './mistake-tracker.js';
export { buildMessagesPayload } from './payload.js'; export { buildMessagesPayload } from './payload.js';
export { generateToolUseSummary } from './tool-summaries.js'; export { generateToolUseSummary } from './tool-summaries.js';
export type { ToolInfo } from './tool-summaries.js'; export type { ToolInfo } from './tool-summaries.js';

View File

@@ -0,0 +1,69 @@
// v#12 MistakeTracker: heterogeneous-failure recovery. Complements the
// doom-loop guard (sentinels.ts:detectDoomLoop, which only catches *identical*
// repeats) by catching a run of consecutive tool FAILURES the model isn't
// recovering from — even when each failure is a *different* error. Algorithm
// reimplemented from cline's mistake-counting pattern (NOT vendored).
//
// Pure module — mirrors sentinels.ts:detectDoomLoop. No DB, no I/O. The state
// lives loop-local in TurnArgs (reset per runInference, like recentToolCalls).
// The failure taxonomy already distinguished in tool-phase.ts:executeToolCall.
// 'api_error' is reserved for upstream-model failures surfaced as tool outcomes
// (no current emit site on apps/server, but the union mirrors the design doc
// so a future caller can record it without a type change).
export type FailureKind =
| 'zod_reject'
| 'tool_not_found'
| 'exec_error'
| 'api_error'
| 'permission_denied';
// Smallest streak that doesn't false-positive on a model that retries once
// after a transient error. Matches DOOM_LOOP_THRESHOLD's rationale.
export const MISTAKE_THRESHOLD = 3;
export interface MistakeState {
// The current consecutive-failure streak (any successful tool step clears it).
run: FailureKind[];
// How many recovery nudges have fired without an intervening success. Used to
// escalate (stop the turn) on the second trip rather than nudging forever.
nudges: number;
}
export function freshMistakeState(): MistakeState {
return { run: [], nudges: 0 };
}
// Record one tool step's outcome. A 'success' clears BOTH the streak and the
// nudge counter (the model recovered). A FailureKind pushes onto the streak.
export function recordStep(
state: MistakeState,
outcome: FailureKind | 'success',
): void {
if (outcome === 'success') {
state.run = [];
state.nudges = 0;
return;
}
state.run.push(outcome);
}
// Decide whether to intervene given the current streak. When the streak has
// reached MISTAKE_THRESHOLD: 'nudge' the first time (no nudge fired yet),
// 'escalate' if it trips again while a nudge is already outstanding (no
// intervening success cleared `nudges`). Below threshold → null.
//
// Pure — the caller is responsible for mutating `nudges`/`run` after acting on
// the decision (mirrors how turn.ts consumes detectDoomLoop's result).
export function detectMistakePattern(
state: MistakeState,
): 'nudge' | 'escalate' | null {
if (state.run.length < MISTAKE_THRESHOLD) return null;
return state.nudges === 0 ? 'nudge' : 'escalate';
}
// Model-facing guidance injected (transiently, for the next step only) when a
// nudge fires. Short + declarative for the same reliability reason as the
// cap-hit / doom-loop notes.
export const MISTAKE_RECOVERY_NOTE =
"You've hit several different errors in a row. Stop retrying variations — re-read the tool schemas, verify file paths and arguments exist before calling, and try a fundamentally different approach.";

View File

@@ -717,3 +717,57 @@ async function insertDoomLoopSentinel(
metadata, metadata,
}); });
} }
// #12 MistakeTracker: heterogeneous-failure recovery sentinel. Mirrors
// insertDoomLoopSentinel structurally — a role='system', status='complete' row
// firing the standard message_started → delta → message_complete frame
// sequence. Two variants distinguished by `escalated`:
// - escalated:false → a nudge fired; recovery guidance was injected into the
// model's next step and the loop continued. can_continue is true (the turn
// is still live).
// - escalated:true → the nudge didn't break the failure run; the turn was
// stopped (cap-hit-style). can_continue is true so the UI can still offer a
// Continue affordance — a fresh user turn resets the tracker.
export async function insertMistakeRecoverySentinel(
ctx: InferenceContext,
sessionId: string,
chatId: string,
opts: { failureKinds: string[]; count: number; escalated: boolean; canContinue: boolean },
): Promise<void> {
const metadata: MessageMetadata = {
kind: 'mistake_recovery',
failure_kinds: opts.failureKinds,
count: opts.count,
escalated: opts.escalated,
can_continue: opts.canContinue,
};
const content = opts.escalated
? `Repeated different errors persisted after a recovery nudge (${opts.count} in a row). Stopping the tool-call loop.`
: `Hit ${opts.count} different errors in a row. Injected recovery guidance and continuing.`;
const [row] = await ctx.sql<{ id: string }[]>`
INSERT INTO messages (session_id, chat_id, role, content, status, created_at, metadata)
VALUES (${sessionId}, ${chatId}, 'system', ${content}, 'complete', clock_timestamp(), ${ctx.sql.json(metadata as never)})
RETURNING id
`;
// Standard frame sequence — same as cap-hit / doom-loop sentinels.
ctx.publish(sessionId, {
type: 'message_started',
message_id: row!.id,
chat_id: chatId,
role: 'system',
});
ctx.publish(sessionId, {
type: 'delta',
message_id: row!.id,
chat_id: chatId,
content,
});
ctx.publish(sessionId, {
type: 'message_complete',
message_id: row!.id,
chat_id: chatId,
metadata,
});
}

View File

@@ -48,6 +48,18 @@ export function isDoomLoopSentinel(m: Message): boolean {
); );
} }
export function isAnySentinel(m: Message): boolean { // #12: mistake-recovery sentinel. Same UI-only semantics as cap-hit /
return isCapHitSentinel(m) || isDoomLoopSentinel(m); // doom-loop — never sent to the LLM (filtered via the isAnySentinel check
// below, which buildMessagesPayload + buildHeadPayload both consult).
export function isMistakeRecoverySentinel(m: Message): boolean {
return (
m.role === 'system' &&
m.metadata !== null &&
typeof m.metadata === 'object' &&
(m.metadata as { kind?: unknown }).kind === 'mistake_recovery'
);
}
export function isAnySentinel(m: Message): boolean {
return isCapHitSentinel(m) || isDoomLoopSentinel(m) || isMistakeRecoverySentinel(m);
} }

View File

@@ -17,6 +17,7 @@ import { formatUnknownToolError } from './tool-suggestions.js';
// prompted about paths we couldn't grant anyway (e.g. /etc/passwd). // prompted about paths we couldn't grant anyway (e.g. /etc/passwd).
import { resolveGrantRoot } from '../grant_resolver.js'; import { resolveGrantRoot } from '../grant_resolver.js';
import { stripToolMarkup } from './tool-call-parser.js'; import { stripToolMarkup } from './tool-call-parser.js';
import type { FailureKind } from './mistake-tracker.js';
import type { import type {
InferenceContext, InferenceContext,
StreamResult, StreamResult,
@@ -33,13 +34,18 @@ async function executeToolCall(
toolCall: ToolCall, toolCall: ToolCall,
extraRoots: readonly string[], extraRoots: readonly string[],
toolCtx?: ToolExecCtx, toolCtx?: ToolExecCtx,
): Promise<{ output: unknown; truncated: boolean; error?: string }> { ): Promise<{ output: unknown; truncated: boolean; error?: string; outcome: FailureKind | 'success' }> {
// v#12 MistakeTracker: every return path carries an `outcome` so the turn
// loop can detect a run of heterogeneous failures. The failure taxonomy
// mirrors mistake-tracker.ts:FailureKind. Does NOT alter the existing
// output/truncated/error shape — outcome is purely additive.
const tool = TOOLS_BY_NAME[toolCall.name]; const tool = TOOLS_BY_NAME[toolCall.name];
if (!tool) { if (!tool) {
return { return {
output: null, output: null,
truncated: false, truncated: false,
error: formatUnknownToolError(toolCall.name, Object.keys(TOOLS_BY_NAME)), error: formatUnknownToolError(toolCall.name, Object.keys(TOOLS_BY_NAME)),
outcome: 'tool_not_found',
}; };
} }
const parsed = tool.inputSchema.safeParse(toolCall.args); const parsed = tool.inputSchema.safeParse(toolCall.args);
@@ -64,6 +70,7 @@ async function executeToolCall(
output: null, output: null,
truncated: false, truncated: false,
error: `tool '${toolCall.name}' rejected — ${hint}`, error: `tool '${toolCall.name}' rejected — ${hint}`,
outcome: 'zod_reject',
}; };
} }
try { try {
@@ -72,15 +79,16 @@ async function executeToolCall(
typeof output === 'object' && output !== null && 'truncated' in output typeof output === 'object' && output !== null && 'truncated' in output
? Boolean((output as { truncated: unknown }).truncated) ? Boolean((output as { truncated: unknown }).truncated)
: false; : false;
return { output, truncated }; return { output, truncated, outcome: 'success' };
} catch (err) { } catch (err) {
if (err instanceof PathScopeError) { if (err instanceof PathScopeError) {
return { output: null, truncated: false, error: err.message }; return { output: null, truncated: false, error: err.message, outcome: 'permission_denied' };
} }
return { return {
output: null, output: null,
truncated: false, truncated: false,
error: err instanceof Error ? err.message : String(err), error: err instanceof Error ? err.message : String(err),
outcome: 'exec_error',
}; };
} }
} }
@@ -93,6 +101,12 @@ export interface ToolPhaseResult {
toolCallCount: number; toolCallCount: number;
toolCalls: ToolCall[]; toolCalls: ToolCall[];
nextAssistantId: string | null; nextAssistantId: string | null;
// v#12 MistakeTracker: one outcome per executed tool call, in no particular
// order (filled inside the Promise.all callbacks). The turn loop folds these
// into TurnArgs.mistakeTracker via recordStep. Pause/auto-grant control-flow
// tools record 'success' (they aren't model mistakes); the genuine error
// paths record their FailureKind.
outcomes: (FailureKind | 'success')[];
} }
export async function executeToolPhase( export async function executeToolPhase(
@@ -187,6 +201,10 @@ export async function executeToolPhase(
// for the synthesis input. Race-free under Promise.all because each // for the synthesis input. Race-free under Promise.all because each
// callback pushes its own captured value. // callback pushes its own captured value.
const synthEntries: Array<{ tc: ToolCall; output: unknown; error?: string }> = []; const synthEntries: Array<{ tc: ToolCall; output: unknown; error?: string }> = [];
// v#12 MistakeTracker: collect each tool's outcome. Concurrent pushes under
// Promise.all are safe (each callback appends its own value; order is not
// significant to recordStep which folds them sequentially).
const outcomes: (FailureKind | 'success')[] = [];
await Promise.all( await Promise.all(
toolCalls.map(async (tc) => { toolCalls.map(async (tc) => {
const [toolRow] = await ctx.sql<{ id: string }[]>` const [toolRow] = await ctx.sql<{ id: string }[]>`
@@ -197,6 +215,7 @@ export async function executeToolPhase(
const toolMessageId = toolRow!.id; const toolMessageId = toolRow!.id;
if (tc.name === 'ask_user_input') { if (tc.name === 'ask_user_input') {
pausingForUserInput = true; pausingForUserInput = true;
outcomes.push('success');
const sentinel = { tool_call_id: tc.id, output: null, truncated: false }; const sentinel = { tool_call_id: tc.id, output: null, truncated: false };
// v1.13.20: parts-only. The answer-endpoint UPDATE later // v1.13.20: parts-only. The answer-endpoint UPDATE later
// (messages.ts) will delete and re-insert this part when the user // (messages.ts) will delete and re-insert this part when the user
@@ -227,7 +246,10 @@ export async function executeToolPhase(
); );
if (!resolution.ok) { if (!resolution.ok) {
// Auto-deny without pausing. The model sees the reason on its // Auto-deny without pausing. The model sees the reason on its
// next turn and decides what to do. // next turn and decides what to do. Counts as a permission_denied
// failure for the mistake tracker (the model asked for a path it
// can't have — a recoverable mistake it should learn from).
outcomes.push('permission_denied');
const stored = { const stored = {
tool_call_id: tc.id, tool_call_id: tc.id,
output: `denied: ${resolution.reason}`, output: `denied: ${resolution.reason}`,
@@ -255,6 +277,7 @@ export async function executeToolPhase(
// pause. The grant endpoint re-derives the root at decision time // pause. The grant endpoint re-derives the root at decision time
// (state may have changed in the meantime) so we don't stash it here. // (state may have changed in the meantime) so we don't stash it here.
pausingForUserInput = true; pausingForUserInput = true;
outcomes.push('success');
const sentinel = { tool_call_id: tc.id, output: null, truncated: false }; const sentinel = { tool_call_id: tc.id, output: null, truncated: false };
// v1.13.20: parts-only write. // v1.13.20: parts-only write.
await insertParts( await insertParts(
@@ -267,6 +290,10 @@ export async function executeToolPhase(
return; return;
} }
if (agent && !matchToolGlob(tc.name, agent.tools)) { if (agent && !matchToolGlob(tc.name, agent.tools)) {
// Agent-scope denial — the model called a tool outside its whitelist.
// permission_denied for the mistake tracker (the model should pick a
// tool it's actually allowed to use).
outcomes.push('permission_denied');
const stored = { const stored = {
tool_call_id: tc.id, tool_call_id: tc.id,
output: null, output: null,
@@ -295,6 +322,10 @@ export async function executeToolPhase(
sql: ctx.sql, sql: ctx.sql,
sessionId, sessionId,
}); });
// v#12 MistakeTracker: record the real execution outcome (success or a
// FailureKind). This is the primary signal for heterogeneous-failure
// detection.
outcomes.push(tres.outcome);
if (SYNTHESIS_TOOLS.has(tc.name)) { if (SYNTHESIS_TOOLS.has(tc.name)) {
synthEntries.push({ tc, output: tres.output, ...(tres.error ? { error: tres.error } : {}) }); synthEntries.push({ tc, output: tres.output, ...(tres.error ? { error: tres.error } : {}) });
} }
@@ -340,6 +371,7 @@ export async function executeToolPhase(
toolCallCount: toolCalls.length, toolCallCount: toolCalls.length,
toolCalls, toolCalls,
nextAssistantId: null, nextAssistantId: null,
outcomes,
}; };
} }
@@ -378,6 +410,7 @@ export async function executeToolPhase(
toolCallCount: toolCalls.length, toolCallCount: toolCalls.length,
toolCalls, toolCalls,
nextAssistantId: null, nextAssistantId: null,
outcomes,
}; };
} }
// ran === false → synthesis failed (timeout / model error) → fall through // ran === false → synthesis failed (timeout / model error) → fall through
@@ -397,5 +430,6 @@ export async function executeToolPhase(
toolCallCount: toolCalls.length, toolCallCount: toolCalls.length,
toolCalls, toolCalls,
nextAssistantId: nextAssistant!.id, nextAssistantId: nextAssistant!.id,
outcomes,
}; };
} }

View File

@@ -22,6 +22,13 @@ import { resolveToolBudget } from './budget.js';
import { import {
detectDoomLoop, detectDoomLoop,
} from './sentinels.js'; } from './sentinels.js';
import {
detectMistakePattern,
freshMistakeState,
recordStep,
MISTAKE_RECOVERY_NOTE,
type MistakeState,
} from './mistake-tracker.js';
import { import {
buildMessagesPayload, buildMessagesPayload,
loadContext, loadContext,
@@ -39,6 +46,7 @@ import {
runCapHitSummary, runCapHitSummary,
runDoomLoopSummary, runDoomLoopSummary,
runStepCapSummary, runStepCapSummary,
insertMistakeRecoverySentinel,
} from './sentinel-summaries.js'; } from './sentinel-summaries.js';
// v1.14.0: hard ceiling on the number of stream-and-tool iterations per // v1.14.0: hard ceiling on the number of stream-and-tool iterations per
@@ -144,6 +152,16 @@ export interface TurnArgs {
// boundaries by runInference, same as toolsUsed. Doom-loop check at the // boundaries by runInference, same as toolsUsed. Doom-loop check at the
// top of runAssistantTurn slices the last DOOM_LOOP_THRESHOLD entries. // top of runAssistantTurn slices the last DOOM_LOOP_THRESHOLD entries.
recentToolCalls: ToolCall[]; recentToolCalls: ToolCall[];
// v#12 MistakeTracker: heterogeneous-failure recovery state. Loop-local,
// reset per runInference (user-message boundary) like recentToolCalls. Folds
// tool-phase outcomes via recordStep each iteration; detectMistakePattern
// gates the nudge/escalate decision.
mistakeTracker: MistakeState;
// v#12: transient model-facing recovery note set when a nudge fires. Consumed
// (appended as a role:'system' message + cleared) on the NEXT payload build.
// Never persisted — mirrors how the cap-hit/doom-loop notes live only inside
// the summary call's messages array.
pendingRecoveryNote?: string;
signal: AbortSignal | undefined; signal: AbortSignal | undefined;
} }
@@ -188,6 +206,12 @@ export async function runAssistantTurn(
let toolsUsed = args.toolsUsed; let toolsUsed = args.toolsUsed;
let recentToolCalls = args.recentToolCalls; let recentToolCalls = args.recentToolCalls;
let assistantMessageId = args.assistantMessageId; let assistantMessageId = args.assistantMessageId;
// v#12 MistakeTracker: the tracker state is carried on `args` (mutated in
// place by recordStep). pendingRecoveryNote is a loop-local because it is a
// single-step transient — set when a nudge fires, consumed (injected into the
// next payload) and cleared on the following iteration.
const mistakeTracker = args.mistakeTracker;
let pendingRecoveryNote: string | undefined = args.pendingRecoveryNote;
while (stepNumber < effectiveCap) { while (stepNumber < effectiveCap) {
// ---- doom-loop check (moved from top-of-function) ---- // ---- doom-loop check (moved from top-of-function) ----
@@ -196,7 +220,7 @@ export async function runAssistantTurn(
// Need fresh history for the summary. // Need fresh history for the summary.
const loaded = await loadContext(ctx.sql, sessionId, chatId); const loaded = await loadContext(ctx.sql, sessionId, chatId);
if (loaded) { if (loaded) {
const iterArgs: TurnArgs = { sessionId, chatId, assistantMessageId, toolsUsed, recentToolCalls, signal }; const iterArgs: TurnArgs = { sessionId, chatId, assistantMessageId, toolsUsed, recentToolCalls, mistakeTracker, signal };
await runDoomLoopSummary(ctx, iterArgs, loaded.session, loaded.project, loaded.history, agent, loop); await runDoomLoopSummary(ctx, iterArgs, loaded.session, loaded.project, loaded.history, agent, loop);
} }
break; break;
@@ -206,7 +230,7 @@ export async function runAssistantTurn(
if (toolsUsed >= budget) { if (toolsUsed >= budget) {
const loaded = await loadContext(ctx.sql, sessionId, chatId); const loaded = await loadContext(ctx.sql, sessionId, chatId);
if (loaded) { if (loaded) {
const iterArgs: TurnArgs = { sessionId, chatId, assistantMessageId, toolsUsed, recentToolCalls, signal }; const iterArgs: TurnArgs = { sessionId, chatId, assistantMessageId, toolsUsed, recentToolCalls, mistakeTracker, signal };
await runCapHitSummary(ctx, iterArgs, loaded.session, loaded.project, loaded.history, agent, budget); await runCapHitSummary(ctx, iterArgs, loaded.session, loaded.project, loaded.history, agent, budget);
} }
break; break;
@@ -265,7 +289,16 @@ export async function runAssistantTurn(
} }
} }
const iterArgs: TurnArgs = { sessionId, chatId, assistantMessageId, toolsUsed, recentToolCalls, signal }; // v#12 MistakeTracker: if the prior iteration's nudge fired, append the
// transient recovery note to THIS payload (consumed exactly once, then
// cleared). Never persisted — same lifecycle as the cap-hit/doom-loop
// summary notes, which live only inside the in-memory messages array.
if (pendingRecoveryNote) {
messages.push({ role: 'system', content: pendingRecoveryNote });
pendingRecoveryNote = undefined;
}
const iterArgs: TurnArgs = { sessionId, chatId, assistantMessageId, toolsUsed, recentToolCalls, mistakeTracker, signal };
const state: StreamPhaseState = { accumulated: '', startedAt: null }; const state: StreamPhaseState = { accumulated: '', startedAt: null };
let result: StreamResult; let result: StreamResult;
try { try {
@@ -305,10 +338,78 @@ export async function runAssistantTurn(
recentToolCalls = [...recentToolCalls, ...toolPhaseResult.toolCalls]; recentToolCalls = [...recentToolCalls, ...toolPhaseResult.toolCalls];
stepNumber++; stepNumber++;
// v#12 MistakeTracker: fold this iteration's tool outcomes into the
// tracker, in order. recordStep mutates `mistakeTracker` in place (it is
// the same object referenced by args). A 'success' clears the streak.
for (const o of toolPhaseResult.outcomes) {
recordStep(mistakeTracker, o);
}
if (toolPhaseResult.action !== 'continue') { if (toolPhaseResult.action !== 'continue') {
// 'paused' (user input) or 'synthesis_done' — stop the loop. // 'paused' (user input) or 'synthesis_done' — stop the loop. The turn is
// already ending, so neither a nudge nor an escalate would change the
// control flow; we skip the mistake decision here.
break; break;
} }
// v#12 MistakeTracker: heterogeneous-failure decision. Only evaluated on
// the 'continue' path (the only case where the loop would otherwise
// proceed to another step). Complements the doom-loop check above, which
// only catches *identical* repeats.
const mistake = detectMistakePattern(mistakeTracker);
if (mistake === 'nudge') {
// Soft intervention: inject model-facing recovery guidance into the NEXT
// step's payload, drop a UI sentinel, bump nudges, reset the streak, and
// continue. The note is consumed (and cleared) at the top of the next
// iteration's payload build.
pendingRecoveryNote = MISTAKE_RECOVERY_NOTE;
const failureKinds = [...mistakeTracker.run];
await insertMistakeRecoverySentinel(ctx, sessionId, chatId, {
failureKinds,
count: failureKinds.length,
escalated: false,
canContinue: true,
});
mistakeTracker.nudges += 1;
mistakeTracker.run = [];
ctx.log.info(
{ sessionId, chatId, step: stepNumber, nudges: mistakeTracker.nudges, failureKinds },
'mistake_recovery nudge',
);
assistantMessageId = toolPhaseResult.nextAssistantId!;
continue;
}
if (mistake === 'escalate') {
// The nudge didn't break the failure run — stop the turn (cap-hit-style)
// to avoid burning the whole step budget on heterogeneous failures. The
// next assistant row is still 'streaming'; finalize it as a short note so
// the slot doesn't dangle, then drop the escalate sentinel.
const failureKinds = [...mistakeTracker.run];
assistantMessageId = toolPhaseResult.nextAssistantId!;
await ctx.sql`
UPDATE messages
SET content = '', status = 'complete', finished_at = clock_timestamp()
WHERE id = ${assistantMessageId}
`;
ctx.publish(sessionId, {
type: 'message_complete',
message_id: assistantMessageId,
chat_id: chatId,
});
await insertMistakeRecoverySentinel(ctx, sessionId, chatId, {
failureKinds,
count: failureKinds.length,
escalated: true,
canContinue: true,
});
ctx.publishUser({ type: 'chat_status', chat_id: chatId, status: 'idle', at: new Date().toISOString() });
ctx.log.info(
{ sessionId, chatId, step: stepNumber, failureKinds },
'mistake_recovery escalate — stopping turn',
);
break;
}
// 'continue' — advance to next assistant message. // 'continue' — advance to next assistant message.
assistantMessageId = toolPhaseResult.nextAssistantId!; assistantMessageId = toolPhaseResult.nextAssistantId!;
} }
@@ -320,7 +421,7 @@ export async function runAssistantTurn(
if (stepNumber >= effectiveCap && effectiveCap < Infinity) { if (stepNumber >= effectiveCap && effectiveCap < Infinity) {
const loaded = await loadContext(ctx.sql, sessionId, chatId); const loaded = await loadContext(ctx.sql, sessionId, chatId);
if (loaded) { if (loaded) {
const capArgs: TurnArgs = { sessionId, chatId, assistantMessageId, toolsUsed, recentToolCalls, signal }; const capArgs: TurnArgs = { sessionId, chatId, assistantMessageId, toolsUsed, recentToolCalls, mistakeTracker, signal };
await runStepCapSummary(ctx, capArgs, loaded.session, loaded.project, loaded.history, agent, stepNumber, effectiveCap); await runStepCapSummary(ctx, capArgs, loaded.session, loaded.project, loaded.history, agent, stepNumber, effectiveCap);
} }
} }
@@ -378,12 +479,16 @@ export async function runInference(
// per-call budget. // per-call budget.
// v1.11.6: recentToolCalls also resets — doom-loop detection is scoped // v1.11.6: recentToolCalls also resets — doom-loop detection is scoped
// to a single user-message turn, so a Continue starts with no history. // to a single user-message turn, so a Continue starts with no history.
// v#12 MistakeTracker: fresh per user-message turn, like recentToolCalls.
// Tracks consecutive heterogeneous tool failures across the loop's
// stream-and-tool iterations within this turn.
return runAssistantTurn(ctx, { return runAssistantTurn(ctx, {
sessionId, sessionId,
chatId, chatId,
assistantMessageId, assistantMessageId,
toolsUsed: 0, toolsUsed: 0,
recentToolCalls: [], recentToolCalls: [],
mistakeTracker: freshMistakeState(),
signal, signal,
}); });
} }

View File

@@ -210,6 +210,11 @@ export type ErrorReason =
// cap_hit — system sentinel emitted when tool budget is exhausted // cap_hit — system sentinel emitted when tool budget is exhausted
// doom_loop — system sentinel emitted when the model called the same // doom_loop — system sentinel emitted when the model called the same
// tool with the same args DOOM_LOOP_THRESHOLD times in a row // tool with the same args DOOM_LOOP_THRESHOLD times in a row
// mistake_recovery — system sentinel emitted when a run of consecutive
// *heterogeneous* tool failures is detected (#12). A nudge
// (escalated:false) injects model-facing recovery guidance
// and continues; an escalate (escalated:true) stops the
// turn after the nudge failed to break the failure run.
// error — attached to a failed assistant message so UI can show reason // error — attached to a failed assistant message so UI can show reason
export type MessageMetadata = export type MessageMetadata =
| { | {
@@ -225,6 +230,14 @@ export type MessageMetadata =
args: Record<string, unknown>; args: Record<string, unknown>;
threshold: number; threshold: number;
} }
| {
// PINNED CONTRACT (#12) — mirrored byte-for-byte in apps/web/src/api/types.ts.
kind: 'mistake_recovery';
failure_kinds: string[];
count: number;
escalated: boolean;
can_continue?: boolean;
}
| { | {
kind: 'error'; kind: 'error';
error_reason: ErrorReason; error_reason: ErrorReason;

View File

@@ -155,6 +155,9 @@ export type ErrorReason =
// budget + agent name + whether Continue is still allowed. // budget + agent name + whether Continue is still allowed.
// doom_loop — sentinel emitted when the model called the same tool with // doom_loop — sentinel emitted when the model called the same tool with
// the same arguments threshold times in a row. // the same arguments threshold times in a row.
// mistake_recovery — sentinel emitted when the model hit repeated *different*
// errors; non-escalated means recovery guidance was injected and
// the turn continues, escalated means the turn was stopped.
// error — attached to a failed assistant message so the bubble can show // error — attached to a failed assistant message so the bubble can show
// a specific reason on reload (WS error frame is one-shot). // a specific reason on reload (WS error frame is one-shot).
export type MessageMetadata = export type MessageMetadata =
@@ -171,6 +174,13 @@ export type MessageMetadata =
args: Record<string, unknown>; args: Record<string, unknown>;
threshold: number; threshold: number;
} }
| {
kind: 'mistake_recovery';
failure_kinds: string[];
count: number;
escalated: boolean;
can_continue?: boolean;
}
| { | {
kind: 'error'; kind: 'error';
error_reason: ErrorReason; error_reason: ErrorReason;

View File

@@ -1,6 +1,6 @@
import { useEffect, useState } from 'react'; import { useEffect, useState } from 'react';
import type { ReactNode } from 'react'; import type { ReactNode } from 'react';
import { ChevronDown, ChevronRight, Copy, RefreshCw, Check, Share2, RotateCw, GitFork, Trash2, Brain, History } from 'lucide-react'; import { ChevronDown, ChevronRight, Copy, RefreshCw, Check, Share2, RotateCw, GitFork, Trash2, Brain, History, AlertCircle } from 'lucide-react';
import { toast } from 'sonner'; import { toast } from 'sonner';
import type { Chat, ErrorReason, Message } from '@/api/types'; import type { Chat, ErrorReason, Message } from '@/api/types';
import { api } from '@/api/client'; import { api } from '@/api/client';
@@ -637,6 +637,76 @@ function ReasoningBlock({ text, streaming }: { text: string; streaming: boolean
); );
} }
// feature #12: mistake-recovery sentinel. Inserted by the backend as a
// role='system', metadata.kind='mistake_recovery' row when the model hit
// repeated *different* errors (distinct from doom_loop, which is the same
// call repeated). Visual treatment mirrors CapHitSentinel / DoomLoopSentinel
// (amber card + alert icon). Non-escalated → recovery guidance was injected
// and the turn continues. Escalated → the turn was stopped; if can_continue
// is set, offer the same Continue affordance as the cap-hit sentinel.
// Loose `!= null` guards per the CLAUDE.md coder-message note (coder rows pass
// metadata as undefined, not null).
function MistakeRecoverySentinel({ message }: { message: Message }) {
const meta = message.metadata;
const isMistakeRecovery =
meta != null && typeof meta === 'object' && meta.kind === 'mistake_recovery';
const failureKinds = isMistakeRecovery ? meta.failure_kinds : [];
const escalated = isMistakeRecovery ? meta.escalated : false;
const canContinue = isMistakeRecovery ? meta.can_continue === true : false;
const [continuing, setContinuing] = useState(false);
async function handleContinue() {
if (continuing || !canContinue) return;
setContinuing(true);
try {
await api.chats.continue(message.chat_id, message.id);
} catch (err) {
toast.error(err instanceof Error ? err.message : 'continue failed');
} finally {
setContinuing(false);
}
}
const kindsLabel =
Array.isArray(failureKinds) && failureKinds.length > 0
? failureKinds.join(', ')
: null;
return (
<div className="rounded-md border border-amber-500/40 bg-amber-500/10 text-sm">
<div className="px-3 py-2 flex items-start gap-2">
<AlertCircle className="size-4 text-amber-500 shrink-0 mt-0.5" />
<div className="flex-1 min-w-0 space-y-1">
<div className="text-xs font-medium text-amber-700 dark:text-amber-300">
{escalated ? 'Repeated errors — turn stopped' : 'Recovering from repeated errors'}
</div>
<div className="text-xs text-muted-foreground">
{escalated
? 'Repeated errors persisted — stopped the turn.'
: kindsLabel
? `Hit repeated different errors (${kindsLabel}) — recovery guidance injected, continuing.`
: 'Hit repeated different errors — recovery guidance injected, continuing.'}
</div>
{escalated && canContinue && (
<div className="pt-1">
<Button
type="button"
size="sm"
variant="outline"
onClick={() => void handleContinue()}
disabled={continuing}
>
{continuing ? 'Continuing…' : 'Continue'}
</Button>
</div>
)}
</div>
</div>
</div>
);
}
export function MessageBubble({ export function MessageBubble({
message, message,
sessionChats, sessionChats,
@@ -681,6 +751,13 @@ export function MessageBubble({
return <DoomLoopSentinel message={message} />; return <DoomLoopSentinel message={message} />;
} }
// feature #12: mistake-recovery sentinel. Non-escalated rows narrate that
// recovery guidance was injected mid-turn; escalated rows report the turn
// was stopped and (when can_continue) offer the cap-hit-style Continue.
if (message.role === 'system' && message.metadata?.kind === 'mistake_recovery') {
return <MistakeRecoverySentinel message={message} />;
}
// v1.8.2: tool messages and assistant tool_calls are now rendered by // v1.8.2: tool messages and assistant tool_calls are now rendered by
// MessageList via ToolCallLine / ToolCallGroup. Tool-role messages reach // MessageList via ToolCallLine / ToolCallGroup. Tool-role messages reach
// this point only if MessageList didn't consume them (shouldn't happen, // this point only if MessageList didn't consume them (shouldn't happen,

View File

@@ -0,0 +1,70 @@
# MistakeTracker + file-provenance ledger (#12)
**Status:** in progress (started 2026-06-01)
**Source:** `boocode_code_review_v2.md` §1 #12, §5e (cline — algorithm-reimplemented, not vendored).
Two native-inference (apps/server) hardening features. One cohesive backend change (they share
`TurnArgs` + the tool-phase observation point) + a small frontend sentinel render.
## Part A — MistakeTracker (heterogeneous-failure recovery)
Complements the doom-loop guard (`sentinels.ts:detectDoomLoop`, which only catches *identical*
repeats) by catching a run of consecutive tool **failures** the model isn't recovering from.
- New pure `apps/server/src/services/inference/mistake-tracker.ts` (mirrors `detectDoomLoop`):
- `FailureKind = 'zod_reject' | 'tool_not_found' | 'exec_error' | 'api_error' | 'permission_denied'`
(all already distinguished in `tool-phase.ts:executeToolCall`).
- `MISTAKE_THRESHOLD = 3`.
- State `{ run: FailureKind[]; nudges: number }``run` is the current consecutive-failure streak,
reset on ANY successful tool step; `nudges` counts recovery injections not yet cleared by a success.
- `recordStep(state, outcome)` where outcome is a failure kind or `'success'`.
- `detectMistakePattern(state): 'nudge' | 'escalate' | null``run.length >= 3``'nudge'` the first
time (`nudges === 0`), `'escalate'` if it trips again while `nudges >= 1` (no intervening success).
- Lives in `TurnArgs` (loop-local, reset per `runInference`, like `recentToolCalls`).
- Integration in `turn.ts` loop: after each tool phase, `recordStep` per tool outcome; then
`detectMistakePattern`:
- `'nudge'` (decision: soft + escalate): append a transient **model-facing** recovery-guidance system
message to the NEXT turn's payload (re-read schemas, verify paths exist before acting, try a
different approach — not retry variations), insert a `mistake_recovery` UI sentinel
(`escalated:false`), bump `nudges`, reset `run`. Loop continues.
- `'escalate'`: stop the turn (break), insert a `mistake_recovery` sentinel (`escalated:true`,
`can_continue:true`, cap-hit-style), finalize. Prevents heterogeneous failures from burning the
whole step budget.
## Part B — File-provenance ledger (Read-only)
- Accumulate file paths read by `view_file`/`grep`/`find_files`/`list_dir` into `TurnArgs.filesRead:
Set<string>` (recorded at the tool-phase, like the failure outcomes).
- On compaction (`compaction.ts:buildPrompt`), inject a deterministic, sorted `## Files Read` list into
the summary prompt context so the summarizer merges it into the rolling summary — **no new
table/column**; it propagates as summary text across compactions. `compaction-prompt.ts`'s
`SUMMARY_TEMPLATE` already has a `## Relevant Files` section to extend/merge with.
- BooChat is **read-only** (no write tools on apps/server) → "Files Modified" is N/A here; only
"Files Read". (The apps/coder write side can add "Modified" later.)
## Sentinel contract (pinned — backend + frontend must match)
New sentinel kind on `MessageMetadata` in BOTH `apps/server/src/types/api.ts` AND
`apps/web/src/api/types.ts`:
```
{ kind: 'mistake_recovery'; failure_kinds: string[]; count: number; escalated: boolean; can_continue?: boolean }
```
- `role='system'`, `status='complete'`, stripped from the LLM payload via `isAnySentinel` in
`payload.ts` (UI-only) and `compaction.ts:buildHeadPayload`.
- Frontend render branch in `apps/web/src/components/MessageBubble.tsx`: `escalated:false` →
"Hit repeated different errors — recovery guidance injected, continuing." `escalated:true` →
"Repeated errors persisted — stopped the turn." (mirror the doom-loop/cap-hit branches).
## Decisions (2026-06-01)
- MistakeTracker intervention: **soft nudge + escalate**.
- **UI sentinel** for recovery (`mistake_recovery`).
## Files (backend, one agent) / (frontend, one agent)
- Backend: `mistake-tracker.ts` (new), `turn.ts`, `tool-phase.ts`, `sentinels.ts`,
`sentinel-summaries.ts`, `payload.ts`, `compaction.ts`, `compaction-prompt.ts`, `types/api.ts` +
tests (`mistake-tracker.test.ts`, ledger/compaction assertions).
- Frontend: `apps/web/src/api/types.ts` (MessageMetadata arm) + `MessageBubble.tsx` (render branch).
MUST NOT touch Sam's WIP web files.
## Verify
- `pnpm -C apps/server test`; `pnpm -C apps/server build`; `npx tsc -p apps/web/tsconfig.app.json --noEmit`