docs: archive shipped openspec batches; add feature/plan/research notes

Move 13 shipped openspec change docs under openspec/changes/archived/.
Add docs/features/git-diff-panel, docs/plans/post-review-backlog, and
docs/research/cross-app-contract-ssot.md (the research behind the
@boocode/contracts SSOT work). Update BOOCHAT.md, BOOCODER.md, and
boocode_roadmap.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-02 21:20:33 +00:00
parent e5ce01ae72
commit 2a05d2f9fe
27 changed files with 2210 additions and 17 deletions

View File

@@ -0,0 +1,76 @@
# Discovery notes — post-review-backlog plan
Single source of truth for project context. Specialists: read this first, do NOT re-grep what is here.
Search further only for what your domain needs that is not covered.
## Tech stack
- Monorepo, pnpm workspaces: `apps/server` (BooChat — Fastify + postgres, native inference, read-only tools),
`apps/web` (React + Vite SPA), `apps/coder` (BooCoder — host systemd service, write tools + external-agent
dispatch, port 9502), `apps/booterm` (PTY/tmux). TypeScript strict, NodeNext (`.js` import suffixes) on
server + coder.
- Tests: vitest (pinned ^3). server `pnpm -C apps/server test`; coder `pnpm -C apps/coder test`
(`globals:false` — import describe/it/expect). Include glob `src/**/__tests__/**/*.test.ts`. No web test
harness, no linters. DB-integration tests opt-in via `DATABASE_URL` + `describe.runIf`.
- Deploy: apps/coder → `sudo systemctl restart boocoder`; apps/web|server → `docker compose up --build -d boocode`.
- Postgres 16, DB `boochat`. Two schema files: `apps/server/src/schema.sql` (sessions/chats/messages/
message_parts) + `apps/coder/src/schema.sql` (agent_sessions, worktrees, pending_changes, available_agents,
checkpoints, claude_session_entries, tasks extension).
## ADRs / coding standards found
- No `docs/adr/` directory. Architectural decisions live in `boocode_roadmap.md` (Decisions log) +
per-app `CLAUDE.md` files (auto-loaded when editing that subtree) + `openspec/changes/archived/`.
- Coding standards: `docs/coding-standards/` (canonical), surfaced via `.claude/rules/coding-standards/`
path-scoped index files. Not loaded automatically; open on demand.
- Cross-cutting conventions in root `CLAUDE.md` "Conventions" section (WS-frame parity, sentinels, JSONB
via `sql.json`, event dedup discipline, deploy-by-surface).
## Code touch points (per scope-brief item)
- **F1 task-cancel:** `apps/coder/src/services/dispatcher.ts` (private `ac=new AbortController()` at
~316/655/991/1248; `inflight=Map<sessionId,Promise>` at :56 — no per-task AbortController registry);
`apps/coder/src/routes/tasks.ts:110-148` (cancel route — inference.cancel + DB only); `routes/messages.ts:388`
(session stop); backends honor `ctx.signal`: `pty-dispatch.ts:159` (child.kill), `backends/warm-acp.ts:318`
(session/cancel), `backends/opencode-server.ts:775` (session.abort), `backends/claude-sdk.ts:209` (interrupt).
Frontend `apps/web/src/components/panes/CoderPane.tsx:987` handleStop; `api/client.ts:395` cancelTask.
- **F2 parser prune:** `apps/server/src/services/inference/tool-call-parser.ts` (exports extractToolCallBlocks,
stripToolMarkup, parseXmlToolCall, parseInvokeToolCall, isPlaceholderArgValue, XML_/INVOKE_ consts).
Live consumers: `stream-phase.ts:263-284` (extractToolCallBlocks, text-delta fallback path), `:285-294`
(structured tool-call path — authoritative today), `tool-phase.ts:122` + `error-handler.ts:25,106`
(stripToolMarkup). llama-swap native `--jinja` parsing confirmed ON (external host `:8401`).
- **F3 xml logging:** `tool-call-parser.ts:65` console.debug; one call site in `stream-phase.ts` executeStreamPhase.
- **F4 notify-hook:** agent config paths `~/.claude/settings.json`, `~/.qwen/settings.json`,
`~/.config/goose/`; existing readers `claude-command-discovery.ts:84`, `qwen-settings.ts`,
`provider-registry.ts`. Existing normalize helper `apps/coder/src/services/normalize-agent-status.ts`
(`normalizeAgentEvent`). Existing status publish wired at dispatcher turn boundaries (v2.7.6);
`index.ts:86` references #10. `permission-waiter.ts:47` has a `PermissionHooks` registry.
- **F5 compaction surfacing:** `apps/coder/src/services/backends/opencode-server.ts` SSE arm handling
(~215-311 region); WS frame parity → server `ws-frames.ts` + web `apps/web/src/api/types.ts` (`WsFrame`).
- **F6 resilience:** `apps/server/src/services/inference/stream-phase.ts:261` (`for await ... result.fullStream`),
abort check at :333, usage at :343. Frontend 60s `discard_stale` watchdog is the only stall guard today.
- **F7 session-history MCP tool:** `apps/coder/src/services/mcp-server.ts` (existing BooCoder MCP tools);
read path `messages_with_parts` view.
- **F8 diff-line UX:** diff UI component NOT located by `git ls-files apps/web | grep -i diff` (returned
empty) — UX specialist must locate the actual diff/changes panel component (BOOCODER.md calls it
"DiffPanel"; may be named differently or nested). Routes through dispatcher + AgentComposerBar.
- **F9 retire :9502:** `apps/coder/web/` package + static serve in `apps/coder/src/index.ts` + coder build
scripts/Dockerfile. KEEP WS + REST routes.
## Recent activity / precedent
- HEAD `e5ce01a` (v2.7.11). v2.7.x line (relicense, write-edit-robustness, sampling-streamjson-tokens,
mistake-tracker-ledger, claude-sdk-sessionstore, agent-status-normalize, UI batches) all shipped 2026-06-01.
- Pure-helper + TDD precedent for extraction: `backends/turn-guard.ts`, `backends/lifecycle-decisions.ts`,
`mistake-tracker.ts` (pure module + unit test, then wire). F1/F2/F6 should follow it.
- Parallel-disjoint-file agent precedent: v2.7.0/v2.7.1/v2.7.3 each built by 3 parallel agents over disjoint
files — relevant to decomposition/sequencing.
## Enumerated gaps (searched, not found)
- No `feature-specification.md``scope-brief.md` is the ground-truth stand-in.
- No `docs/adr/`.
- Diff UI component filename not found via naive grep (F8) — needs UX specialist location.
- llama-swap config is NOT in-repo (external host `:8401`); native-jinja state confirmed by live probe only.
- F8 (diff-line UX) and F4 (notify-hook) are the two items most likely to need their own plan-a-feature; they
have no behavioral spec beyond the review-doc pattern description.

View File

@@ -0,0 +1,147 @@
# Implementation Decision Log: Post-Review Backlog (F1F9)
<!--
This file records every implementation decision committed while planning the post-review backlog.
Behavioral and implementation statements live in [../feature-implementation-plan.md](../feature-implementation-plan.md) —
this file captures the question, rationale, evidence, and rejected alternatives for each decision.
Round-by-round history lives in [implementation-iteration-history.md](implementation-iteration-history.md).
-->
Source of truth: [../scope-brief.md](../scope-brief.md) (ground-truth spec stand-in) and
[synthesis-input.md](synthesis-input.md) (the consolidated Round-1 specialist aggregation; its file:line
evidence is treated as verbatim). The D-N counter is shared across the Trivial and Full sections below.
## Trivial decisions
- D-2: F3 logger threading shape — pass an optional `log?: { debug }` param to `extractToolCallBlocks` from its single call site in `stream-phase.ts` `executeStreamPhase`; no interface (one site, one impl). — Referenced in plan: Implementation Approach (F2+F3), Decomposition and Sequencing.
- D-6: F7 query shape — read `messages_with_parts` `WHERE role != 'system'` (strips sentinels), params `session_id` + optional `chat_id` + `limit` (default 50, max 200), `ORDER BY created_at ASC`, returns `{role,content,...}[]`. — Referenced in plan: Implementation Approach (F7), External Interfaces.
- D-9: Patch-tag-per-unit — each ready item ships as its own sequential patch tag (one batch per coherent unit), not a minor bump; Sam declined v2.8.0 twice. — Referenced in plan: Decomposition and Sequencing, Operational Readiness.
- D-10: F1 Stop-button terminal label — render a muted "Stopped" label (not red, not a toast) for `status='cancelled'`. — Referenced in plan: Implementation Approach (F1), Testing Strategy.
## Full decisions
### D-1: F1 cancel registry shape and finalization-fix scope
- **Question:** How does a Stop on an external agent task actually abort the running child, and what message-state corruption does wiring that abort newly expose?
- **Decision:** Add `taskControllers = new Map<string, AbortController>()` inside `createDispatcher`; `taskControllers.set(taskId, ac)` at each of the four run-functions (`dispatcher.ts` ~316/655/991/1248) and `.delete` in the existing `.finally()` (`dispatcher.ts:117`); export `cancelExternalTask(taskId): boolean` (idempotent — `ac.abort()` no-ops when already aborted, so double-Stop and cancel-after-exit are safe). Pass a narrow `ExternalCancelFn` (not the whole dispatcher) into `registerTaskRoutes`, wired in `index.ts:254`. **In the same batch**, fix the two pre-existing finalization bugs this newly makes reachable: (1) the four catch blocks update only `tasks` state and leave the assistant `messages` row `status='streaming'` (the BooChat 5-min sweep is a different process and cannot recover it); (2) the warm-backend success path writes `messages.status='complete'` unconditionally before checking abort (`dispatcher.ts` ~853/1122/1377). Fix via a shared `cancelAndFinalize` helper across all four paths: after `await backend.prompt(...)`, `if (ac.signal.aborted)` → write `status='cancelled'`, publish the terminal `message_complete` frame, emit idle, return; in each catch finalize the message with `WHERE status='streaming'` (idempotent), mapping `AbortError → cancelled` vs `error → failed`.
- **Rationale:** 4-way specialist consensus (on-call B1, behavioral B1, architect A1, junior). The frontend and all four backends already honor the abort signal correctly (PTY `child.kill` `pty-dispatch.ts:159`; warm-ACP `session/cancel` `warm-acp.ts:318`; opencode `session.abort` `opencode-server.ts:775`; claude-sdk interrupt `claude-sdk.ts:209`); the only missing link is the registry + export + route wiring. Shipping the abort wiring **without** the finalization fixes would convert a silent no-op into a new bug (cancelled turns recorded `complete`, or messages stuck `streaming`), so the two are inseparable in one batch (on-call OCE-001/OCE-002, behavioral B2/B3).
- **Evidence:** `routes/tasks.ts:110-148` (cancel route calls `cancelPendingPermission` + `inference.cancel` native-only + DB cancelled; never reaches dispatcher); `dispatcher.ts:316/655/991/1248` (four private `ac` unreachable); `cancelExternalTask` absent anywhere (synthesis-input C1); finalization bugs at `dispatcher.ts` catch blocks + ~853/1122/1377 (synthesis-input C2); backend signal honoring cited above (scope-brief F1).
- **Rejected alternatives:**
- Abort wiring only, defer the finalization fixes — rejected because wiring abort makes the `streaming`/`complete` corruption reachable from the UI for the first time; deferring ships a new bug (synthesis-input C2, on-call/behavioral).
- Recover stuck messages via the BooChat 5-min sweep — rejected because the sweep runs in a different process (BooChat) and does not see BooCoder's `agent_sessions`/`tasks` finalization (on-call OCE-001).
- Pass the whole dispatcher into `registerTaskRoutes` — rejected for over-coupling; a narrow `ExternalCancelFn` is sufficient (architect A1).
- **Specialist owner:** on-call-engineer (resilience) with software-architect (registry shape).
- **Revisit criterion:** a fifth external backend is added whose abort contract differs from signal-based cancellation, or the warm-vs-one-shot worktree-cleanup split changes.
- **Dissent (if any):** none.
- **Driven by rounds:** R1.
- **Dependent decisions:** D-7 (F1 terminal state), D-8 (F1 frame-extension over new frame type), D-10.
- **Referenced in plan:** Implementation Approach (F1), Runtime Behavior, On-Call Resilience Posture, Decomposition and Sequencing, RAID Log (R1).
### D-3: F2 prune scope — option A (prune-now-minimal), keep the load-bearing guard
- **Question:** How far does the tool-call-parser prune go — unexport the dead symbols only, or the full flag-gated retirement of the text-scrape fallback?
- **Decision:** Option A only. KEEP `extractToolCallBlocks` + `stripToolMarkup` and their types (`ToolCallExtraction`, `ParsedCall`) — the load-bearing `<invoke>`-as-text guard. REMOVE only the `export` keyword (not the implementations) from the 8 zero-external-caller symbols: `isPlaceholderArgValue`, `parseXmlToolCall`, `parseInvokeToolCall`, `partialXmlOpenerStart`, and the 4 consts `XML_TOOL_OPEN/CLOSE`, `INVOKE_TOOL_OPEN/CLOSE`. Zero runtime effect; public export surface goes 11 → 4.
- **Rationale:** The TS parser is dormant defense-in-depth but the `<invoke>`-as-text path is the *only* guard for "tool call emitted as plain text" — `experimental_repairToolCall` does not cover that case, and the sidecar `--jinja` state is confirmed only by a live probe, not pinned in-repo, so keeping the guard is correct (architect A2, confirms junior OQ-F2c). Unexporting is pure simplification with no behavior change; the relicense already removed the AGPL-dead exports, so there is no license pressure forcing the larger move.
- **Evidence:** live consumers `stream-phase.ts:263-284` (extractToolCallBlocks text-delta fallback), `tool-phase.ts:122` + `error-handler.ts:25,106` (stripToolMarkup); structured path `stream-phase.ts:285` does all real work today; llama-swap `--jinja` confirmed ON by live probe of `:8401` only (scope-brief F2, synthesis-input C3).
- **Rejected alternatives:**
- Option B — flag-gate the text-scrape fallback, validate native parsing on live qwen3.6 for one release, then delete — rejected (deferred) because there is no evidence qwen3.6 stopped emitting `<invoke>` text on live and the sidecar `--jinja` state is unconfirmed in-repo; deleting the only plain-text-tool-call guard on that basis is unsafe (architect/test-engineer R1). See Deferred (YAGNI).
- Delete the 8 symbols' implementations outright — rejected because three of them (`parseXmlToolCall`, `parseInvokeToolCall`, `isPlaceholderArgValue`) are called internally by the retained `extractToolCallBlocks`; only the `export` keyword is dead.
- **Specialist owner:** software-architect.
- **Revisit criterion:** a documented multi-session live probe shows zero text-delta tool calls from qwen3.6 (then Option B reopens).
- **Dissent (if any):** none.
- **Driven by rounds:** R1.
- **Dependent decisions:** D-2 (F3 logger — safe because F2 keeps `extractToolCallBlocks`), D-4 (F2 gate test).
- **Referenced in plan:** Implementation Approach (F2+F3), Decomposition and Sequencing, Testing Strategy, RAID Log (R2), Deferred (YAGNI).
### D-4: F2 fallback gate test — pin the untested guard before pruning
- **Question:** The `<invoke>`-as-text fallback is currently exercised by no test; how do we prevent the prune from silently removing it?
- **Decision:** Add a gate test before the prune: stub `streamText` to emit a text-delta containing a complete `<invoke>` block; assert the call lands in `result.toolCalls` and the markup is NOT present in `result.content`. The test must stay green through the prune and fail if `extractToolCallBlocks` is ever removed from the text-delta path.
- **Rationale:** Pruning around an untested load-bearing path risks a silent regression; the gate test converts D-3's "keep the guard" commitment into an enforced invariant (test-engineer T6).
- **Evidence:** untested fallback at `stream-phase.ts:263-284` (synthesis-input C4, test-engineer T6).
- **Rejected alternatives:**
- Prune without the gate test, relying on review — rejected because the fallback has no current coverage, so a future removal would pass CI silently (test-engineer T6).
- **Specialist owner:** test-engineer.
- **Revisit criterion:** the fallback path is intentionally retired under Option B (then this test is rewritten, not deleted).
- **Dissent (if any):** none.
- **Driven by rounds:** R1.
- **Dependent decisions:** none.
- **Referenced in plan:** Testing Strategy, Decomposition and Sequencing.
### D-5: F6 stall-timeout via AbortSignal.any; no retry
- **Question:** How does BooChat detect and recover a hung llama-swap stream server-side, and does it retry?
- **Decision:** Wrap the `stream-phase.ts:261` `fullStream` loop with a per-chunk stall deadline. Create a local `stallAc = new AbortController()`, pass `effectiveSignal = AbortSignal.any([signal, stallAc.signal])` to `streamText`, bump a `setTimeout(STALL_TIMEOUT_MS = 90_000)` on each chunk, clear it in the existing `finally`. At the post-loop check (`stream-phase.ts:337`) test `signal?.aborted || stallAc.signal.aborted` and throw `AbortError` (→ `handleAbortOrError` writes `cancelled`). **No retry** at `executeStreamPhase`/`streamCompletion`.
- **Rationale:** Today a hung stream relies entirely on the frontend 60s `discard_stale` watchdog with zero server-side guard; the 90s server stall-timeout closes that gap and reuses the existing abort/finalize path. Retry is deferred (YAGNI): a retry after a partial stream re-emits already-streamed deltas (`state.accumulated` + live `delta` frames are non-idempotent), which is worse than the current behavior; at single-local-instance scale the user re-sending is the correct recovery (on-call, strong).
- **Evidence:** `stream-phase.ts:261` fullStream loop, `:333` abort check, `:343` usage; frontend 60s `discard_stale` is the only stall guard today (scope-brief F6, synthesis-input C6).
- **Rejected alternatives:**
- `Promise.race` of the loop against a timeout — rejected in favor of `AbortSignal.any`, which threads cancellation through `streamText` and the existing finalize path cleanly (OQ-F6a, on-call).
- Retry/backoff classifier (transient-5xx / stall) — rejected (deferred) because partial-stream re-emit is non-idempotent and llama-swap is a single local instance (synthesis-input YAGNI ledger). See Deferred (YAGNI).
- **Specialist owner:** on-call-engineer.
- **Revisit criterion:** llama-swap gains restart-in-place-with-clear-partial, or a second instance is added for failover (then retry reopens).
- **Dissent (if any):** none.
- **Driven by rounds:** R1.
- **Dependent decisions:** none.
- **Referenced in plan:** Implementation Approach (F6), On-Call Resilience Posture, Testing Strategy, RAID Log (R3), Deferred (YAGNI).
### D-7: F1 terminal state for user Stop — `cancelled`, not `failed`
- **Question:** When a user hits Stop, what terminal `messages.status` does the finalized assistant message land in?
- **Decision:** `cancelled` for a user-initiated Stop (`AbortError`); `failed` only for a genuine thrown error in the catch path.
- **Rationale:** A user Stop is a deliberate, non-error outcome; `MessageStatus` already includes `'cancelled'` and the web reducer can map it without a new enum value. Distinguishing `AbortError → cancelled` vs `error → failed` keeps the human-inbox / failure surfaces honest (resolved OQ, on-call/behavioral).
- **Evidence:** `MessageStatus` includes `'cancelled'`; reducer map point `CoderPane.tsx:299` (synthesis-input F1 UX + resolved OQs).
- **Rejected alternatives:**
- Record user Stop as `failed` — rejected because it pollutes failure surfaces with deliberate user actions (resolved OQ F1 terminal state).
- **Specialist owner:** behavioral-analyst.
- **Revisit criterion:** product decides a user Stop should count against a failure/alerting budget.
- **Dissent (if any):** none.
- **Driven by rounds:** R1.
- **Dependent decisions:** D-8, D-10.
- **Referenced in plan:** Implementation Approach (F1), Runtime Behavior.
### D-8: F1 status surfacing — extend the existing frame, no new frame type
- **Question:** How does the cancelled terminal state reach the web reducer — a new WS frame type, or an extension of the existing one?
- **Decision:** Extend the coder `message_complete` frame with an optional `status` field (Option A — minimal); map it in the reducer (`CoderPane.tsx:299`). No new frame type, so the cross-app `WsFrame` parity rule does not force a paired strict-union arm beyond the optional field.
- **Rationale:** Adding a whole new frame type triggers the full cross-app parity dance (server `InferenceFrame`/`ws-frames.ts` + web `WsFrame`) for a single optional value already carried on a terminal frame; an optional field on the existing frame is the smaller change with the same observable result (UX agent).
- **Evidence:** reducer map point `CoderPane.tsx:299`; cross-app frame parity rule (CLAUDE.md Conventions; scope-brief cross-cutting constraints).
- **Rejected alternatives:**
- New `agent_cancelled` frame type — rejected because it forces a paired strict-union arm in two files for a single optional status value (UX agent, Option A vs B).
- **Specialist owner:** user-experience-designer.
- **Revisit criterion:** a second distinct terminal sub-state needs carrying that does not fit `message_complete.status`.
- **Dissent (if any):** none.
- **Driven by rounds:** R1.
- **Dependent decisions:** D-10.
- **Referenced in plan:** Implementation Approach (F1), External Interfaces.
### D-11: F9 retire :9502 SPA — delete the serve block, keep all API/WS routes
- **Question:** What exactly is removed when retiring the :9502 fallback SPA, and what must stay?
- **Decision:** Delete the `if (existsSync(webRoot))` block in `index.ts` (~269-289) which already no-ops when the dist is absent; keep the inline 404 handler (`{error:'not found'}`). Remove `apps/coder/web` from `pnpm-workspace.yaml`, the coder build step, and the Dockerfile copy; remove the now-unused `fastifyStatic` import (verify it is only used there). KEEP all `/api/coder/*` REST + WS + `/api/health` + `--mcp` routes — CoderPane depends on them. Optionally add a 2-line `GET /` redirect-to-BooChat (no `fastifyStatic`).
- **Rationale:** Sam confirmed "I don't use 9502"; primary UI is CoderPane inside BooChat. OQ-F9a resolved: nothing probes `GET /` on :9502 (health is `/api/health`; the compose healthcheck targets the boocode container, not the host-systemd coder), so 404-or-redirect at `/` is safe (architect A5, verified).
- **Evidence:** serve block `index.ts` ~269-289; `GET /` unprobed (synthesis-input C8, OQ-F9a resolved); scope-brief F9 / DEFERRED #5 removal checklist.
- **Rejected alternatives:**
- Keep the SPA — rejected; Sam greenlit removal and the build step is dead weight (scope-brief F9).
- Remove the REST/WS routes too — rejected because CoderPane inside BooChat depends on every `/api/coder/*` route (architect A5).
- **Specialist owner:** software-architect.
- **Revisit criterion:** a standalone :9502 UI is ever wanted again (would be a fresh feature, not a revert).
- **Dissent (if any):** none.
- **Driven by rounds:** R1.
- **Dependent decisions:** none.
- **Referenced in plan:** Implementation Approach (F9), Operational Readiness, Decomposition and Sequencing.
### D-12: F4/F5/F8 disposition under the standing override — document as Blocked, do not halt
- **Question:** The spec-maturity gate tripped on F4/F5/F8; the skill says recommend routing them out. Sam issued a standing override to plan everything. How are they recorded?
- **Decision:** Proceed with the plan but record F4, F5, F8 in a structurally separate **BLOCKED** tier with their exact blocking open question(s) and recommended resolution path, rather than halting synthesis. F4 → route to `plan-a-feature` (hook-firing-in-unattended-mode premise UNVERIFIED + goose hook mechanism unknown). F5 → SDK capability check (pinned `@opencode-ai/sdk` exposes no compaction event arm); UI treatment (sentinel-row vs ephemeral-frame) stays disputed until the event is confirmed to exist. F8 → route to `plan-a-feature` (no line-selection infra exists, diff source ambiguous). These three do NOT block the ready cluster (F1/F2/F3/F6/F7/F9).
- **Rationale:** The spec-maturity gate tripped with ≥5 spec-level findings (C9, C11, C12, C13, OQ-F4b, OQ-F8a) concentrated in F4/F5/F8 across junior, behavioral, and UX. Sam pre-acknowledged the WANT items would be planned more shallowly when choosing scope "everything we discussed," so the honest synthesis records them as Blocked with explicit reopen paths rather than fabricating plan-level resolutions or stalling the ready work.
- **Evidence:** spec-maturity gate TRIPPED (synthesis-input "Spec-maturity gate"); USER OVERRIDE STANDING (same section); blocking OQs OQ-F4a/F4b, OQ-F5a/F5b, OQ-F8a/b/c (synthesis-input Open Questions); claims C9, C11, C12, C13.
- **Rejected alternatives:**
- Halt synthesis and route F4/F5/F8 out before any planning — rejected because Sam's standing override directs the plan to proceed and document (synthesis-input gate section).
- Plan F4/F5/F8 at plan-level alongside the ready cluster — rejected because their core premises are unverified (F4) / capability-blocked (F5) / infra-absent (F8); plan-level decisions would rest on unproven assumptions (C9/C11/C13).
- **Specialist owner:** project-manager.
- **Revisit criterion:** the named blocking OQ for an item resolves (F4: hooks confirmed to fire unattended + goose format known; F5: SDK compaction event confirmed; F8: diff source chosen + line-selection approach specified) — then that item graduates to its own plan.
- **Dissent (if any):** none; the gate-trip is acknowledged rather than overridden silently.
- **Driven by rounds:** R1.
- **Dependent decisions:** none.
- **Referenced in plan:** Implementation Approach (Blocked tier), RAID Log (R4, R5, assumptions), Open Items, Deferred (YAGNI).

View File

@@ -0,0 +1,48 @@
# Implementation Iteration History: Post-Review Backlog (F1F9)
<!--
This file records how the implementation plan for the post-review backlog evolved across discussion rounds.
Committed decisions live in [implementation-decision-log.md](implementation-decision-log.md) and the primary
plan lives in [../feature-implementation-plan.md](../feature-implementation-plan.md). It also consolidates
the project-manager's per-round facilitation output (claim ledger, Open Questions, spec-maturity tags).
The loop converged in one round; remaining unknowns are spec-level, not resolvable by more specialist rounds.
-->
## R1: Parallel six-specialist backlog review
- **Specialists engaged:** on-call-engineer, behavioral-analyst, software-architect, test-engineer, user-experience-designer, junior-developer, project-manager (coordinator). Team size: large (cross-subsystem; user chose scope "everything we discussed"). Round cap 3; converged in 1.
- **New input provided:** Initial inputs — the [scope brief](../scope-brief.md) (ground-truth spec stand-in; two items live-verified 2026-06-02 with file:line evidence) and the [discovery notes](.discovery-notes.md) (per-item code touch points). No prior round; this is the initial sweep.
- **Claim ledger:** (consolidated, deduped — see [synthesis-input.md](synthesis-input.md) for the full table)
| # | Claim | State | Spec-maturity |
|---|-------|-------|---------------|
| C1 | F1 cancel route never aborts external child; no registry/export | Evidenced | plan-level |
| C2 | F1 catch blocks leave message `streaming`; success path writes `complete` on abort — fix same batch | Evidenced | plan-level |
| C3 | F2 = prune-now-minimal: unexport 8 zero-caller symbols, keep extractToolCallBlocks+stripToolMarkup | Evidenced | plan-level |
| C4 | F2 `<invoke>`-text fallback untested → add gate test before prune | Evidenced | plan-level |
| C5 | F3 optional logger param, do with F2 (same file) | Evidenced | plan-level |
| C6 | F6 stall-timeout via AbortSignal.any, 90s; NO retry (non-idempotent deltas) | Evidenced | plan-level |
| C7 | F7 inline MCP tool, messages_with_parts, role!='system', limit 50/200 | Evidenced | plan-level |
| C8 | F9 delete SPA block, keep routes; GET / unprobed → safe | Evidenced | plan-level |
| C9 | F4 hook-firing in unattended mode UNVERIFIED; goose hook mechanism unknown | Anecdotal (premise) | spec-level |
| C10 | F4 dedup rule: confirm running before `blocked`; suppress hook `done` | Evidenced | plan-level |
| C11 | F5 pinned @opencode-ai/sdk exposes no compaction arm → blocked on capability check | Evidenced | spec-level |
| C12 | F5 UI treatment sentinel-row vs ephemeral-frame | Disputed | spec-level |
| C13 | F8 no line-selection infra; diff source ambiguous; needs own spec | Evidenced | spec-level |
- **Open Questions raised:**
- F1: terminal state (→ D-7, `cancelled`); registry key (→ D-1, `taskId`); shared finalize helper (→ D-1, yes); warm re-throw on abort (→ D-1, short-circuit on `ac.signal.aborted`).
- OQ-F2a (sidecar jinja) → moot under D-3 option A; OQ-F2c (a vs b) → D-3 option A.
- OQ-F6a/b/c → D-5 (AbortSignal.any, no retry, 90s).
- OQ-F7a (session vs chat id) → D-6 (both, chat_id optional, + limit).
- OQ-F9a (GET / probe) → D-11 (unprobed, safe).
- OQ-F4a (hooks fire unattended?), OQ-F4b (goose hook format) → UNRESOLVED, spec-level → OI-1, route F4 to plan-a-feature (D-12).
- OQ-F5a (SDK compaction event existence/name) → UNRESOLVED, capability check → OI-2, blocks F5 (D-12).
- OQ-F5b (sentinel vs ephemeral UI) → UNRESOLVED → OI-3, settle once event confirmed (D-12).
- OQ-F8a/b/c (diff source, serialization, new viewer) → UNRESOLVED, spec-level → OI-4, route F8 to plan-a-feature (D-12).
- OI-5 (F1 best-effort session-stop leg) → non-blocking, decided at implementation.
- **Spec-maturity tags:** plan-level — C1-C8, C10 (9 claims). spec-level — C9, C11, C12, C13, plus OQ-F4b and OQ-F8a (≥5 across junior, behavioral, UX). **Spec-maturity gate TRIPPED**, concentrated in the three WANT items F4/F5/F8; F1/F2/F3/F6/F7/F9 are all plan-level and ready. No T#-contradictions.
- **Resolution source:** evidence (Step 6 specialist findings) for every plan-level OQ (F1, F2, F3, F6, F7, F9); user input for the gate disposition (Sam's standing override → D-12); deferred-to-spec for OQ-F4a/F4b, OQ-F5a/F5b, OQ-F8a/b/c (recorded as OI-1..OI-4, routed out rather than resolved in this loop). The YAGNI gate ran during synthesis: F6 retry, F2 option B, F4 interface, F5 extra compaction arms, F7 reader interface all deferred.
- **Decisions produced:** D-1, D-2, D-3, D-4, D-5, D-6, D-7, D-8, D-9, D-10, D-11, D-12 (all 12; the loop converged in one round so every decision originates here).
- **Changed in plan:** all sections (initial authoring) — Source Specification, Outcome, Context, Implementation Approach (TIER 1 READY / TIER 2 BLOCKED), Decomposition and Sequencing, RAID Log, Testing Strategy, Security Posture, Operational Readiness, On-Call Resilience Posture, Definition of Done, Specialist Handoffs, Deferred (YAGNI), Open Items, Summary.
- **Project-manager next-step recommendation:** Go to synthesis (done — this plan). Build the READY cluster in order F1 → F2+F3 → F6 → F7 → F9 as sequential patch tags; route F4 and F8 to `plan-a-feature` and F5 to an `@opencode-ai/sdk` capability check before any build on those three.

View File

@@ -0,0 +1,181 @@
# Synthesis input — Round 1 aggregation + dispositions
Deterministic aggregation of the Round-1 specialist review (on-call-engineer, behavioral-analyst,
software-architect, test-engineer, user-experience-designer, junior-developer). This is the consolidated
record the project-manager synthesizes into the three plan files. Evidence (file:line) is preserved inline.
Team size: large (cross-subsystem, user chose "everything"). Round cap 3; converged in 1 round (the
remaining unknowns are spec-level, not resolvable by more specialist rounds).
---
## Per-feature dispositions (the decisions)
### READY TO BUILD
**F1 — external task cancel kills child + finalizes message.** Strong 4-way consensus (on-call B1, behavioral
B1, architect A1, junior).
- Root cause CONFIRMED: `routes/tasks.ts:130-138` calls `inference.cancel` (native-only); dispatcher has no
`Map<taskId,AbortController>`; the four private `ac` (dispatcher.ts:316/655/991/1248) are unreachable;
`cancelExternalTask` does not exist anywhere.
- Design (architect A1): add `taskControllers = new Map<string,AbortController>()` inside `createDispatcher`;
`taskControllers.set(taskId, ac)` at each of the 4 run-functions; delete in the existing `.finally()` at
dispatcher.ts:117; export `cancelExternalTask(taskId): boolean` (idempotent — `ac.abort()` is a no-op when
already aborted, so double-Stop and cancel-after-exit are safe). Pass a narrow `ExternalCancelFn`
(NOT the whole dispatcher) into `registerTaskRoutes`; wire in `index.ts:254`.
- TWO pre-existing bugs F1 makes reachable, MUST be fixed in the same batch (on-call OCE-001/OCE-002,
behavioral B2/B3): (1) the four catch blocks update only `tasks` state, never the `messages` row → an
aborted/thrown turn leaves the assistant message `status='streaming'` (BooChat's 5-min sweep can't recover
it — different process); (2) the warm-backend success path writes `messages.status='complete'`
unconditionally before checking abort (dispatcher.ts ~853/1122/1377) → a cancelled turn is recorded
`complete`. Fix: after `await backend.prompt(...)`, `if (ac.signal.aborted)` → write `status='cancelled'`,
publish the terminal `message_complete` frame, emit idle, return; and in each catch finalize the message
with `WHERE status='streaming'` (idempotent) distinguishing AbortError→cancelled vs error→failed.
- UX (UX agent): disable the Stop button while the cancel POST is in flight (mobile double-tap); extend the
coder `message_complete` frame with an optional `status` field (Option A — minimal, no new frame type) and
map it in the reducer (`CoderPane.tsx:299`, `MessageStatus` already includes `'cancelled'`); render a muted
"Stopped" label (not red, not a toast).
- Tests (test-engineer T1-T3): extract a pure `CancelRegistry` (register/cancel/delete/has) — 4 unit cases,
no DB/child; one DB-integration test for the route → row lands `'cancelled'`; warm-worktree-preserved held
as a code comment, not a spy.
- Resolved OQs: terminal state = `cancelled` (not `failed`) for user Stop; registry keyed by `taskId`
(route receives taskId); `session/stop` route — CoderPane already calls `cancelTask` for external tasks so
the session-stop path "never fires for external from UI" (on-call) — wire it best-effort via a
`SELECT id FROM tasks WHERE session_id=$ AND state='running'` lookup OR defer that leg (low value); use a
shared `cancelAndFinalize` helper across the 4 paths (TDD precedent).
**F2 — tool-call-parser prune (option a: prune-now-minimal).** DECISION (architect A2, confirms junior
OQ-F2c): do NOT do the flag-gated full retirement (option b). KEEP `extractToolCallBlocks` + `stripToolMarkup`
+ their types (`ToolCallExtraction`, `ParsedCall`) — load-bearing `<invoke>`-as-text guard (the only guard for
that case; `experimental_repairToolCall` doesn't cover it; sidecar `--jinja` unconfirmed so keeping the guard
is correct). REMOVE the `export` keyword (not the implementations) from the 8 zero-external-caller symbols:
`isPlaceholderArgValue`, `parseXmlToolCall`, `parseInvokeToolCall`, `partialXmlOpenerStart`, and the 4 consts
`XML_TOOL_OPEN/CLOSE`, `INVOKE_TOOL_OPEN/CLOSE`. Zero runtime effect; public surface 11→4 exports.
- Test gap (test-engineer T6): the `<invoke>`-text fallback in `stream-phase.ts:263-284` is currently NOT
exercised by any test → add a gate test (stub `streamText` to emit a text-delta containing a complete
`<invoke>` block; assert it lands in `result.toolCalls` and the markup is NOT in `result.content`). Must
stay green through the prune and fail if `extractToolCallBlocks` is ever removed from the text-delta path.
**F3 — xml-parser structured logging.** Trivial. `tool-call-parser.ts:65` `console.debug` → pass an optional
`log?: { debug }` param to `extractToolCallBlocks` from its one call site (`stream-phase.ts` executeStreamPhase)
and use it. No interface (architect: one site, one impl). SEQUENCING: same file as F2; F2 keeps
`extractToolCallBlocks` (decided), so F3 is safe; do F2+F3 in one batch. Confirm `executeStreamPhase`
signature/test-stubs tolerate the param (junior).
**F6 — BooChat stall-timeout ONLY (retry deferred).** on-call: wrap the `stream-phase.ts:261` fullStream loop
with a per-chunk stall deadline: a local `stallAc = new AbortController()`, `effectiveSignal =
AbortSignal.any([signal, stallAc.signal])` passed to `streamText`; bump a `setTimeout(STALL_TIMEOUT_MS=90_000)`
on each chunk; clear it in the existing `finally`; at the post-loop check (stream-phase.ts:337) test
`signal?.aborted || stallAc.signal.aborted` and throw `AbortError` (→ `handleAbortOrError` writes
`cancelled`). Tests (test-engineer T8-T10): pure `classifyStreamError(err)` helper (5 cases, no I/O) + a
`vi.useFakeTimers()` stall test on a fake hanging stream + a regression pin on the existing `signal?.aborted`
post-loop check.
- YAGNI DEFER (on-call, strong): NO retry at `executeStreamPhase`/`streamCompletion`. A retry after partial
stream re-emits already-streamed deltas (`state.accumulated` + live `delta` frames are non-idempotent) —
worse than current. Reopen trigger: llama-swap gains restart-in-place-with-clear-partial, or a second
instance for failover. The user re-sending is the correct recovery at single-instance scale.
**F7 — view_session_history MCP tool.** architect A4: add tool 7 inline in `mcp-server.ts` (follows the
existing 6-tool inline pattern, `textResult` + direct `sql`). Reads `messages_with_parts`, `WHERE role !=
'system'` (strips sentinels), params `session_id` + optional `chat_id` + `limit` (default 50, max 200),
`ORDER BY created_at ASC`. No interface, no pagination beyond limit. Returns `{role,content,...}[]`.
**F9 — retire apps/coder/web :9502 SPA.** architect A5: the `if (existsSync(webRoot))` block in `index.ts`
(~269-289) already no-ops when the dist is absent. Delete that block, keep the inline 404 handler
(`{error:'not found'}`); remove `apps/coder/web` from `pnpm-workspace.yaml`, the coder build step, and the
Dockerfile copy; remove the now-unused `fastifyStatic` import (verify it's only used there). KEEP all
`/api/coder/*` REST + WS + `/api/health` + `--mcp` routes (CoderPane depends on them). OQ-F9a RESOLVED:
nothing probes `GET /` on :9502 (health is `/api/health`; compose healthcheck is the boocode container, not
the host-systemd coder) → safe to 404 or add a 2-line `GET /` redirect-to-BooChat (no fastifyStatic).
### BLOCKED — need a spec or a capability check before building (gate-trip items)
**F4 — notify-hook config injection.** SPEC-LEVEL gaps (junior OQ-F4a-e, behavioral B4, UX). The core premise
is UNVERIFIED: do claude / qwen / goose actually fire their native lifecycle hooks in unattended mode
(`claude -p` / SDK, `qwen --acp` / `--output-format stream-json`, goose)? goose's hook file/format is unknown
(not in repo). Idempotent per-agent settings.json merge strategy unspecified. `boocoder.service` run-user /
`homedir()` resolution unconfirmed. The inbound POST is a new unauthenticated localhost route (acceptable
single-user, note it). Double-publish dedup with the v2.7.6 turn-boundary publish: behavioral B4 +
architect A3 agree on the rule — inbound route calls `normalizeAgentEvent` (returns bucket
`working|blocked|done`), confirms `tasks.state='running'` before publishing `blocked`, and SUPPRESSES `done`
(the dispatcher already emits `idle`); `done`→drop, never re-publish. UI side already exists (AgentStatusDot,
all 4 buckets — UX: F4 is server-side only). RECOMMENDATION: own `plan-a-feature` — the dedup rule + module
shapes are settled, but the hook-firing-in-unattended-mode premise and goose hook mechanism must be verified
first or the whole feature is built on sand.
**F5 — opencode compaction surfacing.** BLOCKED on a capability check. The installed `@opencode-ai/sdk`
exposes NO compaction event arm (current arms confirmed: `session.next.{text,reasoning,tool,step}.*`,
`message.part.*`, `session.idle/error` at opencode-server.ts:379-491). The review's "consume
compaction.{started,delta,ended}" assumed events from opencode's CORE `event.ts`, which the pinned SDK may not
surface. MUST confirm the SDK emits a compaction signal + its exact event name (or an SDK bump is needed)
before building. DISPUTED UI treatment (behavioral B5 = persistent sentinel row `metadata.kind='compaction'`,
survives refresh; UX = ephemeral inline divider via a new `agent_compacted` frame, no DB row) — settle once
the event exists. Only `compaction.ended` is in scope (YAGNI: started/delta/step.failed/tool.progress out).
Cross-app WS-frame parity is certain if a frame is added.
**F8 — diff-line → agent re-prompt.** SPEC-LEVEL (UX + junior, firm). The "DiffPanel" is inline in
`CoderPane.tsx:478-619`, rendering `pending_changes` rows as a static `<pre>` (CoderPane.tsx:607-610) — NO
line-selection infrastructure exists. Diff source ambiguous (`pending_changes.diff` = BooCode write-tools only
vs the external-agent worktree git diff). "Send to new agent" needs coordinated workspace-pane + chat creation
+ pre-population across 3 surfaces with no existing contract. Selection diverges by modality (desktop line-
select vs mobile long-press → bottom sheet). RECOMMENDATION: own `plan-a-feature` (the scope-brief already
hedged this; treat as firm). MVP-if-pushed: "comment to current agent" only, block-level selection,
pre-populate `ChatInput` — still wants a spec.
---
## Claim ledger (consolidated, deduped)
| # | Claim | State | Spec-maturity | Supporting |
|---|-------|-------|---------------|-----------|
| C1 | F1 cancel route never aborts external child; no registry/export | Evidenced | plan-level | on-call,behavioral,architect,junior |
| C2 | F1 catch blocks leave message `streaming`; success path writes `complete` on abort — fix in same batch | Evidenced | plan-level | on-call,behavioral |
| C3 | F2 = prune-now-minimal: unexport 8 zero-caller symbols, keep extractToolCallBlocks+stripToolMarkup | Evidenced | plan-level | architect (test-engineer guard) |
| C4 | F2 `<invoke>`-text fallback is untested → add gate test before prune | Evidenced | plan-level | test-engineer |
| C5 | F3 optional logger param, do with F2 (same file) | Evidenced | plan-level | architect,junior |
| C6 | F6 stall-timeout via AbortSignal.any, 90s; NO retry (non-idempotent deltas) | Evidenced | plan-level | on-call,behavioral,test-engineer |
| C7 | F7 inline MCP tool, messages_with_parts, role!='system', limit 50/200 | Evidenced | plan-level | architect,UX |
| C8 | F9 delete SPA block, keep routes; GET / unprobed → safe | Evidenced | plan-level | architect (+ verified) |
| C9 | F4 hook-firing in unattended mode UNVERIFIED; goose hook mechanism unknown | Anecdotal (premise) | spec-level | junior,behavioral,UX |
| C10 | F4 dedup rule: confirm running before `blocked`; suppress hook `done` | Evidenced | plan-level | behavioral,architect |
| C11 | F5 pinned @opencode-ai/sdk exposes no compaction arm → blocked on capability check | Evidenced | spec-level | (verified) + junior |
| C12 | F5 UI treatment sentinel-row vs ephemeral-frame | Disputed | spec-level | behavioral vs UX |
| C13 | F8 no line-selection infra; diff source ambiguous; needs own spec | Evidenced | spec-level | UX,junior |
## Open Questions — resolutions
- OQ (F1 terminal state) → RESOLVED: `cancelled`. OQ (F1 registry key) → RESOLVED: `taskId`. OQ (F1 shared
finalize helper) → RESOLVED: yes, pure helper. OQ (F1 warm re-throw on abort) → RESOLVED: short-circuit on
`ac.signal.aborted`.
- OQ-F2a (sidecar jinja) → RESOLVED moot: option a keeps the guard. OQ-F2c (a vs b) → RESOLVED: option a.
- OQ-F6a/b/c → RESOLVED: AbortSignal.any (not Promise.race); no retry; 90s.
- OQ-F7a (session vs chat id) → RESOLVED: both (chat_id optional) + limit.
- OQ-F9a (GET / probe) → RESOLVED: unprobed, safe.
- OQ-F4a (hooks fire unattended?), OQ-F4b (goose hook format) → UNRESOLVED, spec-level → route to F4 spec.
- OQ-F5a (SDK compaction event name/existence) → UNRESOLVED, capability check → blocks F5.
- OQ-F5b (sentinel vs ephemeral UI) → UNRESOLVED → settle in F5 once event confirmed.
- OQ-F8a/b/c (diff source, serialization, new viewer) → UNRESOLVED, spec-level → route to F8 spec.
## Spec-maturity gate
TRIPPED (≥5 spec-level findings — C9, C11, C12, C13, plus OQ-F4b/F8a — across ≥3 specialists: junior,
behavioral, UX). The trip is CONCENTRATED in the three WANT items F4/F5/F8; F1/F2/F3/F6/F7/F9 are all
plan-level and ready. Per skill: gate-trip → recommend the user route F4/F8 to `plan-a-feature` and F5 to a
capability check. USER OVERRIDE STANDING: Sam chose scope "everything we discussed" having pre-acknowledged the
WANT items would be planned more shallowly — so the plan proceeds, documenting F4/F5/F8 as Blocked/own-spec
rather than halting. Decision deferred to Step 9 user presentation.
## YAGNI ledger
- F6 retry logic → DEFER (non-idempotent re-emit of streamed deltas). Reopen: llama-swap restart-in-place or
second instance. Source: on-call R1.
- F2 option b (flag-gated full retirement of extractToolCallBlocks/stripToolMarkup) → DEFER (no evidence
qwen3.6 stopped emitting `<invoke>` text on live; sidecar jinja unconfirmed). Reopen: documented multi-
session live probe shows zero text-delta tool calls. Source: architect/test-engineer R1.
- F4 `NotifyHookInjection` interface → REPLACE with one concrete function switching on agent name (3 agents,
identical read-merge-write). Source: architect R1.
- F5 handling of compaction.started/delta + step.failed + tool.progress → DEFER, only compaction.ended is
user-actionable. Source: behavioral R1.
- F7 SessionHistoryReader interface / pagination → REPLACE with inline query + limit. Source: architect R1.
- Provider tier-2 follow-ups (snapshot frame, enabled column, shared types, MCP list_providers) → already
DEFER/DROP per scope-brief; not re-planned.

View File

@@ -0,0 +1,350 @@
# Feature Implementation Plan: Post-Review Backlog (F1F9)
This is a multi-item backlog, not a single feature. It commits to shipping six **READY** items as
independent sequential patch tags ([D-9](artifacts/implementation-decision-log.md#trivial-decisions)), one batch per
coherent unit, honoring deploy-by-surface and stage-commits-by-path; and to documenting three **BLOCKED**
WANT items (F4/F5/F8) with their exact blocking open questions and recommended resolution paths
([D-12](artifacts/implementation-decision-log.md#d-12-f4f5f8-disposition-under-the-standing-override--document-as-blocked-do-not-halt))
rather than halting on the tripped spec-maturity gate.
## Source Specification
- **Feature specification:** No formal `feature-specification.md` exists. The ground-truth source is the [scope brief](scope-brief.md), which captures the items, decisions, and live-verified current state from the 2026-06-02 backlog review conversation plus `boocode_code_review_v2.md`, `docs/DEFERRED-WORK.md`, and `boocode_roadmap.md`. Origin is conversational + review-doc, so the WANT items are intentionally planned more shallowly.
- **Consolidated specialist record:** [artifacts/synthesis-input.md](artifacts/synthesis-input.md) — the Round-1 aggregation of all six specialists (on-call-engineer, behavioral-analyst, software-architect, test-engineer, user-experience-designer, junior-developer) with file:line evidence, the claim ledger, OQ resolutions, the spec-maturity gate, and the YAGNI ledger. There is no separate per-specialist file; this digest is authoritative.
- **Discovery notes:** [artifacts/.discovery-notes.md](artifacts/.discovery-notes.md) — project context and per-item code touch points.
## Outcome
When this plan's READY cluster is executed: hitting Stop on an external agent task actually kills the
running child and finalizes the assistant message in a correct terminal state (no more `streaming`-stuck or
falsely-`complete` rows); the tool-call parser's public surface is pruned to its four load-bearing exports
while the plain-text `<invoke>` guard stays intact and is for the first time test-pinned; the parser's
rejection logging flows through pino/`LOG_LEVEL`; a hung BooChat stream is detected and finalized
server-side after 90s instead of relying solely on the frontend watchdog; an MCP client can read a
session's transcript through a new BooCoder MCP tool; and the unused :9502 fallback SPA and its build step
are gone. F4/F5/F8 leave this plan with a documented, evidence-backed route to their own spec/capability
work.
## Context
- **Driving constraint:** A post-review backlog with two items live-verified as active correctness bugs (F1 cancel is a no-op that leaves messages stuck; F6 has zero server-side stall guard). The review surfaced them; Sam chose scope "everything we discussed." No external deadline — value is correctness and simplification debt paydown.
- **Stakeholders:** Sam (sole user/operator — wants Stop to work, wants the parser simplified, wants resilience and the MCP history tool, greenlit the :9502 removal). On-call posture is single-operator: Sam is the post-ship owner for every item.
- **Future-state concern:** F1 touches the dispatcher's message-finalization paths across four backends — the risk to watch is finalization races/regressions there. F6 introduces a server-side timer that must not fire on a healthy-but-slow stream. F5 is a standing capability dependency on the pinned `@opencode-ai/sdk`.
- **Out-of-scope boundary:** F2 Option B (flag-gated full parser retirement), F6 retry/backoff, F4's superset code (clean-room pattern only), and everything in the scope brief's "Deferred (Sam's explicit dispositions)" list (subagent permission demux, tier-2 provider follow-ups, large-file splits, PR-resolver, etc.). These are recorded, not planned.
## Team Composition and Participation
Round-by-round detail lives in [artifacts/implementation-iteration-history.md](artifacts/implementation-iteration-history.md). All six specialists converged in one round; remaining unknowns are spec-level, not resolvable by more specialist rounds.
| Specialist | Status | Key Input |
|------------|--------|-----------|
| `project-manager` | Coordinator | Facilitated R1, applied the spec-maturity gate + YAGNI gate, synthesized this plan. |
| `on-call-engineer` | Active | F1 finalization bugs (OCE-001/002); F6 stall-timeout design + no-retry YAGNI call. |
| `behavioral-analyst` | Active | F1 message-state corruption (B2/B3); F5 sentinel-row UI position (disputed); F4 dedup rule. |
| `software-architect` | Active | F1 registry/`ExternalCancelFn` shape (A1); F2 option-A prune (A2); F7 inline tool (A4); F9 removal (A5). |
| `test-engineer` | Active | F1 `CancelRegistry` + DB test (T1-T3); F2 fallback gate test (T6); F6 `classifyStreamError` + fake-timer test (T8-T10). |
| `user-experience-designer` | Active | F1 disable-Stop-while-in-flight + Option-A frame extension + muted "Stopped" label; F5 ephemeral-divider UI (disputed); F8 no line-selection infra. |
| `junior-developer` | Active | OQ reframing across all items; F2c (option a), F4a/b unverified premise, F8 spec-level. |
## Implementation Approach
ALTITUDE: this section names code artifacts and inlines only decision-bearing values (flag names, the 90s
stall timeout, the 50/200 limits, the export list to unexport). Full implementations belong in the files
themselves; mechanics live in the cited decision-log entries.
### TIER 1 — READY TO BUILD
**F1 — external task cancel kills the child + finalizes the message (apps/coder, standalone, do first).**
The highest-value correctness fix. A `taskControllers = new Map<string,AbortController>()` registry inside
`createDispatcher`, populated at the four run-functions and deleted in the existing `.finally()`, plus an
exported idempotent `cancelExternalTask(taskId): boolean` wired into `POST /api/tasks/:id/cancel` via a
narrow `ExternalCancelFn`
([D-1](artifacts/implementation-decision-log.md#d-1-f1-cancel-registry-shape-and-finalization-fix-scope)).
This batch **must** also fix the two pre-existing finalization bugs the abort wiring newly makes reachable —
the catch blocks that leave messages `streaming` and the warm success path that writes `complete` on an
aborted turn — via a shared `cancelAndFinalize` helper, or it ships a new bug
([D-1](artifacts/implementation-decision-log.md#d-1-f1-cancel-registry-shape-and-finalization-fix-scope)).
User Stop finalizes to `cancelled`, not `failed`
([D-7](artifacts/implementation-decision-log.md#d-7-f1-terminal-state-for-user-stop--cancelled-not-failed));
the terminal state reaches the web reducer via an optional `status` field on the existing coder
`message_complete` frame, not a new frame type
([D-8](artifacts/implementation-decision-log.md#d-8-f1-status-surfacing--extend-the-existing-frame-no-new-frame-type)),
rendered as a muted "Stopped" label ([D-10](artifacts/implementation-decision-log.md#trivial-decisions)). The
`session/stop` leg "never fires for external from UI" (CoderPane already calls `cancelTask`), so it is wired
best-effort via a `WHERE session_id=$ AND state='running'` lookup or deferred as a low-value leg.
**F2 + F3 together (apps/server, same file).** F2 is option-A prune-now-minimal: KEEP `extractToolCallBlocks`
+ `stripToolMarkup` (the only guard for a tool call emitted as plain text), unexport the 8 zero-caller
symbols — `isPlaceholderArgValue`, `parseXmlToolCall`, `parseInvokeToolCall`, `partialXmlOpenerStart`, and
the consts `XML_TOOL_OPEN/CLOSE`, `INVOKE_TOOL_OPEN/CLOSE` — taking the public surface 11 → 4 with zero
runtime effect
([D-3](artifacts/implementation-decision-log.md#d-3-f2-prune-scope--option-a-prune-now-minimal-keep-the-load-bearing-guard)).
A gate test pins the untested `<invoke>`-as-text fallback before the prune
([D-4](artifacts/implementation-decision-log.md#d-4-f2-fallback-gate-test--pin-the-untested-guard-before-pruning)).
Because F2 keeps `extractToolCallBlocks`, F3 is safe to land in the same file/batch: thread an optional
`log?: { debug }` param into `extractToolCallBlocks` from its one call site so rejection logging flows
through pino/`LOG_LEVEL` instead of `console.debug` ([D-2](artifacts/implementation-decision-log.md#trivial-decisions)).
**F6 — BooChat stall-timeout ONLY (apps/server, standalone).** Wrap the `stream-phase.ts:261` `fullStream`
loop with a per-chunk stall deadline using `AbortSignal.any([signal, stallAc.signal])` and a
`setTimeout(STALL_TIMEOUT_MS = 90_000)` bumped on each chunk, cleared in the existing `finally`; the
post-loop check throws `AbortError` so the existing finalize path writes `cancelled`
([D-5](artifacts/implementation-decision-log.md#d-5-f6-stall-timeout-via-abortsignalany-no-retry)). **No
retry** — partial-stream re-emit is non-idempotent and the instance is single+local (see Deferred (YAGNI)).
**F7 — `view_session_history` MCP tool (apps/coder MCP, standalone).** Tool 7 added inline in
`mcp-server.ts` following the existing 6-tool inline pattern (`textResult` + direct `sql`). Reads
`messages_with_parts` `WHERE role != 'system'`, params `session_id` + optional `chat_id` + `limit` (default
50, max 200), `ORDER BY created_at ASC` ([D-6](artifacts/implementation-decision-log.md#trivial-decisions)).
No interface, no pagination beyond `limit`.
**F9 — retire :9502 SPA (apps/coder, standalone cleanup).** Delete the `if (existsSync(webRoot))` serve
block in `index.ts` (~269-289), keep the inline 404; remove `apps/coder/web` from `pnpm-workspace.yaml`, the
coder build step, and the Dockerfile copy; drop the now-unused `fastifyStatic` import. KEEP every
`/api/coder/*` REST + WS + `/api/health` + `--mcp` route
([D-11](artifacts/implementation-decision-log.md#d-11-f9-retire-9502-spa--delete-the-serve-block-keep-all-apiws-routes)).
### TIER 2 — BLOCKED (need their own spec / a capability check)
These three are recorded, not built. They do **not** block the READY cluster
([D-12](artifacts/implementation-decision-log.md#d-12-f4f5f8-disposition-under-the-standing-override--document-as-blocked-do-not-halt)).
**F4 — notify-hook config injection.** The module shapes and the double-publish dedup rule are settled
(inbound route calls `normalizeAgentEvent` → bucket; confirms `tasks.state='running'` before publishing
`blocked`; SUPPRESSES `done` since the dispatcher already emits `idle`). But the **core premise is
unverified**: it is unknown whether claude / qwen / goose actually fire their native lifecycle hooks in
unattended mode (`claude -p`/SDK, `qwen --acp`/`stream-json`, goose), and goose's hook file/format is
unknown (not in repo). **Recommended path:** route to `plan-a-feature`; verify hook-firing-in-unattended-mode
and the goose hook mechanism first, or the feature is built on sand. Blocking OQs: OQ-F4a (hooks fire
unattended?), OQ-F4b (goose hook format?).
**F5 — opencode compaction surfacing.** **Blocked on a capability check**: the pinned `@opencode-ai/sdk`
exposes NO compaction event arm (confirmed arms: `session.next.{text,reasoning,tool,step}.*`,
`message.part.*`, `session.idle/error`). The review assumed events from opencode's core `event.ts` that the
pinned SDK may not surface. **Recommended path:** confirm the SDK emits a compaction signal and its exact
event name (or an SDK bump is needed) before building. The UI treatment is **disputed** (behavioral:
persistent sentinel row `metadata.kind='compaction'` that survives refresh; UX: ephemeral inline divider via
a new `agent_compacted` frame, no DB row) — settle once the event is confirmed to exist. Only
`compaction.ended` is in scope. Blocking OQs: OQ-F5a (SDK event existence/name), OQ-F5b (sentinel vs
ephemeral UI).
**F8 — diff-line → agent re-prompt UX.** **Spec-level.** The "DiffPanel" is inline in
`CoderPane.tsx:478-619`, rendering `pending_changes` rows as a static `<pre>` — no line-selection
infrastructure exists. The diff source is ambiguous (`pending_changes.diff` = BooCode write-tools only vs the
external-agent worktree git diff). "Send to new agent" needs coordinated workspace-pane + chat creation +
pre-population across three surfaces with no existing contract; selection diverges by modality.
**Recommended path:** route to `plan-a-feature` (the scope brief already hedged this; treat as firm).
MVP-if-pushed: "comment to current agent" only, block-level selection, pre-populate `ChatInput` — still
wants a spec. Blocking OQs: OQ-F8a (diff source), OQ-F8b (selection serialization), OQ-F8c (new-viewer
routing).
### Architecture and Integration Points
F1 lives entirely in apps/coder (`dispatcher.ts`, `routes/tasks.ts`, `index.ts:254` wiring) with a
single-field touch on the web reducer (`CoderPane.tsx:299`). F2/F3 are confined to
`apps/server/.../tool-call-parser.ts` and its one call site `stream-phase.ts`. F6 is confined to
`stream-phase.ts`. F7 is one inline tool in `apps/coder/.../mcp-server.ts`. F9 removes a serve block and a
workspace package. No new interface is introduced anywhere in TIER 1 — every item follows the existing inline
/ pure-helper precedent ([D-2](artifacts/implementation-decision-log.md#trivial-decisions),
[D-6](artifacts/implementation-decision-log.md#trivial-decisions)).
### Runtime Behavior
F1's cancel path: Stop → `POST /api/tasks/:id/cancel``cancelExternalTask(taskId)``ac.abort()`
backend honors the signal (`child.kill` / `session/cancel` / `session.abort` / interrupt) → the run-function
hits its abort short-circuit or catch → `cancelAndFinalize` writes the terminal `messages.status`
(`cancelled` on abort, `failed` on error) `WHERE status='streaming'` and publishes the terminal frame
([D-1](artifacts/implementation-decision-log.md#d-1-f1-cancel-registry-shape-and-finalization-fix-scope),
[D-7](artifacts/implementation-decision-log.md#d-7-f1-terminal-state-for-user-stop--cancelled-not-failed)).
F6's stall path: each chunk bumps the 90s timer; on stall, `stallAc.abort()` flows through `AbortSignal.any`
into the same finalize-as-`cancelled` path
([D-5](artifacts/implementation-decision-log.md#d-5-f6-stall-timeout-via-abortsignalany-no-retry)).
### Data Model and Persistence
No schema changes in any TIER 1 item. F1 and F6 only change *which* terminal value gets written to existing
`messages.status` / `tasks.state` columns. F7 is a read against the existing `messages_with_parts` view.
### External Interfaces
F1 extends the existing coder `message_complete` WS frame with an optional `status` field — no new frame
type, so no paired strict-union arm is forced in `apps/web/src/api/types.ts` beyond the optional field
([D-8](artifacts/implementation-decision-log.md#d-8-f1-status-surfacing--extend-the-existing-frame-no-new-frame-type)).
F7 exposes one new read-only MCP tool on BooCoder's MCP server
([D-6](artifacts/implementation-decision-log.md#trivial-decisions)). F4 (blocked) would add a new
unauthenticated localhost POST route — see Security Posture. F5 (blocked) would add a new WS frame and thus
trigger the full cross-app parity rule.
## Decomposition and Sequencing
Each READY item is its own patch tag, one batch per coherent unit
([D-9](artifacts/implementation-decision-log.md#trivial-decisions)). F2+F3 are a single unit (same file; F2 keeping
`extractToolCallBlocks` unblocks F3). F1 goes first as the highest-value correctness fix.
| # | Work Unit | Surface / Deploy | Delivers | Depends On | Verification |
|---|-----------|------------------|----------|------------|--------------|
| 1 | **F1** cancel registry + finalization fixes | apps/coder → `systemctl restart boocoder` | Stop kills child; message finalizes `cancelled`/`failed`; no stuck `streaming` ([D-1](artifacts/implementation-decision-log.md#d-1-f1-cancel-registry-shape-and-finalization-fix-scope)) | — | `CancelRegistry` unit (4 cases) + DB-integration route test → row `cancelled` (T1-T3) |
| 2 | **F2+F3** parser prune + logger | apps/server → docker rebuild | Public surface 11→4; rejection logging via pino; guard pinned ([D-3](artifacts/implementation-decision-log.md#d-3-f2-prune-scope--option-a-prune-now-minimal-keep-the-load-bearing-guard), [D-2](artifacts/implementation-decision-log.md#trivial-decisions)) | F2's keep-decision unblocks F3 | Fallback gate test green through prune (T6, [D-4](artifacts/implementation-decision-log.md#d-4-f2-fallback-gate-test--pin-the-untested-guard-before-pruning)) |
| 3 | **F6** stall-timeout | apps/server → docker rebuild | 90s server-side stall detection → finalize `cancelled` ([D-5](artifacts/implementation-decision-log.md#d-5-f6-stall-timeout-via-abortsignalany-no-retry)) | — | `classifyStreamError` unit (5 cases) + `vi.useFakeTimers()` stall test + post-loop abort regression pin (T8-T10) |
| 4 | **F7** session-history MCP tool | apps/coder → `systemctl restart boocoder` | Read-only transcript tool; `/api/health` tool count +1 ([D-6](artifacts/implementation-decision-log.md#trivial-decisions)) | — | Manual MCP call + sentinel-strip assertion (`role != 'system'`) |
| 5 | **F9** retire :9502 SPA | apps/coder → `systemctl restart boocoder` | Serve block + build step + workspace pkg removed; routes kept ([D-11](artifacts/implementation-decision-log.md#d-11-f9-retire-9502-spa--delete-the-serve-block-keep-all-apiws-routes)) | — | `/api/coder/*` + `/api/health` still 200; `GET /` 404-or-redirect |
F4/F5/F8 are NOT in this table — they route out per
[D-12](artifacts/implementation-decision-log.md#d-12-f4f5f8-disposition-under-the-standing-override--document-as-blocked-do-not-halt).
## RAID Log
### Risks
| ID | Risk | Likelihood | Severity | Blast Radius | Reversibility | Owner | Mitigation |
|----|------|------------|----------|--------------|---------------|-------|------------|
| R1 | F1 message-finalization races across the 4 backend paths leave a row in the wrong terminal state | Medium | Medium | One assistant message per affected turn | Reversible (re-run) | on-call-engineer | Shared `cancelAndFinalize` helper, `WHERE status='streaming'` idempotency, abort short-circuit before the unconditional `complete` write ([D-1](artifacts/implementation-decision-log.md#d-1-f1-cancel-registry-shape-and-finalization-fix-scope)) |
| R2 | F2 prune accidentally removes the `<invoke>`-as-text fallback guard (regression: plain-text tool calls silently dropped) | Low | High | Any qwen3.6 turn that emits tool-call markup as text | Reversible (re-add export/call) | software-architect | Option-A keeps the impl; gate test fails if `extractToolCallBlocks` leaves the text-delta path ([D-3](artifacts/implementation-decision-log.md#d-3-f2-prune-scope--option-a-prune-now-minimal-keep-the-load-bearing-guard), [D-4](artifacts/implementation-decision-log.md#d-4-f2-fallback-gate-test--pin-the-untested-guard-before-pruning)) |
| R3 | F6 stall-timeout fires on a healthy-but-slow stream (false abort) | Low | Low | One in-flight chat turn | Reversible (re-send) | on-call-engineer | 90s per-*chunk* (not per-turn) deadline, bumped on every chunk; tune only on observed false-fire ([D-5](artifacts/implementation-decision-log.md#d-5-f6-stall-timeout-via-abortsignalany-no-retry)) |
| R4 | F5 premise fails — the pinned SDK never surfaces a compaction event, so the feature is unbuildable without an SDK bump | High | N/A (blocked) | F5 only | N/A | (capability check) | Confirm event existence/name before any build; F5 stays Blocked until then ([D-12](artifacts/implementation-decision-log.md#d-12-f4f5f8-disposition-under-the-standing-override--document-as-blocked-do-not-halt)) |
| R5 | F4 premise fails — agents do not fire native hooks in unattended mode, so injection yields no signals | High | N/A (blocked) | F4 only | N/A | (plan-a-feature) | Verify hook-firing + goose format in the F4 spec before any build ([D-12](artifacts/implementation-decision-log.md#d-12-f4f5f8-disposition-under-the-standing-override--document-as-blocked-do-not-halt)) |
### Assumptions
| ID | Assumption | What Changes If Wrong | Verifier | Status |
|----|------------|-----------------------|----------|--------|
| A1 | llama-swap native `--jinja` parsing stays ON (structured `tool_calls`), so the text-scrape fallback remains dormant defense-in-depth | If off, the kept guard becomes hot — fine; if a future config silently changes parsing, F2 Option B reopens | live probe of `:8401` (not in-repo) | Unverified-in-repo (probe-only); guard kept regardless ([D-3](artifacts/implementation-decision-log.md#d-3-f2-prune-scope--option-a-prune-now-minimal-keep-the-load-bearing-guard)) |
| A2 | All four external backends honor `ctx.signal` as cited (`child.kill` / `session/cancel` / `session.abort` / interrupt) | If one does not, that backend's Stop stays a no-op even after wiring | F1 DB-integration test per backend path | Evidenced by file:line (scope-brief F1); test confirms |
| A3 | 90s is the right stall deadline for the local llama-swap workload shape | Too low → false aborts on slow models; too high → sluggish recovery | Production observation | Committed value, tunable ([D-5](artifacts/implementation-decision-log.md#d-5-f6-stall-timeout-via-abortsignalany-no-retry)) |
| A4 | F1's `session/stop` leg "never fires for external from UI" (CoderPane calls `cancelTask`) | If a code path does hit session-stop for an external task, that leg's finalization is best-effort only | on-call review of CoderPane wiring | Evidenced (synthesis-input F1 resolved OQs) |
### Dependencies
| ID | Dependency | Owner | Status |
|----|------------|-------|--------|
| Dep1 | F5 requires `@opencode-ai/sdk` to surface a compaction event (or an SDK bump) | (capability check) | BLOCKING F5 — unresolved (OQ-F5a) |
| Dep2 | F4 requires confirmed unattended hook-firing for claude/qwen/goose + the goose hook format | (plan-a-feature) | BLOCKING F4 — unresolved (OQ-F4a, OQ-F4b) |
| Dep3 | F8 requires a chosen diff source + a line-selection approach | (plan-a-feature) | BLOCKING F8 — unresolved (OQ-F8a/b/c) |
## Testing Strategy
Sourced from test-engineer (T1-T3, T6, T8-T10) following the established pure-helper-then-wire precedent
(`turn-guard.ts`, `lifecycle-decisions.ts`, `mistake-tracker.ts`). Coder tests are `globals:false` (import
`describe`/`it`/`expect`); include glob `src/**/__tests__/**/*.test.ts`; DB-integration tests are opt-in via
`DATABASE_URL` + `describe.runIf`.
- **Observable behaviors to test:**
- F1: a cancelled external task lands `messages.status='cancelled'` (not `streaming`, not `complete`); the registry register/cancel/delete/has behaves idempotently ([D-1](artifacts/implementation-decision-log.md#d-1-f1-cancel-registry-shape-and-finalization-fix-scope), [D-7](artifacts/implementation-decision-log.md#d-7-f1-terminal-state-for-user-stop--cancelled-not-failed)).
- F2: a text-delta carrying a complete `<invoke>` block lands in `result.toolCalls` with the markup absent from `result.content` — green through the prune, red if the guard is removed ([D-4](artifacts/implementation-decision-log.md#d-4-f2-fallback-gate-test--pin-the-untested-guard-before-pruning)).
- F6: a fake hanging stream is aborted after the 90s deadline under `vi.useFakeTimers()`; the existing post-loop `signal?.aborted` check still passes (regression pin) ([D-5](artifacts/implementation-decision-log.md#d-5-f6-stall-timeout-via-abortsignalany-no-retry)).
- F7: results exclude `role='system'` rows (sentinel strip) and respect the `limit` cap.
- **Test doubles posture:** extract pure helpers (`CancelRegistry`, `classifyStreamError`) with no DB/child/I/O for unit coverage; stub `streamText` for the F2 gate and F6 stall tests; one DB-integration test for the F1 route. Warm-worktree-preserved is held as a code comment, not a spy (test-engineer T1-T3).
- **Edge cases requiring coverage:** double-Stop / cancel-after-exit (idempotent `ac.abort()`); abort vs thrown error in the catch (cancelled vs failed mapping); F6 stall vs healthy-slow stream.
- **Test levels:** unit (registry, classifier, gate) → integration (F1 route → DB row) → manual (F7 MCP call, F9 route-still-200 smoke).
## Security Posture
Thin. The single TIER 1 surface is BooCoder's new read-only `view_session_history` MCP tool (F7), which
reads only `messages_with_parts` `WHERE role != 'system'` — no write verbs, no PII beyond what the operator
already authored, single-user, Authelia at the reverse proxy, no app-layer auth by design.
The one genuinely new auth-relevant surface — F4's inbound unauthenticated localhost POST route for hook
callbacks — belongs to a **BLOCKED** item and is not built in this plan. When F4 is planned, it must be
recorded explicitly as a new unauthenticated localhost route (acceptable under the single-user / Authelia
posture, but called out, not silent).
## Operational Readiness
- **Deploy by surface:** apps/coder items (F1, F7, F9) → `sudo systemctl restart boocoder`; apps/server items (F2+F3, F6) → `docker compose up --build -d boocode`. Stage commits explicitly by path; never `git add -A` (Sam has uncommitted web WIP) ([D-9](artifacts/implementation-decision-log.md#trivial-decisions)).
- **Build-step change (F9):** removing `apps/coder/web` deletes a coder build step and a workspace package; the coder Dockerfile copy and `pnpm-workspace.yaml` entry must be removed in the same batch or the build references a missing package ([D-11](artifacts/implementation-decision-log.md#d-11-f9-retire-9502-spa--delete-the-serve-block-keep-all-apiws-routes)).
- **Health check after F7:** `/api/health` reports a tool count; expect it to increment by one.
- **Tagging:** sequential patch tags, one per unit; no v2.8.0 ([D-9](artifacts/implementation-decision-log.md#trivial-decisions)).
- **Rollback:** each unit is an independent patch tag, so rollback is per-item (revert the tag, redeploy by surface). No schema migrations means no expand/contract sequencing to unwind.
## On-Call Resilience Posture
Application-source resilience, sourced from on-call-engineer. Infrastructure concerns live in Operational
Readiness above.
- **Timeouts and deadlines:** F6 adds a 90s per-chunk stall deadline on the BooChat `fullStream` loop, the first server-side guard against a hung llama-swap stream (today only the frontend 60s `discard_stale` watchdog exists) ([D-5](artifacts/implementation-decision-log.md#d-5-f6-stall-timeout-via-abortsignalany-no-retry)).
- **Retry strategy:** none, deliberately. A retry after a partial stream re-emits already-streamed deltas (non-idempotent); the user re-sending is the correct recovery at single-local-instance scale (see Deferred (YAGNI)) ([D-5](artifacts/implementation-decision-log.md#d-5-f6-stall-timeout-via-abortsignalany-no-retry)).
- **Idempotency:** F1's `cancelExternalTask` is idempotent (`ac.abort()` no-ops when already aborted → double-Stop and cancel-after-exit are safe); message finalization is idempotent via `WHERE status='streaming'` ([D-1](artifacts/implementation-decision-log.md#d-1-f1-cancel-registry-shape-and-finalization-fix-scope)).
- **Kill switches:** F1 is itself the kill switch for a runaway external agent — making Stop actually stop is the operator's manual control.
- **Graceful degradation:** F1's `session/stop` leg is best-effort (does not fire for external tasks from the UI); F6 falls back to finalizing as `cancelled` on stall rather than hanging indefinitely.
- **Observability of failure paths:** F1 distinguishes `cancelled` (user/stall) from `failed` (thrown error) so the human-inbox surface stays honest ([D-7](artifacts/implementation-decision-log.md#d-7-f1-terminal-state-for-user-stop--cancelled-not-failed)); F3 routes parser-rejection logging through pino/`LOG_LEVEL` ([D-2](artifacts/implementation-decision-log.md#trivial-decisions)).
- **Data integrity:** no monetary/rate columns touched; F1/F6 only write existing terminal-state enum values into existing columns.
- **Migration safety:** no schema changes in any TIER 1 item — nothing to expand/contract.
## Definition of Done
- [ ] Hitting Stop on an external task kills the child and the assistant message finalizes `cancelled` (never left `streaming`, never falsely `complete`) ([D-1](artifacts/implementation-decision-log.md#d-1-f1-cancel-registry-shape-and-finalization-fix-scope), [D-7](artifacts/implementation-decision-log.md#d-7-f1-terminal-state-for-user-stop--cancelled-not-failed)).
- [ ] Stop button is disabled while the cancel POST is in flight; a muted "Stopped" label renders on cancel ([D-8](artifacts/implementation-decision-log.md#d-8-f1-status-surfacing--extend-the-existing-frame-no-new-frame-type), [D-10](artifacts/implementation-decision-log.md#trivial-decisions)).
- [ ] Parser public surface is 4 exports; the `<invoke>`-as-text gate test is green and fails if the guard is removed ([D-3](artifacts/implementation-decision-log.md#d-3-f2-prune-scope--option-a-prune-now-minimal-keep-the-load-bearing-guard), [D-4](artifacts/implementation-decision-log.md#d-4-f2-fallback-gate-test--pin-the-untested-guard-before-pruning)).
- [ ] Parser rejection logging appears via pino and respects `LOG_LEVEL` ([D-2](artifacts/implementation-decision-log.md#trivial-decisions)).
- [ ] A fake hanging BooChat stream is finalized `cancelled` after 90s under fake timers; the post-loop abort regression pin passes ([D-5](artifacts/implementation-decision-log.md#d-5-f6-stall-timeout-via-abortsignalany-no-retry)).
- [ ] `view_session_history` returns a session transcript excluding `role='system'` rows, capped at `limit` (default 50, max 200); `/api/health` tool count +1 ([D-6](artifacts/implementation-decision-log.md#trivial-decisions)).
- [ ] :9502 serve block + build step + workspace package removed; all `/api/coder/*` + `/api/health` routes still 200 ([D-11](artifacts/implementation-decision-log.md#d-11-f9-retire-9502-spa--delete-the-serve-block-keep-all-apiws-routes)).
- [ ] Each shipped item has its own sequential patch tag + CHANGELOG entry; deployed by surface ([D-9](artifacts/implementation-decision-log.md#trivial-decisions)).
## Specialist Handoffs for Implementation
- **`test-engineer`** — dispatch alongside each READY unit; needs the touch-point file:lines (already in discovery notes) to author T1-T3 (F1), T6 (F2), T8-T10 (F6) before/with the wiring.
- **`software-architect`** — dispatch if F1's `cancelAndFinalize` helper shape needs settling across the four backend paths during implementation; needs the four catch-block + success-path line ranges.
- **`plan-a-feature` (F4)** — dispatch when Sam wants F4; input: the settled dedup rule + module shapes from synthesis-input, and the two blocking OQs (unattended hook-firing, goose format) to resolve first.
- **`plan-a-feature` (F8)** — dispatch when Sam wants F8; input: `CoderPane.tsx:478-619` DiffPanel location, the diff-source ambiguity, and the cross-surface routing gap.
- **Capability check (F5)** — dispatch before any F5 build; input: the pinned `@opencode-ai/sdk` arm list and OQ-F5a (does it emit a compaction event, and what is its exact name?).
## Deferred (YAGNI)
Items considered during planning and deferred under the YAGNI rule. Each carries a concrete reopen trigger.
### F6 retry / backoff classifier
- **Why deferred:** Evidence test — a retry after a partial stream re-emits already-streamed deltas (`state.accumulated` + live `delta` frames are non-idempotent), which is strictly worse than the current behavior; no second instance exists to fail over to. Single-local-instance scale makes operator re-send the correct recovery.
- **Reopen when:** llama-swap gains restart-in-place-with-clear-partial semantics, OR a second llama-swap instance is added for failover.
- **Source:** R1, on-call-engineer.
### F2 Option B — flag-gated full retirement of `extractToolCallBlocks` / `stripToolMarkup`
- **Why deferred:** Evidence test — no evidence qwen3.6 stopped emitting `<invoke>`-as-text on live, and the sidecar `--jinja` state is unconfirmed in-repo; deleting the only plain-text-tool-call guard on that basis is unsafe (named anti-pattern: removing a load-bearing guard without a measured signal).
- **Reopen when:** a documented multi-session live probe shows zero text-delta tool calls from qwen3.6.
- **Source:** R1, software-architect / test-engineer.
### F4 `NotifyHookInjection` interface
- **Why deferred:** Simpler-version test — three agents with an identical read-merge-write settings.json flow do not need an interface; one concrete function switching on agent name satisfies the same need (named anti-pattern: single-implementation interface before three divergent uses exist).
- **Reopen when:** a fourth agent with a genuinely different injection contract appears (and only after F4's premise is verified).
- **Source:** R1, software-architect.
### F5 handling of `compaction.started` / `compaction.delta` + `step.failed` + `tool.progress`
- **Why deferred:** Evidence test — only `compaction.ended` is user-actionable (it is what closes the silent context gap); the other arms add UI noise with no user-described need. Compounded by F5 being capability-blocked anyway.
- **Reopen when:** the SDK compaction event is confirmed AND a user-described need for in-progress compaction feedback emerges.
- **Source:** R1, behavioral-analyst.
### F7 `SessionHistoryReader` interface / pagination beyond `limit`
- **Why deferred:** Simpler-version test — one read tool with an inline query + a `limit` cap satisfies the read need; an interface and cursor pagination are unused machinery (named anti-pattern: abstraction before a second concrete use).
- **Reopen when:** a second history-read consumer needs a different read path, or transcripts routinely exceed the 200-row cap in a way that breaks the use case.
- **Source:** R1, software-architect.
## Open Items
- **OI-1 (F4 premise):** Do claude / qwen / goose fire native lifecycle hooks in unattended mode, and what is goose's hook file/format?
- **Resolves when:** verified in a `plan-a-feature` spec for F4 (OQ-F4a, OQ-F4b).
- **Blocks implementation:** Yes for F4 — No for the READY cluster (F4 is independent).
- **OI-2 (F5 capability):** Does the pinned `@opencode-ai/sdk` surface a compaction event, and what is its exact name?
- **Resolves when:** a capability check confirms the event (or an SDK bump is scoped) — OQ-F5a.
- **Blocks implementation:** Yes for F5 — No for the READY cluster.
- **OI-3 (F5 UI treatment):** persistent sentinel row vs ephemeral inline divider for "context compacted."
- **Resolves when:** OI-2 resolves and the team picks the UI shape (OQ-F5b); currently disputed (behavioral vs UX).
- **Blocks implementation:** Yes for F5 — No for the READY cluster.
- **OI-4 (F8 spec):** diff source (`pending_changes.diff` vs worktree git diff), selection serialization, and new-viewer routing.
- **Resolves when:** a `plan-a-feature` spec for F8 settles OQ-F8a/b/c.
- **Blocks implementation:** Yes for F8 — No for the READY cluster.
- **OI-5 (F1 session-stop leg):** whether to wire the best-effort `session/stop` lookup or defer it.
- **Resolves when:** decided at F1 implementation; it does not fire for external tasks from the UI today, so it is low-value.
- **Blocks implementation:** No — F1 ships with or without this leg.
## Summary
- **Outcome delivered:** Stop actually stops external agents and finalizes their messages correctly; the parser is pruned to its load-bearing surface with the guard test-pinned; BooChat gains a 90s server-side stall guard; a session-history MCP tool and the :9502 retirement land; F4/F5/F8 have a documented route to their own spec/capability work.
- **Team size:** 7 specialists (incl. project-manager) — see [artifacts/implementation-iteration-history.md](artifacts/implementation-iteration-history.md)
- **Rounds of facilitation:** 1 — see [artifacts/implementation-iteration-history.md](artifacts/implementation-iteration-history.md)
- **Decisions committed:** 12 — see [artifacts/implementation-decision-log.md](artifacts/implementation-decision-log.md)
- **Decisions settled by evidence:** 11 — see [artifacts/implementation-decision-log.md](artifacts/implementation-decision-log.md)
- **Decisions settled by junior-developer reframing:** 0 (junior confirmed several, but each rests on specialist evidence) — see [artifacts/implementation-decision-log.md](artifacts/implementation-decision-log.md)
- **Decisions settled by user input:** 1 (D-12, the standing-override disposition) — see [artifacts/implementation-decision-log.md](artifacts/implementation-decision-log.md)
- **Rejected alternatives recorded:** 13 — see [artifacts/implementation-decision-log.md](artifacts/implementation-decision-log.md)
- **Open items remaining:** 5 (4 blocking only their own BLOCKED item; OI-5 non-blocking)
- **Recommendation:** Ship the READY cluster as planned — build F1 first (apps/coder, highest-value correctness fix, must include the OCE-001/002 finalization fixes), then F2+F3 (apps/server, one batch), F6 (apps/server), F7 (apps/coder MCP), F9 (apps/coder cleanup), each as its own sequential patch tag. Route F4 and F8 to `plan-a-feature` and F5 to an `@opencode-ai/sdk` capability check before any build.

View File

@@ -0,0 +1,169 @@
# Scope brief — post-review backlog (spec stand-in)
**Status:** ground-truth source for `plan-implementation`. No `feature-specification.md` exists; this brief
captures the items, decisions, and verified current state from the 2026-06-02 backlog review conversation
plus `boocode_code_review_v2.md`, `docs/DEFERRED-WORK.md`, and `boocode_roadmap.md`. Sam chose scope =
"everything we discussed" (one plan covering all actionable items; WANT items have no per-feature spec, so
their planning is intentionally shallower).
Two items were verified live during the review (file:line evidence below). The plan must honor those
findings as committed mechanics, not re-debate them.
---
## Items to implement
### F1 — External task cancel actually kills the child (DEFERRED #1)
**Problem (VERIFIED 2026-06-02):** Hitting Stop on an external agent task (opencode/goose/qwen/claude)
marks `tasks.state='cancelled'` but does NOT stop the agent. The frontend is wired
(`CoderPane.tsx:987` `handleStop``api.coder.cancelTask``POST /api/coder/tasks/:id/cancel`, fired only
for external tasks since `activeTaskId` is set only when `messages.ts:238` returns a `task_id`). But the
route (`apps/coder/src/routes/tasks.ts:110-148`) only calls `cancelPendingPermission` + `inference.cancel`
(native-only) + sets DB cancelled. It never reaches the dispatcher. The dispatcher
(`apps/coder/src/services/dispatcher.ts`) has no `Map<taskId, AbortController>` and no exported cancel; each
external run-function creates a private `const ac = new AbortController()` (lines ~316/655/991/1248) and
threads `ac.signal` into every backend, but `ac.abort()` is never called by any route. All four backends
honor the signal correctly (PTY `child.kill` SIGTERM→SIGKILL at `pty-dispatch.ts:159`; warm-ACP
`session/cancel` at `warm-acp.ts:318`; opencode `session.abort` at `opencode-server.ts:775`; claude-sdk
interrupt at `claude-sdk.ts:209`) — but the signal is never fired from Stop, and the assistant message is
left `status='streaming'`.
**Decision / approach:** Implement Phase A + B of the DEFERRED #1 proposal — a `Map<taskId,AbortController>`
registry in the dispatcher, an exported `cancelExternalTask(taskId)`, wire `POST /api/tasks/:id/cancel` (and
`POST /api/sessions/:id/stop`) into it for external tasks, and finalize the streaming assistant message
(`complete`/`cancelled`) + publish terminal frame on cancel. Warm backends keep their persistent worktree
(by design); one-shot PTY cleans its worktree in `finally`. Native boocode cancel path unchanged.
**Surface:** apps/coder (dispatcher, routes/tasks.ts, routes/messages.ts). Sam: implement.
### F2 — Tool-call parser clean/prune (review §4a, optional retirement)
**Problem (VERIFIED 2026-06-02):** The TS tool-call parser
(`apps/server/src/services/inference/tool-call-parser.ts`) is STILL load-bearing in code —
`extractToolCallBlocks` is called in the live text-delta path (`stream-phase.ts:263-284`) and
`stripToolMarkup` in `tool-phase.ts:122` + `error-handler.ts:25,106`. But a live probe of llama-swap
(`http://100.101.41.16:8401`) confirmed native `--jinja` parsing is ON: the server returns structured
`tool_calls` + `finish_reason:tool_calls`, content is just `"\n"`. The structured path
(`stream-phase.ts:285` `case 'tool-call'`) does all real work today; the TS parser is dormant
defense-in-depth. qwen3.6 is specifically the model that drifts to `<invoke>` markup, and the TS fallback is
the only guard for "tool call emitted as plain text" (`experimental_repairToolCall` does NOT cover that
case).
**Decision / approach:** Sam wants to clean/prune. Honor the safe path: this is NOT a zero-risk delete.
The plan must decide between (a) prune only what is provably dead now while keeping the
`<invoke>`-text-fallback guard, vs (b) the full retirement = gate the text-scrape fallback behind a flag,
validate native parsing on live qwen3.6 for one release, then delete. The relicense already removed the
AGPL-dead exports; this is pure simplification, no license pressure. Identify exactly which exports/branches
are safely removable today vs which are the load-bearing guard.
**Surface:** apps/server inference. Sam: clean/prune (safe path).
### F3 — xml-parser structured logging (DEFERRED #6)
**Problem:** `tool-call-parser.ts:65` uses `console.debug` for placeholder-rejection logging instead of the
pino `log.debug({...})` pattern, so it skips the structured pipeline and ignores `LOG_LEVEL`. Cosmetic.
**Decision / approach:** Pass an optional logger into `extractToolCallBlocks` from its one call site
(`stream-phase.ts` executeStreamPhase), or a module-level debug hook. ~5-minute edit. Sam: implement (will
be done directly, included here for completeness).
### F4 — Notify-hook config injection → out-of-band agent status (review #10 follow-on)
**Context:** v2.7.6 shipped the SCOPED status-publish (normalized `working|blocked|idle|error` frame from
BooCoder's OWN turn boundaries via `normalize-agent-status.ts`). The follow-on — the superset clean-room
pattern (ELv2, PATTERN-ONLY) — injects a notify hook into each PTY agent's native config
(`~/.claude/settings.json`, `~/.qwen/settings.json`, `~/.config/goose/`), the hook POSTs
`{terminalId/taskId, eventType, agent}` back to BooCoder, which normalizes ~30 vendor event names → the
existing status buckets via the already-built `normalizeAgentEvent` helper. This gives goose/qwen/claude
real out-of-band working/blocked-on-permission/done signals the turn-boundary publish can't see.
**Decision / approach:** Sam wants it. Clean-room only (no superset code). Reuse the existing
`normalize-agent-status.ts`. Plan the injection mechanism (idempotent settings.json merge per agent),
the inbound POST route, and how it composes with the existing scoped publish without double-publishing.
**Surface:** apps/coder (agent config injection, new route, normalize reuse). Sam: want.
### F5 — opencode compaction surfacing (review §3 #1)
**Problem (VERIFIED earlier):** `opencode-server.ts` consumes ~5 SSE arms but NOT `compaction.{started,delta,ended}`
— so when the warm opencode server auto-compacts mid-conversation it shows as a silent context gap in the
UI. Only `step.ended` (tokens, v2.7.3) was taken from the fuller `session.next.*` set.
**Decision / approach:** Sam wants it. Consume the compaction arms (and consider `step.failed`,
`tool.progress`) in `opencode-server.ts`, map to a UI signal so the user sees "context compacted" instead of
a gap. May need a new WS frame + web `WsFrame` arm (cross-app parity rule applies).
**Surface:** apps/coder (opencode-server.ts) + apps/web (frame + render). Sam: want.
### F6 — BooChat resilience: stall-timeout + retry/backoff (review §3 #2/#3)
**Problem (VERIFIED 2026-06-02):** `apps/server/src/services/inference/stream-phase.ts` has ZERO retry and
NO server-side stall timeout. A hung llama-swap stream relies entirely on the frontend 60s `discard_stale`
watchdog. The `for await (const part of result.fullStream)` loop (`stream-phase.ts:261`) is the wrap point.
**Decision / approach:** Sam wants it. Add a per-chunk stall timeout wrapping the fullStream loop that fires
the existing abort path (~40-60 LoC), and a minimal retry/backoff classifier (transient-5xx / stall;
llama-swap rarely emits retry-after, so value is mostly transient-5xx + stall retry — strip the cloud-billing
arms). YAGNI-gate the retry scope hard: single local instance, so keep it minimal.
**Surface:** apps/server inference (stream-phase.ts). Sam: want.
### F7 — view_session_history MCP tool
**Context:** Roadmap v2.x optional. Expose chat/session history as a read-only MCP tool on BooCoder's MCP
server (`apps/coder/src/services/mcp-server.ts`) so an MCP client can retrieve the transcript of a session.
**Decision / approach:** Sam wants it. Read-only tool over the existing `messages_with_parts` read path,
keyed by session/chat id. Reference shape: memov `snap`/`mem_history`. Keep minimal — one read tool, no
write/validate verbs unless evidence demands.
**Surface:** apps/coder MCP server. Sam: want.
### F8 — diff-line → agent re-prompt UX (review §5j, superset pattern)
**Context:** Select lines in a diff, send them as a comment to the existing agent session or to a new agent
("fix this here"). superset has the pattern (ELv2, PATTERN-ONLY clean-room). This is a review/diff UX
frontier, the heaviest WANT and the one most needing its own feature spec.
**Decision / approach:** Sam wants it. Plan at a coarse level (it likely warrants a follow-on plan-a-feature):
the diff-line selection affordance, the compose-to-session/new-agent action, and how it routes through the
existing dispatcher + AgentComposerBar. The diff UI component must be located (grep for the diff panel did not
surface a `*diff*` filename — UX specialist to locate the actual component).
**Surface:** apps/web (diff UI) + apps/coder (route to existing/new session). Sam: want.
### F9 — Retire apps/coder/web :9502 fallback SPA (DEFERRED #5)
**Context:** Standalone Vite SPA (`@boocode/coder-web`) served by BooCoder at `:9502` when no BooChat proxy
is in front. Sam confirmed: "I don't use 9502." Primary UI is CoderPane inside BooChat.
**Decision / approach:** Remove `apps/coder/web/` + its static serve from `apps/coder/src/index.ts` + the
coder build step; KEEP all WS + REST routes (CoderPane depends on them); optional minimal static
"open in BooChat" page at `/`. Per DEFERRED #5 removal checklist. Low risk. Sam: greenlit.
---
## Deferred (Sam's explicit dispositions — record, do not plan)
- **Subagent permission demux** — DEFER. Condition met (VERIFIED 2026-06-02: `opencode-server.ts` has zero
permission handling; parent and child sub-sessions run without BooCode gating — effectively bypassed).
- **Tier-2 provider follow-ups** — `available_agents.enabled` column and MCP `list_providers` tools DROP
(redundant / moot now that v2.4 external-driving is dead); `provider_snapshot_updated` WS frame + shared
`packages/types` PARK (revisit only if status feels laggy / a third type consumer appears).
- **Large-file splits** — DEFER (refactor-when-touched).
- **PR-resolver, multi-provider LLM, HMAC audit log, verify-gate ensembler, taste-skill, workflow graphs,
record/replay test harness** — DEFER (optional/far-future).
- **v2.4 BooCoder-as-ACP-agent** — DROPPED (won't-do, never drive BooCoder from external editors).
- **Native llama-server tool-call-parser full retirement** — covered by F2's safe-path decision.
## Cross-cutting constraints (from CLAUDE.md — must honor)
- New WS frame type → update BOTH server `InferenceFrame`/`ws-frames.ts` AND web `WsFrame`
(`apps/web/src/api/types.ts`); the web type is the wire-format gate (applies to F5, maybe F4).
- Sentinels are `role='system'` rows with `metadata.kind`; `buildMessagesPayload` strips them (relevant if
F4/F5 add a UI-only signal).
- Deploy by surface: apps/coder change → `systemctl restart boocoder`; apps/web|server → docker rebuild.
- Stage commits explicitly by path; never `git add -A` (Sam has uncommitted web WIP).
- Coder↔web provider-type parity + ws-frames parity tests must stay green.
- Default to next patch tags, one batch per coherent unit (Sam declined v2.8.0 twice).

View File

@@ -0,0 +1,261 @@
# Test Plan: post-review-backlog (F1, F2, F6, F5/F4/F7 brief)
## Scope
`apps/coder/src/services/dispatcher.ts`,
`apps/coder/src/routes/tasks.ts:110-148`,
`apps/server/src/services/inference/tool-call-parser.ts`,
`apps/server/src/services/inference/stream-phase.ts:261-341`,
and their existing test companions.
Branch: main (HEAD e5ce01a, v2.7.11).
---
## Summary
Three features — F1 (task cancel), F2 (parser prune), F6 (stall-timeout + retry) — have behavioral contracts not yet protected by any test. F1 is the highest risk because the whole-point of the feature is a side-effect (child termination) that is trivially skippable in the implementation. F2 is the whole point of a prune: keeping the existing `tool-call-parser.test.ts` suite green IS the contract, but one gap (the `<invoke>`-fallback path exercised from the call site in `stream-phase.ts`) needs a new test before any deletion proceeds. F6 benefits most from a pure-helper extraction (following the `lifecycle-decisions.ts` / `mistake-tracker.ts` precedent) so the classifier logic is unit-testable without a real stream.
| Priority | Count |
|----------|-------|
| High | 4 |
| Medium | 3 |
| Low | 2 |
| Skipped | 3 |
Full analysis written to: /home/samkintop/opt/boocode/docs/plans/post-review-backlog/test-plan.md
---
## Coverage Assessment
**Well-tested:** `tool-call-parser.ts` — every export has thorough unit coverage (`tool-call-parser.test.ts`); the inline fallback behavior of `extractToolCallBlocks` is the best-tested surface in the inference layer. Coder backend lifecycle decisions (`lifecycle-decisions.test.ts`, `turn-guard.test.ts`, `normalize-agent-status.test.ts`) follow the pure-helper extraction pattern and have complete unit coverage.
**Significant gaps:**
- `dispatcher.ts` — the `cancelExternalTask` function does not exist yet. Once written it will have zero tests. The cancel-registry decision logic (the pure "does this taskId map to an AbortController?" portion) is the highest-value extraction target.
- `routes/tasks.ts:110-148` (cancel route) — currently the route never reaches the dispatcher for external tasks; after F1 the new code path has no test.
- `stream-phase.ts` stall / retry behavior — entirely absent; the `for await` loop has no timeout and no retry. Both will be new code with no tests.
- `stream-phase.ts` abort-after-stream behavior — the critical post-loop `signal?.aborted` check at line 337 is already documented as a hard-won correctness property but has no regression test.
**Overall:** the inference layer has solid unit tests for data-transformation helpers but none for control-flow behavior (abort, stall, retry). The coder dispatcher has no tests at all. This is appropriate given the DB-heavy nature of the dispatcher, but the pure decision layers (cancel-registry, retry classifier) can and should be extracted and tested.
---
## Findings
### F1 — Task Cancel
**T1: cancel-registry decisions — pure unit**
- **Priority:** High
- **Test level:** Unit
- **Entry point:** `apps/coder/src/services/dispatcher.ts` — the new `cancelExternalTask(taskId)` export (to be created at F1 implementation time; the registry is the `Map<taskId, AbortController>` that does not exist yet)
- **Gap type:** Untested (code not yet written — test is the contract spec)
- **Test approach:**
- **Behavior:** Extract a pure `CancelRegistry` module (following the `lifecycle-decisions.ts` / `turn-guard.ts` / `mistake-tracker.ts` precedent) that owns register/cancel/has/delete operations over a `Map<taskId, AbortController>`. The dispatcher creates one and calls these; routes call `cancel`. The unit tests pin what the registry commits to.
- **Stubs:** None — the module is pure (no I/O, no DB)
- **Cases:**
1. `register(id, ac)` then `cancel(id)``ac.signal.aborted` is `true`, returns `true`
2. `cancel(id)` on unknown id → returns `false`, no throw
3. `register` + natural completion calls `delete(id)``cancel(id)` now returns `false` (idempotent double-cancel safety)
4. `has(id)` returns `true` while registered, `false` after delete
- **Expected output:** Signal aborted state, return values as above
- **Expected commands:** None (pure)
- **Brittleness assessment:** Durable — asserts the behavioral contract of the new module regardless of how the dispatcher wires it. The four cases directly map to the F1 correctness requirements (signal fired, idempotent, no-op after exit). No mock expectations.
---
**T2: cancel route reaches dispatcher for external task — integration**
- **Priority:** High
- **Test level:** Integration (DB-opt-in via `describe.runIf(!!process.env.DATABASE_URL)`)
- **Entry point:** `apps/coder/src/routes/tasks.ts:110``POST /api/tasks/:id/cancel`
- **Gap type:** Untested
- **Test approach:**
- **Behavior:** After F1, `POST /api/tasks/:id/cancel` for a running external task must call `cancelExternalTask(taskId)` (which fires the AbortController), set `tasks.state='cancelled'`, and leave the previously-streaming assistant message with `status` not `'streaming'` (it must be `'cancelled'` or `'complete'`).
- **Stubs:** Stub the `cancelExternalTask` function (or inject it as a dep) so the test does not require a live agent child. Stub returns `true` (task was in-flight).
- **Input/Action:** Insert a task row in state `'running'` with a known `session_id`. POST to the cancel route.
- **Expected output:** Response `{ cancelled: true }`, DB `tasks.state = 'cancelled'`
- **Expected commands:** `cancelExternalTask(taskId)` called once (spy on injected dep)
- **Brittleness assessment:** The assertion on `cancelExternalTask` being called is legitimate here — it IS the behavioral contract of the route (the whole missing piece in the current implementation). Only assert call happened, not argument order or internal mechanics. The DB state assertion (`tasks.state`) is the durable anchor. Follow the `tool_cost_stats.test.ts` + `describe.runIf` pattern.
---
**T3: natural-exit does not re-cancel — pure unit on registry**
- **Priority:** High
- **Test level:** Unit (part of T1's file, not a separate file)
- **Entry point:** Cancel registry `delete(taskId)` — called in the dispatcher's `finally` block on clean completion
- **Gap type:** Untested
- **Test approach:**
- **Behavior:** When the dispatcher's `runExternalAgent` / `runOpenCodeServerTask` `finally` block deletes the taskId from the registry, a subsequent route-level `cancelExternalTask(taskId)` call is a no-op (returns `false`). This is the "cancel-after-natural-exit no-op" behavior from the F1 scope.
- **Stubs:** None
- **Input/Action:** `registry.register(id, ac)``registry.delete(id)``registry.cancel(id)`
- **Expected output:** returns `false`, `ac.signal.aborted` remains `false`
- **Brittleness assessment:** Trivially durable. This is the critical property that prevents spurious aborts arriving late after a task completes.
---
**T4: warm worktree preserved on cancel (opencode path) — contract assertion on existing behavior**
- **Priority:** Medium
- **Test level:** Unit (pure behavioral assertion — no DB or child needed)
- **Entry point:** `apps/coder/src/services/dispatcher.ts``runOpenCodeServerTask` catch/cancel branch (~line 921-933)
- **Gap type:** Untested
- **Test approach:**
- **Behavior:** The scope-brief (F1 "Warm backends keep their persistent worktree by design") requires that when `ac.abort()` is called during `runOpenCodeServerTask`, the function does NOT call `cleanupWorktree`. The one-shot PTY path (`runExternalAgent`) does call `cleanupWorktree` in its finally. This distinction is the behavioral contract.
- **Stubs:** This is most cleanly verified by reading the implementation post-F1 and confirming via a test that passes a spy `cleanupWorktree` to the cancel path. However — this is better expressed as a **code-review assertion** rather than a test, because the behavior is structural (code path absence), not data-dependent.
- **Recommendation:** Capture this as a doc-comment contract in the implementation ("no cleanupWorktree on cancel — worktree is persistent") rather than a brittle spy assertion. If Sam disagrees, add an integration test that confirms the worktree directory still exists on disk after an abort, using a tempdir fixture (follows `pending_changes_integration.test.ts` pattern).
- **Brittleness assessment:** A spy on `cleanupWorktree` would break if the function is renamed or inlined. The structural guarantee is better held by the code comment + the existing lifecycle tests. Mark as Medium since the scope-brief calls it "by design" — a regression here would be caught by the warm-backend integration path anyway.
---
### F2 — Parser Prune
**T5: existing tool-call-parser.test.ts suite stays green — regression gate**
- **Priority:** High
- **Test level:** Unit
- **Entry point:** `apps/server/src/services/inference/tool-call-parser.ts` — all exports
- **Gap type:** Partially tested (all current behaviors are tested; this is a "must not break" constraint during the prune)
- **Test approach:**
- **Behavior:** Every test in `apps/server/src/services/__tests__/tool-call-parser.test.ts` pins a behavior the prune must not change. The test suite covers: `parseXmlToolCall`, `parseInvokeToolCall`, `partialXmlOpenerStart`, `extractToolCallBlocks` (all cases including placeholder rejection and streaming accumulation), `stripToolMarkup` (closed/unclosed/final), delimiter constants.
- **Action:** No new tests needed here — run `pnpm -C apps/server test` after each prune step and verify 0 failures. This suite is the contract.
- **Gap to close:** The one behavior NOT currently tested in the pure module's test file is the `<invoke>`-fallback being triggered from `stream-phase.ts` when a text-delta chunk arrives. See T6.
- **Brittleness assessment:** These tests are extremely durable — they test pure string-in/struct-out transforms. Zero brittleness risk. They are the reason to keep `tool-call-parser.ts` — they prove the guard is real.
---
**T6: invoke-fallback fires from the stream-phase call site — call-site integration test**
- **Priority:** High
- **Test level:** Unit (test `streamCompletion` or `executeStreamPhase` with a fake `fullStream`)
- **Entry point:** `apps/server/src/services/inference/stream-phase.ts:263-283` — the `text-delta` case calling `extractToolCallBlocks`
- **Gap type:** Untested
- **Test approach:**
- **Behavior:** When `result.fullStream` emits a `text-delta` chunk containing a complete `<invoke>` block, `streamCompletion` must return that block's parsed tool call in `result.toolCalls` (not in `result.content`). This is the live guard for the "qwen3.6 drifts to `<invoke>` markup" scenario from the scope-brief.
- **Stubs:** Stub `streamText` (import spy or dependency injection) to return a fake `fullStream` that yields:
```
{ type: 'text-delta', text: 'Thinking...\n<invoke name="view_file"><parameter name="path">/tmp/x</parameter></invoke>' }
{ type: 'finish', finishReason: 'stop' }
```
Stub `result.usage` resolving to `{ inputTokens: 10, outputTokens: 5 }`. Follow the `ReadableStream test stubs` guidance in `apps/server/CLAUDE.md` (use `pull()` not `start()`). The `upstreamModel` call can be stubbed to return a dummy provider.
- **Input/Action:** Call `streamCompletion(ctx, model, messages, opts, onDelta, onUsage, signal)` with the stubbed `streamText`.
- **Expected output:**
- `result.toolCalls` has length 1 with `name='view_file'`, `args.path='/tmp/x'`
- `result.content` is `'Thinking...\n'` (markup stripped)
- `onDelta` was called with `'Thinking...\n'` and NOT with the `<invoke>` markup
- **Expected commands:** None (no DB, no WS)
- **Brittleness assessment:** Medium brittleness — this test depends on the `streamText` API shape. Inject `streamText` as a dep rather than `vi.spyOn` a module export, so renaming the import doesn't break the test. The behavioral assertion (tool in `toolCalls`, markup NOT in `content`) is durable and directly pinned to the F2 safety concern. This test must stay green during the prune and must FAIL if `extractToolCallBlocks` is removed from the `text-delta` path.
**Important:** If the F2 decision is option (b) — full retirement behind a flag — this test validates the flag-off path (native-parsing-only) for one release. Write the test against the flag-off behavior first, confirm it passes without the parser, then delete the flag and the test in the follow-on batch.
---
**T7: stripToolMarkup still called in tool-phase path — regression guard**
- **Priority:** Medium
- **Test level:** Unit (test `tool-call-parser.ts` export directly — already covered)
- **Entry point:** `apps/server/src/services/inference/tool-phase.ts:122` and `apps/server/src/services/inference/error-handler.ts:25,106`
- **Gap type:** Partially tested
- **Test approach:**
- **Behavior:** `stripToolMarkup` is the second consumer (tool-phase, error-handler). The scope-brief identifies it as load-bearing for the case where the model puts raw `<tool_call>` or `<invoke>` markup in its "final" text content. The existing `stripToolMarkup` tests in `tool-call-parser.test.ts` already cover this behavior at the pure level.
- **Action:** No additional test needed. The existing `stripToolMarkup` tests (lines 297-342 of `tool-call-parser.test.ts`) are the contract. Do not write call-site integration tests for `tool-phase.ts` — they would require mocking the entire inference context and test the wrong level.
- **Note for prune:** Before deleting `stripToolMarkup`, confirm none of `tool-phase.ts:122`, `error-handler.ts:25`, `error-handler.ts:106` call it. If they do, the deletion is not safe regardless of the test.
- **Brittleness assessment:** N/A — this is a "stay green" check, not a new test.
---
### F6 — Stall-Timeout + Retry
**T8: stall-timeout classifier — pure unit**
- **Priority:** High
- **Test level:** Unit
- **Entry point:** New `apps/server/src/services/inference/stream-resilience.ts` (to be created, following `mistake-tracker.ts` / `lifecycle-decisions.ts` pattern)
- **Gap type:** Untested (code not yet written — test is the contract spec)
- **Test approach:**
- **Behavior:** Extract a pure `classifyStreamError(err: unknown): 'stall' | 'transient-5xx' | 'non-retryable'` function (or equivalent enum). The retry loop in `stream-phase.ts` delegates the retry decision to this function. Unit tests pin the classification without touching a real stream or timer.
- **Stubs:** None
- **Cases:**
1. Error whose message contains "stall" or whose name is `'StallTimeoutError'` → `'stall'`
2. Error with HTTP status 500, 502, 503, 504 → `'transient-5xx'`
3. AbortError (name `'AbortError'`) → `'non-retryable'`
4. Error with HTTP status 400 → `'non-retryable'`
5. Error with HTTP status 429 → `'non-retryable'` (YAGNI-gated: scope-brief says "strip the cloud-billing arms"; 429 is a cloud concern, not a single local llama-swap instance)
- **Expected output:** Classification string for each case
- **Brittleness assessment:** Durable — pure function, no I/O. The classification is the decision surface; the retry loop can be refactored freely as long as it consults this function. This follows the exact `mistake-tracker.ts` + `lifecycle-decisions.ts` pattern.
---
**T9: stall-timeout aborts the stream — unit with fake timer and fake stream**
- **Priority:** Medium
- **Test level:** Unit
- **Entry point:** `apps/server/src/services/inference/stream-phase.ts:261` — the `for await (const part of result.fullStream)` loop (post-F6, wrapped with a per-chunk deadline timer)
- **Gap type:** Untested (code not yet written)
- **Test approach:**
- **Behavior:** When `fullStream` emits no chunk for longer than the stall deadline, the wrapping code must throw a `StallTimeoutError` (or equivalent) that propagates out of `streamCompletion`. This is the new abort path distinct from the user-initiated `signal.aborted` path.
- **Stubs:**
- Stub `streamText` to return a fake `fullStream` that yields one chunk then hangs (never yields the next). Use a `ReadableStream` with a `pull()` controller that enqueues one chunk and then never calls the resolve/reject until after the test's fake timer fires.
- Use `vi.useFakeTimers()` (vitest fake timer) to advance time past the stall deadline without waiting real milliseconds.
- **Input/Action:** Call `streamCompletion` with fake timers active and a stream that stalls after the first chunk. Advance fake time by stall-deadline + 1ms.
- **Expected output:** `streamCompletion` rejects with a `StallTimeoutError`
- **Expected commands:** None
- **Brittleness assessment:** Medium — depends on fake-timer interacting correctly with the async stream loop. Vitest fake timers work with `setTimeout`-based deadlines but NOT with `Promise`-based deadlines. The implementation must use `setTimeout` (not `Promise.race` with a raw `sleep`) for the stall timer, or the test will be structurally incompatible. Note this constraint to the implementer. If the implementation uses `Promise.race`, adjust the test to use a `setTimeout`-resolving promise that can be controlled via `vi.runAllTimersAsync()`.
---
**T10: abort-after-stream regression — unit**
- **Priority:** Medium
- **Test level:** Unit
- **Entry point:** `apps/server/src/services/inference/stream-phase.ts:337` — `if (signal?.aborted) throw AbortError`
- **Gap type:** Untested (existing behavior, not new — but this is a documented regression risk per the inline comment "Smoke D caught this in v1.13.1-A — don't refactor it away")
- **Test approach:**
- **Behavior:** When the `AbortSignal` fires during a stream that has already consumed all chunks and exited the `for await` loop normally, `streamCompletion` must throw an `AbortError` rather than returning `StreamResult`. Without this throw the message would finalize as `complete` instead of `cancelled`.
- **Stubs:** Stub `streamText` to return a fake `fullStream` that yields a text chunk then `finish`, then resolves normally. The `AbortSignal` is pre-aborted before the call (or aborted concurrently using a resolved-promise microtask so it fires before the post-loop check).
- **Input/Action:** Create an `AbortController`, abort it, pass `ac.signal` to `streamCompletion`.
- **Expected output:** `streamCompletion` rejects with an error whose `name === 'AbortError'`
- **Expected commands:** None
- **Brittleness assessment:** Very durable — asserts a precise, documented behavioral contract. This test protects the most subtle correctness invariant in the inference layer. It should be written whether or not F6 proceeds, as a regression pin on existing code.
---
## F5, F4, F7 — Brief Notes
**F5 (opencode compaction surfacing):** The behavior being added is consuming new SSE event types in `opencode-server.ts` and emitting new WS frames. The primary test vector is the cross-app parity check: `apps/coder/src/services/__tests__/provider-types-parity.test.ts` is the precedent. A new `ws-frames-parity.test.ts` (or extension of `ws-frames.test.ts`) that asserts the new compaction frame type exists in BOTH `apps/coder` and `apps/web` `WsFrame` schemas is the right test. This is a Medium-priority addition once the frame types are defined. No new unit test is needed for the SSE arm parsing itself — it is structurally identical to the existing `step.ended` arm.
**F4 (notify-hook config injection):** The testable surface is the idempotent `settings.json` merge logic. Extract that as a pure `mergeNotifyHook(existing: object, hookConfig: object): object` function and unit-test it (same pattern as `mistake-tracker.ts`). The inbound POST route is testable via Fastify's `inject()` method if the handler is kept thin. However, F4 has no behavioral spec beyond the review description and is a WANT item — defer test planning until a feature spec exists. Testability is high when the merge logic is extracted.
**F7 (view_session_history MCP tool):** The new MCP tool is read-only over `messages_with_parts`. The testable behavior is the DB query returning the correct shape. Pattern: `describe.runIf(!!process.env.DATABASE_URL)` with a real DB, following `tool_cost_stats.test.ts`. Unit-level: the `mcp-client.test.ts` pattern (mock the DB, verify the query shape). Medium priority, straightforward once the tool implementation exists.
---
## Deferred / Skipped Tests
**S1: dispatcher.ts full integration test (any path)**
- **Entry point:** `apps/coder/src/services/dispatcher.ts:46` — `createDispatcher`
- **Reason:** The dispatcher is DB-heavy, timer-driven, and spawns child processes. An integration test for the full dispatch loop would require a real DB, a real (or fake) agent binary, and process lifecycle management. The value is low relative to the brittleness: the cancel-registry unit tests (T1, T3) and the route integration test (T2) together cover the observable behavioral gap at much lower maintenance cost. The full dispatch loop is better validated by end-to-end smoke testing.
**S2: one-shot PTY worktree cleanup on cancel**
- **Entry point:** `apps/coder/src/services/dispatcher.ts:600-606` — `cleanupWorktree` in the PTY error path
- **Reason:** Testing that `cleanupWorktree` is called requires either a spy (over-specified double) or a real filesystem side effect (requires a real git repo tempdir). The behavioral contract ("PTY worktree cleaned up on cancel") is better held by the existing `write_guard.test.ts` + `pending_changes_integration.test.ts` precedents, which validate the write-side contract. A spy assertion on `cleanupWorktree` call count is an over-specified double that breaks on any rename or refactor of the cleanup path.
**S3: retry backoff timing**
- **Entry point:** `apps/server/src/services/inference/stream-phase.ts` — post-F6 retry loop
- **Reason:** Testing that backoff delays are e.g. 1s, 2s, 4s requires fake timers and multi-attempt orchestration. The scope-brief says "keep it minimal" and YAGNI-gates the retry scope hard for a single local instance. The classifier (T8) pins the correctness decision; the timer values are implementation details. A test that asserts specific delay durations would break on any tuning and has no behavioral anchor in the F6 spec.
---
## Coverage Estimate
After all recommended tests are written:
- **F1 behavioral contract:** Fully covered at unit level (registry decisions) + opt-in DB integration (route → dispatcher). The warm-vs-PTY worktree divergence is held by a code-comment contract rather than a test (see T4 recommendation).
- **F2 behavioral contract:** Fully covered — existing `tool-call-parser.test.ts` suite is the regression gate; T6 closes the one gap (call-site fallback from `stream-phase.ts`). Prune can proceed safely with T6 green.
- **F6 behavioral contract:** Covered for the classifier (T8, unit) and the stall-throw path (T9, unit with fake timers). The existing abort regression (T10) closes a pre-existing gap independent of F6.
- **F5/F4/F7:** Testability is high but deferred to implementation time; no test can be written before the frame types and route handlers exist.
Behaviors intentionally left without tests: full dispatcher orchestration loop (S1), PTY cleanup spy (S2), backoff timing (S3).