Move 13 shipped openspec change docs under openspec/changes/archived/. Add docs/features/git-diff-panel, docs/plans/post-review-backlog, and docs/research/cross-app-contract-ssot.md (the research behind the @boocode/contracts SSOT work). Update BOOCHAT.md, BOOCODER.md, and boocode_roadmap.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
23 KiB
Test Plan: post-review-backlog (F1, F2, F6, F5/F4/F7 brief)
Scope
apps/coder/src/services/dispatcher.ts,
apps/coder/src/routes/tasks.ts:110-148,
apps/server/src/services/inference/tool-call-parser.ts,
apps/server/src/services/inference/stream-phase.ts:261-341,
and their existing test companions.
Branch: main (HEAD e5ce01a, v2.7.11).
Summary
Three features — F1 (task cancel), F2 (parser prune), F6 (stall-timeout + retry) — have behavioral contracts not yet protected by any test. F1 is the highest risk because the whole-point of the feature is a side-effect (child termination) that is trivially skippable in the implementation. F2 is the whole point of a prune: keeping the existing tool-call-parser.test.ts suite green IS the contract, but one gap (the <invoke>-fallback path exercised from the call site in stream-phase.ts) needs a new test before any deletion proceeds. F6 benefits most from a pure-helper extraction (following the lifecycle-decisions.ts / mistake-tracker.ts precedent) so the classifier logic is unit-testable without a real stream.
| Priority | Count |
|---|---|
| High | 4 |
| Medium | 3 |
| Low | 2 |
| Skipped | 3 |
Full analysis written to: /home/samkintop/opt/boocode/docs/plans/post-review-backlog/test-plan.md
Coverage Assessment
Well-tested: tool-call-parser.ts — every export has thorough unit coverage (tool-call-parser.test.ts); the inline fallback behavior of extractToolCallBlocks is the best-tested surface in the inference layer. Coder backend lifecycle decisions (lifecycle-decisions.test.ts, turn-guard.test.ts, normalize-agent-status.test.ts) follow the pure-helper extraction pattern and have complete unit coverage.
Significant gaps:
dispatcher.ts— thecancelExternalTaskfunction does not exist yet. Once written it will have zero tests. The cancel-registry decision logic (the pure "does this taskId map to an AbortController?" portion) is the highest-value extraction target.routes/tasks.ts:110-148(cancel route) — currently the route never reaches the dispatcher for external tasks; after F1 the new code path has no test.stream-phase.tsstall / retry behavior — entirely absent; thefor awaitloop has no timeout and no retry. Both will be new code with no tests.stream-phase.tsabort-after-stream behavior — the critical post-loopsignal?.abortedcheck at line 337 is already documented as a hard-won correctness property but has no regression test.
Overall: the inference layer has solid unit tests for data-transformation helpers but none for control-flow behavior (abort, stall, retry). The coder dispatcher has no tests at all. This is appropriate given the DB-heavy nature of the dispatcher, but the pure decision layers (cancel-registry, retry classifier) can and should be extracted and tested.
Findings
F1 — Task Cancel
T1: cancel-registry decisions — pure unit
- Priority: High
- Test level: Unit
- Entry point:
apps/coder/src/services/dispatcher.ts— the newcancelExternalTask(taskId)export (to be created at F1 implementation time; the registry is theMap<taskId, AbortController>that does not exist yet) - Gap type: Untested (code not yet written — test is the contract spec)
- Test approach:
- Behavior: Extract a pure
CancelRegistrymodule (following thelifecycle-decisions.ts/turn-guard.ts/mistake-tracker.tsprecedent) that owns register/cancel/has/delete operations over aMap<taskId, AbortController>. The dispatcher creates one and calls these; routes callcancel. The unit tests pin what the registry commits to. - Stubs: None — the module is pure (no I/O, no DB)
- Cases:
register(id, ac)thencancel(id)→ac.signal.abortedistrue, returnstruecancel(id)on unknown id → returnsfalse, no throwregister+ natural completion callsdelete(id)→cancel(id)now returnsfalse(idempotent double-cancel safety)has(id)returnstruewhile registered,falseafter delete
- Expected output: Signal aborted state, return values as above
- Expected commands: None (pure)
- Behavior: Extract a pure
- Brittleness assessment: Durable — asserts the behavioral contract of the new module regardless of how the dispatcher wires it. The four cases directly map to the F1 correctness requirements (signal fired, idempotent, no-op after exit). No mock expectations.
T2: cancel route reaches dispatcher for external task — integration
- Priority: High
- Test level: Integration (DB-opt-in via
describe.runIf(!!process.env.DATABASE_URL)) - Entry point:
apps/coder/src/routes/tasks.ts:110—POST /api/tasks/:id/cancel - Gap type: Untested
- Test approach:
- Behavior: After F1,
POST /api/tasks/:id/cancelfor a running external task must callcancelExternalTask(taskId)(which fires the AbortController), settasks.state='cancelled', and leave the previously-streaming assistant message withstatusnot'streaming'(it must be'cancelled'or'complete'). - Stubs: Stub the
cancelExternalTaskfunction (or inject it as a dep) so the test does not require a live agent child. Stub returnstrue(task was in-flight). - Input/Action: Insert a task row in state
'running'with a knownsession_id. POST to the cancel route. - Expected output: Response
{ cancelled: true }, DBtasks.state = 'cancelled' - Expected commands:
cancelExternalTask(taskId)called once (spy on injected dep)
- Behavior: After F1,
- Brittleness assessment: The assertion on
cancelExternalTaskbeing called is legitimate here — it IS the behavioral contract of the route (the whole missing piece in the current implementation). Only assert call happened, not argument order or internal mechanics. The DB state assertion (tasks.state) is the durable anchor. Follow thetool_cost_stats.test.ts+describe.runIfpattern.
T3: natural-exit does not re-cancel — pure unit on registry
- Priority: High
- Test level: Unit (part of T1's file, not a separate file)
- Entry point: Cancel registry
delete(taskId)— called in the dispatcher'sfinallyblock on clean completion - Gap type: Untested
- Test approach:
- Behavior: When the dispatcher's
runExternalAgent/runOpenCodeServerTaskfinallyblock deletes the taskId from the registry, a subsequent route-levelcancelExternalTask(taskId)call is a no-op (returnsfalse). This is the "cancel-after-natural-exit no-op" behavior from the F1 scope. - Stubs: None
- Input/Action:
registry.register(id, ac)→registry.delete(id)→registry.cancel(id) - Expected output: returns
false,ac.signal.abortedremainsfalse
- Behavior: When the dispatcher's
- Brittleness assessment: Trivially durable. This is the critical property that prevents spurious aborts arriving late after a task completes.
T4: warm worktree preserved on cancel (opencode path) — contract assertion on existing behavior
- Priority: Medium
- Test level: Unit (pure behavioral assertion — no DB or child needed)
- Entry point:
apps/coder/src/services/dispatcher.ts—runOpenCodeServerTaskcatch/cancel branch (~line 921-933) - Gap type: Untested
- Test approach:
- Behavior: The scope-brief (F1 "Warm backends keep their persistent worktree by design") requires that when
ac.abort()is called duringrunOpenCodeServerTask, the function does NOT callcleanupWorktree. The one-shot PTY path (runExternalAgent) does callcleanupWorktreein its finally. This distinction is the behavioral contract. - Stubs: This is most cleanly verified by reading the implementation post-F1 and confirming via a test that passes a spy
cleanupWorktreeto the cancel path. However — this is better expressed as a code-review assertion rather than a test, because the behavior is structural (code path absence), not data-dependent. - Recommendation: Capture this as a doc-comment contract in the implementation ("no cleanupWorktree on cancel — worktree is persistent") rather than a brittle spy assertion. If Sam disagrees, add an integration test that confirms the worktree directory still exists on disk after an abort, using a tempdir fixture (follows
pending_changes_integration.test.tspattern).
- Behavior: The scope-brief (F1 "Warm backends keep their persistent worktree by design") requires that when
- Brittleness assessment: A spy on
cleanupWorktreewould break if the function is renamed or inlined. The structural guarantee is better held by the code comment + the existing lifecycle tests. Mark as Medium since the scope-brief calls it "by design" — a regression here would be caught by the warm-backend integration path anyway.
F2 — Parser Prune
T5: existing tool-call-parser.test.ts suite stays green — regression gate
- Priority: High
- Test level: Unit
- Entry point:
apps/server/src/services/inference/tool-call-parser.ts— all exports - Gap type: Partially tested (all current behaviors are tested; this is a "must not break" constraint during the prune)
- Test approach:
- Behavior: Every test in
apps/server/src/services/__tests__/tool-call-parser.test.tspins a behavior the prune must not change. The test suite covers:parseXmlToolCall,parseInvokeToolCall,partialXmlOpenerStart,extractToolCallBlocks(all cases including placeholder rejection and streaming accumulation),stripToolMarkup(closed/unclosed/final), delimiter constants. - Action: No new tests needed here — run
pnpm -C apps/server testafter each prune step and verify 0 failures. This suite is the contract. - Gap to close: The one behavior NOT currently tested in the pure module's test file is the
<invoke>-fallback being triggered fromstream-phase.tswhen a text-delta chunk arrives. See T6.
- Behavior: Every test in
- Brittleness assessment: These tests are extremely durable — they test pure string-in/struct-out transforms. Zero brittleness risk. They are the reason to keep
tool-call-parser.ts— they prove the guard is real.
T6: invoke-fallback fires from the stream-phase call site — call-site integration test
- Priority: High
- Test level: Unit (test
streamCompletionorexecuteStreamPhasewith a fakefullStream) - Entry point:
apps/server/src/services/inference/stream-phase.ts:263-283— thetext-deltacase callingextractToolCallBlocks - Gap type: Untested
- Test approach:
- Behavior: When
result.fullStreamemits atext-deltachunk containing a complete<invoke>block,streamCompletionmust return that block's parsed tool call inresult.toolCalls(not inresult.content). This is the live guard for the "qwen3.6 drifts to<invoke>markup" scenario from the scope-brief. - Stubs: Stub
streamText(import spy or dependency injection) to return a fakefullStreamthat yields:Stub{ type: 'text-delta', text: 'Thinking...\n<invoke name="view_file"><parameter name="path">/tmp/x</parameter></invoke>' } { type: 'finish', finishReason: 'stop' }result.usageresolving to{ inputTokens: 10, outputTokens: 5 }. Follow theReadableStream test stubsguidance inapps/server/CLAUDE.md(usepull()notstart()). TheupstreamModelcall can be stubbed to return a dummy provider. - Input/Action: Call
streamCompletion(ctx, model, messages, opts, onDelta, onUsage, signal)with the stubbedstreamText. - Expected output:
result.toolCallshas length 1 withname='view_file',args.path='/tmp/x'result.contentis'Thinking...\n'(markup stripped)onDeltawas called with'Thinking...\n'and NOT with the<invoke>markup
- Expected commands: None (no DB, no WS)
- Behavior: When
- Brittleness assessment: Medium brittleness — this test depends on the
streamTextAPI shape. InjectstreamTextas a dep rather thanvi.spyOna module export, so renaming the import doesn't break the test. The behavioral assertion (tool intoolCalls, markup NOT incontent) is durable and directly pinned to the F2 safety concern. This test must stay green during the prune and must FAIL ifextractToolCallBlocksis removed from thetext-deltapath.
Important: If the F2 decision is option (b) — full retirement behind a flag — this test validates the flag-off path (native-parsing-only) for one release. Write the test against the flag-off behavior first, confirm it passes without the parser, then delete the flag and the test in the follow-on batch.
T7: stripToolMarkup still called in tool-phase path — regression guard
- Priority: Medium
- Test level: Unit (test
tool-call-parser.tsexport directly — already covered) - Entry point:
apps/server/src/services/inference/tool-phase.ts:122andapps/server/src/services/inference/error-handler.ts:25,106 - Gap type: Partially tested
- Test approach:
- Behavior:
stripToolMarkupis the second consumer (tool-phase, error-handler). The scope-brief identifies it as load-bearing for the case where the model puts raw<tool_call>or<invoke>markup in its "final" text content. The existingstripToolMarkuptests intool-call-parser.test.tsalready cover this behavior at the pure level. - Action: No additional test needed. The existing
stripToolMarkuptests (lines 297-342 oftool-call-parser.test.ts) are the contract. Do not write call-site integration tests fortool-phase.ts— they would require mocking the entire inference context and test the wrong level. - Note for prune: Before deleting
stripToolMarkup, confirm none oftool-phase.ts:122,error-handler.ts:25,error-handler.ts:106call it. If they do, the deletion is not safe regardless of the test.
- Behavior:
- Brittleness assessment: N/A — this is a "stay green" check, not a new test.
F6 — Stall-Timeout + Retry
T8: stall-timeout classifier — pure unit
- Priority: High
- Test level: Unit
- Entry point: New
apps/server/src/services/inference/stream-resilience.ts(to be created, followingmistake-tracker.ts/lifecycle-decisions.tspattern) - Gap type: Untested (code not yet written — test is the contract spec)
- Test approach:
- Behavior: Extract a pure
classifyStreamError(err: unknown): 'stall' | 'transient-5xx' | 'non-retryable'function (or equivalent enum). The retry loop instream-phase.tsdelegates the retry decision to this function. Unit tests pin the classification without touching a real stream or timer. - Stubs: None
- Cases:
- Error whose message contains "stall" or whose name is
'StallTimeoutError'→'stall' - Error with HTTP status 500, 502, 503, 504 →
'transient-5xx' - AbortError (name
'AbortError') →'non-retryable' - Error with HTTP status 400 →
'non-retryable' - Error with HTTP status 429 →
'non-retryable'(YAGNI-gated: scope-brief says "strip the cloud-billing arms"; 429 is a cloud concern, not a single local llama-swap instance)
- Error whose message contains "stall" or whose name is
- Expected output: Classification string for each case
- Behavior: Extract a pure
- Brittleness assessment: Durable — pure function, no I/O. The classification is the decision surface; the retry loop can be refactored freely as long as it consults this function. This follows the exact
mistake-tracker.ts+lifecycle-decisions.tspattern.
T9: stall-timeout aborts the stream — unit with fake timer and fake stream
- Priority: Medium
- Test level: Unit
- Entry point:
apps/server/src/services/inference/stream-phase.ts:261— thefor await (const part of result.fullStream)loop (post-F6, wrapped with a per-chunk deadline timer) - Gap type: Untested (code not yet written)
- Test approach:
- Behavior: When
fullStreamemits no chunk for longer than the stall deadline, the wrapping code must throw aStallTimeoutError(or equivalent) that propagates out ofstreamCompletion. This is the new abort path distinct from the user-initiatedsignal.abortedpath. - Stubs:
- Stub
streamTextto return a fakefullStreamthat yields one chunk then hangs (never yields the next). Use aReadableStreamwith apull()controller that enqueues one chunk and then never calls the resolve/reject until after the test's fake timer fires. - Use
vi.useFakeTimers()(vitest fake timer) to advance time past the stall deadline without waiting real milliseconds.
- Stub
- Input/Action: Call
streamCompletionwith fake timers active and a stream that stalls after the first chunk. Advance fake time by stall-deadline + 1ms. - Expected output:
streamCompletionrejects with aStallTimeoutError - Expected commands: None
- Behavior: When
- Brittleness assessment: Medium — depends on fake-timer interacting correctly with the async stream loop. Vitest fake timers work with
setTimeout-based deadlines but NOT withPromise-based deadlines. The implementation must usesetTimeout(notPromise.racewith a rawsleep) for the stall timer, or the test will be structurally incompatible. Note this constraint to the implementer. If the implementation usesPromise.race, adjust the test to use asetTimeout-resolving promise that can be controlled viavi.runAllTimersAsync().
T10: abort-after-stream regression — unit
- Priority: Medium
- Test level: Unit
- Entry point:
apps/server/src/services/inference/stream-phase.ts:337—if (signal?.aborted) throw AbortError - Gap type: Untested (existing behavior, not new — but this is a documented regression risk per the inline comment "Smoke D caught this in v1.13.1-A — don't refactor it away")
- Test approach:
- Behavior: When the
AbortSignalfires during a stream that has already consumed all chunks and exited thefor awaitloop normally,streamCompletionmust throw anAbortErrorrather than returningStreamResult. Without this throw the message would finalize ascompleteinstead ofcancelled. - Stubs: Stub
streamTextto return a fakefullStreamthat yields a text chunk thenfinish, then resolves normally. TheAbortSignalis pre-aborted before the call (or aborted concurrently using a resolved-promise microtask so it fires before the post-loop check). - Input/Action: Create an
AbortController, abort it, passac.signaltostreamCompletion. - Expected output:
streamCompletionrejects with an error whosename === 'AbortError' - Expected commands: None
- Behavior: When the
- Brittleness assessment: Very durable — asserts a precise, documented behavioral contract. This test protects the most subtle correctness invariant in the inference layer. It should be written whether or not F6 proceeds, as a regression pin on existing code.
F5, F4, F7 — Brief Notes
F5 (opencode compaction surfacing): The behavior being added is consuming new SSE event types in opencode-server.ts and emitting new WS frames. The primary test vector is the cross-app parity check: apps/coder/src/services/__tests__/provider-types-parity.test.ts is the precedent. A new ws-frames-parity.test.ts (or extension of ws-frames.test.ts) that asserts the new compaction frame type exists in BOTH apps/coder and apps/web WsFrame schemas is the right test. This is a Medium-priority addition once the frame types are defined. No new unit test is needed for the SSE arm parsing itself — it is structurally identical to the existing step.ended arm.
F4 (notify-hook config injection): The testable surface is the idempotent settings.json merge logic. Extract that as a pure mergeNotifyHook(existing: object, hookConfig: object): object function and unit-test it (same pattern as mistake-tracker.ts). The inbound POST route is testable via Fastify's inject() method if the handler is kept thin. However, F4 has no behavioral spec beyond the review description and is a WANT item — defer test planning until a feature spec exists. Testability is high when the merge logic is extracted.
F7 (view_session_history MCP tool): The new MCP tool is read-only over messages_with_parts. The testable behavior is the DB query returning the correct shape. Pattern: describe.runIf(!!process.env.DATABASE_URL) with a real DB, following tool_cost_stats.test.ts. Unit-level: the mcp-client.test.ts pattern (mock the DB, verify the query shape). Medium priority, straightforward once the tool implementation exists.
Deferred / Skipped Tests
S1: dispatcher.ts full integration test (any path)
- Entry point:
apps/coder/src/services/dispatcher.ts:46—createDispatcher - Reason: The dispatcher is DB-heavy, timer-driven, and spawns child processes. An integration test for the full dispatch loop would require a real DB, a real (or fake) agent binary, and process lifecycle management. The value is low relative to the brittleness: the cancel-registry unit tests (T1, T3) and the route integration test (T2) together cover the observable behavioral gap at much lower maintenance cost. The full dispatch loop is better validated by end-to-end smoke testing.
S2: one-shot PTY worktree cleanup on cancel
- Entry point:
apps/coder/src/services/dispatcher.ts:600-606—cleanupWorktreein the PTY error path - Reason: Testing that
cleanupWorktreeis called requires either a spy (over-specified double) or a real filesystem side effect (requires a real git repo tempdir). The behavioral contract ("PTY worktree cleaned up on cancel") is better held by the existingwrite_guard.test.ts+pending_changes_integration.test.tsprecedents, which validate the write-side contract. A spy assertion oncleanupWorktreecall count is an over-specified double that breaks on any rename or refactor of the cleanup path.
S3: retry backoff timing
- Entry point:
apps/server/src/services/inference/stream-phase.ts— post-F6 retry loop - Reason: Testing that backoff delays are e.g. 1s, 2s, 4s requires fake timers and multi-attempt orchestration. The scope-brief says "keep it minimal" and YAGNI-gates the retry scope hard for a single local instance. The classifier (T8) pins the correctness decision; the timer values are implementation details. A test that asserts specific delay durations would break on any tuning and has no behavioral anchor in the F6 spec.
Coverage Estimate
After all recommended tests are written:
- F1 behavioral contract: Fully covered at unit level (registry decisions) + opt-in DB integration (route → dispatcher). The warm-vs-PTY worktree divergence is held by a code-comment contract rather than a test (see T4 recommendation).
- F2 behavioral contract: Fully covered — existing
tool-call-parser.test.tssuite is the regression gate; T6 closes the one gap (call-site fallback fromstream-phase.ts). Prune can proceed safely with T6 green. - F6 behavioral contract: Covered for the classifier (T8, unit) and the stall-throw path (T9, unit with fake timers). The existing abort regression (T10) closes a pre-existing gap independent of F6.
- F5/F4/F7: Testability is high but deferred to implementation time; no test can be written before the frame types and route handlers exist.
Behaviors intentionally left without tests: full dispatcher orchestration loop (S1), PTY cleanup spy (S2), backoff timing (S3).