Files
boocode/docs/plans/post-review-backlog/test-plan.md
indifferentketchup 2a05d2f9fe docs: archive shipped openspec batches; add feature/plan/research notes
Move 13 shipped openspec change docs under openspec/changes/archived/.
Add docs/features/git-diff-panel, docs/plans/post-review-backlog, and
docs/research/cross-app-contract-ssot.md (the research behind the
@boocode/contracts SSOT work). Update BOOCHAT.md, BOOCODER.md, and
boocode_roadmap.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 21:20:33 +00:00

23 KiB

Test Plan: post-review-backlog (F1, F2, F6, F5/F4/F7 brief)

Scope

apps/coder/src/services/dispatcher.ts, apps/coder/src/routes/tasks.ts:110-148, apps/server/src/services/inference/tool-call-parser.ts, apps/server/src/services/inference/stream-phase.ts:261-341, and their existing test companions.

Branch: main (HEAD e5ce01a, v2.7.11).


Summary

Three features — F1 (task cancel), F2 (parser prune), F6 (stall-timeout + retry) — have behavioral contracts not yet protected by any test. F1 is the highest risk because the whole-point of the feature is a side-effect (child termination) that is trivially skippable in the implementation. F2 is the whole point of a prune: keeping the existing tool-call-parser.test.ts suite green IS the contract, but one gap (the <invoke>-fallback path exercised from the call site in stream-phase.ts) needs a new test before any deletion proceeds. F6 benefits most from a pure-helper extraction (following the lifecycle-decisions.ts / mistake-tracker.ts precedent) so the classifier logic is unit-testable without a real stream.

Priority Count
High 4
Medium 3
Low 2
Skipped 3

Full analysis written to: /home/samkintop/opt/boocode/docs/plans/post-review-backlog/test-plan.md


Coverage Assessment

Well-tested: tool-call-parser.ts — every export has thorough unit coverage (tool-call-parser.test.ts); the inline fallback behavior of extractToolCallBlocks is the best-tested surface in the inference layer. Coder backend lifecycle decisions (lifecycle-decisions.test.ts, turn-guard.test.ts, normalize-agent-status.test.ts) follow the pure-helper extraction pattern and have complete unit coverage.

Significant gaps:

  • dispatcher.ts — the cancelExternalTask function does not exist yet. Once written it will have zero tests. The cancel-registry decision logic (the pure "does this taskId map to an AbortController?" portion) is the highest-value extraction target.
  • routes/tasks.ts:110-148 (cancel route) — currently the route never reaches the dispatcher for external tasks; after F1 the new code path has no test.
  • stream-phase.ts stall / retry behavior — entirely absent; the for await loop has no timeout and no retry. Both will be new code with no tests.
  • stream-phase.ts abort-after-stream behavior — the critical post-loop signal?.aborted check at line 337 is already documented as a hard-won correctness property but has no regression test.

Overall: the inference layer has solid unit tests for data-transformation helpers but none for control-flow behavior (abort, stall, retry). The coder dispatcher has no tests at all. This is appropriate given the DB-heavy nature of the dispatcher, but the pure decision layers (cancel-registry, retry classifier) can and should be extracted and tested.


Findings

F1 — Task Cancel

T1: cancel-registry decisions — pure unit

  • Priority: High
  • Test level: Unit
  • Entry point: apps/coder/src/services/dispatcher.ts — the new cancelExternalTask(taskId) export (to be created at F1 implementation time; the registry is the Map<taskId, AbortController> that does not exist yet)
  • Gap type: Untested (code not yet written — test is the contract spec)
  • Test approach:
    • Behavior: Extract a pure CancelRegistry module (following the lifecycle-decisions.ts / turn-guard.ts / mistake-tracker.ts precedent) that owns register/cancel/has/delete operations over a Map<taskId, AbortController>. The dispatcher creates one and calls these; routes call cancel. The unit tests pin what the registry commits to.
    • Stubs: None — the module is pure (no I/O, no DB)
    • Cases:
      1. register(id, ac) then cancel(id)ac.signal.aborted is true, returns true
      2. cancel(id) on unknown id → returns false, no throw
      3. register + natural completion calls delete(id)cancel(id) now returns false (idempotent double-cancel safety)
      4. has(id) returns true while registered, false after delete
    • Expected output: Signal aborted state, return values as above
    • Expected commands: None (pure)
  • Brittleness assessment: Durable — asserts the behavioral contract of the new module regardless of how the dispatcher wires it. The four cases directly map to the F1 correctness requirements (signal fired, idempotent, no-op after exit). No mock expectations.

T2: cancel route reaches dispatcher for external task — integration

  • Priority: High
  • Test level: Integration (DB-opt-in via describe.runIf(!!process.env.DATABASE_URL))
  • Entry point: apps/coder/src/routes/tasks.ts:110POST /api/tasks/:id/cancel
  • Gap type: Untested
  • Test approach:
    • Behavior: After F1, POST /api/tasks/:id/cancel for a running external task must call cancelExternalTask(taskId) (which fires the AbortController), set tasks.state='cancelled', and leave the previously-streaming assistant message with status not 'streaming' (it must be 'cancelled' or 'complete').
    • Stubs: Stub the cancelExternalTask function (or inject it as a dep) so the test does not require a live agent child. Stub returns true (task was in-flight).
    • Input/Action: Insert a task row in state 'running' with a known session_id. POST to the cancel route.
    • Expected output: Response { cancelled: true }, DB tasks.state = 'cancelled'
    • Expected commands: cancelExternalTask(taskId) called once (spy on injected dep)
  • Brittleness assessment: The assertion on cancelExternalTask being called is legitimate here — it IS the behavioral contract of the route (the whole missing piece in the current implementation). Only assert call happened, not argument order or internal mechanics. The DB state assertion (tasks.state) is the durable anchor. Follow the tool_cost_stats.test.ts + describe.runIf pattern.

T3: natural-exit does not re-cancel — pure unit on registry

  • Priority: High
  • Test level: Unit (part of T1's file, not a separate file)
  • Entry point: Cancel registry delete(taskId) — called in the dispatcher's finally block on clean completion
  • Gap type: Untested
  • Test approach:
    • Behavior: When the dispatcher's runExternalAgent / runOpenCodeServerTask finally block deletes the taskId from the registry, a subsequent route-level cancelExternalTask(taskId) call is a no-op (returns false). This is the "cancel-after-natural-exit no-op" behavior from the F1 scope.
    • Stubs: None
    • Input/Action: registry.register(id, ac)registry.delete(id)registry.cancel(id)
    • Expected output: returns false, ac.signal.aborted remains false
  • Brittleness assessment: Trivially durable. This is the critical property that prevents spurious aborts arriving late after a task completes.

T4: warm worktree preserved on cancel (opencode path) — contract assertion on existing behavior

  • Priority: Medium
  • Test level: Unit (pure behavioral assertion — no DB or child needed)
  • Entry point: apps/coder/src/services/dispatcher.tsrunOpenCodeServerTask catch/cancel branch (~line 921-933)
  • Gap type: Untested
  • Test approach:
    • Behavior: The scope-brief (F1 "Warm backends keep their persistent worktree by design") requires that when ac.abort() is called during runOpenCodeServerTask, the function does NOT call cleanupWorktree. The one-shot PTY path (runExternalAgent) does call cleanupWorktree in its finally. This distinction is the behavioral contract.
    • Stubs: This is most cleanly verified by reading the implementation post-F1 and confirming via a test that passes a spy cleanupWorktree to the cancel path. However — this is better expressed as a code-review assertion rather than a test, because the behavior is structural (code path absence), not data-dependent.
    • Recommendation: Capture this as a doc-comment contract in the implementation ("no cleanupWorktree on cancel — worktree is persistent") rather than a brittle spy assertion. If Sam disagrees, add an integration test that confirms the worktree directory still exists on disk after an abort, using a tempdir fixture (follows pending_changes_integration.test.ts pattern).
  • Brittleness assessment: A spy on cleanupWorktree would break if the function is renamed or inlined. The structural guarantee is better held by the code comment + the existing lifecycle tests. Mark as Medium since the scope-brief calls it "by design" — a regression here would be caught by the warm-backend integration path anyway.

F2 — Parser Prune

T5: existing tool-call-parser.test.ts suite stays green — regression gate

  • Priority: High
  • Test level: Unit
  • Entry point: apps/server/src/services/inference/tool-call-parser.ts — all exports
  • Gap type: Partially tested (all current behaviors are tested; this is a "must not break" constraint during the prune)
  • Test approach:
    • Behavior: Every test in apps/server/src/services/__tests__/tool-call-parser.test.ts pins a behavior the prune must not change. The test suite covers: parseXmlToolCall, parseInvokeToolCall, partialXmlOpenerStart, extractToolCallBlocks (all cases including placeholder rejection and streaming accumulation), stripToolMarkup (closed/unclosed/final), delimiter constants.
    • Action: No new tests needed here — run pnpm -C apps/server test after each prune step and verify 0 failures. This suite is the contract.
    • Gap to close: The one behavior NOT currently tested in the pure module's test file is the <invoke>-fallback being triggered from stream-phase.ts when a text-delta chunk arrives. See T6.
  • Brittleness assessment: These tests are extremely durable — they test pure string-in/struct-out transforms. Zero brittleness risk. They are the reason to keep tool-call-parser.ts — they prove the guard is real.

T6: invoke-fallback fires from the stream-phase call site — call-site integration test

  • Priority: High
  • Test level: Unit (test streamCompletion or executeStreamPhase with a fake fullStream)
  • Entry point: apps/server/src/services/inference/stream-phase.ts:263-283 — the text-delta case calling extractToolCallBlocks
  • Gap type: Untested
  • Test approach:
    • Behavior: When result.fullStream emits a text-delta chunk containing a complete <invoke> block, streamCompletion must return that block's parsed tool call in result.toolCalls (not in result.content). This is the live guard for the "qwen3.6 drifts to <invoke> markup" scenario from the scope-brief.
    • Stubs: Stub streamText (import spy or dependency injection) to return a fake fullStream that yields:
      { type: 'text-delta', text: 'Thinking...\n<invoke name="view_file"><parameter name="path">/tmp/x</parameter></invoke>' }
      { type: 'finish', finishReason: 'stop' }
      
      Stub result.usage resolving to { inputTokens: 10, outputTokens: 5 }. Follow the ReadableStream test stubs guidance in apps/server/CLAUDE.md (use pull() not start()). The upstreamModel call can be stubbed to return a dummy provider.
    • Input/Action: Call streamCompletion(ctx, model, messages, opts, onDelta, onUsage, signal) with the stubbed streamText.
    • Expected output:
      • result.toolCalls has length 1 with name='view_file', args.path='/tmp/x'
      • result.content is 'Thinking...\n' (markup stripped)
      • onDelta was called with 'Thinking...\n' and NOT with the <invoke> markup
    • Expected commands: None (no DB, no WS)
  • Brittleness assessment: Medium brittleness — this test depends on the streamText API shape. Inject streamText as a dep rather than vi.spyOn a module export, so renaming the import doesn't break the test. The behavioral assertion (tool in toolCalls, markup NOT in content) is durable and directly pinned to the F2 safety concern. This test must stay green during the prune and must FAIL if extractToolCallBlocks is removed from the text-delta path.

Important: If the F2 decision is option (b) — full retirement behind a flag — this test validates the flag-off path (native-parsing-only) for one release. Write the test against the flag-off behavior first, confirm it passes without the parser, then delete the flag and the test in the follow-on batch.


T7: stripToolMarkup still called in tool-phase path — regression guard

  • Priority: Medium
  • Test level: Unit (test tool-call-parser.ts export directly — already covered)
  • Entry point: apps/server/src/services/inference/tool-phase.ts:122 and apps/server/src/services/inference/error-handler.ts:25,106
  • Gap type: Partially tested
  • Test approach:
    • Behavior: stripToolMarkup is the second consumer (tool-phase, error-handler). The scope-brief identifies it as load-bearing for the case where the model puts raw <tool_call> or <invoke> markup in its "final" text content. The existing stripToolMarkup tests in tool-call-parser.test.ts already cover this behavior at the pure level.
    • Action: No additional test needed. The existing stripToolMarkup tests (lines 297-342 of tool-call-parser.test.ts) are the contract. Do not write call-site integration tests for tool-phase.ts — they would require mocking the entire inference context and test the wrong level.
    • Note for prune: Before deleting stripToolMarkup, confirm none of tool-phase.ts:122, error-handler.ts:25, error-handler.ts:106 call it. If they do, the deletion is not safe regardless of the test.
  • Brittleness assessment: N/A — this is a "stay green" check, not a new test.

F6 — Stall-Timeout + Retry

T8: stall-timeout classifier — pure unit

  • Priority: High
  • Test level: Unit
  • Entry point: New apps/server/src/services/inference/stream-resilience.ts (to be created, following mistake-tracker.ts / lifecycle-decisions.ts pattern)
  • Gap type: Untested (code not yet written — test is the contract spec)
  • Test approach:
    • Behavior: Extract a pure classifyStreamError(err: unknown): 'stall' | 'transient-5xx' | 'non-retryable' function (or equivalent enum). The retry loop in stream-phase.ts delegates the retry decision to this function. Unit tests pin the classification without touching a real stream or timer.
    • Stubs: None
    • Cases:
      1. Error whose message contains "stall" or whose name is 'StallTimeoutError''stall'
      2. Error with HTTP status 500, 502, 503, 504 → 'transient-5xx'
      3. AbortError (name 'AbortError') → 'non-retryable'
      4. Error with HTTP status 400 → 'non-retryable'
      5. Error with HTTP status 429 → 'non-retryable' (YAGNI-gated: scope-brief says "strip the cloud-billing arms"; 429 is a cloud concern, not a single local llama-swap instance)
    • Expected output: Classification string for each case
  • Brittleness assessment: Durable — pure function, no I/O. The classification is the decision surface; the retry loop can be refactored freely as long as it consults this function. This follows the exact mistake-tracker.ts + lifecycle-decisions.ts pattern.

T9: stall-timeout aborts the stream — unit with fake timer and fake stream

  • Priority: Medium
  • Test level: Unit
  • Entry point: apps/server/src/services/inference/stream-phase.ts:261 — the for await (const part of result.fullStream) loop (post-F6, wrapped with a per-chunk deadline timer)
  • Gap type: Untested (code not yet written)
  • Test approach:
    • Behavior: When fullStream emits no chunk for longer than the stall deadline, the wrapping code must throw a StallTimeoutError (or equivalent) that propagates out of streamCompletion. This is the new abort path distinct from the user-initiated signal.aborted path.
    • Stubs:
      • Stub streamText to return a fake fullStream that yields one chunk then hangs (never yields the next). Use a ReadableStream with a pull() controller that enqueues one chunk and then never calls the resolve/reject until after the test's fake timer fires.
      • Use vi.useFakeTimers() (vitest fake timer) to advance time past the stall deadline without waiting real milliseconds.
    • Input/Action: Call streamCompletion with fake timers active and a stream that stalls after the first chunk. Advance fake time by stall-deadline + 1ms.
    • Expected output: streamCompletion rejects with a StallTimeoutError
    • Expected commands: None
  • Brittleness assessment: Medium — depends on fake-timer interacting correctly with the async stream loop. Vitest fake timers work with setTimeout-based deadlines but NOT with Promise-based deadlines. The implementation must use setTimeout (not Promise.race with a raw sleep) for the stall timer, or the test will be structurally incompatible. Note this constraint to the implementer. If the implementation uses Promise.race, adjust the test to use a setTimeout-resolving promise that can be controlled via vi.runAllTimersAsync().

T10: abort-after-stream regression — unit

  • Priority: Medium
  • Test level: Unit
  • Entry point: apps/server/src/services/inference/stream-phase.ts:337if (signal?.aborted) throw AbortError
  • Gap type: Untested (existing behavior, not new — but this is a documented regression risk per the inline comment "Smoke D caught this in v1.13.1-A — don't refactor it away")
  • Test approach:
    • Behavior: When the AbortSignal fires during a stream that has already consumed all chunks and exited the for await loop normally, streamCompletion must throw an AbortError rather than returning StreamResult. Without this throw the message would finalize as complete instead of cancelled.
    • Stubs: Stub streamText to return a fake fullStream that yields a text chunk then finish, then resolves normally. The AbortSignal is pre-aborted before the call (or aborted concurrently using a resolved-promise microtask so it fires before the post-loop check).
    • Input/Action: Create an AbortController, abort it, pass ac.signal to streamCompletion.
    • Expected output: streamCompletion rejects with an error whose name === 'AbortError'
    • Expected commands: None
  • Brittleness assessment: Very durable — asserts a precise, documented behavioral contract. This test protects the most subtle correctness invariant in the inference layer. It should be written whether or not F6 proceeds, as a regression pin on existing code.

F5, F4, F7 — Brief Notes

F5 (opencode compaction surfacing): The behavior being added is consuming new SSE event types in opencode-server.ts and emitting new WS frames. The primary test vector is the cross-app parity check: apps/coder/src/services/__tests__/provider-types-parity.test.ts is the precedent. A new ws-frames-parity.test.ts (or extension of ws-frames.test.ts) that asserts the new compaction frame type exists in BOTH apps/coder and apps/web WsFrame schemas is the right test. This is a Medium-priority addition once the frame types are defined. No new unit test is needed for the SSE arm parsing itself — it is structurally identical to the existing step.ended arm.

F4 (notify-hook config injection): The testable surface is the idempotent settings.json merge logic. Extract that as a pure mergeNotifyHook(existing: object, hookConfig: object): object function and unit-test it (same pattern as mistake-tracker.ts). The inbound POST route is testable via Fastify's inject() method if the handler is kept thin. However, F4 has no behavioral spec beyond the review description and is a WANT item — defer test planning until a feature spec exists. Testability is high when the merge logic is extracted.

F7 (view_session_history MCP tool): The new MCP tool is read-only over messages_with_parts. The testable behavior is the DB query returning the correct shape. Pattern: describe.runIf(!!process.env.DATABASE_URL) with a real DB, following tool_cost_stats.test.ts. Unit-level: the mcp-client.test.ts pattern (mock the DB, verify the query shape). Medium priority, straightforward once the tool implementation exists.


Deferred / Skipped Tests

S1: dispatcher.ts full integration test (any path)

  • Entry point: apps/coder/src/services/dispatcher.ts:46createDispatcher
  • Reason: The dispatcher is DB-heavy, timer-driven, and spawns child processes. An integration test for the full dispatch loop would require a real DB, a real (or fake) agent binary, and process lifecycle management. The value is low relative to the brittleness: the cancel-registry unit tests (T1, T3) and the route integration test (T2) together cover the observable behavioral gap at much lower maintenance cost. The full dispatch loop is better validated by end-to-end smoke testing.

S2: one-shot PTY worktree cleanup on cancel

  • Entry point: apps/coder/src/services/dispatcher.ts:600-606cleanupWorktree in the PTY error path
  • Reason: Testing that cleanupWorktree is called requires either a spy (over-specified double) or a real filesystem side effect (requires a real git repo tempdir). The behavioral contract ("PTY worktree cleaned up on cancel") is better held by the existing write_guard.test.ts + pending_changes_integration.test.ts precedents, which validate the write-side contract. A spy assertion on cleanupWorktree call count is an over-specified double that breaks on any rename or refactor of the cleanup path.

S3: retry backoff timing

  • Entry point: apps/server/src/services/inference/stream-phase.ts — post-F6 retry loop
  • Reason: Testing that backoff delays are e.g. 1s, 2s, 4s requires fake timers and multi-attempt orchestration. The scope-brief says "keep it minimal" and YAGNI-gates the retry scope hard for a single local instance. The classifier (T8) pins the correctness decision; the timer values are implementation details. A test that asserts specific delay durations would break on any tuning and has no behavioral anchor in the F6 spec.

Coverage Estimate

After all recommended tests are written:

  • F1 behavioral contract: Fully covered at unit level (registry decisions) + opt-in DB integration (route → dispatcher). The warm-vs-PTY worktree divergence is held by a code-comment contract rather than a test (see T4 recommendation).
  • F2 behavioral contract: Fully covered — existing tool-call-parser.test.ts suite is the regression gate; T6 closes the one gap (call-site fallback from stream-phase.ts). Prune can proceed safely with T6 green.
  • F6 behavioral contract: Covered for the classifier (T8, unit) and the stall-throw path (T9, unit with fake timers). The existing abort regression (T10) closes a pre-existing gap independent of F6.
  • F5/F4/F7: Testability is high but deferred to implementation time; no test can be written before the frame types and route handlers exist.

Behaviors intentionally left without tests: full dispatcher orchestration loop (S1), PTY cleanup spy (S2), backoff timing (S3).