docs: archive shipped openspec batches; add feature/plan/research notes

Move 13 shipped openspec change docs under openspec/changes/archived/.
Add docs/features/git-diff-panel, docs/plans/post-review-backlog, and
docs/research/cross-app-contract-ssot.md (the research behind the
@boocode/contracts SSOT work). Update BOOCHAT.md, BOOCODER.md, and
boocode_roadmap.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-02 21:20:33 +00:00
parent e5ce01ae72
commit 2a05d2f9fe
27 changed files with 2210 additions and 17 deletions

View File

@@ -0,0 +1,261 @@
# Test Plan: post-review-backlog (F1, F2, F6, F5/F4/F7 brief)
## Scope
`apps/coder/src/services/dispatcher.ts`,
`apps/coder/src/routes/tasks.ts:110-148`,
`apps/server/src/services/inference/tool-call-parser.ts`,
`apps/server/src/services/inference/stream-phase.ts:261-341`,
and their existing test companions.
Branch: main (HEAD e5ce01a, v2.7.11).
---
## Summary
Three features — F1 (task cancel), F2 (parser prune), F6 (stall-timeout + retry) — have behavioral contracts not yet protected by any test. F1 is the highest risk because the whole-point of the feature is a side-effect (child termination) that is trivially skippable in the implementation. F2 is the whole point of a prune: keeping the existing `tool-call-parser.test.ts` suite green IS the contract, but one gap (the `<invoke>`-fallback path exercised from the call site in `stream-phase.ts`) needs a new test before any deletion proceeds. F6 benefits most from a pure-helper extraction (following the `lifecycle-decisions.ts` / `mistake-tracker.ts` precedent) so the classifier logic is unit-testable without a real stream.
| Priority | Count |
|----------|-------|
| High | 4 |
| Medium | 3 |
| Low | 2 |
| Skipped | 3 |
Full analysis written to: /home/samkintop/opt/boocode/docs/plans/post-review-backlog/test-plan.md
---
## Coverage Assessment
**Well-tested:** `tool-call-parser.ts` — every export has thorough unit coverage (`tool-call-parser.test.ts`); the inline fallback behavior of `extractToolCallBlocks` is the best-tested surface in the inference layer. Coder backend lifecycle decisions (`lifecycle-decisions.test.ts`, `turn-guard.test.ts`, `normalize-agent-status.test.ts`) follow the pure-helper extraction pattern and have complete unit coverage.
**Significant gaps:**
- `dispatcher.ts` — the `cancelExternalTask` function does not exist yet. Once written it will have zero tests. The cancel-registry decision logic (the pure "does this taskId map to an AbortController?" portion) is the highest-value extraction target.
- `routes/tasks.ts:110-148` (cancel route) — currently the route never reaches the dispatcher for external tasks; after F1 the new code path has no test.
- `stream-phase.ts` stall / retry behavior — entirely absent; the `for await` loop has no timeout and no retry. Both will be new code with no tests.
- `stream-phase.ts` abort-after-stream behavior — the critical post-loop `signal?.aborted` check at line 337 is already documented as a hard-won correctness property but has no regression test.
**Overall:** the inference layer has solid unit tests for data-transformation helpers but none for control-flow behavior (abort, stall, retry). The coder dispatcher has no tests at all. This is appropriate given the DB-heavy nature of the dispatcher, but the pure decision layers (cancel-registry, retry classifier) can and should be extracted and tested.
---
## Findings
### F1 — Task Cancel
**T1: cancel-registry decisions — pure unit**
- **Priority:** High
- **Test level:** Unit
- **Entry point:** `apps/coder/src/services/dispatcher.ts` — the new `cancelExternalTask(taskId)` export (to be created at F1 implementation time; the registry is the `Map<taskId, AbortController>` that does not exist yet)
- **Gap type:** Untested (code not yet written — test is the contract spec)
- **Test approach:**
- **Behavior:** Extract a pure `CancelRegistry` module (following the `lifecycle-decisions.ts` / `turn-guard.ts` / `mistake-tracker.ts` precedent) that owns register/cancel/has/delete operations over a `Map<taskId, AbortController>`. The dispatcher creates one and calls these; routes call `cancel`. The unit tests pin what the registry commits to.
- **Stubs:** None — the module is pure (no I/O, no DB)
- **Cases:**
1. `register(id, ac)` then `cancel(id)``ac.signal.aborted` is `true`, returns `true`
2. `cancel(id)` on unknown id → returns `false`, no throw
3. `register` + natural completion calls `delete(id)``cancel(id)` now returns `false` (idempotent double-cancel safety)
4. `has(id)` returns `true` while registered, `false` after delete
- **Expected output:** Signal aborted state, return values as above
- **Expected commands:** None (pure)
- **Brittleness assessment:** Durable — asserts the behavioral contract of the new module regardless of how the dispatcher wires it. The four cases directly map to the F1 correctness requirements (signal fired, idempotent, no-op after exit). No mock expectations.
---
**T2: cancel route reaches dispatcher for external task — integration**
- **Priority:** High
- **Test level:** Integration (DB-opt-in via `describe.runIf(!!process.env.DATABASE_URL)`)
- **Entry point:** `apps/coder/src/routes/tasks.ts:110``POST /api/tasks/:id/cancel`
- **Gap type:** Untested
- **Test approach:**
- **Behavior:** After F1, `POST /api/tasks/:id/cancel` for a running external task must call `cancelExternalTask(taskId)` (which fires the AbortController), set `tasks.state='cancelled'`, and leave the previously-streaming assistant message with `status` not `'streaming'` (it must be `'cancelled'` or `'complete'`).
- **Stubs:** Stub the `cancelExternalTask` function (or inject it as a dep) so the test does not require a live agent child. Stub returns `true` (task was in-flight).
- **Input/Action:** Insert a task row in state `'running'` with a known `session_id`. POST to the cancel route.
- **Expected output:** Response `{ cancelled: true }`, DB `tasks.state = 'cancelled'`
- **Expected commands:** `cancelExternalTask(taskId)` called once (spy on injected dep)
- **Brittleness assessment:** The assertion on `cancelExternalTask` being called is legitimate here — it IS the behavioral contract of the route (the whole missing piece in the current implementation). Only assert call happened, not argument order or internal mechanics. The DB state assertion (`tasks.state`) is the durable anchor. Follow the `tool_cost_stats.test.ts` + `describe.runIf` pattern.
---
**T3: natural-exit does not re-cancel — pure unit on registry**
- **Priority:** High
- **Test level:** Unit (part of T1's file, not a separate file)
- **Entry point:** Cancel registry `delete(taskId)` — called in the dispatcher's `finally` block on clean completion
- **Gap type:** Untested
- **Test approach:**
- **Behavior:** When the dispatcher's `runExternalAgent` / `runOpenCodeServerTask` `finally` block deletes the taskId from the registry, a subsequent route-level `cancelExternalTask(taskId)` call is a no-op (returns `false`). This is the "cancel-after-natural-exit no-op" behavior from the F1 scope.
- **Stubs:** None
- **Input/Action:** `registry.register(id, ac)``registry.delete(id)``registry.cancel(id)`
- **Expected output:** returns `false`, `ac.signal.aborted` remains `false`
- **Brittleness assessment:** Trivially durable. This is the critical property that prevents spurious aborts arriving late after a task completes.
---
**T4: warm worktree preserved on cancel (opencode path) — contract assertion on existing behavior**
- **Priority:** Medium
- **Test level:** Unit (pure behavioral assertion — no DB or child needed)
- **Entry point:** `apps/coder/src/services/dispatcher.ts``runOpenCodeServerTask` catch/cancel branch (~line 921-933)
- **Gap type:** Untested
- **Test approach:**
- **Behavior:** The scope-brief (F1 "Warm backends keep their persistent worktree by design") requires that when `ac.abort()` is called during `runOpenCodeServerTask`, the function does NOT call `cleanupWorktree`. The one-shot PTY path (`runExternalAgent`) does call `cleanupWorktree` in its finally. This distinction is the behavioral contract.
- **Stubs:** This is most cleanly verified by reading the implementation post-F1 and confirming via a test that passes a spy `cleanupWorktree` to the cancel path. However — this is better expressed as a **code-review assertion** rather than a test, because the behavior is structural (code path absence), not data-dependent.
- **Recommendation:** Capture this as a doc-comment contract in the implementation ("no cleanupWorktree on cancel — worktree is persistent") rather than a brittle spy assertion. If Sam disagrees, add an integration test that confirms the worktree directory still exists on disk after an abort, using a tempdir fixture (follows `pending_changes_integration.test.ts` pattern).
- **Brittleness assessment:** A spy on `cleanupWorktree` would break if the function is renamed or inlined. The structural guarantee is better held by the code comment + the existing lifecycle tests. Mark as Medium since the scope-brief calls it "by design" — a regression here would be caught by the warm-backend integration path anyway.
---
### F2 — Parser Prune
**T5: existing tool-call-parser.test.ts suite stays green — regression gate**
- **Priority:** High
- **Test level:** Unit
- **Entry point:** `apps/server/src/services/inference/tool-call-parser.ts` — all exports
- **Gap type:** Partially tested (all current behaviors are tested; this is a "must not break" constraint during the prune)
- **Test approach:**
- **Behavior:** Every test in `apps/server/src/services/__tests__/tool-call-parser.test.ts` pins a behavior the prune must not change. The test suite covers: `parseXmlToolCall`, `parseInvokeToolCall`, `partialXmlOpenerStart`, `extractToolCallBlocks` (all cases including placeholder rejection and streaming accumulation), `stripToolMarkup` (closed/unclosed/final), delimiter constants.
- **Action:** No new tests needed here — run `pnpm -C apps/server test` after each prune step and verify 0 failures. This suite is the contract.
- **Gap to close:** The one behavior NOT currently tested in the pure module's test file is the `<invoke>`-fallback being triggered from `stream-phase.ts` when a text-delta chunk arrives. See T6.
- **Brittleness assessment:** These tests are extremely durable — they test pure string-in/struct-out transforms. Zero brittleness risk. They are the reason to keep `tool-call-parser.ts` — they prove the guard is real.
---
**T6: invoke-fallback fires from the stream-phase call site — call-site integration test**
- **Priority:** High
- **Test level:** Unit (test `streamCompletion` or `executeStreamPhase` with a fake `fullStream`)
- **Entry point:** `apps/server/src/services/inference/stream-phase.ts:263-283` — the `text-delta` case calling `extractToolCallBlocks`
- **Gap type:** Untested
- **Test approach:**
- **Behavior:** When `result.fullStream` emits a `text-delta` chunk containing a complete `<invoke>` block, `streamCompletion` must return that block's parsed tool call in `result.toolCalls` (not in `result.content`). This is the live guard for the "qwen3.6 drifts to `<invoke>` markup" scenario from the scope-brief.
- **Stubs:** Stub `streamText` (import spy or dependency injection) to return a fake `fullStream` that yields:
```
{ type: 'text-delta', text: 'Thinking...\n<invoke name="view_file"><parameter name="path">/tmp/x</parameter></invoke>' }
{ type: 'finish', finishReason: 'stop' }
```
Stub `result.usage` resolving to `{ inputTokens: 10, outputTokens: 5 }`. Follow the `ReadableStream test stubs` guidance in `apps/server/CLAUDE.md` (use `pull()` not `start()`). The `upstreamModel` call can be stubbed to return a dummy provider.
- **Input/Action:** Call `streamCompletion(ctx, model, messages, opts, onDelta, onUsage, signal)` with the stubbed `streamText`.
- **Expected output:**
- `result.toolCalls` has length 1 with `name='view_file'`, `args.path='/tmp/x'`
- `result.content` is `'Thinking...\n'` (markup stripped)
- `onDelta` was called with `'Thinking...\n'` and NOT with the `<invoke>` markup
- **Expected commands:** None (no DB, no WS)
- **Brittleness assessment:** Medium brittleness — this test depends on the `streamText` API shape. Inject `streamText` as a dep rather than `vi.spyOn` a module export, so renaming the import doesn't break the test. The behavioral assertion (tool in `toolCalls`, markup NOT in `content`) is durable and directly pinned to the F2 safety concern. This test must stay green during the prune and must FAIL if `extractToolCallBlocks` is removed from the `text-delta` path.
**Important:** If the F2 decision is option (b) — full retirement behind a flag — this test validates the flag-off path (native-parsing-only) for one release. Write the test against the flag-off behavior first, confirm it passes without the parser, then delete the flag and the test in the follow-on batch.
---
**T7: stripToolMarkup still called in tool-phase path — regression guard**
- **Priority:** Medium
- **Test level:** Unit (test `tool-call-parser.ts` export directly — already covered)
- **Entry point:** `apps/server/src/services/inference/tool-phase.ts:122` and `apps/server/src/services/inference/error-handler.ts:25,106`
- **Gap type:** Partially tested
- **Test approach:**
- **Behavior:** `stripToolMarkup` is the second consumer (tool-phase, error-handler). The scope-brief identifies it as load-bearing for the case where the model puts raw `<tool_call>` or `<invoke>` markup in its "final" text content. The existing `stripToolMarkup` tests in `tool-call-parser.test.ts` already cover this behavior at the pure level.
- **Action:** No additional test needed. The existing `stripToolMarkup` tests (lines 297-342 of `tool-call-parser.test.ts`) are the contract. Do not write call-site integration tests for `tool-phase.ts` — they would require mocking the entire inference context and test the wrong level.
- **Note for prune:** Before deleting `stripToolMarkup`, confirm none of `tool-phase.ts:122`, `error-handler.ts:25`, `error-handler.ts:106` call it. If they do, the deletion is not safe regardless of the test.
- **Brittleness assessment:** N/A — this is a "stay green" check, not a new test.
---
### F6 — Stall-Timeout + Retry
**T8: stall-timeout classifier — pure unit**
- **Priority:** High
- **Test level:** Unit
- **Entry point:** New `apps/server/src/services/inference/stream-resilience.ts` (to be created, following `mistake-tracker.ts` / `lifecycle-decisions.ts` pattern)
- **Gap type:** Untested (code not yet written — test is the contract spec)
- **Test approach:**
- **Behavior:** Extract a pure `classifyStreamError(err: unknown): 'stall' | 'transient-5xx' | 'non-retryable'` function (or equivalent enum). The retry loop in `stream-phase.ts` delegates the retry decision to this function. Unit tests pin the classification without touching a real stream or timer.
- **Stubs:** None
- **Cases:**
1. Error whose message contains "stall" or whose name is `'StallTimeoutError'` → `'stall'`
2. Error with HTTP status 500, 502, 503, 504 → `'transient-5xx'`
3. AbortError (name `'AbortError'`) → `'non-retryable'`
4. Error with HTTP status 400 → `'non-retryable'`
5. Error with HTTP status 429 → `'non-retryable'` (YAGNI-gated: scope-brief says "strip the cloud-billing arms"; 429 is a cloud concern, not a single local llama-swap instance)
- **Expected output:** Classification string for each case
- **Brittleness assessment:** Durable — pure function, no I/O. The classification is the decision surface; the retry loop can be refactored freely as long as it consults this function. This follows the exact `mistake-tracker.ts` + `lifecycle-decisions.ts` pattern.
---
**T9: stall-timeout aborts the stream — unit with fake timer and fake stream**
- **Priority:** Medium
- **Test level:** Unit
- **Entry point:** `apps/server/src/services/inference/stream-phase.ts:261` — the `for await (const part of result.fullStream)` loop (post-F6, wrapped with a per-chunk deadline timer)
- **Gap type:** Untested (code not yet written)
- **Test approach:**
- **Behavior:** When `fullStream` emits no chunk for longer than the stall deadline, the wrapping code must throw a `StallTimeoutError` (or equivalent) that propagates out of `streamCompletion`. This is the new abort path distinct from the user-initiated `signal.aborted` path.
- **Stubs:**
- Stub `streamText` to return a fake `fullStream` that yields one chunk then hangs (never yields the next). Use a `ReadableStream` with a `pull()` controller that enqueues one chunk and then never calls the resolve/reject until after the test's fake timer fires.
- Use `vi.useFakeTimers()` (vitest fake timer) to advance time past the stall deadline without waiting real milliseconds.
- **Input/Action:** Call `streamCompletion` with fake timers active and a stream that stalls after the first chunk. Advance fake time by stall-deadline + 1ms.
- **Expected output:** `streamCompletion` rejects with a `StallTimeoutError`
- **Expected commands:** None
- **Brittleness assessment:** Medium — depends on fake-timer interacting correctly with the async stream loop. Vitest fake timers work with `setTimeout`-based deadlines but NOT with `Promise`-based deadlines. The implementation must use `setTimeout` (not `Promise.race` with a raw `sleep`) for the stall timer, or the test will be structurally incompatible. Note this constraint to the implementer. If the implementation uses `Promise.race`, adjust the test to use a `setTimeout`-resolving promise that can be controlled via `vi.runAllTimersAsync()`.
---
**T10: abort-after-stream regression — unit**
- **Priority:** Medium
- **Test level:** Unit
- **Entry point:** `apps/server/src/services/inference/stream-phase.ts:337` — `if (signal?.aborted) throw AbortError`
- **Gap type:** Untested (existing behavior, not new — but this is a documented regression risk per the inline comment "Smoke D caught this in v1.13.1-A — don't refactor it away")
- **Test approach:**
- **Behavior:** When the `AbortSignal` fires during a stream that has already consumed all chunks and exited the `for await` loop normally, `streamCompletion` must throw an `AbortError` rather than returning `StreamResult`. Without this throw the message would finalize as `complete` instead of `cancelled`.
- **Stubs:** Stub `streamText` to return a fake `fullStream` that yields a text chunk then `finish`, then resolves normally. The `AbortSignal` is pre-aborted before the call (or aborted concurrently using a resolved-promise microtask so it fires before the post-loop check).
- **Input/Action:** Create an `AbortController`, abort it, pass `ac.signal` to `streamCompletion`.
- **Expected output:** `streamCompletion` rejects with an error whose `name === 'AbortError'`
- **Expected commands:** None
- **Brittleness assessment:** Very durable — asserts a precise, documented behavioral contract. This test protects the most subtle correctness invariant in the inference layer. It should be written whether or not F6 proceeds, as a regression pin on existing code.
---
## F5, F4, F7 — Brief Notes
**F5 (opencode compaction surfacing):** The behavior being added is consuming new SSE event types in `opencode-server.ts` and emitting new WS frames. The primary test vector is the cross-app parity check: `apps/coder/src/services/__tests__/provider-types-parity.test.ts` is the precedent. A new `ws-frames-parity.test.ts` (or extension of `ws-frames.test.ts`) that asserts the new compaction frame type exists in BOTH `apps/coder` and `apps/web` `WsFrame` schemas is the right test. This is a Medium-priority addition once the frame types are defined. No new unit test is needed for the SSE arm parsing itself — it is structurally identical to the existing `step.ended` arm.
**F4 (notify-hook config injection):** The testable surface is the idempotent `settings.json` merge logic. Extract that as a pure `mergeNotifyHook(existing: object, hookConfig: object): object` function and unit-test it (same pattern as `mistake-tracker.ts`). The inbound POST route is testable via Fastify's `inject()` method if the handler is kept thin. However, F4 has no behavioral spec beyond the review description and is a WANT item — defer test planning until a feature spec exists. Testability is high when the merge logic is extracted.
**F7 (view_session_history MCP tool):** The new MCP tool is read-only over `messages_with_parts`. The testable behavior is the DB query returning the correct shape. Pattern: `describe.runIf(!!process.env.DATABASE_URL)` with a real DB, following `tool_cost_stats.test.ts`. Unit-level: the `mcp-client.test.ts` pattern (mock the DB, verify the query shape). Medium priority, straightforward once the tool implementation exists.
---
## Deferred / Skipped Tests
**S1: dispatcher.ts full integration test (any path)**
- **Entry point:** `apps/coder/src/services/dispatcher.ts:46` — `createDispatcher`
- **Reason:** The dispatcher is DB-heavy, timer-driven, and spawns child processes. An integration test for the full dispatch loop would require a real DB, a real (or fake) agent binary, and process lifecycle management. The value is low relative to the brittleness: the cancel-registry unit tests (T1, T3) and the route integration test (T2) together cover the observable behavioral gap at much lower maintenance cost. The full dispatch loop is better validated by end-to-end smoke testing.
**S2: one-shot PTY worktree cleanup on cancel**
- **Entry point:** `apps/coder/src/services/dispatcher.ts:600-606` — `cleanupWorktree` in the PTY error path
- **Reason:** Testing that `cleanupWorktree` is called requires either a spy (over-specified double) or a real filesystem side effect (requires a real git repo tempdir). The behavioral contract ("PTY worktree cleaned up on cancel") is better held by the existing `write_guard.test.ts` + `pending_changes_integration.test.ts` precedents, which validate the write-side contract. A spy assertion on `cleanupWorktree` call count is an over-specified double that breaks on any rename or refactor of the cleanup path.
**S3: retry backoff timing**
- **Entry point:** `apps/server/src/services/inference/stream-phase.ts` — post-F6 retry loop
- **Reason:** Testing that backoff delays are e.g. 1s, 2s, 4s requires fake timers and multi-attempt orchestration. The scope-brief says "keep it minimal" and YAGNI-gates the retry scope hard for a single local instance. The classifier (T8) pins the correctness decision; the timer values are implementation details. A test that asserts specific delay durations would break on any tuning and has no behavioral anchor in the F6 spec.
---
## Coverage Estimate
After all recommended tests are written:
- **F1 behavioral contract:** Fully covered at unit level (registry decisions) + opt-in DB integration (route → dispatcher). The warm-vs-PTY worktree divergence is held by a code-comment contract rather than a test (see T4 recommendation).
- **F2 behavioral contract:** Fully covered — existing `tool-call-parser.test.ts` suite is the regression gate; T6 closes the one gap (call-site fallback from `stream-phase.ts`). Prune can proceed safely with T6 green.
- **F6 behavioral contract:** Covered for the classifier (T8, unit) and the stall-throw path (T9, unit with fake timers). The existing abort regression (T10) closes a pre-existing gap independent of F6.
- **F5/F4/F7:** Testability is high but deferred to implementation time; no test can be written before the frame types and route handlers exist.
Behaviors intentionally left without tests: full dispatcher orchestration loop (S1), PTY cleanup spy (S2), backoff timing (S3).