Files
boocode/test-plan.md
indifferentketchup b18de2a331 chore: snapshot working tree - pty_exited notifications + in-flight inference WIP
feat(booterm): structured pty_exited WS notifications. Plan-validated, impl-validated, code-reviewed green (contracts build clean, contracts test 29/29, booterm + web typecheck clean).

wip: in-progress inference/provider refactor (agents.ts, provider.ts, new llama-providers.ts, removed llama-args-validator), plus arena, dispatcher, compaction, schema changes.

openspec: pty-exit-notifications complete; x-agent-flags planned (not yet implemented).
2026-06-14 12:48:47 +00:00

22 KiB
Raw Permalink Blame History

Test Plan: BooControl P1 Implementation

Scope

Files analyzed:

  • apps/control/src/services/fleet-connector.ts — SSE parsing loop, reconnect loop, jitter, backoff
  • apps/control/src/services/fleet-state.ts — types + helpers
  • apps/control/src/services/retention.ts — retention job functions (runRollup, prune*)
  • apps/control/src/services/host-access.ts — no-op host access grant
  • apps/control/src/index.ts — application orchestrator (event handlers, reconcile, perf poller, main)
  • apps/control/src/routes/ws.ts — WebSocket endpoint, buildSnapshot
  • apps/control/src/config.ts — env var parsing
  • apps/control/src/db.ts — database connection + schema application
  • apps/web/src/hooks/useControlStream.tsx — WS client, frame parsing, snapshot/delta seq filtering

Test files analyzed:

  • apps/control/src/services/__tests__/fleet-connector.test.ts
  • apps/control/src/services/__tests__/fleet-state.test.ts
  • apps/control/src/services/__tests__/liveness.test.ts
  • apps/control/src/services/__tests__/reconcile.test.ts
  • apps/control/src/services/__tests__/retention.test.ts
  • apps/control/src/services/__tests__/seq-logic.test.ts

Branch: untracked (new feature, not yet committed). Recency analysis: N/A — all files are new, no git history.

Summary

BooControl P1 has 6 test files covering only 4 pure utility functions (addJitter, reconnectDecision, trimCapture, fleet-state helpers) plus 2 tests that test non-production code (liveness test doubles a function that doesn't exist in the source; seq-logic tests a local copy not the real handler). The core orchestration — SSE parsing/reconnect loop, LlamaSweep event handling, reconcile gap detection, performance polling, WebSocket snapshot building, and the frontend WS stream hook — have zero behavioral coverage. The SSE parsing loop contains a confirmed bug (line 158 skips all data: lines, preventing any event from being parsed). The reconcile test is a placeholder. Of the 6 real flaws detectable by behavioral tests, only the missing-jitter-in-backoff concern (which prompted addJitter) was preemptively covered.

Priority Count
High 8
Medium 4
Low 0
Skipped 4

Full analysis written to: /home/samkintop/opt/boocode/test-plan.md

Coverage Assessment

Well-tested: addJitter, reconnectDecision, trimCapture, createFleetState/ensureHostState/stampLastSeen. These pure functions have clean bounded-assertion tests that verify the behavioral contract without over-specification.

Placeholder tests (value = 0):

  • reconcile.test.tsexpect(true).toBe(true) with DB context comments. No assertions exercised.
  • liveness.test.ts — tests a transitionLiveness function that does not exist in production code. The real liveness transitions (state.liveness = 'connected' inline in handleLlamaSweepEvent/handleReconcile) have no tests.
  • seq-logic.test.ts — tests a local copy of buffer-then-filter logic that does not correspond to any importable production function. The real seq filtering is inlined in useControlStream.tsx's ws.onmessage handler and has no tests.

Completely untested public API surface:

  • runFleetConnector / startFleetConnector — the entire SSE loop with streaming read, line parsing, error handling, and reconnect
  • handleLlamaSweepEvent — all 4 event type branches (modelStatus, logData, metrics, inflight)
  • handleReconcile — gap detection logic, entry ingestion
  • pollPerformance — watermark recovery, API fetch, sample insertion
  • buildSnapshot — fleet state serialization
  • registerControlWebSocket — WS endpoint lifecycle
  • useControlStream.tsx — WS connect, reconnect, frame dispatch, seq filtering, all 5 frame types
  • buildRetentionConfig, runRollup, pruneRawSamples, pruneActivity, pruneModelEvents
  • loadConfig, getSql, waitForTable, applySchema, pingDb

Known bugs detectable by tests:

  1. SSE parsing loop always skips data: lines (line 158: trimmed.startsWith('data:') continues instead of only skipping non-data lines)
  2. incrementSeq defined but never called — seq stays at 0 for all hosts
  3. control_job frame handler in useControlStream.tsx pushes dummy empty entry instead of actual frame data (line 195)

Findings

T1: SSE parsing loop — data: line filter bug + end-to-end event processing

  • Priority: High
  • Test level: Unit (with mocked fetch/reader)
  • Entry point: apps/control/src/services/fleet-connector.ts:122runFleetConnector
  • Gap type: Untested
  • Test approach:
    • Behavior: Given a mocked fetch returning a ReadableStream with SSE-formatted chunks, verify that onEvent is called with correctly typed events for each SSE envelope type (modelStatus, logData, metrics, inflight). This test would immediately catch the line-158 bug.
    • Stubs: deps.isUp() returns true initially then false after first event; deps.onEvent is a spy; deps.onReconnectGiveUp is a stub; deps.sleep is a fast-resolve stub; deps.log.warn is a spy; deps.sql is unused in the parse path; mock fetch to return a Response with a readable stream that yields SSE-formatted chunks.
    • Input/Action: Call runFleetConnector('test-host', 'http://localhost:8401', abort, deps) with abort that fires after one event.
    • Expected output: deps.onEvent called exactly once per parseable SSE event. Each call receives the correct providerId and a typed LlamaSweepSSEEvent with the correct type and parsed data.
    • Expected commands: Verify onEvent call count and arguments.
  • Brittleness assessment: Low. Stubbing fetch is the standard port in JS. The test verifies the parsing outcome (events dispatched), not internal call ordering. If the SSE chunking strategy changes, the test data changes but the assertion doesn't.

T2: SSE error handling — HTTP error, network error, empty body

  • Priority: High
  • Test level: Unit
  • Entry point: apps/control/src/services/fleet-connector.ts:122runFleetConnector
  • Gap type: Untested
  • Test approach:
    • Behavior: When fetch returns a non-ok response (e.g., 500), the connector should log the error, increment failure count, and enter the reconnect backoff loop. When fetch throws (network error), same flow. When the body is null, throw. When the stream ends cleanly, reset failures to 0 and sleep at base delay.
    • Stubs: Same deps as T1. Mock fetch to reject once then resolve on subsequent calls (so the loop continues). Or mock fetch to return { ok: false, status: 500, statusText: 'Internal Server Error' }.
    • Input/Action: Call runFleetConnector with appropriate mocks. Use abort after the error decision is made.
    • Expected output: deps.log.warn called with relevant context. deps.sleep called with jittered delay. Failure state handled.
    • Expected commands: Verify deps.log.warn was called with error context. Verify deps.sleep was called with a delay >= baseMs.
  • Brittleness assessment: Low. Standard mock patterns. The test checks that errors route to backoff, not the exact error message.

T3: Reconnect give-up — circuit breaker triggers onReconnectGiveUp

  • Priority: High
  • Test level: Unit
  • Entry point: apps/control/src/services/fleet-connector.ts:189-192
  • Gap type: Untested
  • Test approach:
    • Behavior: After maxAttempts consecutive failures, the connector calls deps.onReconnectGiveUp(providerId) and breaks from the loop instead of continuing to reconnect.
    • Stubs: deps.isUp() returns true; deps.sleep resolves instantly; deps.onReconnectGiveUp is a spy; mock fetch to always reject (network error). Set policy to { baseMs: 1, maxMs: 10, maxAttempts: 3 } for fast test.
    • Input/Action: Call runFleetConnector. It will retry 3 times then give up.
    • Expected output: deps.onReconnectGiveUp called exactly once with the providerId. Loop exits.
    • Expected commands: Verify onReconnectGiveUp call and that onEvent was never called.
  • Brittleness assessment: Low. The retry count is explicit in the policy. Clean deterministic test.

T4: handleLlamaSweepEvent — all 4 event type branches

  • Priority: High
  • Test level: Unit (pure logic with stubbed DB)
  • Entry point: apps/control/src/services/../index.ts:44handleLlamaSweepEvent (note: this function is not exported; needs extraction or testing via startFleetConnector's deps.onEvent)
  • Gap type: Untested
  • Test approach:
    • Behavior:
      1. modelStatus event: creates host state if needed, stamps lastSeen, sets liveness to 'connected', creates model state with ttlDeadline, persists model event to DB.
      2. logData event: no-op (no DB writes, no state changes beyond stampLastSeen).
      3. metrics event: trims captures, inserts request rows with capture (or NULL when no capture), ON CONFLICT DO NOTHING.
      4. inflight event: updates existing model state inflight count, no-op if model not in state.
    • Stubs: fleet is a createFleetState(); sql is a mock where each template-literal call resolves to empty; config is a mock with CAPTURE_SIZE_KB.
    • Input/Action: Call handler with each event type and verify state mutations + DB calls.
    • Expected output: Fleet state reflects the event. Liveness, lastSeenAt, models map updated correctly.
    • Expected commands: Verify sql was called with correct SQL template for modelStatus/metrics events. Verify sql NOT called for logData.
  • Brittleness assessment: Medium. The SQL template literal assertions are the brittle part — exact SQL string matching couples to query formatting. Mitigate by using a helper that checks table name and key bindings rather than full SQL string equality. Consider extracting event handlers into a service with injectable sql for cleaner test boundaries.

T5: handleReconcile — gap detection logic

  • Priority: High
  • Test level: Integration (DB required) or Unit (with sql mock)
  • Entry point: apps/control/src/services/../index.ts:102handleReconcile (not exported)
  • Gap type: Partially tested (placeholder test exists but asserts nothing)
  • Test approach:
    • Behavior:
      1. When oldest reconcile entry is newer than newest persisted entry, insert a gap_suspected model event.
      2. When overlap exists (oldest reconcile <= newest persisted), no gap event inserted.
      3. When no persisted entries exist (first reconcile), no gap (no comparison possible).
      4. All reconcile entries are ingested with dedup (ON CONFLICT DO NOTHING).
    • Stubs: Fleet state; sql mock that returns controlled results for the SELECT ts FROM control_requests ORDER BY ts DESC LIMIT 1 query.
    • Input/Action: Call handler with metrics data containing known timestamps.
    • Expected output: When gap exists, sql called to insert gap_suspected. When no gap, no such insertion.
    • Expected commands: Verify sql calls for gap_suspected insertion (or absence). Verify sql calls for each reconcile entry ingestion.
  • Brittleness assessment: Medium (same SQL coupling as T4). The gap detection is a critical business rule — worth it.

T6: pollPerformance — watermark resume + data ingestion

  • Priority: High
  • Test level: Unit (with mocked fetch and sql)
  • Entry point: apps/control/src/services/../index.ts:158pollPerformance (not exported)
  • Gap type: Untested
  • Test approach:
    • Behavior:
      1. Reads MAX(ts) watermark from DB per provider.
      2. Fetches /api/performance with ?after= param when watermark exists, without when null.
      3. Inserts returned samples with dedup.
      4. Stamps lastSeen on success.
      5. Silently handles fetch failures (no throw, no crash).
    • Stubs: sql mock returning controlled watermark; fetch mock returning performance JSON; fleet state.
    • Input/Action: Call pollPerformance with different watermark scenarios (null, existing ts).
    • Expected output: Fetch called with correct URL (with/without ?after=). sql insert called for each sample. state.lastSeenAt updated.
    • Expected commands: Verify fetch URL; verify sql insert calls per sample.
  • Brittleness assessment: Medium-ish. Fetch URL construction is simple string concatenation. The main risk is SQL coupling. Acceptable for this level.

T7: buildSnapshot — fleet state serialization

  • Priority: High
  • Test level: Unit
  • Entry point: apps/control/src/routes/ws.ts:62buildSnapshot (not exported; needs extraction or indirect test)
  • Gap type: Untested
  • Test approach:
    • Behavior: Given a FleetState with hosts, models, dates, produces a serializable SnapshotData object with correct structure. Null lastSeenAt → null. Date objects → ISO strings. Null ttlDeadline → null. Empty fleet → empty hosts array.
    • Stubs: Create FleetState via createFleetState(), populate with hosts and models.
    • Input/Action: Call buildSnapshot(fleet).
    • Expected output: SnapshotData.hosts array with correct length, each host has correct field types, dates are ISO strings or null.
    • Expected commands: None (pure function).
  • Brittleness assessment: Low. Pure data transformation. Tests would break only if the SnapshotData shape changes, which is the behavioral contract.

T8: useControlStream WS frame parsing — snapshot vs delta discrimination

  • Priority: High
  • Test level: Unit (with mocked WebSocket)
  • Entry point: apps/web/src/hooks/useControlStream.tsx:154-172
  • Gap type: Untested
  • Test approach:
    • Behavior:
      1. On control_fleet frame with hosts array where first element lacks a seq field: treated as snapshot, updates snapshot seq map, replaces hosts state.
      2. On control_fleet frame with hosts array where first element has seq > snapshot seq: treated as delta, updates hosts state.
      3. On control_fleet frame with seq <= snapshot seq: discarded.
      4. On ping frame: ignored (no state change).
      5. On control_activity, control_perf, control_log: respective state arrays updated.
    • Stubs: Create a mock WebSocket and mock window.location. Use renderHook from React Testing Library to render useControlStream (or test ControlProvider with controlled message injection).
    • Input/Action: Simulate ws.onmessage events with different frame payloads.
    • Expected output: State updated correctly per frame type. Snapshot seq map tracks host seqs. Deltas with stale seq filtered.
  • Brittleness assessment: Medium. Requires React Testing Library setup. The frame discrimination logic (checking Array.isArray(frame.hosts) and 'providerId' in firstHost) is fragile — any change to the sentinel heuristic breaks it. The test should verify correct behavioral outcome, not the heuristic itself.

T9: runRollup — idempotent upsert behavior

  • Priority: Medium
  • Test level: Integration (requires DB)
  • Entry point: apps/control/src/services/retention.ts:34runRollup
  • Gap type: Untested
  • Test approach:
    • Behavior: Given performance samples in control_perf_samples, produces 5-minute bucket rollups via idempotent upsert. Re-running the same window produces identical results (no duplicate rows, no data loss).
    • Stubs: Requires DATABASE_URL. Follow tool_cost_stats.test.ts pattern: describe.runIf(!!DATABASE_URL) with beforeAll applying schema.
    • Input/Action: Insert test samples, run rollup, assert rollup rows exist. Run rollup again, assert same row count.
    • Expected output: Bucket rows with gpu_agg/sys_agg containing distinct aggregated samples.
  • Brittleness assessment: Low. Standard DB integration test pattern. Schema changes would break it, which is appropriate.

T10: pruneRawSamples — chunked deletion

  • Priority: Medium
  • Test level: Integration (requires DB)
  • Entry point: apps/control/src/services/retention.ts:73pruneRawSamples
  • Gap type: Untested
  • Test approach:
    • Behavior: Deletes raw samples older than the retention window. Uses chunked (1000-row) deletes. Stops when no more rows to delete.
    • Stubs: Same as T9.
    • Input/Action: Insert samples both within and outside the retention window. Run pruneRawSamples with short hours parameter. Assert only old samples deleted.
    • Expected output: Old samples deleted, recent samples preserved.
  • Brittleness assessment: Low.

T11: loadConfig — env var parsing

  • Priority: Medium
  • Test level: Unit
  • Entry point: apps/control/src/config.ts:17loadConfig
  • Gap type: Untested
  • Test approach:
    • Behavior: Parses process.env with Zod schema. Applies defaults for optional vars. Exits process on invalid input.
    • Stubs: Set process.env before each test; restore after.
    • Input/Action: Call loadConfig() with valid env (DATABASE_URL set), missing optional vars, invalid LOG_LEVEL.
    • Expected output: Returns parsed config object with defaults applied. On invalid input, calls process.exit(1).
    • Expected commands: N/A.
  • Brittleness assessment: Low. Standard env-parse test. Spying process.exit is a one-liner.

T12: buildRetentionConfig — config transformation

  • Priority: Medium
  • Test level: Unit
  • Entry point: apps/control/src/services/retention.ts:21buildRetentionConfig
  • Gap type: Untested
  • Test approach:
    • Behavior: Maps Config env values to RetentionConfig fields.
    • Stubs: Mock config object.
    • Input/Action: Call with known config values.
    • Expected output: Returns RetentionConfig with correct mapping.
  • Brittleness assessment: Low. Pure trivial mapping. Low value but zero cost to test.

Deferred / Skipped Tests

S1: Liveness transition state machine (as currently written)

  • Entry point: apps/control/src/services/__tests__/liveness.test.ts:6-101 (test file, no production entry point)
  • Reason: This test verifies a transitionLiveness function that does not exist in production code. The production code uses inline state.liveness = 'connected' assignments. The test provides zero behavioral coverage of any production code path. The liveness transitions are implicitly covered by T4 (handleLlamaSweepEvent sets liveness to 'connected', onReconnectGiveUp sets 'down'). Adding a test for the non-existent function would be YAGNI. If a state machine extraction is performed later, the test belongs with that extracted function. Delete the existing test file and rely on T4 coverage.

S2: seq-logic test (test-local copy)

  • Entry point: apps/control/src/services/__tests__/seq-logic.test.ts:17-107 (test-local functions, no production entry point)
  • Reason: Tests a test-local reimplementation of the buffer-then-filter seq logic that does not correspond to any importable production function. The real seq logic is inlined in useControlStream.tsx:154-172. This test validates the concept is correct but provides no coverage of the actual implementation. Replace with T8 which tests the real handler. The serial-number concept test is YAGNI until the logic is extracted.

S3: reconcile.test.ts placeholder tests (expect(true).toBe(true))

  • Entry point: apps/control/src/services/__tests__/reconcile.test.ts (all 8 lines of assertions)
  • Reason: These tests assert nothing. They are correctly gated on DATABASE_URL for eventual DB integration. The gap detection logic they describe is covered by T5 which uses mocked sql for faster, more deterministic coverage. When real DB integration tests are needed for the reconcile path, replace these placeholders with actual assertions rather than keeping both.

S4: host-access.ts tests

  • Entry point: apps/control/src/services/host-access.ts:13acquireHostAccess
  • Reason: V1 implementation is a no-op returning { ok: true }. Testing this would be Coverage Metric Chasing — there is no meaningful observable behavior to verify. When P8 replaces the body with actual DB lease logic, add tests at that time.

Coverage Estimate

After all High- and Medium-priority tests are written (T1T12):

  • Core SSE loop (fleet-connector.ts): 100% — addJitter, reconnectDecision, the loop, error handling, give-up all tested.
  • Fleet state (fleet-state.ts): 100% — already fully tested at the helpers level.
  • Event handlers (index.ts): ~95% — all 4 event type branches, reconcile gap detection, perf polling tested. The main() function itself remains untested (orchestration is better covered by integration tests).
  • Retention (retention.ts): ~90% — trimCapture tested, runRollup + pruneRawSamples tested via integration, pruneActivity/pruneModelEvents remain lower-priority.
  • WS endpoint (routes/ws.ts): ~80% — buildSnapshot tested. registerControlWebSocket lifecycle (WS open/close/error handlers) remains integration-level.
  • Frontend (useControlStream.tsx): ~80% — frame parsing and seq filtering tested. Reconnect timer cleanup and WS lifecycle remain untested at unit level (better suited for integration/browser test).
  • Config/DB: ~70% — loadConfig tested. getSql singleton, waitForTable, applySchema remain integration-level.

Untested behaviors that remain intentionally deferred:

  • main() orchestrator function (startup sequence, timer setup, graceful shutdown) — best covered by end-to-end or container-level test, not meaningful at unit level
  • registerControlWebSocket full WS lifecycle — requires Fastify integration test harness
  • pruneActivity and pruneModelEvents — trivial single-statement queries, low risk
  • host-access.ts — no-op, testing it is coverage metric chasing