# Test Plan: BooControl P1 Implementation

## Scope

Files analyzed:
- `apps/control/src/services/fleet-connector.ts` — SSE parsing loop, reconnect loop, jitter, backoff
- `apps/control/src/services/fleet-state.ts` — types + helpers
- `apps/control/src/services/retention.ts` — retention job functions (runRollup, prune*)
- `apps/control/src/services/host-access.ts` — no-op host access grant
- `apps/control/src/index.ts` — application orchestrator (event handlers, reconcile, perf poller, main)
- `apps/control/src/routes/ws.ts` — WebSocket endpoint, buildSnapshot
- `apps/control/src/config.ts` — env var parsing
- `apps/control/src/db.ts` — database connection + schema application
- `apps/web/src/hooks/useControlStream.tsx` — WS client, frame parsing, snapshot/delta seq filtering

Test files analyzed:
- `apps/control/src/services/__tests__/fleet-connector.test.ts`
- `apps/control/src/services/__tests__/fleet-state.test.ts`
- `apps/control/src/services/__tests__/liveness.test.ts`
- `apps/control/src/services/__tests__/reconcile.test.ts`
- `apps/control/src/services/__tests__/retention.test.ts`
- `apps/control/src/services/__tests__/seq-logic.test.ts`

Branch: untracked (new feature, not yet committed). Recency analysis: N/A — all files are new, no git history.

## Summary

BooControl P1 has 6 test files covering only 4 pure utility functions (addJitter, reconnectDecision, trimCapture, fleet-state helpers) plus 2 tests that test non-production code (liveness test doubles a function that doesn't exist in the source; seq-logic tests a local copy not the real handler). The core orchestration — SSE parsing/reconnect loop, LlamaSweep event handling, reconcile gap detection, performance polling, WebSocket snapshot building, and the frontend WS stream hook — have zero behavioral coverage. The SSE parsing loop contains a confirmed bug (line 158 skips all `data:` lines, preventing any event from being parsed). The reconcile test is a placeholder. Of the 6 real flaws detectable by behavioral tests, only the missing-jitter-in-backoff concern (which prompted addJitter) was preemptively covered.

| Priority | Count |
|----------|-------|
| High     | 8     |
| Medium   | 4     |
| Low      | 0     |
| Skipped  | 4     |

Full analysis written to: /home/samkintop/opt/boocode/test-plan.md

## Coverage Assessment

**Well-tested:** `addJitter`, `reconnectDecision`, `trimCapture`, `createFleetState`/`ensureHostState`/`stampLastSeen`. These pure functions have clean bounded-assertion tests that verify the behavioral contract without over-specification.

**Placeholder tests (value = 0):**
- `reconcile.test.ts` — `expect(true).toBe(true)` with DB context comments. No assertions exercised.
- `liveness.test.ts` — tests a `transitionLiveness` function that does not exist in production code. The real liveness transitions (`state.liveness = 'connected'` inline in `handleLlamaSweepEvent`/`handleReconcile`) have no tests.
- `seq-logic.test.ts` — tests a local copy of buffer-then-filter logic that does not correspond to any importable production function. The real seq filtering is inlined in `useControlStream.tsx`'s `ws.onmessage` handler and has no tests.

**Completely untested public API surface:**
- `runFleetConnector` / `startFleetConnector` — the entire SSE loop with streaming read, line parsing, error handling, and reconnect
- `handleLlamaSweepEvent` — all 4 event type branches (modelStatus, logData, metrics, inflight)
- `handleReconcile` — gap detection logic, entry ingestion
- `pollPerformance` — watermark recovery, API fetch, sample insertion
- `buildSnapshot` — fleet state serialization
- `registerControlWebSocket` — WS endpoint lifecycle
- `useControlStream.tsx` — WS connect, reconnect, frame dispatch, seq filtering, all 5 frame types
- `buildRetentionConfig`, `runRollup`, `pruneRawSamples`, `pruneActivity`, `pruneModelEvents`
- `loadConfig`, `getSql`, `waitForTable`, `applySchema`, `pingDb`

**Known bugs detectable by tests:**
1. SSE parsing loop always skips `data:` lines (line 158: `trimmed.startsWith('data:')` continues instead of only skipping non-data lines)
2. `incrementSeq` defined but never called — seq stays at 0 for all hosts
3. `control_job` frame handler in `useControlStream.tsx` pushes dummy empty entry instead of actual frame data (line 195)

## Findings

**T1: SSE parsing loop — data: line filter bug + end-to-end event processing**
- **Priority:** High
- **Test level:** Unit (with mocked fetch/reader)
- **Entry point:** `apps/control/src/services/fleet-connector.ts:122` — `runFleetConnector`
- **Gap type:** Untested
- **Test approach:**
  - **Behavior:** Given a mocked fetch returning a ReadableStream with SSE-formatted chunks, verify that `onEvent` is called with correctly typed events for each SSE envelope type (modelStatus, logData, metrics, inflight). This test would immediately catch the line-158 bug.
  - **Stubs:** `deps.isUp()` returns `true` initially then `false` after first event; `deps.onEvent` is a spy; `deps.onReconnectGiveUp` is a stub; `deps.sleep` is a fast-resolve stub; `deps.log.warn` is a spy; `deps.sql` is unused in the parse path; mock `fetch` to return a `Response` with a readable stream that yields SSE-formatted chunks.
  - **Input/Action:** Call `runFleetConnector('test-host', 'http://localhost:8401', abort, deps)` with `abort` that fires after one event.
  - **Expected output:** `deps.onEvent` called exactly once per parseable SSE event. Each call receives the correct providerId and a typed `LlamaSweepSSEEvent` with the correct `type` and parsed `data`.
  - **Expected commands:** Verify onEvent call count and arguments.
- **Brittleness assessment:** Low. Stubbing `fetch` is the standard port in JS. The test verifies the parsing outcome (events dispatched), not internal call ordering. If the SSE chunking strategy changes, the test data changes but the assertion doesn't.

**T2: SSE error handling — HTTP error, network error, empty body**
- **Priority:** High
- **Test level:** Unit
- **Entry point:** `apps/control/src/services/fleet-connector.ts:122` — `runFleetConnector`
- **Gap type:** Untested
- **Test approach:**
  - **Behavior:** When fetch returns a non-ok response (e.g., 500), the connector should log the error, increment failure count, and enter the reconnect backoff loop. When fetch throws (network error), same flow. When the body is `null`, throw. When the stream ends cleanly, reset failures to 0 and sleep at base delay.
  - **Stubs:** Same deps as T1. Mock `fetch` to reject once then resolve on subsequent calls (so the loop continues). Or mock `fetch` to return `{ ok: false, status: 500, statusText: 'Internal Server Error' }`.
  - **Input/Action:** Call `runFleetConnector` with appropriate mocks. Use `abort` after the error decision is made.
  - **Expected output:** `deps.log.warn` called with relevant context. `deps.sleep` called with jittered delay. Failure state handled.
  - **Expected commands:** Verify `deps.log.warn` was called with error context. Verify `deps.sleep` was called with a delay >= baseMs.
- **Brittleness assessment:** Low. Standard mock patterns. The test checks that errors route to backoff, not the exact error message.

**T3: Reconnect give-up — circuit breaker triggers onReconnectGiveUp**
- **Priority:** High
- **Test level:** Unit
- **Entry point:** `apps/control/src/services/fleet-connector.ts:189-192`
- **Gap type:** Untested
- **Test approach:**
  - **Behavior:** After `maxAttempts` consecutive failures, the connector calls `deps.onReconnectGiveUp(providerId)` and breaks from the loop instead of continuing to reconnect.
  - **Stubs:** `deps.isUp()` returns `true`; `deps.sleep` resolves instantly; `deps.onReconnectGiveUp` is a spy; mock `fetch` to always reject (network error). Set policy to `{ baseMs: 1, maxMs: 10, maxAttempts: 3 }` for fast test.
  - **Input/Action:** Call `runFleetConnector`. It will retry 3 times then give up.
  - **Expected output:** `deps.onReconnectGiveUp` called exactly once with the providerId. Loop exits.
  - **Expected commands:** Verify `onReconnectGiveUp` call and that `onEvent` was never called.
- **Brittleness assessment:** Low. The retry count is explicit in the policy. Clean deterministic test.

**T4: handleLlamaSweepEvent — all 4 event type branches**
- **Priority:** High
- **Test level:** Unit (pure logic with stubbed DB)
- **Entry point:** `apps/control/src/services/../index.ts:44` — `handleLlamaSweepEvent` (note: this function is not exported; needs extraction or testing via startFleetConnector's deps.onEvent)
- **Gap type:** Untested
- **Test approach:**
  - **Behavior:** 
    1. `modelStatus` event: creates host state if needed, stamps lastSeen, sets liveness to 'connected', creates model state with ttlDeadline, persists model event to DB.
    2. `logData` event: no-op (no DB writes, no state changes beyond stampLastSeen).
    3. `metrics` event: trims captures, inserts request rows with capture (or NULL when no capture), ON CONFLICT DO NOTHING.
    4. `inflight` event: updates existing model state inflight count, no-op if model not in state.
  - **Stubs:** `fleet` is a `createFleetState()`; `sql` is a mock where each template-literal call resolves to empty; `config` is a mock with CAPTURE_SIZE_KB.
  - **Input/Action:** Call handler with each event type and verify state mutations + DB calls.
  - **Expected output:** Fleet state reflects the event. Liveness, lastSeenAt, models map updated correctly.
  - **Expected commands:** Verify `sql` was called with correct SQL template for modelStatus/metrics events. Verify sql NOT called for logData.
- **Brittleness assessment:** Medium. The SQL template literal assertions are the brittle part — exact SQL string matching couples to query formatting. Mitigate by using a helper that checks table name and key bindings rather than full SQL string equality. Consider extracting event handlers into a service with injectable sql for cleaner test boundaries.

**T5: handleReconcile — gap detection logic**
- **Priority:** High
- **Test level:** Integration (DB required) or Unit (with sql mock)
- **Entry point:** `apps/control/src/services/../index.ts:102` — `handleReconcile` (not exported)
- **Gap type:** Partially tested (placeholder test exists but asserts nothing)
- **Test approach:**
  - **Behavior:** 
    1. When oldest reconcile entry is newer than newest persisted entry, insert a `gap_suspected` model event.
    2. When overlap exists (oldest reconcile <= newest persisted), no gap event inserted.
    3. When no persisted entries exist (first reconcile), no gap (no comparison possible).
    4. All reconcile entries are ingested with dedup (ON CONFLICT DO NOTHING).
  - **Stubs:** Fleet state; sql mock that returns controlled results for the `SELECT ts FROM control_requests ORDER BY ts DESC LIMIT 1` query.
  - **Input/Action:** Call handler with metrics data containing known timestamps.
  - **Expected output:** When gap exists, sql called to insert `gap_suspected`. When no gap, no such insertion.
  - **Expected commands:** Verify sql calls for gap_suspected insertion (or absence). Verify sql calls for each reconcile entry ingestion.
- **Brittleness assessment:** Medium (same SQL coupling as T4). The gap detection is a critical business rule — worth it.

**T6: pollPerformance — watermark resume + data ingestion**
- **Priority:** High
- **Test level:** Unit (with mocked fetch and sql)
- **Entry point:** `apps/control/src/services/../index.ts:158` — `pollPerformance` (not exported)
- **Gap type:** Untested
- **Test approach:**
  - **Behavior:** 
    1. Reads MAX(ts) watermark from DB per provider.
    2. Fetches `/api/performance` with `?after=` param when watermark exists, without when null.
    3. Inserts returned samples with dedup.
    4. Stamps lastSeen on success.
    5. Silently handles fetch failures (no throw, no crash).
  - **Stubs:** sql mock returning controlled watermark; fetch mock returning performance JSON; fleet state.
  - **Input/Action:** Call `pollPerformance` with different watermark scenarios (null, existing ts).
  - **Expected output:** Fetch called with correct URL (with/without `?after=`). sql insert called for each sample. `state.lastSeenAt` updated.
  - **Expected commands:** Verify fetch URL; verify sql insert calls per sample.
- **Brittleness assessment:** Medium-ish. Fetch URL construction is simple string concatenation. The main risk is SQL coupling. Acceptable for this level.

**T7: buildSnapshot — fleet state serialization**
- **Priority:** High
- **Test level:** Unit
- **Entry point:** `apps/control/src/routes/ws.ts:62` — `buildSnapshot` (not exported; needs extraction or indirect test)
- **Gap type:** Untested
- **Test approach:**
  - **Behavior:** Given a FleetState with hosts, models, dates, produces a serializable SnapshotData object with correct structure. Null lastSeenAt → null. Date objects → ISO strings. Null ttlDeadline → null. Empty fleet → empty hosts array.
  - **Stubs:** Create FleetState via `createFleetState()`, populate with hosts and models.
  - **Input/Action:** Call `buildSnapshot(fleet)`.
  - **Expected output:** SnapshotData.hosts array with correct length, each host has correct field types, dates are ISO strings or null.
  - **Expected commands:** None (pure function).
- **Brittleness assessment:** Low. Pure data transformation. Tests would break only if the SnapshotData shape changes, which is the behavioral contract.

**T8: useControlStream WS frame parsing — snapshot vs delta discrimination**
- **Priority:** High
- **Test level:** Unit (with mocked WebSocket)
- **Entry point:** `apps/web/src/hooks/useControlStream.tsx:154-172`
- **Gap type:** Untested
- **Test approach:**
  - **Behavior:** 
    1. On `control_fleet` frame with `hosts` array where first element lacks a `seq` field: treated as snapshot, updates snapshot seq map, replaces hosts state.
    2. On `control_fleet` frame with `hosts` array where first element has seq > snapshot seq: treated as delta, updates hosts state.
    3. On `control_fleet` frame with seq <= snapshot seq: discarded.
    4. On `ping` frame: ignored (no state change).
    5. On `control_activity`, `control_perf`, `control_log`: respective state arrays updated.
  - **Stubs:** Create a mock WebSocket and mock `window.location`. Use `renderHook` from React Testing Library to render `useControlStream` (or test `ControlProvider` with controlled message injection).
  - **Input/Action:** Simulate `ws.onmessage` events with different frame payloads.
  - **Expected output:** State updated correctly per frame type. Snapshot seq map tracks host seqs. Deltas with stale seq filtered.
- **Brittleness assessment:** Medium. Requires React Testing Library setup. The frame discrimination logic (checking `Array.isArray(frame.hosts)` and `'providerId' in firstHost`) is fragile — any change to the sentinel heuristic breaks it. The test should verify correct behavioral outcome, not the heuristic itself.

**T9: runRollup — idempotent upsert behavior**
- **Priority:** Medium
- **Test level:** Integration (requires DB)
- **Entry point:** `apps/control/src/services/retention.ts:34` — `runRollup`
- **Gap type:** Untested
- **Test approach:**
  - **Behavior:** Given performance samples in `control_perf_samples`, produces 5-minute bucket rollups via idempotent upsert. Re-running the same window produces identical results (no duplicate rows, no data loss).
  - **Stubs:** Requires DATABASE_URL. Follow `tool_cost_stats.test.ts` pattern: `describe.runIf(!!DATABASE_URL)` with `beforeAll` applying schema.
  - **Input/Action:** Insert test samples, run rollup, assert rollup rows exist. Run rollup again, assert same row count.
  - **Expected output:** Bucket rows with gpu_agg/sys_agg containing distinct aggregated samples.
- **Brittleness assessment:** Low. Standard DB integration test pattern. Schema changes would break it, which is appropriate.

**T10: pruneRawSamples — chunked deletion**
- **Priority:** Medium
- **Test level:** Integration (requires DB)
- **Entry point:** `apps/control/src/services/retention.ts:73` — `pruneRawSamples`
- **Gap type:** Untested
- **Test approach:**
  - **Behavior:** Deletes raw samples older than the retention window. Uses chunked (1000-row) deletes. Stops when no more rows to delete.
  - **Stubs:** Same as T9.
  - **Input/Action:** Insert samples both within and outside the retention window. Run `pruneRawSamples` with short `hours` parameter. Assert only old samples deleted.
  - **Expected output:** Old samples deleted, recent samples preserved.
- **Brittleness assessment:** Low.

**T11: loadConfig — env var parsing**
- **Priority:** Medium
- **Test level:** Unit
- **Entry point:** `apps/control/src/config.ts:17` — `loadConfig`
- **Gap type:** Untested
- **Test approach:**
  - **Behavior:** Parses `process.env` with Zod schema. Applies defaults for optional vars. Exits process on invalid input.
  - **Stubs:** Set `process.env` before each test; restore after.
  - **Input/Action:** Call `loadConfig()` with valid env (DATABASE_URL set), missing optional vars, invalid LOG_LEVEL.
  - **Expected output:** Returns parsed config object with defaults applied. On invalid input, calls `process.exit(1)`.
  - **Expected commands:** N/A.
- **Brittleness assessment:** Low. Standard env-parse test. Spying `process.exit` is a one-liner.

**T12: buildRetentionConfig — config transformation**
- **Priority:** Medium
- **Test level:** Unit
- **Entry point:** `apps/control/src/services/retention.ts:21` — `buildRetentionConfig`
- **Gap type:** Untested
- **Test approach:**
  - **Behavior:** Maps Config env values to RetentionConfig fields.
  - **Stubs:** Mock config object.
  - **Input/Action:** Call with known config values.
  - **Expected output:** Returns RetentionConfig with correct mapping.
- **Brittleness assessment:** Low. Pure trivial mapping. Low value but zero cost to test.

## Deferred / Skipped Tests

**S1: Liveness transition state machine (as currently written)**
- **Entry point:** `apps/control/src/services/__tests__/liveness.test.ts:6-101` (test file, no production entry point)
- **Reason:** This test verifies a `transitionLiveness` function that does not exist in production code. The production code uses inline `state.liveness = 'connected'` assignments. The test provides zero behavioral coverage of any production code path. The liveness transitions are implicitly covered by T4 (handleLlamaSweepEvent sets liveness to 'connected', onReconnectGiveUp sets 'down'). Adding a test for the non-existent function would be YAGNI. If a state machine extraction is performed later, the test belongs with that extracted function. Delete the existing test file and rely on T4 coverage.

**S2: seq-logic test (test-local copy)**
- **Entry point:** `apps/control/src/services/__tests__/seq-logic.test.ts:17-107` (test-local functions, no production entry point)
- **Reason:** Tests a test-local reimplementation of the buffer-then-filter seq logic that does not correspond to any importable production function. The real seq logic is inlined in `useControlStream.tsx:154-172`. This test validates the concept is correct but provides no coverage of the actual implementation. Replace with T8 which tests the real handler. The serial-number concept test is YAGNI until the logic is extracted.

**S3: reconcile.test.ts placeholder tests (expect(true).toBe(true))**
- **Entry point:** `apps/control/src/services/__tests__/reconcile.test.ts` (all 8 lines of assertions)
- **Reason:** These tests assert nothing. They are correctly gated on `DATABASE_URL` for eventual DB integration. The gap detection logic they describe is covered by T5 which uses mocked sql for faster, more deterministic coverage. When real DB integration tests are needed for the reconcile path, replace these placeholders with actual assertions rather than keeping both.

**S4: host-access.ts tests**
- **Entry point:** `apps/control/src/services/host-access.ts:13` — `acquireHostAccess`
- **Reason:** V1 implementation is a no-op returning `{ ok: true }`. Testing this would be Coverage Metric Chasing — there is no meaningful observable behavior to verify. When P8 replaces the body with actual DB lease logic, add tests at that time.

## Coverage Estimate

After all High- and Medium-priority tests are written (T1–T12):

- **Core SSE loop (fleet-connector.ts):** 100% — `addJitter`, `reconnectDecision`, the loop, error handling, give-up all tested.
- **Fleet state (fleet-state.ts):** 100% — already fully tested at the helpers level.
- **Event handlers (index.ts):** ~95% — all 4 event type branches, reconcile gap detection, perf polling tested. The `main()` function itself remains untested (orchestration is better covered by integration tests).
- **Retention (retention.ts):** ~90% — `trimCapture` tested, `runRollup` + `pruneRawSamples` tested via integration, `pruneActivity`/`pruneModelEvents` remain lower-priority.
- **WS endpoint (routes/ws.ts):** ~80% — `buildSnapshot` tested. `registerControlWebSocket` lifecycle (WS open/close/error handlers) remains integration-level.
- **Frontend (useControlStream.tsx):** ~80% — frame parsing and seq filtering tested. Reconnect timer cleanup and WS lifecycle remain untested at unit level (better suited for integration/browser test).
- **Config/DB:** ~70% — `loadConfig` tested. `getSql` singleton, `waitForTable`, `applySchema` remain integration-level.

**Untested behaviors that remain intentionally deferred:**
- `main()` orchestrator function (startup sequence, timer setup, graceful shutdown) — best covered by end-to-end or container-level test, not meaningful at unit level
- `registerControlWebSocket` full WS lifecycle — requires Fastify integration test harness
- `pruneActivity` and `pruneModelEvents` — trivial single-statement queries, low risk
- `host-access.ts` — no-op, testing it is coverage metric chasing