# Review: BooControl P1 (uncommitted working tree) ## Scope `apps/control/**` (new Fastify host service: SSE fleet connector w/ backoff+jitter, perf poller, seq-stamped in-memory fleet state, WS endpoint, retention job, schema.sql, db.ts waitForTable, 6 test files), `apps/server/src/routes/control-proxy.ts`, `packages/contracts/src/ws-frames.ts` control_* frames, `apps/web/src/pages/Control.tsx`, `apps/web/src/hooks/useControlStream.tsx`, `apps/web/src/components/control/**` (HostCard, FleetTab, ActivityTab, PerfChart, VramGauge, TtlRing, buildEChartsTheme). ## Size **Large** -- new host service (5 source files, 6 tests), cross-app WS contract additions (contracts + server proxy + web hook + 7 UI components), touches DB, SSE, WebSocket, and rendering surfaces. ## Summary The SSE fleet connector's line parser is logic-inverted (skips the lines it tries to match), making the entire ingestion pipeline dead code. Beyond that, three compounding issues make the WS endpoint non-functional: `incrementSeq` is never called (seq stays 0), the WS handler has no delta-publishing mechanism, and the snapshot wire format nests `hosts` under a `snapshot` key the client never reads. The retention job will crash on first execution because `pruneRawSamples` references a non-existent `id` column. The `onEvent` callback drops async errors, meaning a single DB failure crashes the process. In total, the backend pipeline (SSE -> parse -> store -> WS publish) is broken at every link, and the frontend implements a protocol the server does not speak. None of the core data flows work end-to-end. | Classification | Count | |----------------|-------| | Blocking | 8 | | Advisory | 10 | | Nit | 5 | ## Findings ### Blocking **B1: SSE line parser is logic-inverted -- all events silently dropped** - **Location:** `apps/control/src/services/fleet-connector.ts:158` - **Evidence:** ```typescript // Line 158: SKIP any line starting with "data:" if (!trimmed || trimmed.startsWith('data:')) continue; // Line 160: But THEN require the line to start with "data:" to proceed const dataMatch = trimmed.match(/^data:\s*(.+)$/); if (!dataMatch) continue; ``` - **Standard violated:** SSE parsing correctness. The filter and the regex are contradictory: lines matching the regex are filtered out before reaching it. The `onEvent` callback at line 169 is unreachable dead code. - **Risk:** This is the root entry point of the entire data pipeline. No SSE events from any llama-swap host ever reach `handleLlamaSweepEvent` or `handleReconcile`. The in-memory fleet state is never populated. The DB is never written to. The WS snapshot is always empty. The entire BooControl cockpit is non-functional at runtime. - **Fix sketch:** Remove the `startsWith('data:')` filter on line 158. If the format is standard SSE (`event: type\ndata: json`), accumulate event type from `event:` lines and payload from `data:` lines, emit on blank line. If the format is non-standard single-line (`type: json`), use a single regex like `/^(\w+):\s*(.+)$/` and remove the `data:` prefix check entirely. The `eventType = trimmed.split(':')[0]` on line 167 also breaks on JSON payloads containing colons (timestamps). **B2: `incrementSeq` defined but never called -- seq stays 0 forever** - **Location:** `apps/control/src/index.ts:33-36` - **Evidence:** ```typescript function incrementSeq(state: HostState): number { state.seq += 1; return state.seq; } ``` No call site in the codebase invokes `incrementSeq`. Every `HostState` starts with `seq: 0` and stays there. The client-side dedup guard at `useControlStream.tsx:168` (`if (frame.seq > snapshotSeq)`) discards every delta since `0 > 0` is false. - **Standard violated:** The seq-stamped delta protocol described in `design.md` section 4 ("per-host monotonic seq, incremented on every mutation"). - **Risk:** Even with SSE parsing fixed, no delta would ever pass the client's seq filter. Live updates are structurally impossible. - **Fix sketch:** Call `incrementSeq(state)` inside `handleLlamaSweepEvent` and `handleReconcile` after every fleet-state mutation, before the DB write. Include the returned seq in the delta published to WS subscribers. **B3: WS handler has no delta-publishing mechanism -- `onFleetDelta` is dead code** - **Location:** `apps/control/src/routes/ws.ts:30-39` - **Evidence:** ```typescript const onFleetDelta = (delta: unknown) => { if (socket.readyState === WebSocket.OPEN) { socket.send(JSON.stringify(delta)); } }; // Comment: "In practice, the fleet service should publish deltas through a channel // that this handler subscribes to. For now, we use a simple approach: // the fleet state is rebuilt on each snapshot request." ``` The callback is defined but nothing subscribes to it or calls it. There is no event emitter, no pub/sub channel, no polling loop. - **Standard violated:** design.md section 4: "Fan-out to browser: the control service publishes over its own WS." - **Risk:** WS clients get a one-shot snapshot at connection time and then go permanently stale. Model state changes, activity events, perf samples, and logs are never pushed to the frontend. - **Fix sketch:** Add an `EventEmitter` (or a simple `Set` pattern matching `sessionEvents.ts`) to the fleet state. Have `handleLlamaSweepEvent`/`handleReconcile` publish seq-stamped deltas through it. The WS handler registers a listener on connect and removes it on close. **B4: Snapshot wire format mismatch -- client never receives host data** - **Location:** `apps/control/src/routes/ws.ts:24-27` vs `apps/web/src/hooks/useControlStream.tsx:157` - **Evidence:** Server sends: ```typescript socket.send(JSON.stringify({ type: 'control_fleet' as const, snapshot, // { hosts: [...] } nested under "snapshot" key })); ``` Client reads: ```typescript if (frame.hosts && Array.isArray(frame.hosts)) { // frame.hosts is undefined ``` The `hosts` array is at `frame.snapshot.hosts`, not `frame.hosts`. The client silently ignores the frame. - **Standard violated:** Wire format contract between `ws.ts` and `useControlStream.tsx`. The `ControlFleetFrame` Zod schema in `ws-frames.ts:492-508` expects `seq` and `hosts` at the top level, which the snapshot does not provide. - **Risk:** Even if B1-B3 were fixed, the client would never populate the Fleet tab. The page would show "No hosts connected" permanently. - **Fix sketch:** Change the server to send `{ type: 'control_fleet', seq: host.seq, hosts: [...] }` at the top level (matching the Zod schema). Alternatively, change the client to read `data.snapshot.hosts`. The former is simpler and aligns with the contracts schema. **B5: `onEvent` callback drops async errors -- DB failure crashes the process** - **Location:** `apps/control/src/services/fleet-connector.ts:101,169` + `apps/control/src/index.ts:253` - **Evidence:** ```typescript // fleet-connector.ts:101 -- typed as returning void onEvent: (providerId: string, event: LlamaSweepSSEEvent) => void; // fleet-connector.ts:169 -- called without await deps.onEvent(providerId, event); // index.ts:253 -- implementation is async onEvent: (pid, event) => handleLlamaSweepEvent(fleet, sql, config, pid, event), ``` `handleLlamaSweepEvent` is async and performs SQL INSERTs. The returned Promise is discarded. Any SQL failure (connection timeout, pool exhaustion) becomes an unhandled rejection. Node 15+ crashes on unhandled rejections by default. - **Standard violated:** Async error handling discipline. The `onReconcile` callback IS typed as `Promise` and is properly awaited, showing the pattern was intended. - **Risk:** A single transient DB error during SSE event processing crashes the entire BooControl process. Under high event throughput, unbounded concurrent DB writes also exhaust the 10-connection pool, causing cascading timeouts. - **Fix sketch:** Add `.catch()` to the onEvent call: `Promise.resolve(deps.onEvent(providerId, event)).catch((err) => { deps.log.error({ providerId, err }, 'fleet: onEvent failed'); });`. Change the type to `(providerId: string, event: LlamaSweepSSEEvent) => void | Promise`. For backpressure, consider a bounded queue (e.g., p-queue with concurrency capped at pool size minus headroom). **B6: `pruneRawSamples` references non-existent `id` column -- guaranteed SQL error** - **Location:** `apps/control/src/services/retention.ts:78-88` - **Evidence:** ```typescript const toDelete = await sql<{ id: number }[]>` SELECT id FROM control_perf_samples -- no "id" column in this table WHERE provider_id = ${providerId} AND ts < ${cutoff.toISOString()} ORDER BY ts DESC LIMIT ${chunkSize} `; ``` `control_perf_samples` schema (`schema.sql:49-55`): `(provider_id TEXT, ts TIMESTAMPTZ, gpu JSONB, sys JSONB)` -- no `id` column. Compare with `control_requests` which has `id BIGSERIAL PRIMARY KEY`. - **Standard violated:** Schema/code consistency. The retention function was likely written for `control_requests` and copied without adapting to `control_perf_samples`'s composite-key schema. - **Risk:** The daily retention job throws `column "id" does not exist` on first execution. The error propagates from the `setInterval` callback as an unhandled rejection, crashing the service. - **Fix sketch:** Rewrite to chunk by `(provider_id, ts)` composite key: ```typescript const toDelete = await sql<{ provider_id: string; ts: Date }[]>` SELECT provider_id, ts FROM control_perf_samples WHERE provider_id = ${providerId} AND ts < ${cutoff.toISOString()} ORDER BY ts DESC LIMIT ${chunkSize} `; if (toDelete.length === 0) break; await sql`DELETE FROM control_perf_samples WHERE (provider_id, ts) = ANY(${sql(toDelete)})`; ``` Or add an `id BIGSERIAL` column to the table (migration needed for existing DBs). **B7: `onReconcile` wired but never called -- gap detection is dead code** - **Location:** `apps/control/src/services/fleet-connector.ts:102` + `apps/control/src/index.ts:102-154,254` - **Evidence:** The `onReconcile` callback is declared in `FleetConnectorDeps` and wired at `index.ts:254`, but the connector loop at `fleet-connector.ts:122-196` never invokes `deps.onReconcile`. The `handleReconcile` function (gap detection + bulk INSERT) is unreachable dead code. - **Standard violated:** design.md section 4: "On reconnect, reconcile via GET /api/metrics (full ring)." The reconcile-on-reconnect path is the mechanism for detecting ring-buffer wraps and filling data gaps. - **Risk:** Silent data loss after connector restarts or network interruptions. Metrics ring buffer wraps are never detected, leaving permanent gaps in `control_requests` that are invisible to the user. - **Fix sketch:** Call `onReconcile` when the SSE `metrics` event arrives (pass the MetricsData through), or add a periodic reconcile timer in `index.ts` that fetches the full metrics ring from each host on a configurable interval. **B8: `control_job` frame handler inserts garbage data into activity feed** - **Location:** `apps/web/src/hooks/useControlStream.tsx:191-196` - **Evidence:** ```typescript } else if (data.type === 'control_job') { const frame = data as ControlJobFrame; setState((prev) => ({ ...prev, requests: [...prev.requests, { id: 0, providerId: '', ts: '', model: null, reqPath: null, statusCode: null, durationMs: null }].slice(-500), })); } ``` The frame payload is parsed but ignored. A hardcoded garbage entry is pushed into the `requests` array. - **Standard violated:** Idempotent event handling. The handler should either use the frame data or be a no-op placeholder. - **Risk:** Currently moot (no `control_job` frames are sent in P1). When jobs are implemented, every job event pollutes the activity feed with empty phantom entries, displacing real request data from the 500-entry cap. - **Fix sketch:** Either implement proper job-state tracking (store in a separate `jobs` state field) or replace with a no-op `// TODO: P3 implement job frame handling`. ### Advisory **A1: No fleet-state rebuild from DB on service restart** - **Location:** `apps/control/src/index.ts:223` - **Finding:** `createFleetState()` always returns an empty Map. The ws.ts comment says "On service restart, rebuild fleet state from DB before serving snapshots" but this is unimplemented. - **YAGNI gate:** Moot while B1 is unfixed (SSE never populates state). Will become blocking once SSE is fixed. A late-joining client during the gap after restart sees all hosts as `down` with no models. **A2: `pruneActivity` and `pruneModelEvents` are not chunked** - **Location:** `apps/control/src/services/retention.ts:95-109` - **Finding:** Both do unbounded `DELETE` in a single statement. Design doc section 6 explicitly calls for "chunked transactions: one transaction per provider per 1-hour window, never one 48h mega-transaction." - **YAGNI gate:** At 5s poll intervals x 2 hosts, `control_requests` accumulates ~35k rows/day. A 48h unbounded DELETE holds a RowExclusiveLock for seconds, blocking the perf poller's concurrent INSERTs. The stall is measurable but not catastrophic for a single-user setup. Reopen trigger: if retention causes visible perf-poller lag in production. **A3: No Zod validation on incoming WS frames** - **Location:** `apps/web/src/hooks/useControlStream.tsx:149-201` - **Finding:** Frames are parsed with `JSON.parse` and cast directly to types. Sibling `useUserEvents.ts:41-68` validates every frame against `WsFrameSchema` with fail-closed logging. - **YAGNI gate:** Control frames bypass the broker (raw WS proxy), so the server-side Zod gate does not apply. Without client validation, a malformed frame silently corrupts state. Reopen trigger: any incident where a bad frame causes a UI crash. **A4: ECharts instances never disposed on component unmount** - **Location:** `apps/web/src/components/control/PerfChart.tsx:95-97`, `VramGauge.tsx:89-91`, `TtlRing.tsx:98-101` - **Finding:** Cleanup functions disconnect ResizeObservers and clear intervals but never call `chart.dispose()`. Canvas elements and associated GPU memory are leaked on unmount. - **YAGNI gate:** The Control page is a single-route SPA; components unmount only on navigation away. The leak is bounded (3 chart instances max). Reopen trigger: memory profiling shows ECharts accumulation after repeated navigation. **A5: `trimCapture` size estimation uses UTF-16 code-unit count as byte proxy** - **Location:** `apps/control/src/services/retention.ts:117` - **Finding:** `captureJson.length * 2` estimates bytes for a UTF-16 JS string. For ASCII-heavy JSON (the common case for HTTP captures), this overestimates by 2x, meaning captures that should be trimmed are not. The trim threshold at line 120 (`sizeKB * 512`) compensates, but the check-and-trim logic is inconsistent. - **YAGNI gate:** The cap is advisory (256KB default). Captures slightly over the cap are not trimmed, but the total budget pruning (not implemented in P1) would catch them. Reopen trigger: capture storage exceeds `CAPTURE_BUDGET_MB`. **A6: Fixed 5s reconnect delay without exponential backoff** - **Location:** `apps/web/src/hooks/useControlStream.tsx:205` - **Finding:** `setTimeout(connect, 5000)` -- fixed delay. Siblings `useUserEvents.ts` and `useSessionStream.ts` both use exponential backoff (1s to 30s). - **YAGNI gate:** The control WS is a secondary connection; a 5s reconnect cadence is acceptable for a dashboard. Reopen trigger: reconnect storms during extended outages. **A7: Perf poller has no fetch timeout** - **Location:** `apps/control/src/index.ts:176` - **Finding:** `fetch(url)` has no `signal` or timeout. If a host hangs (accepts TCP but never responds), the poll blocks indefinitely. The sequential `for` loop at line 271 means one hung host stalls polling for all subsequent hosts. - **YAGNI gate:** llama-swap's `/api/performance` is a fast local endpoint. Reopen trigger: any host observed hanging in production. **A8: Perf poller catch block swallows errors silently** - **Location:** `apps/control/src/index.ts:190-192` - **Finding:** `catch { // Poll failure -- handled by the connector's circuit-breaker. }`. The comment references a circuit-breaker that does not exist for the perf poller. The error is silently discarded. - **YAGNI gate:** Same as A7 -- fast local endpoint, errors are transient. Reopen trigger: silent poll failures observed in logs. **A9: Response header forwarding without filtering in control-proxy** - **Location:** `apps/server/src/routes/control-proxy.ts:78-81` - **Finding:** All upstream response headers are forwarded except `transfer-encoding`. This includes `set-cookie`, `x-powered-by`, and internal headers. The coder-proxy has the same pattern (deliberate clone), but the control service is a new internal service with no auth, making header leakage more concerning. - **YAGNI gate:** BooControl is an internal dashboard behind Authelia. Header leakage is not exploitable from outside the Tailscale mesh. Reopen trigger: any external exposure of the control endpoint. **A10: SSRF via unvalidated `ssh_host` in URL construction** - **Location:** `apps/control/src/index.ts:248` - **Finding:** `const baseUrl = \`http://${sshHost}:8401\`` -- `ssh_host` from the DB flows directly into `fetch()` URLs with no validation (IP format, private-range check). - **YAGNI gate:** `control_hosts` is seeded with known hosts and modified only via direct SQL (no admin UI in P1). An attacker with DB write access already has worse options. Reopen trigger: any user-facing host-edit UI. ### Nits **N1: Duplicate `createFleetState` definition** -- `index.ts:14` defines a local `createFleetState` that shadows the identical export from `fleet-state.ts:60`. Remove the local copy and import from the module. **N2: `theme as any` cast in ECharts init** -- `PerfChart.tsx:37`, `VramGauge.tsx:25`, `TtlRing.tsx:25`. `buildEChartsTheme()` returns `Record` but `echarts.init()` expects a typed theme. The `as any` bypasses type safety. Low risk since the theme object is simple and validated by visual inspection. **N3: `window.matchMedia` called in render body** -- `HostCard.tsx:51` and `HostCard.tsx:207`. The `prefersReducedMotion` check runs on every render. Move to a `useMemo` or module-level constant to avoid redundant re-evaluation. **N4: SSE error logging drops the error object** -- `fleet-connector.ts:185`. The `err` variable from the catch block is captured but not included in the log fields. Distinguishing connection reset from DNS failure requires the error message. **N5: Sequential N+1 DB inserts for metrics entries** -- `index.ts:79-86`. Each metrics entry triggers an individual `await sql` INSERT. A batch of N entries requires N round-trips. Consider a multi-row INSERT or a transactional batch. ## Verdict **Block** Blocking findings B1-B8 must be resolved before merge. The SSE parser inversion (B1) makes the entire ingestion pipeline dead code. The seq/delta/publish chain (B2-B4) makes the WS endpoint non-functional. The retention crash (B6) will take down the service on first daily tick. The async error handling (B5) means any DB failure is a process crash. The reconcile dead code (B7) means gap detection never runs. The garbage handler (B8) will corrupt the activity feed when jobs ship. The core recommendation: before fixing individual bugs, establish the end-to-end data flow first. Wire SSE parse -> event handler -> seq increment -> delta publish -> WS broadcast -> client apply in a single pass, with integration tests at each boundary. The current code has the right shapes (backoff+jitter, seq-stamped protocol, chunked retention) but none of the links are connected. ## Claims I did not verify - Whether llama-swap's `/api/events` SSE format is standard (`event:` + `data:` lines) or non-standard (single-line `type: json`). The fix for B1 depends on this. - Whether the `control_perf_samples` table exists in any deployed DB (it would fail on `SELECT id` if it does). - Whether `react-virtuoso`'s `followOutput` prop type accepts `'bottom' as FollowOutput` without runtime issues. - Whether the ECharts `GaugeChart` import at `VramGauge.tsx:4` and `TtlRing.tsx:4` is tree-shakeable or pulls the full gauge bundle. - Whether the `postgres` tagged-template library parameterizes `::jsonb` casts correctly (the security analyst concluded it does, but I did not trace the library internals). - Whether the `setInterval` callbacks at `index.ts:265,277` can overlap if a poll/retention cycle exceeds the interval period (Node's single-threaded model prevents true overlap, but the async callback can be re-entered at `await` points). - Whether the `onClose` hook at `index.ts:287` fires before or after `sql.end()` in the shutdown sequence. --- # Re-review (post-fix) **Date:** 2026-06-12 **Baseline:** p1-code-review.md (verdict Block, B1-B8 blocking) **Fix pass:** p1-fix-analysis.md (all B1-B8 claimed fixed, 49 tests passing) ## Scope Same files as original review. Re-traced the full data chain: SSE line -> parseSseLine -> handleLlamaSweepEvent -> DB insert + incrementSeq -> DeltaEmitter.publish -> ws.ts subscriber -> ControlFleetFrame wire shape -> useControlStream.tsx client application. Verified each blocking finding by reading the current code, not by trusting comments or the fix analysis. ## Size **Medium** -- fix pass across 7 source files + 1 new test file; no new subsystems or surfaces. ## Summary All 8 original blocking findings are genuinely fixed at the code level. The SSE parser works, incrementSeq is called on every mutation, the DeltaEmitter pattern connects mutations to WS subscribers, the wire format matches between server and client, async errors are caught, retention uses the composite key, reconcile runs from the metrics case, and the job handler uses frame data. However, the fix pass introduced a new multi-host regression (deltas replace the full hosts array), the rebuildFleetFromDB sets liveness to 'connected' when it should be 'down', and the pipeline test simulates the logic inline rather than exercising the real implementation chain. | Classification | Count | |----------------|-------| | Blocking | 1 | | Advisory | 3 | | Nit | 1 | ## Blocking findings: B1-B8 confirmation ### B1: SSE line parser inverted **Verdict: FIXED** `fleet-connector.ts:116-159`: The contradictory `startsWith('data:')` filter is gone. `parseSseLine` now correctly handles three cases: 1. `event:` lines set the event type (line 124-126) 2. `data:` lines emit the event using the current event type (line 129-141) 3. Non-standard `type: json` single-line format (line 144-156) The caller loop at `fleet-connector.ts:204-227` tracks `currentEventType` and calls `parseSseLine(line, currentEventType)`. Standard SSE: `event:` line returns `{event: null, eventType: 'modelStatus'}`, caller stores it. Next `data:` line returns the parsed event with the stored type. Dead code eliminated; the `onEvent` callback is now reachable. ### B2: incrementSeq never called **Verdict: FIXED** `incrementSeq` is exported from `fleet-state.ts:83-86`, imported in `index.ts:6`, and called at: - `index.ts:60` (modelStatus case) - `index.ts:89` (logData case) - `index.ts:102` (metrics case) - `index.ts:237` (pollPerformance, per sample) Every fleet-state mutation increments seq before publishing. The seq is included in the delta payload. ### B3: WS handler has no delta-publishing mechanism **Verdict: FIXED** `DeltaEmitter` (`index.ts:16-34`) is a `Set` pattern with `subscribe` and `publish`. Every mutation path calls `emitter.publish(...)`. `ws.ts:34-37` subscribes on connect, unsubscribes on close/error (lines 48-56). The listener set is iterated in `publish` with per-listener try/catch (line 30). Live updates flow from mutation to WS client. ### B4: Snapshot wire format mismatch **Verdict: FIXED** `ws.ts:26-31` sends `{ type: 'control_fleet', seq: maxSeq, hosts: snapshot.hosts }` at the top level, matching the `ControlFleetFrame` Zod schema (`ws-frames.ts:492-508`). The client at `useControlStream.tsx:155` reads `frame.hosts` which now exists. Snapshot uses `maxSeq` across all hosts (line 26). Client distinguishes snapshot from delta via `hasSnapshotRef` flag (line 156-166). ### B5: onEvent drops async errors **Verdict: FIXED** `fleet-connector.ts:101`: Type is `() => void | Promise`. Call site at line 222-226: `await Promise.resolve(deps.onEvent(providerId, parsed.event))` with `catch` that logs via `deps.log.error`. DB failures no longer produce unhandled rejections. ### B6: pruneRawSamples references non-existent id column **Verdict: FIXED** `retention.ts:77-88`: Rewritten to use composite key `(provider_id, ts)`. SELECT returns `{ provider_id, ts }` rows. DELETE uses `WHERE (provider_id, ts) = ANY(...)`. Chunked in a while-loop with `chunkSize = 1000`. ### B7: onReconcile wired but never called **Verdict: FIXED (with nit)** Gap detection now runs via `handleLlamaSweepEvent` -> `handleReconcile` direct call (`index.ts:101-105`), not via `deps.onReconcile`. The `deps.onReconcile` callback at `index.ts:377` is wired but never invoked from the connector loop -- it is dead code. The effect is correct: `metrics` events trigger reconcile. The dead `onReconcile` dep is a nit (see below). ### B8: control_job garbage insert **Verdict: FIXED** `useControlStream.tsx:185-191`: Handler reads `frame.jobType`, `frame.jobId`, `frame.status` from the parsed `ControlJobFrame` and pushes a proper entry to the `jobs` array, capped at 200. No hardcoded garbage. ## New finding from fix pass **B9: Fleet delta replaces entire hosts array -- multi-host regression** - **Location:** `apps/web/src/hooks/useControlStream.tsx:164` - **Evidence:** ```typescript // Delta: apply only if seq > snapshot seq. if (frame.seq > snapshotSeqRef.current) { setState((prev) => ({ ...prev, hosts: frame.hosts as unknown as ControlFleetHost[] })); } ``` Each delta from the server contains only the changed host in `hosts` (e.g., `index.ts:68-84` publishes a single-element array). The client replaces `prev.hosts` wholesale with this single-element array. With 2+ connected hosts, a modelStatus event for host A wipes host B from the UI until the next snapshot. - **Standard violated:** Idempotent delta application. Deltas should merge by `providerId`, not replace the full array. - **Risk:** Any multi-host deployment shows flickering/missing hosts in the Fleet tab. Single-host deployments are unaffected. - **Fix sketch:** ```typescript if (frame.seq > snapshotSeqRef.current) { setState((prev) => { const hostMap = new Map(prev.hosts.map((h) => [h.providerId, h])); for (const h of frame.hosts) hostMap.set(h.providerId, h); return { ...prev, hosts: Array.from(hostMap.values()) }; }); } ``` ## A1 rebuildFleetFromDB correctness **Location:** `index.ts:256-310` **Finding:** `rebuildFleetFromDB` sets `state.liveness = 'connected'` at line 270 for every host it rebuilds from DB. This runs at startup (line 355-357), before SSE connectors start (line 366-385). After a service restart, hosts have no live SSE connection yet. Setting liveness to `'connected'` is incorrect -- the hosts should start as `'down'` (the default from `ensureHostState` at `fleet-state.ts:67`) until the SSE connector establishes a connection. The correct behavior: `rebuildFleetFromDB` should populate models/lastSeenAt from DB but leave `liveness` at the default `'down'`. The SSE connector loop will update liveness to `'connected'` when connections are established (via `stampLastSeen` + the `modelStatus` case setting `state.liveness = 'connected'` at `index.ts:52`). - **Severity:** Advisory. A late-joining client during the brief window before connectors start sees hosts as 'connected' with stale data. The window is typically seconds. The hosts will flip to 'down' momentarily if the connector fails to connect, or stay 'connected' if it succeeds -- so the visual glitch is minor. But it violates the liveness semantic. ## HostCard.tsx:56 double-cast **Location:** `apps/web/src/components/control/HostCard.tsx:56` ```typescript const gpuData = (host as unknown as Record)['gpu'] as { vram_used?: number; vram_total?: number; temperature?: number; power?: number; } | undefined; ``` The `ControlFleetHost` type has no `gpu` field. The double-cast accesses a property that doesn't exist on the wire type. At runtime, `host.gpu` is always `undefined`, so the GPU gauge always shows "no GPU data". This is a silent no-op, not a crash. **Typed fix:** GPU data comes from perf samples, not the fleet snapshot. The HostCard should receive the latest perf sample for its host as a prop (looked up from `ControlStreamState.perfSamples` by `providerId`). Remove the double-cast; add a `perfSample?: ControlPerfSample` prop to `HostCardProps`. ## pipeline.test.ts quality **Location:** `apps/control/src/services/__tests__/pipeline.test.ts` The test title says "SSE pipeline: parse -> store -> emit deltas" but it does not exercise the actual `handleLlamaSweepEvent`, `DeltaEmitter`, or SQL code paths. Instead, it reimplements the logic inline (lines 97-132) with mock SQL that always succeeds. This means: 1. The `await + catch` error handling (B5 fix) is never tested -- mock SQL never fails. 2. The `DeltaEmitter.publish` -> subscriber path is never tested. 3. The actual `handleLlamaSweepEvent` function is never called. 4. The `metrics` case with reconcile and per-entry INSERTs is not tested against the real code. The tests prove the logic can work in isolation but do not prove the wiring is correct. The `reconcile.test.ts` (7 tests on `detectGap`) is solid and well-targeted. The `fleet-connector.test.ts` and `fleet-state.test.ts` test their respective modules. But there is no integration test that calls `handleLlamaSweepEvent` with a mock SQL + DeltaEmitter and asserts the emitted deltas match the wire format. - **Severity:** Advisory. The unit tests cover the building blocks. An integration test would catch wiring bugs (wrong import, wrong field name, missing await). Reopen trigger: any bug where the individual components pass tests but the pipeline fails at runtime. ## Accepted follow-ups (not re-litigated) A2, A3, A5, A9, A10 per the fix analysis YAGNI gates. ## Nits **N6: Dead `onReconcile` dep callback** -- `fleet-connector.ts:102` declares `onReconcile` in `FleetConnectorDeps`, wired at `index.ts:377`, but the connector loop never calls `deps.onReconcile`. Reconcile runs via the direct `handleLlamaSweepEvent -> handleReconcile` path. Remove the dead callback or have the connector call it on the `metrics` event instead of calling `handleReconcile` directly from `handleLlamaSweepEvent`. ## Verdict **REQUEST-CHANGES** B1-B8 from the original review are all genuinely fixed. The data chain works end-to-end for a single host. However, the fix pass introduced a new blocking finding: - **B9** (blocking): Fleet delta replaces the entire hosts array, breaking multi-host deployments. A delta for one host wipes all other hosts from the UI. Fix: merge deltas by `providerId` instead of replacing `prev.hosts`. Advisory findings to address before or shortly after merge: - **A1 rebuild liveness**: `rebuildFleetFromDB` sets liveness to `'connected'` before connectors start. Should leave at `'down'`. - **HostCard double-cast**: Remove the `as unknown as` cast; pass GPU data from perfSamples as a typed prop. - **pipeline.test.ts**: Does not exercise the real `handleLlamaSweepEvent` or `DeltaEmitter` chain. Consider an integration test with mock SQL + emitter. ## Claims I did not verify - Same as original review (llama-swap SSE format, react-virtuoso types, ECharts tree-shaking, postgres parameterization, setInterval overlap, shutdown ordering). - Whether the `DELETE ... = ANY(${sql(toDelete)})` pattern at `retention.ts:87` works with the `postgres` library when `toDelete` contains objects with Date values (the `ts` field is typed as `Date` but the column is `TIMESTAMPTZ`). - Whether the batch INSERT at `index.ts:229-231` (`sql.unsafe(inserts.map(s => s.toString()).join(';\n'))`) correctly handles the semicolon-separated multi-statement execution in the `postgres` library.