feat(booterm): structured pty_exited WS notifications. Plan-validated, impl-validated, code-reviewed green (contracts build clean, contracts test 29/29, booterm + web typecheck clean). wip: in-progress inference/provider refactor (agents.ts, provider.ts, new llama-providers.ts, removed llama-args-validator), plus arena, dispatcher, compaction, schema changes. openspec: pty-exit-notifications complete; x-agent-flags planned (not yet implemented).
32 KiB
Review: BooControl P1 (uncommitted working tree)
Scope
apps/control/** (new Fastify host service: SSE fleet connector w/ backoff+jitter, perf poller, seq-stamped in-memory fleet state, WS endpoint, retention job, schema.sql, db.ts waitForTable, 6 test files), apps/server/src/routes/control-proxy.ts, packages/contracts/src/ws-frames.ts control_* frames, apps/web/src/pages/Control.tsx, apps/web/src/hooks/useControlStream.tsx, apps/web/src/components/control/** (HostCard, FleetTab, ActivityTab, PerfChart, VramGauge, TtlRing, buildEChartsTheme).
Size
Large -- new host service (5 source files, 6 tests), cross-app WS contract additions (contracts + server proxy + web hook + 7 UI components), touches DB, SSE, WebSocket, and rendering surfaces.
Summary
The SSE fleet connector's line parser is logic-inverted (skips the lines it tries to match), making the entire ingestion pipeline dead code. Beyond that, three compounding issues make the WS endpoint non-functional: incrementSeq is never called (seq stays 0), the WS handler has no delta-publishing mechanism, and the snapshot wire format nests hosts under a snapshot key the client never reads. The retention job will crash on first execution because pruneRawSamples references a non-existent id column. The onEvent callback drops async errors, meaning a single DB failure crashes the process. In total, the backend pipeline (SSE -> parse -> store -> WS publish) is broken at every link, and the frontend implements a protocol the server does not speak. None of the core data flows work end-to-end.
| Classification | Count |
|---|---|
| Blocking | 8 |
| Advisory | 10 |
| Nit | 5 |
Findings
Blocking
B1: SSE line parser is logic-inverted -- all events silently dropped
- Location:
apps/control/src/services/fleet-connector.ts:158 - Evidence:
// Line 158: SKIP any line starting with "data:" if (!trimmed || trimmed.startsWith('data:')) continue; // Line 160: But THEN require the line to start with "data:" to proceed const dataMatch = trimmed.match(/^data:\s*(.+)$/); if (!dataMatch) continue; - Standard violated: SSE parsing correctness. The filter and the regex are contradictory: lines matching the regex are filtered out before reaching it. The
onEventcallback at line 169 is unreachable dead code. - Risk: This is the root entry point of the entire data pipeline. No SSE events from any llama-swap host ever reach
handleLlamaSweepEventorhandleReconcile. The in-memory fleet state is never populated. The DB is never written to. The WS snapshot is always empty. The entire BooControl cockpit is non-functional at runtime. - Fix sketch: Remove the
startsWith('data:')filter on line 158. If the format is standard SSE (event: type\ndata: json), accumulate event type fromevent:lines and payload fromdata:lines, emit on blank line. If the format is non-standard single-line (type: json), use a single regex like/^(\w+):\s*(.+)$/and remove thedata:prefix check entirely. TheeventType = trimmed.split(':')[0]on line 167 also breaks on JSON payloads containing colons (timestamps).
B2: incrementSeq defined but never called -- seq stays 0 forever
- Location:
apps/control/src/index.ts:33-36 - Evidence:
No call site in the codebase invokes
function incrementSeq(state: HostState): number { state.seq += 1; return state.seq; }incrementSeq. EveryHostStatestarts withseq: 0and stays there. The client-side dedup guard atuseControlStream.tsx:168(if (frame.seq > snapshotSeq)) discards every delta since0 > 0is false. - Standard violated: The seq-stamped delta protocol described in
design.mdsection 4 ("per-host monotonic seq, incremented on every mutation"). - Risk: Even with SSE parsing fixed, no delta would ever pass the client's seq filter. Live updates are structurally impossible.
- Fix sketch: Call
incrementSeq(state)insidehandleLlamaSweepEventandhandleReconcileafter every fleet-state mutation, before the DB write. Include the returned seq in the delta published to WS subscribers.
B3: WS handler has no delta-publishing mechanism -- onFleetDelta is dead code
- Location:
apps/control/src/routes/ws.ts:30-39 - Evidence:
The callback is defined but nothing subscribes to it or calls it. There is no event emitter, no pub/sub channel, no polling loop.
const onFleetDelta = (delta: unknown) => { if (socket.readyState === WebSocket.OPEN) { socket.send(JSON.stringify(delta)); } }; // Comment: "In practice, the fleet service should publish deltas through a channel // that this handler subscribes to. For now, we use a simple approach: // the fleet state is rebuilt on each snapshot request." - Standard violated: design.md section 4: "Fan-out to browser: the control service publishes over its own WS."
- Risk: WS clients get a one-shot snapshot at connection time and then go permanently stale. Model state changes, activity events, perf samples, and logs are never pushed to the frontend.
- Fix sketch: Add an
EventEmitter(or a simpleSet<callback>pattern matchingsessionEvents.ts) to the fleet state. HavehandleLlamaSweepEvent/handleReconcilepublish seq-stamped deltas through it. The WS handler registers a listener on connect and removes it on close.
B4: Snapshot wire format mismatch -- client never receives host data
- Location:
apps/control/src/routes/ws.ts:24-27vsapps/web/src/hooks/useControlStream.tsx:157 - Evidence: Server sends:
Client reads:
socket.send(JSON.stringify({ type: 'control_fleet' as const, snapshot, // { hosts: [...] } nested under "snapshot" key }));Theif (frame.hosts && Array.isArray(frame.hosts)) { // frame.hosts is undefinedhostsarray is atframe.snapshot.hosts, notframe.hosts. The client silently ignores the frame. - Standard violated: Wire format contract between
ws.tsanduseControlStream.tsx. TheControlFleetFrameZod schema inws-frames.ts:492-508expectsseqandhostsat the top level, which the snapshot does not provide. - Risk: Even if B1-B3 were fixed, the client would never populate the Fleet tab. The page would show "No hosts connected" permanently.
- Fix sketch: Change the server to send
{ type: 'control_fleet', seq: host.seq, hosts: [...] }at the top level (matching the Zod schema). Alternatively, change the client to readdata.snapshot.hosts. The former is simpler and aligns with the contracts schema.
B5: onEvent callback drops async errors -- DB failure crashes the process
- Location:
apps/control/src/services/fleet-connector.ts:101,169+apps/control/src/index.ts:253 - Evidence:
// fleet-connector.ts:101 -- typed as returning void onEvent: (providerId: string, event: LlamaSweepSSEEvent) => void; // fleet-connector.ts:169 -- called without await deps.onEvent(providerId, event); // index.ts:253 -- implementation is async onEvent: (pid, event) => handleLlamaSweepEvent(fleet, sql, config, pid, event),handleLlamaSweepEventis async and performs SQL INSERTs. The returned Promise is discarded. Any SQL failure (connection timeout, pool exhaustion) becomes an unhandled rejection. Node 15+ crashes on unhandled rejections by default. - Standard violated: Async error handling discipline. The
onReconcilecallback IS typed asPromise<boolean>and is properly awaited, showing the pattern was intended. - Risk: A single transient DB error during SSE event processing crashes the entire BooControl process. Under high event throughput, unbounded concurrent DB writes also exhaust the 10-connection pool, causing cascading timeouts.
- Fix sketch: Add
.catch()to the onEvent call:Promise.resolve(deps.onEvent(providerId, event)).catch((err) => { deps.log.error({ providerId, err }, 'fleet: onEvent failed'); });. Change the type to(providerId: string, event: LlamaSweepSSEEvent) => void | Promise<void>. For backpressure, consider a bounded queue (e.g., p-queue with concurrency capped at pool size minus headroom).
B6: pruneRawSamples references non-existent id column -- guaranteed SQL error
- Location:
apps/control/src/services/retention.ts:78-88 - Evidence:
const toDelete = await sql<{ id: number }[]>` SELECT id FROM control_perf_samples -- no "id" column in this table WHERE provider_id = ${providerId} AND ts < ${cutoff.toISOString()} ORDER BY ts DESC LIMIT ${chunkSize} `;control_perf_samplesschema (schema.sql:49-55):(provider_id TEXT, ts TIMESTAMPTZ, gpu JSONB, sys JSONB)-- noidcolumn. Compare withcontrol_requestswhich hasid BIGSERIAL PRIMARY KEY. - Standard violated: Schema/code consistency. The retention function was likely written for
control_requestsand copied without adapting tocontrol_perf_samples's composite-key schema. - Risk: The daily retention job throws
column "id" does not existon first execution. The error propagates from thesetIntervalcallback as an unhandled rejection, crashing the service. - Fix sketch: Rewrite to chunk by
(provider_id, ts)composite key:Or add anconst toDelete = await sql<{ provider_id: string; ts: Date }[]>` SELECT provider_id, ts FROM control_perf_samples WHERE provider_id = ${providerId} AND ts < ${cutoff.toISOString()} ORDER BY ts DESC LIMIT ${chunkSize} `; if (toDelete.length === 0) break; await sql`DELETE FROM control_perf_samples WHERE (provider_id, ts) = ANY(${sql(toDelete)})`;id BIGSERIALcolumn to the table (migration needed for existing DBs).
B7: onReconcile wired but never called -- gap detection is dead code
- Location:
apps/control/src/services/fleet-connector.ts:102+apps/control/src/index.ts:102-154,254 - Evidence: The
onReconcilecallback is declared inFleetConnectorDepsand wired atindex.ts:254, but the connector loop atfleet-connector.ts:122-196never invokesdeps.onReconcile. ThehandleReconcilefunction (gap detection + bulk INSERT) is unreachable dead code. - Standard violated: design.md section 4: "On reconnect, reconcile via GET /api/metrics (full ring)." The reconcile-on-reconnect path is the mechanism for detecting ring-buffer wraps and filling data gaps.
- Risk: Silent data loss after connector restarts or network interruptions. Metrics ring buffer wraps are never detected, leaving permanent gaps in
control_requeststhat are invisible to the user. - Fix sketch: Call
onReconcilewhen the SSEmetricsevent arrives (pass the MetricsData through), or add a periodic reconcile timer inindex.tsthat fetches the full metrics ring from each host on a configurable interval.
B8: control_job frame handler inserts garbage data into activity feed
- Location:
apps/web/src/hooks/useControlStream.tsx:191-196 - Evidence:
The frame payload is parsed but ignored. A hardcoded garbage entry is pushed into the
} else if (data.type === 'control_job') { const frame = data as ControlJobFrame; setState((prev) => ({ ...prev, requests: [...prev.requests, { id: 0, providerId: '', ts: '', model: null, reqPath: null, statusCode: null, durationMs: null }].slice(-500), })); }requestsarray. - Standard violated: Idempotent event handling. The handler should either use the frame data or be a no-op placeholder.
- Risk: Currently moot (no
control_jobframes are sent in P1). When jobs are implemented, every job event pollutes the activity feed with empty phantom entries, displacing real request data from the 500-entry cap. - Fix sketch: Either implement proper job-state tracking (store in a separate
jobsstate field) or replace with a no-op// TODO: P3 implement job frame handling.
Advisory
A1: No fleet-state rebuild from DB on service restart
- Location:
apps/control/src/index.ts:223 - Finding:
createFleetState()always returns an empty Map. The ws.ts comment says "On service restart, rebuild fleet state from DB before serving snapshots" but this is unimplemented. - YAGNI gate: Moot while B1 is unfixed (SSE never populates state). Will become blocking once SSE is fixed. A late-joining client during the gap after restart sees all hosts as
downwith no models.
A2: pruneActivity and pruneModelEvents are not chunked
- Location:
apps/control/src/services/retention.ts:95-109 - Finding: Both do unbounded
DELETEin a single statement. Design doc section 6 explicitly calls for "chunked transactions: one transaction per provider per 1-hour window, never one 48h mega-transaction." - YAGNI gate: At 5s poll intervals x 2 hosts,
control_requestsaccumulates ~35k rows/day. A 48h unbounded DELETE holds a RowExclusiveLock for seconds, blocking the perf poller's concurrent INSERTs. The stall is measurable but not catastrophic for a single-user setup. Reopen trigger: if retention causes visible perf-poller lag in production.
A3: No Zod validation on incoming WS frames
- Location:
apps/web/src/hooks/useControlStream.tsx:149-201 - Finding: Frames are parsed with
JSON.parseand cast directly to types. SiblinguseUserEvents.ts:41-68validates every frame againstWsFrameSchemawith fail-closed logging. - YAGNI gate: Control frames bypass the broker (raw WS proxy), so the server-side Zod gate does not apply. Without client validation, a malformed frame silently corrupts state. Reopen trigger: any incident where a bad frame causes a UI crash.
A4: ECharts instances never disposed on component unmount
- Location:
apps/web/src/components/control/PerfChart.tsx:95-97,VramGauge.tsx:89-91,TtlRing.tsx:98-101 - Finding: Cleanup functions disconnect ResizeObservers and clear intervals but never call
chart.dispose(). Canvas elements and associated GPU memory are leaked on unmount. - YAGNI gate: The Control page is a single-route SPA; components unmount only on navigation away. The leak is bounded (3 chart instances max). Reopen trigger: memory profiling shows ECharts accumulation after repeated navigation.
A5: trimCapture size estimation uses UTF-16 code-unit count as byte proxy
- Location:
apps/control/src/services/retention.ts:117 - Finding:
captureJson.length * 2estimates bytes for a UTF-16 JS string. For ASCII-heavy JSON (the common case for HTTP captures), this overestimates by 2x, meaning captures that should be trimmed are not. The trim threshold at line 120 (sizeKB * 512) compensates, but the check-and-trim logic is inconsistent. - YAGNI gate: The cap is advisory (256KB default). Captures slightly over the cap are not trimmed, but the total budget pruning (not implemented in P1) would catch them. Reopen trigger: capture storage exceeds
CAPTURE_BUDGET_MB.
A6: Fixed 5s reconnect delay without exponential backoff
- Location:
apps/web/src/hooks/useControlStream.tsx:205 - Finding:
setTimeout(connect, 5000)-- fixed delay. SiblingsuseUserEvents.tsanduseSessionStream.tsboth use exponential backoff (1s to 30s). - YAGNI gate: The control WS is a secondary connection; a 5s reconnect cadence is acceptable for a dashboard. Reopen trigger: reconnect storms during extended outages.
A7: Perf poller has no fetch timeout
- Location:
apps/control/src/index.ts:176 - Finding:
fetch(url)has nosignalor timeout. If a host hangs (accepts TCP but never responds), the poll blocks indefinitely. The sequentialforloop at line 271 means one hung host stalls polling for all subsequent hosts. - YAGNI gate: llama-swap's
/api/performanceis a fast local endpoint. Reopen trigger: any host observed hanging in production.
A8: Perf poller catch block swallows errors silently
- Location:
apps/control/src/index.ts:190-192 - Finding:
catch { // Poll failure -- handled by the connector's circuit-breaker. }. The comment references a circuit-breaker that does not exist for the perf poller. The error is silently discarded. - YAGNI gate: Same as A7 -- fast local endpoint, errors are transient. Reopen trigger: silent poll failures observed in logs.
A9: Response header forwarding without filtering in control-proxy
- Location:
apps/server/src/routes/control-proxy.ts:78-81 - Finding: All upstream response headers are forwarded except
transfer-encoding. This includesset-cookie,x-powered-by, and internal headers. The coder-proxy has the same pattern (deliberate clone), but the control service is a new internal service with no auth, making header leakage more concerning. - YAGNI gate: BooControl is an internal dashboard behind Authelia. Header leakage is not exploitable from outside the Tailscale mesh. Reopen trigger: any external exposure of the control endpoint.
A10: SSRF via unvalidated ssh_host in URL construction
- Location:
apps/control/src/index.ts:248 - Finding:
const baseUrl = \http://${sshHost}:8401`--ssh_hostfrom the DB flows directly intofetch()` URLs with no validation (IP format, private-range check). - YAGNI gate:
control_hostsis seeded with known hosts and modified only via direct SQL (no admin UI in P1). An attacker with DB write access already has worse options. Reopen trigger: any user-facing host-edit UI.
Nits
N1: Duplicate createFleetState definition -- index.ts:14 defines a local createFleetState that shadows the identical export from fleet-state.ts:60. Remove the local copy and import from the module.
N2: theme as any cast in ECharts init -- PerfChart.tsx:37, VramGauge.tsx:25, TtlRing.tsx:25. buildEChartsTheme() returns Record<string, unknown> but echarts.init() expects a typed theme. The as any bypasses type safety. Low risk since the theme object is simple and validated by visual inspection.
N3: window.matchMedia called in render body -- HostCard.tsx:51 and HostCard.tsx:207. The prefersReducedMotion check runs on every render. Move to a useMemo or module-level constant to avoid redundant re-evaluation.
N4: SSE error logging drops the error object -- fleet-connector.ts:185. The err variable from the catch block is captured but not included in the log fields. Distinguishing connection reset from DNS failure requires the error message.
N5: Sequential N+1 DB inserts for metrics entries -- index.ts:79-86. Each metrics entry triggers an individual await sql INSERT. A batch of N entries requires N round-trips. Consider a multi-row INSERT or a transactional batch.
Verdict
Block
Blocking findings B1-B8 must be resolved before merge. The SSE parser inversion (B1) makes the entire ingestion pipeline dead code. The seq/delta/publish chain (B2-B4) makes the WS endpoint non-functional. The retention crash (B6) will take down the service on first daily tick. The async error handling (B5) means any DB failure is a process crash. The reconcile dead code (B7) means gap detection never runs. The garbage handler (B8) will corrupt the activity feed when jobs ship.
The core recommendation: before fixing individual bugs, establish the end-to-end data flow first. Wire SSE parse -> event handler -> seq increment -> delta publish -> WS broadcast -> client apply in a single pass, with integration tests at each boundary. The current code has the right shapes (backoff+jitter, seq-stamped protocol, chunked retention) but none of the links are connected.
Claims I did not verify
- Whether llama-swap's
/api/eventsSSE format is standard (event:+data:lines) or non-standard (single-linetype: json). The fix for B1 depends on this. - Whether the
control_perf_samplestable exists in any deployed DB (it would fail onSELECT idif it does). - Whether
react-virtuoso'sfollowOutputprop type accepts'bottom' as FollowOutputwithout runtime issues. - Whether the ECharts
GaugeChartimport atVramGauge.tsx:4andTtlRing.tsx:4is tree-shakeable or pulls the full gauge bundle. - Whether the
postgrestagged-template library parameterizes::jsonbcasts correctly (the security analyst concluded it does, but I did not trace the library internals). - Whether the
setIntervalcallbacks atindex.ts:265,277can overlap if a poll/retention cycle exceeds the interval period (Node's single-threaded model prevents true overlap, but the async callback can be re-entered atawaitpoints). - Whether the
onClosehook atindex.ts:287fires before or aftersql.end()in the shutdown sequence.
Re-review (post-fix)
Date: 2026-06-12 Baseline: p1-code-review.md (verdict Block, B1-B8 blocking) Fix pass: p1-fix-analysis.md (all B1-B8 claimed fixed, 49 tests passing)
Scope
Same files as original review. Re-traced the full data chain: SSE line -> parseSseLine -> handleLlamaSweepEvent -> DB insert + incrementSeq -> DeltaEmitter.publish -> ws.ts subscriber -> ControlFleetFrame wire shape -> useControlStream.tsx client application. Verified each blocking finding by reading the current code, not by trusting comments or the fix analysis.
Size
Medium -- fix pass across 7 source files + 1 new test file; no new subsystems or surfaces.
Summary
All 8 original blocking findings are genuinely fixed at the code level. The SSE parser works, incrementSeq is called on every mutation, the DeltaEmitter pattern connects mutations to WS subscribers, the wire format matches between server and client, async errors are caught, retention uses the composite key, reconcile runs from the metrics case, and the job handler uses frame data. However, the fix pass introduced a new multi-host regression (deltas replace the full hosts array), the rebuildFleetFromDB sets liveness to 'connected' when it should be 'down', and the pipeline test simulates the logic inline rather than exercising the real implementation chain.
| Classification | Count |
|---|---|
| Blocking | 1 |
| Advisory | 3 |
| Nit | 1 |
Blocking findings: B1-B8 confirmation
B1: SSE line parser inverted
Verdict: FIXED
fleet-connector.ts:116-159: The contradictory startsWith('data:') filter is gone. parseSseLine now correctly handles three cases:
event:lines set the event type (line 124-126)data:lines emit the event using the current event type (line 129-141)- Non-standard
type: jsonsingle-line format (line 144-156)
The caller loop at fleet-connector.ts:204-227 tracks currentEventType and calls parseSseLine(line, currentEventType). Standard SSE: event: line returns {event: null, eventType: 'modelStatus'}, caller stores it. Next data: line returns the parsed event with the stored type. Dead code eliminated; the onEvent callback is now reachable.
B2: incrementSeq never called
Verdict: FIXED
incrementSeq is exported from fleet-state.ts:83-86, imported in index.ts:6, and called at:
index.ts:60(modelStatus case)index.ts:89(logData case)index.ts:102(metrics case)index.ts:237(pollPerformance, per sample)
Every fleet-state mutation increments seq before publishing. The seq is included in the delta payload.
B3: WS handler has no delta-publishing mechanism
Verdict: FIXED
DeltaEmitter (index.ts:16-34) is a Set<callback> pattern with subscribe and publish. Every mutation path calls emitter.publish(...). ws.ts:34-37 subscribes on connect, unsubscribes on close/error (lines 48-56). The listener set is iterated in publish with per-listener try/catch (line 30). Live updates flow from mutation to WS client.
B4: Snapshot wire format mismatch
Verdict: FIXED
ws.ts:26-31 sends { type: 'control_fleet', seq: maxSeq, hosts: snapshot.hosts } at the top level, matching the ControlFleetFrame Zod schema (ws-frames.ts:492-508). The client at useControlStream.tsx:155 reads frame.hosts which now exists. Snapshot uses maxSeq across all hosts (line 26). Client distinguishes snapshot from delta via hasSnapshotRef flag (line 156-166).
B5: onEvent drops async errors
Verdict: FIXED
fleet-connector.ts:101: Type is () => void | Promise<void>. Call site at line 222-226: await Promise.resolve(deps.onEvent(providerId, parsed.event)) with catch that logs via deps.log.error. DB failures no longer produce unhandled rejections.
B6: pruneRawSamples references non-existent id column
Verdict: FIXED
retention.ts:77-88: Rewritten to use composite key (provider_id, ts). SELECT returns { provider_id, ts } rows. DELETE uses WHERE (provider_id, ts) = ANY(...). Chunked in a while-loop with chunkSize = 1000.
B7: onReconcile wired but never called
Verdict: FIXED (with nit)
Gap detection now runs via handleLlamaSweepEvent -> handleReconcile direct call (index.ts:101-105), not via deps.onReconcile. The deps.onReconcile callback at index.ts:377 is wired but never invoked from the connector loop -- it is dead code. The effect is correct: metrics events trigger reconcile. The dead onReconcile dep is a nit (see below).
B8: control_job garbage insert
Verdict: FIXED
useControlStream.tsx:185-191: Handler reads frame.jobType, frame.jobId, frame.status from the parsed ControlJobFrame and pushes a proper entry to the jobs array, capped at 200. No hardcoded garbage.
New finding from fix pass
B9: Fleet delta replaces entire hosts array -- multi-host regression
- Location:
apps/web/src/hooks/useControlStream.tsx:164 - Evidence:
Each delta from the server contains only the changed host in
// Delta: apply only if seq > snapshot seq. if (frame.seq > snapshotSeqRef.current) { setState((prev) => ({ ...prev, hosts: frame.hosts as unknown as ControlFleetHost[] })); }hosts(e.g.,index.ts:68-84publishes a single-element array). The client replacesprev.hostswholesale with this single-element array. With 2+ connected hosts, a modelStatus event for host A wipes host B from the UI until the next snapshot. - Standard violated: Idempotent delta application. Deltas should merge by
providerId, not replace the full array. - Risk: Any multi-host deployment shows flickering/missing hosts in the Fleet tab. Single-host deployments are unaffected.
- Fix sketch:
if (frame.seq > snapshotSeqRef.current) { setState((prev) => { const hostMap = new Map(prev.hosts.map((h) => [h.providerId, h])); for (const h of frame.hosts) hostMap.set(h.providerId, h); return { ...prev, hosts: Array.from(hostMap.values()) }; }); }
A1 rebuildFleetFromDB correctness
Location: index.ts:256-310
Finding: rebuildFleetFromDB sets state.liveness = 'connected' at line 270 for every host it rebuilds from DB. This runs at startup (line 355-357), before SSE connectors start (line 366-385). After a service restart, hosts have no live SSE connection yet. Setting liveness to 'connected' is incorrect -- the hosts should start as 'down' (the default from ensureHostState at fleet-state.ts:67) until the SSE connector establishes a connection.
The correct behavior: rebuildFleetFromDB should populate models/lastSeenAt from DB but leave liveness at the default 'down'. The SSE connector loop will update liveness to 'connected' when connections are established (via stampLastSeen + the modelStatus case setting state.liveness = 'connected' at index.ts:52).
- Severity: Advisory. A late-joining client during the brief window before connectors start sees hosts as 'connected' with stale data. The window is typically seconds. The hosts will flip to 'down' momentarily if the connector fails to connect, or stay 'connected' if it succeeds -- so the visual glitch is minor. But it violates the liveness semantic.
HostCard.tsx:56 double-cast
Location: apps/web/src/components/control/HostCard.tsx:56
const gpuData = (host as unknown as Record<string, unknown>)['gpu'] as {
vram_used?: number; vram_total?: number; temperature?: number; power?: number;
} | undefined;
The ControlFleetHost type has no gpu field. The double-cast accesses a property that doesn't exist on the wire type. At runtime, host.gpu is always undefined, so the GPU gauge always shows "no GPU data". This is a silent no-op, not a crash.
Typed fix: GPU data comes from perf samples, not the fleet snapshot. The HostCard should receive the latest perf sample for its host as a prop (looked up from ControlStreamState.perfSamples by providerId). Remove the double-cast; add a perfSample?: ControlPerfSample prop to HostCardProps.
pipeline.test.ts quality
Location: apps/control/src/services/__tests__/pipeline.test.ts
The test title says "SSE pipeline: parse -> store -> emit deltas" but it does not exercise the actual handleLlamaSweepEvent, DeltaEmitter, or SQL code paths. Instead, it reimplements the logic inline (lines 97-132) with mock SQL that always succeeds. This means:
- The
await + catcherror handling (B5 fix) is never tested -- mock SQL never fails. - The
DeltaEmitter.publish-> subscriber path is never tested. - The actual
handleLlamaSweepEventfunction is never called. - The
metricscase with reconcile and per-entry INSERTs is not tested against the real code.
The tests prove the logic can work in isolation but do not prove the wiring is correct. The reconcile.test.ts (7 tests on detectGap) is solid and well-targeted. The fleet-connector.test.ts and fleet-state.test.ts test their respective modules. But there is no integration test that calls handleLlamaSweepEvent with a mock SQL + DeltaEmitter and asserts the emitted deltas match the wire format.
- Severity: Advisory. The unit tests cover the building blocks. An integration test would catch wiring bugs (wrong import, wrong field name, missing await). Reopen trigger: any bug where the individual components pass tests but the pipeline fails at runtime.
Accepted follow-ups (not re-litigated)
A2, A3, A5, A9, A10 per the fix analysis YAGNI gates.
Nits
N6: Dead onReconcile dep callback -- fleet-connector.ts:102 declares onReconcile in FleetConnectorDeps, wired at index.ts:377, but the connector loop never calls deps.onReconcile. Reconcile runs via the direct handleLlamaSweepEvent -> handleReconcile path. Remove the dead callback or have the connector call it on the metrics event instead of calling handleReconcile directly from handleLlamaSweepEvent.
Verdict
REQUEST-CHANGES
B1-B8 from the original review are all genuinely fixed. The data chain works end-to-end for a single host. However, the fix pass introduced a new blocking finding:
- B9 (blocking): Fleet delta replaces the entire hosts array, breaking multi-host deployments. A delta for one host wipes all other hosts from the UI. Fix: merge deltas by
providerIdinstead of replacingprev.hosts.
Advisory findings to address before or shortly after merge:
- A1 rebuild liveness:
rebuildFleetFromDBsets liveness to'connected'before connectors start. Should leave at'down'. - HostCard double-cast: Remove the
as unknown ascast; pass GPU data from perfSamples as a typed prop. - pipeline.test.ts: Does not exercise the real
handleLlamaSweepEventorDeltaEmitterchain. Consider an integration test with mock SQL + emitter.
Claims I did not verify
- Same as original review (llama-swap SSE format, react-virtuoso types, ECharts tree-shaking, postgres parameterization, setInterval overlap, shutdown ordering).
- Whether the
DELETE ... = ANY(${sql(toDelete)})pattern atretention.ts:87works with thepostgreslibrary whentoDeletecontains objects with Date values (thetsfield is typed asDatebut the column isTIMESTAMPTZ). - Whether the batch INSERT at
index.ts:229-231(sql.unsafe(inserts.map(s => s.toString()).join(';\n'))) correctly handles the semicolon-separated multi-statement execution in thepostgreslibrary.