Files

indifferentketchup b18de2a331 chore: snapshot working tree - pty_exited notifications + in-flight inference WIP

feat(booterm): structured pty_exited WS notifications. Plan-validated, impl-validated, code-reviewed green (contracts build clean, contracts test 29/29, booterm + web typecheck clean).

wip: in-progress inference/provider refactor (agents.ts, provider.ts, new llama-providers.ts, removed llama-args-validator), plus arena, dispatcher, compaction, schema changes.

openspec: pty-exit-notifications complete; x-agent-flags planned (not yet implemented).

2026-06-14 12:48:47 +00:00

32 KiB

Raw Blame History

Review: BooControl P1 (uncommitted working tree)

Scope

apps/control/** (new Fastify host service: SSE fleet connector w/ backoff+jitter, perf poller, seq-stamped in-memory fleet state, WS endpoint, retention job, schema.sql, db.ts waitForTable, 6 test files), apps/server/src/routes/control-proxy.ts, packages/contracts/src/ws-frames.ts control_* frames, apps/web/src/pages/Control.tsx, apps/web/src/hooks/useControlStream.tsx, apps/web/src/components/control/** (HostCard, FleetTab, ActivityTab, PerfChart, VramGauge, TtlRing, buildEChartsTheme).

Size

Large -- new host service (5 source files, 6 tests), cross-app WS contract additions (contracts + server proxy + web hook + 7 UI components), touches DB, SSE, WebSocket, and rendering surfaces.

Summary

The SSE fleet connector's line parser is logic-inverted (skips the lines it tries to match), making the entire ingestion pipeline dead code. Beyond that, three compounding issues make the WS endpoint non-functional: incrementSeq is never called (seq stays 0), the WS handler has no delta-publishing mechanism, and the snapshot wire format nests hosts under a snapshot key the client never reads. The retention job will crash on first execution because pruneRawSamples references a non-existent id column. The onEvent callback drops async errors, meaning a single DB failure crashes the process. In total, the backend pipeline (SSE -> parse -> store -> WS publish) is broken at every link, and the frontend implements a protocol the server does not speak. None of the core data flows work end-to-end.

Classification	Count
Blocking	8
Advisory	10
Nit	5

Findings

Blocking

B1: SSE line parser is logic-inverted -- all events silently dropped

Location: apps/control/src/services/fleet-connector.ts:158

Evidence:

// Line 158: SKIP any line starting with "data:"
if (!trimmed || trimmed.startsWith('data:')) continue;

// Line 160: But THEN require the line to start with "data:" to proceed
const dataMatch = trimmed.match(/^data:\s*(.+)$/);
if (!dataMatch) continue;

Standard violated: SSE parsing correctness. The filter and the regex are contradictory: lines matching the regex are filtered out before reaching it. The onEvent callback at line 169 is unreachable dead code.
Risk: This is the root entry point of the entire data pipeline. No SSE events from any llama-swap host ever reach handleLlamaSweepEvent or handleReconcile. The in-memory fleet state is never populated. The DB is never written to. The WS snapshot is always empty. The entire BooControl cockpit is non-functional at runtime.
Fix sketch: Remove the startsWith('data:') filter on line 158. If the format is standard SSE (event: type\ndata: json), accumulate event type from event: lines and payload from data: lines, emit on blank line. If the format is non-standard single-line (type: json), use a single regex like /^(\w+):\s*(.+)$/ and remove the data: prefix check entirely. The eventType = trimmed.split(':')[0] on line 167 also breaks on JSON payloads containing colons (timestamps).

B2: incrementSeq defined but never called -- seq stays 0 forever

Location: apps/control/src/index.ts:33-36
Evidence:
```
function incrementSeq(state: HostState): number {
  state.seq += 1;
  return state.seq;
}
```
No call site in the codebase invokes incrementSeq. Every HostState starts with seq: 0 and stays there. The client-side dedup guard at useControlStream.tsx:168 (if (frame.seq > snapshotSeq)) discards every delta since 0 > 0 is false.
Standard violated: The seq-stamped delta protocol described in design.md section 4 ("per-host monotonic seq, incremented on every mutation").
Risk: Even with SSE parsing fixed, no delta would ever pass the client's seq filter. Live updates are structurally impossible.
Fix sketch: Call incrementSeq(state) inside handleLlamaSweepEvent and handleReconcile after every fleet-state mutation, before the DB write. Include the returned seq in the delta published to WS subscribers.

B3: WS handler has no delta-publishing mechanism -- onFleetDelta is dead code

Location: apps/control/src/routes/ws.ts:30-39

Evidence:

const onFleetDelta = (delta: unknown) => {
  if (socket.readyState === WebSocket.OPEN) {
    socket.send(JSON.stringify(delta));
  }
};
// Comment: "In practice, the fleet service should publish deltas through a channel
// that this handler subscribes to. For now, we use a simple approach:
// the fleet state is rebuilt on each snapshot request."

The callback is defined but nothing subscribes to it or calls it. There is no event emitter, no pub/sub channel, no polling loop.

Standard violated: design.md section 4: "Fan-out to browser: the control service publishes over its own WS."
Risk: WS clients get a one-shot snapshot at connection time and then go permanently stale. Model state changes, activity events, perf samples, and logs are never pushed to the frontend.
Fix sketch: Add an EventEmitter (or a simple Set<callback> pattern matching sessionEvents.ts) to the fleet state. Have handleLlamaSweepEvent/handleReconcile publish seq-stamped deltas through it. The WS handler registers a listener on connect and removes it on close.

B4: Snapshot wire format mismatch -- client never receives host data

Location: apps/control/src/routes/ws.ts:24-27 vs apps/web/src/hooks/useControlStream.tsx:157

Evidence: Server sends:

socket.send(JSON.stringify({
  type: 'control_fleet' as const,
  snapshot,  // { hosts: [...] } nested under "snapshot" key
}));

Client reads:

if (frame.hosts && Array.isArray(frame.hosts)) {  // frame.hosts is undefined

The hosts array is at frame.snapshot.hosts, not frame.hosts. The client silently ignores the frame.

Standard violated: Wire format contract between ws.ts and useControlStream.tsx. The ControlFleetFrame Zod schema in ws-frames.ts:492-508 expects seq and hosts at the top level, which the snapshot does not provide.
Risk: Even if B1-B3 were fixed, the client would never populate the Fleet tab. The page would show "No hosts connected" permanently.
Fix sketch: Change the server to send { type: 'control_fleet', seq: host.seq, hosts: [...] } at the top level (matching the Zod schema). Alternatively, change the client to read data.snapshot.hosts. The former is simpler and aligns with the contracts schema.

B5: onEvent callback drops async errors -- DB failure crashes the process

Location: apps/control/src/services/fleet-connector.ts:101,169 + apps/control/src/index.ts:253

Evidence:

// fleet-connector.ts:101 -- typed as returning void
onEvent: (providerId: string, event: LlamaSweepSSEEvent) => void;

// fleet-connector.ts:169 -- called without await
deps.onEvent(providerId, event);

// index.ts:253 -- implementation is async
onEvent: (pid, event) => handleLlamaSweepEvent(fleet, sql, config, pid, event),

handleLlamaSweepEvent is async and performs SQL INSERTs. The returned Promise is discarded. Any SQL failure (connection timeout, pool exhaustion) becomes an unhandled rejection. Node 15+ crashes on unhandled rejections by default.

Standard violated: Async error handling discipline. The onReconcile callback IS typed as Promise<boolean> and is properly awaited, showing the pattern was intended.
Risk: A single transient DB error during SSE event processing crashes the entire BooControl process. Under high event throughput, unbounded concurrent DB writes also exhaust the 10-connection pool, causing cascading timeouts.
Fix sketch: Add .catch() to the onEvent call: Promise.resolve(deps.onEvent(providerId, event)).catch((err) => { deps.log.error({ providerId, err }, 'fleet: onEvent failed'); });. Change the type to (providerId: string, event: LlamaSweepSSEEvent) => void | Promise<void>. For backpressure, consider a bounded queue (e.g., p-queue with concurrency capped at pool size minus headroom).

B6: pruneRawSamples references non-existent id column -- guaranteed SQL error

Location: apps/control/src/services/retention.ts:78-88

Evidence:

const toDelete = await sql<{ id: number }[]>`
  SELECT id FROM control_perf_samples  -- no "id" column in this table
  WHERE provider_id = ${providerId}
    AND ts < ${cutoff.toISOString()}
  ORDER BY ts DESC
  LIMIT ${chunkSize}
`;

control_perf_samples schema (schema.sql:49-55): (provider_id TEXT, ts TIMESTAMPTZ, gpu JSONB, sys JSONB) -- no id column. Compare with control_requests which has id BIGSERIAL PRIMARY KEY.

Standard violated: Schema/code consistency. The retention function was likely written for control_requests and copied without adapting to control_perf_samples's composite-key schema.
Risk: The daily retention job throws column "id" does not exist on first execution. The error propagates from the setInterval callback as an unhandled rejection, crashing the service.

Fix sketch: Rewrite to chunk by (provider_id, ts) composite key:

const toDelete = await sql<{ provider_id: string; ts: Date }[]>`
  SELECT provider_id, ts FROM control_perf_samples
  WHERE provider_id = ${providerId} AND ts < ${cutoff.toISOString()}
  ORDER BY ts DESC LIMIT ${chunkSize}
`;
if (toDelete.length === 0) break;
await sql`DELETE FROM control_perf_samples WHERE (provider_id, ts) = ANY(${sql(toDelete)})`;

Or add an id BIGSERIAL column to the table (migration needed for existing DBs).

B7: onReconcile wired but never called -- gap detection is dead code

Location: apps/control/src/services/fleet-connector.ts:102 + apps/control/src/index.ts:102-154,254
Evidence: The onReconcile callback is declared in FleetConnectorDeps and wired at index.ts:254, but the connector loop at fleet-connector.ts:122-196 never invokes deps.onReconcile. The handleReconcile function (gap detection + bulk INSERT) is unreachable dead code.
Standard violated: design.md section 4: "On reconnect, reconcile via GET /api/metrics (full ring)." The reconcile-on-reconnect path is the mechanism for detecting ring-buffer wraps and filling data gaps.
Risk: Silent data loss after connector restarts or network interruptions. Metrics ring buffer wraps are never detected, leaving permanent gaps in control_requests that are invisible to the user.
Fix sketch: Call onReconcile when the SSE metrics event arrives (pass the MetricsData through), or add a periodic reconcile timer in index.ts that fetches the full metrics ring from each host on a configurable interval.

B8: control_job frame handler inserts garbage data into activity feed

Location: apps/web/src/hooks/useControlStream.tsx:191-196

Evidence:

} else if (data.type === 'control_job') {
  const frame = data as ControlJobFrame;
  setState((prev) => ({
    ...prev,
    requests: [...prev.requests, { id: 0, providerId: '', ts: '', model: null,
      reqPath: null, statusCode: null, durationMs: null }].slice(-500),
  }));
}

The frame payload is parsed but ignored. A hardcoded garbage entry is pushed into the requests array.

Standard violated: Idempotent event handling. The handler should either use the frame data or be a no-op placeholder.
Risk: Currently moot (no control_job frames are sent in P1). When jobs are implemented, every job event pollutes the activity feed with empty phantom entries, displacing real request data from the 500-entry cap.
Fix sketch: Either implement proper job-state tracking (store in a separate jobs state field) or replace with a no-op // TODO: P3 implement job frame handling.

Advisory

A1: No fleet-state rebuild from DB on service restart

Location: apps/control/src/index.ts:223
Finding: createFleetState() always returns an empty Map. The ws.ts comment says "On service restart, rebuild fleet state from DB before serving snapshots" but this is unimplemented.
YAGNI gate: Moot while B1 is unfixed (SSE never populates state). Will become blocking once SSE is fixed. A late-joining client during the gap after restart sees all hosts as down with no models.

A2: pruneActivity and pruneModelEvents are not chunked

Location: apps/control/src/services/retention.ts:95-109
Finding: Both do unbounded DELETE in a single statement. Design doc section 6 explicitly calls for "chunked transactions: one transaction per provider per 1-hour window, never one 48h mega-transaction."
YAGNI gate: At 5s poll intervals x 2 hosts, control_requests accumulates ~35k rows/day. A 48h unbounded DELETE holds a RowExclusiveLock for seconds, blocking the perf poller's concurrent INSERTs. The stall is measurable but not catastrophic for a single-user setup. Reopen trigger: if retention causes visible perf-poller lag in production.

A3: No Zod validation on incoming WS frames

Location: apps/web/src/hooks/useControlStream.tsx:149-201
Finding: Frames are parsed with JSON.parse and cast directly to types. Sibling useUserEvents.ts:41-68 validates every frame against WsFrameSchema with fail-closed logging.
YAGNI gate: Control frames bypass the broker (raw WS proxy), so the server-side Zod gate does not apply. Without client validation, a malformed frame silently corrupts state. Reopen trigger: any incident where a bad frame causes a UI crash.

A4: ECharts instances never disposed on component unmount

Location: apps/web/src/components/control/PerfChart.tsx:95-97, VramGauge.tsx:89-91, TtlRing.tsx:98-101
Finding: Cleanup functions disconnect ResizeObservers and clear intervals but never call chart.dispose(). Canvas elements and associated GPU memory are leaked on unmount.
YAGNI gate: The Control page is a single-route SPA; components unmount only on navigation away. The leak is bounded (3 chart instances max). Reopen trigger: memory profiling shows ECharts accumulation after repeated navigation.

A5: trimCapture size estimation uses UTF-16 code-unit count as byte proxy

Location: apps/control/src/services/retention.ts:117
Finding: captureJson.length * 2 estimates bytes for a UTF-16 JS string. For ASCII-heavy JSON (the common case for HTTP captures), this overestimates by 2x, meaning captures that should be trimmed are not. The trim threshold at line 120 (sizeKB * 512) compensates, but the check-and-trim logic is inconsistent.
YAGNI gate: The cap is advisory (256KB default). Captures slightly over the cap are not trimmed, but the total budget pruning (not implemented in P1) would catch them. Reopen trigger: capture storage exceeds CAPTURE_BUDGET_MB.

A6: Fixed 5s reconnect delay without exponential backoff

Location: apps/web/src/hooks/useControlStream.tsx:205
Finding: setTimeout(connect, 5000) -- fixed delay. Siblings useUserEvents.ts and useSessionStream.ts both use exponential backoff (1s to 30s).
YAGNI gate: The control WS is a secondary connection; a 5s reconnect cadence is acceptable for a dashboard. Reopen trigger: reconnect storms during extended outages.

A7: Perf poller has no fetch timeout

Location: apps/control/src/index.ts:176
Finding: fetch(url) has no signal or timeout. If a host hangs (accepts TCP but never responds), the poll blocks indefinitely. The sequential for loop at line 271 means one hung host stalls polling for all subsequent hosts.
YAGNI gate: llama-swap's /api/performance is a fast local endpoint. Reopen trigger: any host observed hanging in production.

A8: Perf poller catch block swallows errors silently

Location: apps/control/src/index.ts:190-192
Finding: catch { // Poll failure -- handled by the connector's circuit-breaker. }. The comment references a circuit-breaker that does not exist for the perf poller. The error is silently discarded.
YAGNI gate: Same as A7 -- fast local endpoint, errors are transient. Reopen trigger: silent poll failures observed in logs.

A9: Response header forwarding without filtering in control-proxy

Location: apps/server/src/routes/control-proxy.ts:78-81
Finding: All upstream response headers are forwarded except transfer-encoding. This includes set-cookie, x-powered-by, and internal headers. The coder-proxy has the same pattern (deliberate clone), but the control service is a new internal service with no auth, making header leakage more concerning.
YAGNI gate: BooControl is an internal dashboard behind Authelia. Header leakage is not exploitable from outside the Tailscale mesh. Reopen trigger: any external exposure of the control endpoint.

A10: SSRF via unvalidated ssh_host in URL construction

Location: apps/control/src/index.ts:248
Finding: const baseUrl = \http://${sshHost}:8401`--ssh_hostfrom the DB flows directly intofetch()` URLs with no validation (IP format, private-range check).
YAGNI gate: control_hosts is seeded with known hosts and modified only via direct SQL (no admin UI in P1). An attacker with DB write access already has worse options. Reopen trigger: any user-facing host-edit UI.

Nits

N1: Duplicate createFleetState definition -- index.ts:14 defines a local createFleetState that shadows the identical export from fleet-state.ts:60. Remove the local copy and import from the module.

N2: theme as any cast in ECharts init -- PerfChart.tsx:37, VramGauge.tsx:25, TtlRing.tsx:25. buildEChartsTheme() returns Record<string, unknown> but echarts.init() expects a typed theme. The as any bypasses type safety. Low risk since the theme object is simple and validated by visual inspection.

N3: window.matchMedia called in render body -- HostCard.tsx:51 and HostCard.tsx:207. The prefersReducedMotion check runs on every render. Move to a useMemo or module-level constant to avoid redundant re-evaluation.

N4: SSE error logging drops the error object -- fleet-connector.ts:185. The err variable from the catch block is captured but not included in the log fields. Distinguishing connection reset from DNS failure requires the error message.

N5: Sequential N+1 DB inserts for metrics entries -- index.ts:79-86. Each metrics entry triggers an individual await sql INSERT. A batch of N entries requires N round-trips. Consider a multi-row INSERT or a transactional batch.

Verdict

Block

Blocking findings B1-B8 must be resolved before merge. The SSE parser inversion (B1) makes the entire ingestion pipeline dead code. The seq/delta/publish chain (B2-B4) makes the WS endpoint non-functional. The retention crash (B6) will take down the service on first daily tick. The async error handling (B5) means any DB failure is a process crash. The reconcile dead code (B7) means gap detection never runs. The garbage handler (B8) will corrupt the activity feed when jobs ship.

The core recommendation: before fixing individual bugs, establish the end-to-end data flow first. Wire SSE parse -> event handler -> seq increment -> delta publish -> WS broadcast -> client apply in a single pass, with integration tests at each boundary. The current code has the right shapes (backoff+jitter, seq-stamped protocol, chunked retention) but none of the links are connected.

Claims I did not verify

Whether llama-swap's /api/events SSE format is standard (event: + data: lines) or non-standard (single-line type: json). The fix for B1 depends on this.
Whether the control_perf_samples table exists in any deployed DB (it would fail on SELECT id if it does).
Whether react-virtuoso's followOutput prop type accepts 'bottom' as FollowOutput without runtime issues.
Whether the ECharts GaugeChart import at VramGauge.tsx:4 and TtlRing.tsx:4 is tree-shakeable or pulls the full gauge bundle.
Whether the postgres tagged-template library parameterizes ::jsonb casts correctly (the security analyst concluded it does, but I did not trace the library internals).
Whether the setInterval callbacks at index.ts:265,277 can overlap if a poll/retention cycle exceeds the interval period (Node's single-threaded model prevents true overlap, but the async callback can be re-entered at await points).
Whether the onClose hook at index.ts:287 fires before or after sql.end() in the shutdown sequence.

Re-review (post-fix)

Date: 2026-06-12 Baseline: p1-code-review.md (verdict Block, B1-B8 blocking) Fix pass: p1-fix-analysis.md (all B1-B8 claimed fixed, 49 tests passing)

Scope

Same files as original review. Re-traced the full data chain: SSE line -> parseSseLine -> handleLlamaSweepEvent -> DB insert + incrementSeq -> DeltaEmitter.publish -> ws.ts subscriber -> ControlFleetFrame wire shape -> useControlStream.tsx client application. Verified each blocking finding by reading the current code, not by trusting comments or the fix analysis.

Size

Medium -- fix pass across 7 source files + 1 new test file; no new subsystems or surfaces.

Summary

All 8 original blocking findings are genuinely fixed at the code level. The SSE parser works, incrementSeq is called on every mutation, the DeltaEmitter pattern connects mutations to WS subscribers, the wire format matches between server and client, async errors are caught, retention uses the composite key, reconcile runs from the metrics case, and the job handler uses frame data. However, the fix pass introduced a new multi-host regression (deltas replace the full hosts array), the rebuildFleetFromDB sets liveness to 'connected' when it should be 'down', and the pipeline test simulates the logic inline rather than exercising the real implementation chain.

Classification	Count
Blocking	1
Advisory	3
Nit	1

Blocking findings: B1-B8 confirmation

B1: SSE line parser inverted

Verdict: FIXED

fleet-connector.ts:116-159: The contradictory startsWith('data:') filter is gone. parseSseLine now correctly handles three cases:

event: lines set the event type (line 124-126)
data: lines emit the event using the current event type (line 129-141)
Non-standard type: json single-line format (line 144-156)

The caller loop at fleet-connector.ts:204-227 tracks currentEventType and calls parseSseLine(line, currentEventType). Standard SSE: event: line returns {event: null, eventType: 'modelStatus'}, caller stores it. Next data: line returns the parsed event with the stored type. Dead code eliminated; the onEvent callback is now reachable.

B2: incrementSeq never called

Verdict: FIXED

incrementSeq is exported from fleet-state.ts:83-86, imported in index.ts:6, and called at:

index.ts:60 (modelStatus case)
index.ts:89 (logData case)
index.ts:102 (metrics case)
index.ts:237 (pollPerformance, per sample)

Every fleet-state mutation increments seq before publishing. The seq is included in the delta payload.

B3: WS handler has no delta-publishing mechanism

Verdict: FIXED

DeltaEmitter (index.ts:16-34) is a Set<callback> pattern with subscribe and publish. Every mutation path calls emitter.publish(...). ws.ts:34-37 subscribes on connect, unsubscribes on close/error (lines 48-56). The listener set is iterated in publish with per-listener try/catch (line 30). Live updates flow from mutation to WS client.

B4: Snapshot wire format mismatch

Verdict: FIXED

ws.ts:26-31 sends { type: 'control_fleet', seq: maxSeq, hosts: snapshot.hosts } at the top level, matching the ControlFleetFrame Zod schema (ws-frames.ts:492-508). The client at useControlStream.tsx:155 reads frame.hosts which now exists. Snapshot uses maxSeq across all hosts (line 26). Client distinguishes snapshot from delta via hasSnapshotRef flag (line 156-166).

B5: onEvent drops async errors

Verdict: FIXED

fleet-connector.ts:101: Type is () => void | Promise<void>. Call site at line 222-226: await Promise.resolve(deps.onEvent(providerId, parsed.event)) with catch that logs via deps.log.error. DB failures no longer produce unhandled rejections.

B6: pruneRawSamples references non-existent id column

Verdict: FIXED

retention.ts:77-88: Rewritten to use composite key (provider_id, ts). SELECT returns { provider_id, ts } rows. DELETE uses WHERE (provider_id, ts) = ANY(...). Chunked in a while-loop with chunkSize = 1000.

B7: onReconcile wired but never called

Verdict: FIXED (with nit)

Gap detection now runs via handleLlamaSweepEvent -> handleReconcile direct call (index.ts:101-105), not via deps.onReconcile. The deps.onReconcile callback at index.ts:377 is wired but never invoked from the connector loop -- it is dead code. The effect is correct: metrics events trigger reconcile. The dead onReconcile dep is a nit (see below).

B8: control_job garbage insert

Verdict: FIXED

useControlStream.tsx:185-191: Handler reads frame.jobType, frame.jobId, frame.status from the parsed ControlJobFrame and pushes a proper entry to the jobs array, capped at 200. No hardcoded garbage.

New finding from fix pass

B9: Fleet delta replaces entire hosts array -- multi-host regression

Location: apps/web/src/hooks/useControlStream.tsx:164
Evidence:
```
// Delta: apply only if seq > snapshot seq.
if (frame.seq > snapshotSeqRef.current) {
  setState((prev) => ({ ...prev, hosts: frame.hosts as unknown as ControlFleetHost[] }));
}
```
Each delta from the server contains only the changed host in hosts (e.g., index.ts:68-84 publishes a single-element array). The client replaces prev.hosts wholesale with this single-element array. With 2+ connected hosts, a modelStatus event for host A wipes host B from the UI until the next snapshot.
Standard violated: Idempotent delta application. Deltas should merge by providerId, not replace the full array.
Risk: Any multi-host deployment shows flickering/missing hosts in the Fleet tab. Single-host deployments are unaffected.

Fix sketch:

if (frame.seq > snapshotSeqRef.current) {
  setState((prev) => {
    const hostMap = new Map(prev.hosts.map((h) => [h.providerId, h]));
    for (const h of frame.hosts) hostMap.set(h.providerId, h);
    return { ...prev, hosts: Array.from(hostMap.values()) };
  });
}

A1 rebuildFleetFromDB correctness

Location: index.ts:256-310

Finding: rebuildFleetFromDB sets state.liveness = 'connected' at line 270 for every host it rebuilds from DB. This runs at startup (line 355-357), before SSE connectors start (line 366-385). After a service restart, hosts have no live SSE connection yet. Setting liveness to 'connected' is incorrect -- the hosts should start as 'down' (the default from ensureHostState at fleet-state.ts:67) until the SSE connector establishes a connection.

The correct behavior: rebuildFleetFromDB should populate models/lastSeenAt from DB but leave liveness at the default 'down'. The SSE connector loop will update liveness to 'connected' when connections are established (via stampLastSeen + the modelStatus case setting state.liveness = 'connected' at index.ts:52).

Severity: Advisory. A late-joining client during the brief window before connectors start sees hosts as 'connected' with stale data. The window is typically seconds. The hosts will flip to 'down' momentarily if the connector fails to connect, or stay 'connected' if it succeeds -- so the visual glitch is minor. But it violates the liveness semantic.

HostCard.tsx:56 double-cast

Location: apps/web/src/components/control/HostCard.tsx:56

const gpuData = (host as unknown as Record<string, unknown>)['gpu'] as {
  vram_used?: number; vram_total?: number; temperature?: number; power?: number;
} | undefined;

The ControlFleetHost type has no gpu field. The double-cast accesses a property that doesn't exist on the wire type. At runtime, host.gpu is always undefined, so the GPU gauge always shows "no GPU data". This is a silent no-op, not a crash.

Typed fix: GPU data comes from perf samples, not the fleet snapshot. The HostCard should receive the latest perf sample for its host as a prop (looked up from ControlStreamState.perfSamples by providerId). Remove the double-cast; add a perfSample?: ControlPerfSample prop to HostCardProps.

pipeline.test.ts quality

Location: apps/control/src/services/__tests__/pipeline.test.ts

The test title says "SSE pipeline: parse -> store -> emit deltas" but it does not exercise the actual handleLlamaSweepEvent, DeltaEmitter, or SQL code paths. Instead, it reimplements the logic inline (lines 97-132) with mock SQL that always succeeds. This means:

The await + catch error handling (B5 fix) is never tested -- mock SQL never fails.
The DeltaEmitter.publish -> subscriber path is never tested.
The actual handleLlamaSweepEvent function is never called.
The metrics case with reconcile and per-entry INSERTs is not tested against the real code.

The tests prove the logic can work in isolation but do not prove the wiring is correct. The reconcile.test.ts (7 tests on detectGap) is solid and well-targeted. The fleet-connector.test.ts and fleet-state.test.ts test their respective modules. But there is no integration test that calls handleLlamaSweepEvent with a mock SQL + DeltaEmitter and asserts the emitted deltas match the wire format.

Severity: Advisory. The unit tests cover the building blocks. An integration test would catch wiring bugs (wrong import, wrong field name, missing await). Reopen trigger: any bug where the individual components pass tests but the pipeline fails at runtime.

Accepted follow-ups (not re-litigated)

A2, A3, A5, A9, A10 per the fix analysis YAGNI gates.

Nits

N6: Dead onReconcile dep callback -- fleet-connector.ts:102 declares onReconcile in FleetConnectorDeps, wired at index.ts:377, but the connector loop never calls deps.onReconcile. Reconcile runs via the direct handleLlamaSweepEvent -> handleReconcile path. Remove the dead callback or have the connector call it on the metrics event instead of calling handleReconcile directly from handleLlamaSweepEvent.

Verdict

REQUEST-CHANGES

B1-B8 from the original review are all genuinely fixed. The data chain works end-to-end for a single host. However, the fix pass introduced a new blocking finding:

B9 (blocking): Fleet delta replaces the entire hosts array, breaking multi-host deployments. A delta for one host wipes all other hosts from the UI. Fix: merge deltas by providerId instead of replacing prev.hosts.

Advisory findings to address before or shortly after merge:

A1 rebuild liveness: rebuildFleetFromDB sets liveness to 'connected' before connectors start. Should leave at 'down'.
HostCard double-cast: Remove the as unknown as cast; pass GPU data from perfSamples as a typed prop.
pipeline.test.ts: Does not exercise the real handleLlamaSweepEvent or DeltaEmitter chain. Consider an integration test with mock SQL + emitter.

Claims I did not verify

Same as original review (llama-swap SSE format, react-virtuoso types, ECharts tree-shaking, postgres parameterization, setInterval overlap, shutdown ordering).
Whether the DELETE ... = ANY(${sql(toDelete)}) pattern at retention.ts:87 works with the postgres library when toDelete contains objects with Date values (the ts field is typed as Date but the column is TIMESTAMPTZ).
Whether the batch INSERT at index.ts:229-231 (sql.unsafe(inserts.map(s => s.toString()).join(';\n'))) correctly handles the semicolon-separated multi-statement execution in the postgres library.

32 KiB Raw Blame History

Review: BooControl P1 (uncommitted working tree)

Scope

Size

Summary

Findings

Blocking

Advisory

Nits

Verdict

Claims I did not verify

Re-review (post-fix)

Scope

Size

Summary

Blocking findings: B1-B8 confirmation

B1: SSE line parser inverted

B2: incrementSeq never called

B3: WS handler has no delta-publishing mechanism

B4: Snapshot wire format mismatch

B5: onEvent drops async errors

B6: pruneRawSamples references non-existent id column

B7: onReconcile wired but never called

B8: control_job garbage insert

New finding from fix pass

A1 rebuildFleetFromDB correctness

HostCard.tsx:56 double-cast

pipeline.test.ts quality

Accepted follow-ups (not re-litigated)

Nits

Verdict

Claims I did not verify

32 KiB

Raw Blame History