Files
boocode/openspec/changes/boocontrol/artifacts/p3-audit.md
indifferentketchup b18de2a331 chore: snapshot working tree - pty_exited notifications + in-flight inference WIP
feat(booterm): structured pty_exited WS notifications. Plan-validated, impl-validated, code-reviewed green (contracts build clean, contracts test 29/29, booterm + web typecheck clean).

wip: in-progress inference/provider refactor (agents.ts, provider.ts, new llama-providers.ts, removed llama-args-validator), plus arena, dispatcher, compaction, schema changes.

openspec: pty-exit-notifications complete; x-agent-flags planned (not yet implemented).
2026-06-14 12:48:47 +00:00

11 KiB

P3 Audit — Validation + Code Review

Validation: boocontrol P3 (implementation mode)

Verdict: PASS-WITH-FINDINGS

Task claim table

Task Claim Evidence Status
P3.1 Playground tab Model select, param controls, streaming chat, A/B compare, Arena handoff routes/playground.ts:17-238 — GET /api/playground/models, POST /api/playground/chat (SSE relay), POST /api/playground/chat-ab (dual SSE with lane wrapping). PlaygroundTab.tsx:19-494 — grouped model picker, temperature/topP/maxTokens controls, single-stream chat at line 80, A/B compare at line 163, Arena link at line 249. PROVEN
P3.2 Bench engine Suite model, TTFT capture, timings parse, bounded fan-out, aggregates + samples to DB bench-engine.ts:241-393runBenchSuite builds grid at line 252, Promise.allSettled fan-out at line 329, TTFT at line 180-182, parseLlamaTimings at line 63-102, samples INSERT at line 367, aggregates at line 375. Schema: schema.sql:85-136bench_suites, bench_runs, bench_samples with FKs + indexes. PROVEN
P3.3 V1 safety User-initiated only, takeover confirmation, embedding-first defaults, concurrent_foreign_requests routes/bench.ts:182-193checkRecentTraffic at line 380 reads hostState.models inflight totals; returns 409 via acquireHostAccess at line 187. runBenchAsync at line 411 records concurrent_foreign_requests from control_requests last 60s at line 422-427. host-access.ts:13-18 — v1 no-op {ok:true}. PROVEN
P3.4 acquireHostAccess seam Every run gates through acquireHostAccess(providerId, purpose) routes/bench.ts:187const grant = await acquireHostAccess(suite.providerId, 'bench') before runner launch. playground.ts does NOT call it (playground is read-only, not a bench run — correct). host-access.ts:13-18{ok:true} no-op, documented P8 seam. PROVEN
P3.5 Bench UI Run launcher, live progress via control_job, history charts, baseline + regression flags BenchTab.tsx:65-649 — launcher view at line 400, history view at line 524, results view at line 592. control_job frames consumed by useControlStream.tsx:266-271. Baselines: getRegressionFlag at line 223 — delta < -10% -> regression, > +5% -> improvement. History chart with ECharts at line 311. Results chart at line 235. PROVEN

Design section 8 "Speed bench" conformance

Design requirement Implementation Status
HTTP-level via llama-swap bench-engine.ts:140fetch(\${baseUrl}/v1/chat/completions`)` PASS
llama.cpp timings parse parseLlamaTimings at line 63 — reads timings.prompt_per_second etc. PASS
TTFT client-side at first delta bench-engine.ts:180-182 — captures Date.now() on first delta PASS
Bounded fan-out (Promise.allSettled) bench-engine.ts:329Promise.allSettled(promises) with batchSize = concurrency at line 309 PASS
Warmup excluded Not implemented (no warmup pass) FINDING
Baselines + regression (-10% threshold) BenchTab.tsx:223-233 — compares avgGenTps delta < -0.1 PASS (UI only)
User-initiated, manual POST /api/bench/run — no scheduler PASS
Takeover confirmation checkRecentTraffic + acquireHostAccess gate PASS
concurrent_foreign_requests runBenchAsync:422-427 — counts from control_requests last 60s PASS

Review: P3 implementation (APPROVE-WITH-NITS)

Blocking (0)

None. No correctness issues that block merge.

Advisory (6)

A1: Regression baseline comparison has no baseline stored in DB

  • Location: BenchTab.tsx:223-233, routes/bench.ts:348-374
  • Finding: The getRegressionFlag function compares against baselineAggregate passed from state, but the baseline data comes from GET /api/bench/baselines which fetches the latest completed run per (provider_id, model). There is no dedicated bench_baselines table — baselines are implicitly "the latest run." The getRegressionFlag is only called in the history view at line 534 with null as the second argument: getRegressionFlag(run.aggregate, null). This means regression flags are ALWAYS null in the actual UI. The baseline comparison logic exists but is dead code in the history view.
  • Impact: P3.5 claim "baseline + regression flags" is partially unproven — the comparison function works, but the UI never passes a baseline to it. The flag rendering at lines 553-560 is never triggered.
  • YAGNI gate: This is a real usability gap for the speed bench demo. The baseline data IS fetched (line 209) and stored in state (line 217), but never correlated to the run's suite/model for comparison.

A2: jobType not stored in bench_runs table

  • Location: schema.sql:99-111, bench-engine.ts:282,352,388
  • Finding: control_job frames carry jobType: 'bench' (and jobType: 'action' in actions.ts:70), but the bench_runs table has no job_type column. The control_job frame is only a WS event for live progress — there is no persistent job type on the run record. If P5 adds eval runs that also write to bench_runs, there is no way to distinguish bench from eval runs in the DB.
  • YAGNI gate: Bench and eval are separate phases (P3 vs P5). Acceptable for v1.

A3: resolveBaseUrl is hardcoded, not read from control_hosts

  • Location: routes/bench.ts:398-406, routes/playground.ts:232-237
  • Finding: Both resolveBaseUrl in bench.ts and resolveProviderUrl in playground.ts use hardcoded Record<string, string> mappings. The control_hosts table stores ssh_host which should be the source of truth. This means adding a new host requires editing two files.
  • YAGNI gate: Only two hosts exist and are seeded. Low blast radius.

A4: Benchmark requests do not include suite-defined sampling params

  • Location: bench-engine.ts:143-150
  • Finding: runSingleBenchRequest accepts temperature and topP parameters (line 116-117) and passes them to the request body. However, the BenchSuite interface (line 17-27) does NOT include temperature or topP — those come from BenchRunParams (line 29-34) which is the runner-level parameter. The suite definition has metadata?: Record<string, unknown> but no typed sampling params. This means the bench endpoint at routes/bench.ts:139-143 defaults to temperature: 0.7, topP: 0.9 regardless of what the suite was designed with. The suite's params are silently ignored.
  • YAGNI gate: v1 uses fixed params. The design says "v1 sampling-params parity: bench requests should honor suite params, not silently use server defaults." This is a spec gap — the suite schema should include temperature and topP as typed fields.

A5: No warmup pass

  • Location: bench-engine.ts:241-393
  • Finding: The design section 8 says "warmup excluded from results" implying a warmup pass exists. The code has no warmup phase — it runs the full grid directly. For llama.cpp, the first request to a model is typically slower (model loading/prefill), so TTFT values are inflated without a warmup. The comment at line 8 ("Warmup excluded from results") is misleading — there is no warmup at all.
  • YAGNI gate: Bench is manual, results are for Sam's own hardware. Acceptable for v1.

A6: checkRecentTraffic reads from in-memory state, not the activity stream

  • Location: routes/bench.ts:380-392
  • Finding: The design says "concurrent_foreign_requests recorded per run to flag polluted results" and "sourced from the live activity stream during the run window." However, checkRecentTraffic reads hostState.models inflight counts (in-memory SSE state), while runBenchAsync records concurrent_foreign_requests from control_requests DB queries. These measure different things: inflight counts (instantaneous) vs request count in last 60s (windowed). The UI shows concurrentForeignRequests from the DB (the 60s window) but the takeover confirmation uses the in-memory inflight count. This is not a bug — they serve different purposes — but the naming is inconsistent with the design spec which says "sourced from the activity stream."
  • YAGNI gate: Both measurements are valid indicators. The design spec is slightly imprecise.

Nits (5)

N1: BenchTab.tsx:534 — baseline lookup is O(n) per run in history view

  • const suite = suites.find((s) => s.id === run.suiteId) at line 533 — fine for small N but should be a Map for correctness.

N2: BenchTab.tsx:190-197 — polling interval leaks on component unmount

  • pollInterval is created in runBench but clearInterval is only called when status changes or 10 min timeout fires. If the user navigates away from the Bench tab while a run is in progress, the interval keeps firing.

N3: playground.ts:125 — SSE relay drops the data: prefix

  • reply.raw.write(\data: ${trimmed}\n\n`)— thetrimmedline already hasdata: stripped by the SSE parser inbench-engine.ts:66, but the playground relay receives raw SSE lines from llama-swap which may or may not have the prefix. If llama-swap sends data: {...}, trimmedbecomesdata: {...}(after trim) and the relay writesdata: data: {...}— double prefix. However,bench-engine.ts` strips it; the playground is a direct relay so it depends on what llama-swap sends. This is fragile.

N4: bench-engine.ts:211-222 — prompt generation is a rough approximation

  • charsPerToken = 4 is used to generate deterministic prompts. The comment says "~1.3 chars/token is a rough average for English text" but the code uses 4. This is internally inconsistent. The prompt will be ~4x longer than intended token count.

N5: BenchTab.tsx:229 — delta calculation divides by zero risk

  • const delta = (currentGenTps - baselineGenTps) / baselineGenTps; — if baselineGenTps is 0, this produces Infinity. The == null check at line 227 does not guard against 0.

Claims I did not verify

  1. useControlStream integration with Control.tsx — I read the hook and page, but did not verify that ControlProvider wraps the Control page in App.tsx. The routing exists (/control in App.tsx), but the provider placement was not confirmed.
  2. /api/control/playground/models route path — The playground routes are registered at /api/playground/* (route path prefix in registerPlaygroundRoutes), but the web client fetches /api/control/playground/models (PlaygroundTab.tsx:47). The control-proxy at apps/server/src/routes/control-proxy.ts:64 rewrites /api/control/* to /api/*, so this should work. Not verified by reading the proxy rewrite logic end-to-end.
  3. jobType: 'bench' in the WsFrameSchema — The ControlJobFrame has jobType: z.enum(['bench', 'eval', 'action']) (ws-frames.ts:548). This is correct.
  4. BenchRunParams.temperature and topP flow — The bench route at routes/bench.ts:142-143 passes temperature/topP to runBenchAsync, which passes them to runBenchSuite, which passes them to runSingleBenchRequest. The chain is complete.
  5. Contracts drift test coverage — The ws-frames.test.ts passes (11 tests). I did not read the test file to confirm it covers all 5 new control frame types.