boocode/openspec/changes/boocontrol/artifacts/p3-audit.md

# P3 Audit — Validation + Code Review

## Validation: boocontrol P3 (implementation mode)

### Verdict: PASS-WITH-FINDINGS

### Task claim table

| Task | Claim | Evidence | Status |
|------|-------|----------|--------|
| P3.1 Playground tab | Model select, param controls, streaming chat, A/B compare, Arena handoff | `routes/playground.ts:17-238` — GET `/api/playground/models`, POST `/api/playground/chat` (SSE relay), POST `/api/playground/chat-ab` (dual SSE with lane wrapping). `PlaygroundTab.tsx:19-494` — grouped model picker, temperature/topP/maxTokens controls, single-stream chat at line 80, A/B compare at line 163, Arena link at line 249. | PROVEN |
| P3.2 Bench engine | Suite model, TTFT capture, timings parse, bounded fan-out, aggregates + samples to DB | `bench-engine.ts:241-393` — `runBenchSuite` builds grid at line 252, `Promise.allSettled` fan-out at line 329, TTFT at line 180-182, `parseLlamaTimings` at line 63-102, samples INSERT at line 367, aggregates at line 375. Schema: `schema.sql:85-136` — `bench_suites`, `bench_runs`, `bench_samples` with FKs + indexes. | PROVEN |
| P3.3 V1 safety | User-initiated only, takeover confirmation, embedding-first defaults, concurrent_foreign_requests | `routes/bench.ts:182-193` — `checkRecentTraffic` at line 380 reads `hostState.models` inflight totals; returns 409 via `acquireHostAccess` at line 187. `runBenchAsync` at line 411 records `concurrent_foreign_requests` from `control_requests` last 60s at line 422-427. `host-access.ts:13-18` — v1 no-op `{ok:true}`. | PROVEN |
| P3.4 acquireHostAccess seam | Every run gates through `acquireHostAccess(providerId, purpose)` | `routes/bench.ts:187` — `const grant = await acquireHostAccess(suite.providerId, 'bench')` before runner launch. `playground.ts` does NOT call it (playground is read-only, not a bench run — correct). `host-access.ts:13-18` — `{ok:true}` no-op, documented P8 seam. | PROVEN |
| P3.5 Bench UI | Run launcher, live progress via control_job, history charts, baseline + regression flags | `BenchTab.tsx:65-649` — launcher view at line 400, history view at line 524, results view at line 592. `control_job` frames consumed by `useControlStream.tsx:266-271`. Baselines: `getRegressionFlag` at line 223 — delta < -10% -> regression, > +5% -> improvement. History chart with ECharts at line 311. Results chart at line 235. | PROVEN |

### Design section 8 "Speed bench" conformance

| Design requirement | Implementation | Status |
|---|---|---|
| HTTP-level via llama-swap | `bench-engine.ts:140` — `fetch(\`${baseUrl}/v1/chat/completions\`)` | PASS |
| llama.cpp timings parse | `parseLlamaTimings` at line 63 — reads `timings.prompt_per_second` etc. | PASS |
| TTFT client-side at first delta | `bench-engine.ts:180-182` — captures `Date.now()` on first delta | PASS |
| Bounded fan-out (Promise.allSettled) | `bench-engine.ts:329` — `Promise.allSettled(promises)` with `batchSize = concurrency` at line 309 | PASS |
| Warmup excluded | Not implemented (no warmup pass) | FINDING |
| Baselines + regression (-10% threshold) | `BenchTab.tsx:223-233` — compares `avgGenTps` delta < -0.1 | PASS (UI only) |
| User-initiated, manual | POST `/api/bench/run` — no scheduler | PASS |
| Takeover confirmation | `checkRecentTraffic` + `acquireHostAccess` gate | PASS |
| `concurrent_foreign_requests` | `runBenchAsync:422-427` — counts from `control_requests` last 60s | PASS |

## Review: P3 implementation (APPROVE-WITH-NITS)

### Blocking (0)

None. No correctness issues that block merge.

### Advisory (6)

**A1: Regression baseline comparison has no baseline stored in DB**
- **Location:** `BenchTab.tsx:223-233`, `routes/bench.ts:348-374`
- **Finding:** The `getRegressionFlag` function compares against `baselineAggregate` passed from state, but the baseline data comes from `GET /api/bench/baselines` which fetches the latest completed run per (provider_id, model). There is no dedicated `bench_baselines` table — baselines are implicitly "the latest run." The `getRegressionFlag` is only called in the history view at line 534 with `null` as the second argument: `getRegressionFlag(run.aggregate, null)`. This means regression flags are ALWAYS null in the actual UI. The baseline comparison logic exists but is dead code in the history view.
- **Impact:** P3.5 claim "baseline + regression flags" is partially unproven — the comparison function works, but the UI never passes a baseline to it. The flag rendering at lines 553-560 is never triggered.
- **YAGNI gate:** This is a real usability gap for the speed bench demo. The baseline data IS fetched (line 209) and stored in state (line 217), but never correlated to the run's suite/model for comparison.

**A2: `jobType` not stored in `bench_runs` table**
- **Location:** `schema.sql:99-111`, `bench-engine.ts:282,352,388`
- **Finding:** `control_job` frames carry `jobType: 'bench'` (and `jobType: 'action'` in `actions.ts:70`), but the `bench_runs` table has no `job_type` column. The `control_job` frame is only a WS event for live progress — there is no persistent job type on the run record. If P5 adds eval runs that also write to `bench_runs`, there is no way to distinguish bench from eval runs in the DB.
- **YAGNI gate:** Bench and eval are separate phases (P3 vs P5). Acceptable for v1.

**A3: `resolveBaseUrl` is hardcoded, not read from `control_hosts`**
- **Location:** `routes/bench.ts:398-406`, `routes/playground.ts:232-237`
- **Finding:** Both `resolveBaseUrl` in bench.ts and `resolveProviderUrl` in playground.ts use hardcoded `Record<string, string>` mappings. The `control_hosts` table stores `ssh_host` which should be the source of truth. This means adding a new host requires editing two files.
- **YAGNI gate:** Only two hosts exist and are seeded. Low blast radius.

**A4: Benchmark requests do not include suite-defined sampling params**
- **Location:** `bench-engine.ts:143-150`
- **Finding:** `runSingleBenchRequest` accepts `temperature` and `topP` parameters (line 116-117) and passes them to the request body. However, the `BenchSuite` interface (line 17-27) does NOT include `temperature` or `topP` — those come from `BenchRunParams` (line 29-34) which is the runner-level parameter. The suite definition has `metadata?: Record<string, unknown>` but no typed sampling params. This means the bench endpoint at `routes/bench.ts:139-143` defaults to `temperature: 0.7, topP: 0.9` regardless of what the suite was designed with. The suite's params are silently ignored.
- **YAGNI gate:** v1 uses fixed params. The design says "v1 sampling-params parity: bench requests should honor suite params, not silently use server defaults." This is a spec gap — the suite schema should include `temperature` and `topP` as typed fields.

**A5: No warmup pass**
- **Location:** `bench-engine.ts:241-393`
- **Finding:** The design section 8 says "warmup excluded from results" implying a warmup pass exists. The code has no warmup phase — it runs the full grid directly. For llama.cpp, the first request to a model is typically slower (model loading/prefill), so TTFT values are inflated without a warmup. The comment at line 8 ("Warmup excluded from results") is misleading — there is no warmup at all.
- **YAGNI gate:** Bench is manual, results are for Sam's own hardware. Acceptable for v1.

**A6: `checkRecentTraffic` reads from in-memory state, not the activity stream**
- **Location:** `routes/bench.ts:380-392`
- **Finding:** The design says "`concurrent_foreign_requests` recorded per run to flag polluted results" and "sourced from the live activity stream during the run window." However, `checkRecentTraffic` reads `hostState.models` inflight counts (in-memory SSE state), while `runBenchAsync` records `concurrent_foreign_requests` from `control_requests` DB queries. These measure different things: inflight counts (instantaneous) vs request count in last 60s (windowed). The UI shows `concurrentForeignRequests` from the DB (the 60s window) but the takeover confirmation uses the in-memory inflight count. This is not a bug — they serve different purposes — but the naming is inconsistent with the design spec which says "sourced from the activity stream."
- **YAGNI gate:** Both measurements are valid indicators. The design spec is slightly imprecise.

### Nits (5)

**N1: `BenchTab.tsx:534` — baseline lookup is O(n) per run in history view**
- `const suite = suites.find((s) => s.id === run.suiteId)` at line 533 — fine for small N but should be a Map for correctness.

**N2: `BenchTab.tsx:190-197` — polling interval leaks on component unmount**
- `pollInterval` is created in `runBench` but `clearInterval` is only called when status changes or 10 min timeout fires. If the user navigates away from the Bench tab while a run is in progress, the interval keeps firing.

**N3: `playground.ts:125` — SSE relay drops the `data: ` prefix**
- `reply.raw.write(\`data: ${trimmed}\n\n\`)` — the `trimmed` line already has `data: ` stripped by the SSE parser in `bench-engine.ts:66`, but the playground relay receives raw SSE lines from llama-swap which may or may not have the prefix. If llama-swap sends `data: {...}`, `trimmed` becomes `data: {...}` (after trim) and the relay writes `data: data: {...}` — double prefix. However, `bench-engine.ts` strips it; the playground is a direct relay so it depends on what llama-swap sends. This is fragile.

**N4: `bench-engine.ts:211-222` — prompt generation is a rough approximation**
- `charsPerToken = 4` is used to generate deterministic prompts. The comment says "~1.3 chars/token is a rough average for English text" but the code uses 4. This is internally inconsistent. The prompt will be ~4x longer than intended token count.

**N5: `BenchTab.tsx:229` — delta calculation divides by zero risk**
- `const delta = (currentGenTps - baselineGenTps) / baselineGenTps;` — if `baselineGenTps` is 0, this produces `Infinity`. The `== null` check at line 227 does not guard against 0.

## Claims I did not verify

1. **`useControlStream` integration with Control.tsx** — I read the hook and page, but did not verify that `ControlProvider` wraps the Control page in `App.tsx`. The routing exists (`/control` in `App.tsx`), but the provider placement was not confirmed.
2. **`/api/control/playground/models` route path** — The playground routes are registered at `/api/playground/*` (route path prefix in `registerPlaygroundRoutes`), but the web client fetches `/api/control/playground/models` (PlaygroundTab.tsx:47). The control-proxy at `apps/server/src/routes/control-proxy.ts:64` rewrites `/api/control/*` to `/api/*`, so this should work. Not verified by reading the proxy rewrite logic end-to-end.
3. **`jobType: 'bench'` in the `WsFrameSchema`** — The `ControlJobFrame` has `jobType: z.enum(['bench', 'eval', 'action'])` (ws-frames.ts:548). This is correct.
4. **`BenchRunParams.temperature` and `topP` flow** — The bench route at `routes/bench.ts:142-143` passes `temperature`/`topP` to `runBenchAsync`, which passes them to `runBenchSuite`, which passes them to `runSingleBenchRequest`. The chain is complete.
5. **Contracts drift test coverage** — The `ws-frames.test.ts` passes (11 tests). I did not read the test file to confirm it covers all 5 new control frame types.