feat(booterm): structured pty_exited WS notifications. Plan-validated, impl-validated, code-reviewed green (contracts build clean, contracts test 29/29, booterm + web typecheck clean). wip: in-progress inference/provider refactor (agents.ts, provider.ts, new llama-providers.ts, removed llama-args-validator), plus arena, dispatcher, compaction, schema changes. openspec: pty-exit-notifications complete; x-agent-flags planned (not yet implemented).
11 KiB
11 KiB
P3 Audit — Validation + Code Review
Validation: boocontrol P3 (implementation mode)
Verdict: PASS-WITH-FINDINGS
Task claim table
| Task | Claim | Evidence | Status |
|---|---|---|---|
| P3.1 Playground tab | Model select, param controls, streaming chat, A/B compare, Arena handoff | routes/playground.ts:17-238 — GET /api/playground/models, POST /api/playground/chat (SSE relay), POST /api/playground/chat-ab (dual SSE with lane wrapping). PlaygroundTab.tsx:19-494 — grouped model picker, temperature/topP/maxTokens controls, single-stream chat at line 80, A/B compare at line 163, Arena link at line 249. |
PROVEN |
| P3.2 Bench engine | Suite model, TTFT capture, timings parse, bounded fan-out, aggregates + samples to DB | bench-engine.ts:241-393 — runBenchSuite builds grid at line 252, Promise.allSettled fan-out at line 329, TTFT at line 180-182, parseLlamaTimings at line 63-102, samples INSERT at line 367, aggregates at line 375. Schema: schema.sql:85-136 — bench_suites, bench_runs, bench_samples with FKs + indexes. |
PROVEN |
| P3.3 V1 safety | User-initiated only, takeover confirmation, embedding-first defaults, concurrent_foreign_requests | routes/bench.ts:182-193 — checkRecentTraffic at line 380 reads hostState.models inflight totals; returns 409 via acquireHostAccess at line 187. runBenchAsync at line 411 records concurrent_foreign_requests from control_requests last 60s at line 422-427. host-access.ts:13-18 — v1 no-op {ok:true}. |
PROVEN |
| P3.4 acquireHostAccess seam | Every run gates through acquireHostAccess(providerId, purpose) |
routes/bench.ts:187 — const grant = await acquireHostAccess(suite.providerId, 'bench') before runner launch. playground.ts does NOT call it (playground is read-only, not a bench run — correct). host-access.ts:13-18 — {ok:true} no-op, documented P8 seam. |
PROVEN |
| P3.5 Bench UI | Run launcher, live progress via control_job, history charts, baseline + regression flags | BenchTab.tsx:65-649 — launcher view at line 400, history view at line 524, results view at line 592. control_job frames consumed by useControlStream.tsx:266-271. Baselines: getRegressionFlag at line 223 — delta < -10% -> regression, > +5% -> improvement. History chart with ECharts at line 311. Results chart at line 235. |
PROVEN |
Design section 8 "Speed bench" conformance
| Design requirement | Implementation | Status |
|---|---|---|
| HTTP-level via llama-swap | bench-engine.ts:140 — fetch(\${baseUrl}/v1/chat/completions`)` |
PASS |
| llama.cpp timings parse | parseLlamaTimings at line 63 — reads timings.prompt_per_second etc. |
PASS |
| TTFT client-side at first delta | bench-engine.ts:180-182 — captures Date.now() on first delta |
PASS |
| Bounded fan-out (Promise.allSettled) | bench-engine.ts:329 — Promise.allSettled(promises) with batchSize = concurrency at line 309 |
PASS |
| Warmup excluded | Not implemented (no warmup pass) | FINDING |
| Baselines + regression (-10% threshold) | BenchTab.tsx:223-233 — compares avgGenTps delta < -0.1 |
PASS (UI only) |
| User-initiated, manual | POST /api/bench/run — no scheduler |
PASS |
| Takeover confirmation | checkRecentTraffic + acquireHostAccess gate |
PASS |
concurrent_foreign_requests |
runBenchAsync:422-427 — counts from control_requests last 60s |
PASS |
Review: P3 implementation (APPROVE-WITH-NITS)
Blocking (0)
None. No correctness issues that block merge.
Advisory (6)
A1: Regression baseline comparison has no baseline stored in DB
- Location:
BenchTab.tsx:223-233,routes/bench.ts:348-374 - Finding: The
getRegressionFlagfunction compares againstbaselineAggregatepassed from state, but the baseline data comes fromGET /api/bench/baselineswhich fetches the latest completed run per (provider_id, model). There is no dedicatedbench_baselinestable — baselines are implicitly "the latest run." ThegetRegressionFlagis only called in the history view at line 534 withnullas the second argument:getRegressionFlag(run.aggregate, null). This means regression flags are ALWAYS null in the actual UI. The baseline comparison logic exists but is dead code in the history view. - Impact: P3.5 claim "baseline + regression flags" is partially unproven — the comparison function works, but the UI never passes a baseline to it. The flag rendering at lines 553-560 is never triggered.
- YAGNI gate: This is a real usability gap for the speed bench demo. The baseline data IS fetched (line 209) and stored in state (line 217), but never correlated to the run's suite/model for comparison.
A2: jobType not stored in bench_runs table
- Location:
schema.sql:99-111,bench-engine.ts:282,352,388 - Finding:
control_jobframes carryjobType: 'bench'(andjobType: 'action'inactions.ts:70), but thebench_runstable has nojob_typecolumn. Thecontrol_jobframe is only a WS event for live progress — there is no persistent job type on the run record. If P5 adds eval runs that also write tobench_runs, there is no way to distinguish bench from eval runs in the DB. - YAGNI gate: Bench and eval are separate phases (P3 vs P5). Acceptable for v1.
A3: resolveBaseUrl is hardcoded, not read from control_hosts
- Location:
routes/bench.ts:398-406,routes/playground.ts:232-237 - Finding: Both
resolveBaseUrlin bench.ts andresolveProviderUrlin playground.ts use hardcodedRecord<string, string>mappings. Thecontrol_hoststable storesssh_hostwhich should be the source of truth. This means adding a new host requires editing two files. - YAGNI gate: Only two hosts exist and are seeded. Low blast radius.
A4: Benchmark requests do not include suite-defined sampling params
- Location:
bench-engine.ts:143-150 - Finding:
runSingleBenchRequestacceptstemperatureandtopPparameters (line 116-117) and passes them to the request body. However, theBenchSuiteinterface (line 17-27) does NOT includetemperatureortopP— those come fromBenchRunParams(line 29-34) which is the runner-level parameter. The suite definition hasmetadata?: Record<string, unknown>but no typed sampling params. This means the bench endpoint atroutes/bench.ts:139-143defaults totemperature: 0.7, topP: 0.9regardless of what the suite was designed with. The suite's params are silently ignored. - YAGNI gate: v1 uses fixed params. The design says "v1 sampling-params parity: bench requests should honor suite params, not silently use server defaults." This is a spec gap — the suite schema should include
temperatureandtopPas typed fields.
A5: No warmup pass
- Location:
bench-engine.ts:241-393 - Finding: The design section 8 says "warmup excluded from results" implying a warmup pass exists. The code has no warmup phase — it runs the full grid directly. For llama.cpp, the first request to a model is typically slower (model loading/prefill), so TTFT values are inflated without a warmup. The comment at line 8 ("Warmup excluded from results") is misleading — there is no warmup at all.
- YAGNI gate: Bench is manual, results are for Sam's own hardware. Acceptable for v1.
A6: checkRecentTraffic reads from in-memory state, not the activity stream
- Location:
routes/bench.ts:380-392 - Finding: The design says "
concurrent_foreign_requestsrecorded per run to flag polluted results" and "sourced from the live activity stream during the run window." However,checkRecentTrafficreadshostState.modelsinflight counts (in-memory SSE state), whilerunBenchAsyncrecordsconcurrent_foreign_requestsfromcontrol_requestsDB queries. These measure different things: inflight counts (instantaneous) vs request count in last 60s (windowed). The UI showsconcurrentForeignRequestsfrom the DB (the 60s window) but the takeover confirmation uses the in-memory inflight count. This is not a bug — they serve different purposes — but the naming is inconsistent with the design spec which says "sourced from the activity stream." - YAGNI gate: Both measurements are valid indicators. The design spec is slightly imprecise.
Nits (5)
N1: BenchTab.tsx:534 — baseline lookup is O(n) per run in history view
const suite = suites.find((s) => s.id === run.suiteId)at line 533 — fine for small N but should be a Map for correctness.
N2: BenchTab.tsx:190-197 — polling interval leaks on component unmount
pollIntervalis created inrunBenchbutclearIntervalis only called when status changes or 10 min timeout fires. If the user navigates away from the Bench tab while a run is in progress, the interval keeps firing.
N3: playground.ts:125 — SSE relay drops the data: prefix
reply.raw.write(\data: ${trimmed}\n\n`)— thetrimmedline already hasdata:stripped by the SSE parser inbench-engine.ts:66, but the playground relay receives raw SSE lines from llama-swap which may or may not have the prefix. If llama-swap sendsdata: {...},trimmedbecomesdata: {...}(after trim) and the relay writesdata: data: {...}— double prefix. However,bench-engine.ts` strips it; the playground is a direct relay so it depends on what llama-swap sends. This is fragile.
N4: bench-engine.ts:211-222 — prompt generation is a rough approximation
charsPerToken = 4is used to generate deterministic prompts. The comment says "~1.3 chars/token is a rough average for English text" but the code uses 4. This is internally inconsistent. The prompt will be ~4x longer than intended token count.
N5: BenchTab.tsx:229 — delta calculation divides by zero risk
const delta = (currentGenTps - baselineGenTps) / baselineGenTps;— ifbaselineGenTpsis 0, this producesInfinity. The== nullcheck at line 227 does not guard against 0.
Claims I did not verify
useControlStreamintegration with Control.tsx — I read the hook and page, but did not verify thatControlProviderwraps the Control page inApp.tsx. The routing exists (/controlinApp.tsx), but the provider placement was not confirmed./api/control/playground/modelsroute path — The playground routes are registered at/api/playground/*(route path prefix inregisterPlaygroundRoutes), but the web client fetches/api/control/playground/models(PlaygroundTab.tsx:47). The control-proxy atapps/server/src/routes/control-proxy.ts:64rewrites/api/control/*to/api/*, so this should work. Not verified by reading the proxy rewrite logic end-to-end.jobType: 'bench'in theWsFrameSchema— TheControlJobFramehasjobType: z.enum(['bench', 'eval', 'action'])(ws-frames.ts:548). This is correct.BenchRunParams.temperatureandtopPflow — The bench route atroutes/bench.ts:142-143passestemperature/topPtorunBenchAsync, which passes them torunBenchSuite, which passes them torunSingleBenchRequest. The chain is complete.- Contracts drift test coverage — The
ws-frames.test.tspasses (11 tests). I did not read the test file to confirm it covers all 5 new control frame types.