Files

indifferentketchup b18de2a331 chore: snapshot working tree - pty_exited notifications + in-flight inference WIP

feat(booterm): structured pty_exited WS notifications. Plan-validated, impl-validated, code-reviewed green (contracts build clean, contracts test 29/29, booterm + web typecheck clean).

wip: in-progress inference/provider refactor (agents.ts, provider.ts, new llama-providers.ts, removed llama-args-validator), plus arena, dispatcher, compaction, schema changes.

openspec: pty-exit-notifications complete; x-agent-flags planned (not yet implemented).

2026-06-14 12:48:47 +00:00

11 KiB

Raw Blame History

P3 Audit — Validation + Code Review

Validation: boocontrol P3 (implementation mode)

Verdict: PASS-WITH-FINDINGS

Task claim table

Task	Claim	Evidence	Status
P3.1 Playground tab	Model select, param controls, streaming chat, A/B compare, Arena handoff	`routes/playground.ts:17-238` — GET `/api/playground/models`, POST `/api/playground/chat` (SSE relay), POST `/api/playground/chat-ab` (dual SSE with lane wrapping). `PlaygroundTab.tsx:19-494` — grouped model picker, temperature/topP/maxTokens controls, single-stream chat at line 80, A/B compare at line 163, Arena link at line 249.	PROVEN
P3.2 Bench engine	Suite model, TTFT capture, timings parse, bounded fan-out, aggregates + samples to DB	`bench-engine.ts:241-393` — `runBenchSuite` builds grid at line 252, `Promise.allSettled` fan-out at line 329, TTFT at line 180-182, `parseLlamaTimings` at line 63-102, samples INSERT at line 367, aggregates at line 375. Schema: `schema.sql:85-136` — `bench_suites`, `bench_runs`, `bench_samples` with FKs + indexes.	PROVEN
P3.3 V1 safety	User-initiated only, takeover confirmation, embedding-first defaults, concurrent_foreign_requests	`routes/bench.ts:182-193` — `checkRecentTraffic` at line 380 reads `hostState.models` inflight totals; returns 409 via `acquireHostAccess` at line 187. `runBenchAsync` at line 411 records `concurrent_foreign_requests` from `control_requests` last 60s at line 422-427. `host-access.ts:13-18` — v1 no-op `{ok:true}`.	PROVEN
P3.4 acquireHostAccess seam	Every run gates through `acquireHostAccess(providerId, purpose)`	`routes/bench.ts:187` — `const grant = await acquireHostAccess(suite.providerId, 'bench')` before runner launch. `playground.ts` does NOT call it (playground is read-only, not a bench run — correct). `host-access.ts:13-18` — `{ok:true}` no-op, documented P8 seam.	PROVEN
P3.5 Bench UI	Run launcher, live progress via control_job, history charts, baseline + regression flags	`BenchTab.tsx:65-649` — launcher view at line 400, history view at line 524, results view at line 592. `control_job` frames consumed by `useControlStream.tsx:266-271`. Baselines: `getRegressionFlag` at line 223 — delta < -10% -> regression, > +5% -> improvement. History chart with ECharts at line 311. Results chart at line 235.	PROVEN

Design section 8 "Speed bench" conformance

Design requirement	Implementation	Status
HTTP-level via llama-swap	`bench-engine.ts:140` — `fetch(\`${baseUrl}/v1/chat/completions`)`	PASS
llama.cpp timings parse	`parseLlamaTimings` at line 63 — reads `timings.prompt_per_second` etc.	PASS
TTFT client-side at first delta	`bench-engine.ts:180-182` — captures `Date.now()` on first delta	PASS
Bounded fan-out (Promise.allSettled)	`bench-engine.ts:329` — `Promise.allSettled(promises)` with `batchSize = concurrency` at line 309	PASS
Warmup excluded	Not implemented (no warmup pass)	FINDING
Baselines + regression (-10% threshold)	`BenchTab.tsx:223-233` — compares `avgGenTps` delta < -0.1	PASS (UI only)
User-initiated, manual	POST `/api/bench/run` — no scheduler	PASS
Takeover confirmation	`checkRecentTraffic` + `acquireHostAccess` gate	PASS
`concurrent_foreign_requests`	`runBenchAsync:422-427` — counts from `control_requests` last 60s	PASS

Review: P3 implementation (APPROVE-WITH-NITS)

Blocking (0)

None. No correctness issues that block merge.

Advisory (6)

A1: Regression baseline comparison has no baseline stored in DB

Location: BenchTab.tsx:223-233, routes/bench.ts:348-374
Finding: The getRegressionFlag function compares against baselineAggregate passed from state, but the baseline data comes from GET /api/bench/baselines which fetches the latest completed run per (provider_id, model). There is no dedicated bench_baselines table — baselines are implicitly "the latest run." The getRegressionFlag is only called in the history view at line 534 with null as the second argument: getRegressionFlag(run.aggregate, null). This means regression flags are ALWAYS null in the actual UI. The baseline comparison logic exists but is dead code in the history view.
Impact: P3.5 claim "baseline + regression flags" is partially unproven — the comparison function works, but the UI never passes a baseline to it. The flag rendering at lines 553-560 is never triggered.
YAGNI gate: This is a real usability gap for the speed bench demo. The baseline data IS fetched (line 209) and stored in state (line 217), but never correlated to the run's suite/model for comparison.

A2: jobType not stored in bench_runs table

Location: schema.sql:99-111, bench-engine.ts:282,352,388
Finding: control_job frames carry jobType: 'bench' (and jobType: 'action' in actions.ts:70), but the bench_runs table has no job_type column. The control_job frame is only a WS event for live progress — there is no persistent job type on the run record. If P5 adds eval runs that also write to bench_runs, there is no way to distinguish bench from eval runs in the DB.
YAGNI gate: Bench and eval are separate phases (P3 vs P5). Acceptable for v1.

A3: resolveBaseUrl is hardcoded, not read from control_hosts

Location: routes/bench.ts:398-406, routes/playground.ts:232-237
Finding: Both resolveBaseUrl in bench.ts and resolveProviderUrl in playground.ts use hardcoded Record<string, string> mappings. The control_hosts table stores ssh_host which should be the source of truth. This means adding a new host requires editing two files.
YAGNI gate: Only two hosts exist and are seeded. Low blast radius.

A4: Benchmark requests do not include suite-defined sampling params

Location: bench-engine.ts:143-150
Finding: runSingleBenchRequest accepts temperature and topP parameters (line 116-117) and passes them to the request body. However, the BenchSuite interface (line 17-27) does NOT include temperature or topP — those come from BenchRunParams (line 29-34) which is the runner-level parameter. The suite definition has metadata?: Record<string, unknown> but no typed sampling params. This means the bench endpoint at routes/bench.ts:139-143 defaults to temperature: 0.7, topP: 0.9 regardless of what the suite was designed with. The suite's params are silently ignored.
YAGNI gate: v1 uses fixed params. The design says "v1 sampling-params parity: bench requests should honor suite params, not silently use server defaults." This is a spec gap — the suite schema should include temperature and topP as typed fields.

A5: No warmup pass

Location: bench-engine.ts:241-393
Finding: The design section 8 says "warmup excluded from results" implying a warmup pass exists. The code has no warmup phase — it runs the full grid directly. For llama.cpp, the first request to a model is typically slower (model loading/prefill), so TTFT values are inflated without a warmup. The comment at line 8 ("Warmup excluded from results") is misleading — there is no warmup at all.
YAGNI gate: Bench is manual, results are for Sam's own hardware. Acceptable for v1.

A6: checkRecentTraffic reads from in-memory state, not the activity stream

Location: routes/bench.ts:380-392
Finding: The design says "concurrent_foreign_requests recorded per run to flag polluted results" and "sourced from the live activity stream during the run window." However, checkRecentTraffic reads hostState.models inflight counts (in-memory SSE state), while runBenchAsync records concurrent_foreign_requests from control_requests DB queries. These measure different things: inflight counts (instantaneous) vs request count in last 60s (windowed). The UI shows concurrentForeignRequests from the DB (the 60s window) but the takeover confirmation uses the in-memory inflight count. This is not a bug — they serve different purposes — but the naming is inconsistent with the design spec which says "sourced from the activity stream."
YAGNI gate: Both measurements are valid indicators. The design spec is slightly imprecise.

Nits (5)

N1: BenchTab.tsx:534 — baseline lookup is O(n) per run in history view

const suite = suites.find((s) => s.id === run.suiteId) at line 533 — fine for small N but should be a Map for correctness.

N2: BenchTab.tsx:190-197 — polling interval leaks on component unmount

pollInterval is created in runBench but clearInterval is only called when status changes or 10 min timeout fires. If the user navigates away from the Bench tab while a run is in progress, the interval keeps firing.

N3: playground.ts:125 — SSE relay drops the data: prefix

reply.raw.write(\data: ${trimmed}\n\n`)— thetrimmedline already hasdata: stripped by the SSE parser inbench-engine.ts:66, but the playground relay receives raw SSE lines from llama-swap which may or may not have the prefix. If llama-swap sends data: {...}, trimmedbecomesdata: {...}(after trim) and the relay writesdata: data: {...}— double prefix. However,bench-engine.ts` strips it; the playground is a direct relay so it depends on what llama-swap sends. This is fragile.

N4: bench-engine.ts:211-222 — prompt generation is a rough approximation

charsPerToken = 4 is used to generate deterministic prompts. The comment says "~1.3 chars/token is a rough average for English text" but the code uses 4. This is internally inconsistent. The prompt will be ~4x longer than intended token count.

N5: BenchTab.tsx:229 — delta calculation divides by zero risk

const delta = (currentGenTps - baselineGenTps) / baselineGenTps; — if baselineGenTps is 0, this produces Infinity. The == null check at line 227 does not guard against 0.

Claims I did not verify

useControlStream integration with Control.tsx — I read the hook and page, but did not verify that ControlProvider wraps the Control page in App.tsx. The routing exists (/control in App.tsx), but the provider placement was not confirmed.
/api/control/playground/models route path — The playground routes are registered at /api/playground/* (route path prefix in registerPlaygroundRoutes), but the web client fetches /api/control/playground/models (PlaygroundTab.tsx:47). The control-proxy at apps/server/src/routes/control-proxy.ts:64 rewrites /api/control/* to /api/*, so this should work. Not verified by reading the proxy rewrite logic end-to-end.
jobType: 'bench' in the WsFrameSchema — The ControlJobFrame has jobType: z.enum(['bench', 'eval', 'action']) (ws-frames.ts:548). This is correct.
BenchRunParams.temperature and topP flow — The bench route at routes/bench.ts:142-143 passes temperature/topP to runBenchAsync, which passes them to runBenchSuite, which passes them to runSingleBenchRequest. The chain is complete.
Contracts drift test coverage — The ws-frames.test.ts passes (11 tests). I did not read the test file to confirm it covers all 5 new control frame types.

11 KiB Raw Blame History