feat(booterm): structured pty_exited WS notifications. Plan-validated, impl-validated, code-reviewed green (contracts build clean, contracts test 29/29, booterm + web typecheck clean). wip: in-progress inference/provider refactor (agents.ts, provider.ts, new llama-providers.ts, removed llama-args-validator), plus arena, dispatcher, compaction, schema changes. openspec: pty-exit-notifications complete; x-agent-flags planned (not yet implemented).
94 lines
11 KiB
Markdown
94 lines
11 KiB
Markdown
# P3 Audit — Validation + Code Review
|
|
|
|
## Validation: boocontrol P3 (implementation mode)
|
|
|
|
### Verdict: PASS-WITH-FINDINGS
|
|
|
|
### Task claim table
|
|
|
|
| Task | Claim | Evidence | Status |
|
|
|------|-------|----------|--------|
|
|
| P3.1 Playground tab | Model select, param controls, streaming chat, A/B compare, Arena handoff | `routes/playground.ts:17-238` — GET `/api/playground/models`, POST `/api/playground/chat` (SSE relay), POST `/api/playground/chat-ab` (dual SSE with lane wrapping). `PlaygroundTab.tsx:19-494` — grouped model picker, temperature/topP/maxTokens controls, single-stream chat at line 80, A/B compare at line 163, Arena link at line 249. | PROVEN |
|
|
| P3.2 Bench engine | Suite model, TTFT capture, timings parse, bounded fan-out, aggregates + samples to DB | `bench-engine.ts:241-393` — `runBenchSuite` builds grid at line 252, `Promise.allSettled` fan-out at line 329, TTFT at line 180-182, `parseLlamaTimings` at line 63-102, samples INSERT at line 367, aggregates at line 375. Schema: `schema.sql:85-136` — `bench_suites`, `bench_runs`, `bench_samples` with FKs + indexes. | PROVEN |
|
|
| P3.3 V1 safety | User-initiated only, takeover confirmation, embedding-first defaults, concurrent_foreign_requests | `routes/bench.ts:182-193` — `checkRecentTraffic` at line 380 reads `hostState.models` inflight totals; returns 409 via `acquireHostAccess` at line 187. `runBenchAsync` at line 411 records `concurrent_foreign_requests` from `control_requests` last 60s at line 422-427. `host-access.ts:13-18` — v1 no-op `{ok:true}`. | PROVEN |
|
|
| P3.4 acquireHostAccess seam | Every run gates through `acquireHostAccess(providerId, purpose)` | `routes/bench.ts:187` — `const grant = await acquireHostAccess(suite.providerId, 'bench')` before runner launch. `playground.ts` does NOT call it (playground is read-only, not a bench run — correct). `host-access.ts:13-18` — `{ok:true}` no-op, documented P8 seam. | PROVEN |
|
|
| P3.5 Bench UI | Run launcher, live progress via control_job, history charts, baseline + regression flags | `BenchTab.tsx:65-649` — launcher view at line 400, history view at line 524, results view at line 592. `control_job` frames consumed by `useControlStream.tsx:266-271`. Baselines: `getRegressionFlag` at line 223 — delta < -10% -> regression, > +5% -> improvement. History chart with ECharts at line 311. Results chart at line 235. | PROVEN |
|
|
|
|
### Design section 8 "Speed bench" conformance
|
|
|
|
| Design requirement | Implementation | Status |
|
|
|---|---|---|
|
|
| HTTP-level via llama-swap | `bench-engine.ts:140` — `fetch(\`${baseUrl}/v1/chat/completions\`)` | PASS |
|
|
| llama.cpp timings parse | `parseLlamaTimings` at line 63 — reads `timings.prompt_per_second` etc. | PASS |
|
|
| TTFT client-side at first delta | `bench-engine.ts:180-182` — captures `Date.now()` on first delta | PASS |
|
|
| Bounded fan-out (Promise.allSettled) | `bench-engine.ts:329` — `Promise.allSettled(promises)` with `batchSize = concurrency` at line 309 | PASS |
|
|
| Warmup excluded | Not implemented (no warmup pass) | FINDING |
|
|
| Baselines + regression (-10% threshold) | `BenchTab.tsx:223-233` — compares `avgGenTps` delta < -0.1 | PASS (UI only) |
|
|
| User-initiated, manual | POST `/api/bench/run` — no scheduler | PASS |
|
|
| Takeover confirmation | `checkRecentTraffic` + `acquireHostAccess` gate | PASS |
|
|
| `concurrent_foreign_requests` | `runBenchAsync:422-427` — counts from `control_requests` last 60s | PASS |
|
|
|
|
## Review: P3 implementation (APPROVE-WITH-NITS)
|
|
|
|
### Blocking (0)
|
|
|
|
None. No correctness issues that block merge.
|
|
|
|
### Advisory (6)
|
|
|
|
**A1: Regression baseline comparison has no baseline stored in DB**
|
|
- **Location:** `BenchTab.tsx:223-233`, `routes/bench.ts:348-374`
|
|
- **Finding:** The `getRegressionFlag` function compares against `baselineAggregate` passed from state, but the baseline data comes from `GET /api/bench/baselines` which fetches the latest completed run per (provider_id, model). There is no dedicated `bench_baselines` table — baselines are implicitly "the latest run." The `getRegressionFlag` is only called in the history view at line 534 with `null` as the second argument: `getRegressionFlag(run.aggregate, null)`. This means regression flags are ALWAYS null in the actual UI. The baseline comparison logic exists but is dead code in the history view.
|
|
- **Impact:** P3.5 claim "baseline + regression flags" is partially unproven — the comparison function works, but the UI never passes a baseline to it. The flag rendering at lines 553-560 is never triggered.
|
|
- **YAGNI gate:** This is a real usability gap for the speed bench demo. The baseline data IS fetched (line 209) and stored in state (line 217), but never correlated to the run's suite/model for comparison.
|
|
|
|
**A2: `jobType` not stored in `bench_runs` table**
|
|
- **Location:** `schema.sql:99-111`, `bench-engine.ts:282,352,388`
|
|
- **Finding:** `control_job` frames carry `jobType: 'bench'` (and `jobType: 'action'` in `actions.ts:70`), but the `bench_runs` table has no `job_type` column. The `control_job` frame is only a WS event for live progress — there is no persistent job type on the run record. If P5 adds eval runs that also write to `bench_runs`, there is no way to distinguish bench from eval runs in the DB.
|
|
- **YAGNI gate:** Bench and eval are separate phases (P3 vs P5). Acceptable for v1.
|
|
|
|
**A3: `resolveBaseUrl` is hardcoded, not read from `control_hosts`**
|
|
- **Location:** `routes/bench.ts:398-406`, `routes/playground.ts:232-237`
|
|
- **Finding:** Both `resolveBaseUrl` in bench.ts and `resolveProviderUrl` in playground.ts use hardcoded `Record<string, string>` mappings. The `control_hosts` table stores `ssh_host` which should be the source of truth. This means adding a new host requires editing two files.
|
|
- **YAGNI gate:** Only two hosts exist and are seeded. Low blast radius.
|
|
|
|
**A4: Benchmark requests do not include suite-defined sampling params**
|
|
- **Location:** `bench-engine.ts:143-150`
|
|
- **Finding:** `runSingleBenchRequest` accepts `temperature` and `topP` parameters (line 116-117) and passes them to the request body. However, the `BenchSuite` interface (line 17-27) does NOT include `temperature` or `topP` — those come from `BenchRunParams` (line 29-34) which is the runner-level parameter. The suite definition has `metadata?: Record<string, unknown>` but no typed sampling params. This means the bench endpoint at `routes/bench.ts:139-143` defaults to `temperature: 0.7, topP: 0.9` regardless of what the suite was designed with. The suite's params are silently ignored.
|
|
- **YAGNI gate:** v1 uses fixed params. The design says "v1 sampling-params parity: bench requests should honor suite params, not silently use server defaults." This is a spec gap — the suite schema should include `temperature` and `topP` as typed fields.
|
|
|
|
**A5: No warmup pass**
|
|
- **Location:** `bench-engine.ts:241-393`
|
|
- **Finding:** The design section 8 says "warmup excluded from results" implying a warmup pass exists. The code has no warmup phase — it runs the full grid directly. For llama.cpp, the first request to a model is typically slower (model loading/prefill), so TTFT values are inflated without a warmup. The comment at line 8 ("Warmup excluded from results") is misleading — there is no warmup at all.
|
|
- **YAGNI gate:** Bench is manual, results are for Sam's own hardware. Acceptable for v1.
|
|
|
|
**A6: `checkRecentTraffic` reads from in-memory state, not the activity stream**
|
|
- **Location:** `routes/bench.ts:380-392`
|
|
- **Finding:** The design says "`concurrent_foreign_requests` recorded per run to flag polluted results" and "sourced from the live activity stream during the run window." However, `checkRecentTraffic` reads `hostState.models` inflight counts (in-memory SSE state), while `runBenchAsync` records `concurrent_foreign_requests` from `control_requests` DB queries. These measure different things: inflight counts (instantaneous) vs request count in last 60s (windowed). The UI shows `concurrentForeignRequests` from the DB (the 60s window) but the takeover confirmation uses the in-memory inflight count. This is not a bug — they serve different purposes — but the naming is inconsistent with the design spec which says "sourced from the activity stream."
|
|
- **YAGNI gate:** Both measurements are valid indicators. The design spec is slightly imprecise.
|
|
|
|
### Nits (5)
|
|
|
|
**N1: `BenchTab.tsx:534` — baseline lookup is O(n) per run in history view**
|
|
- `const suite = suites.find((s) => s.id === run.suiteId)` at line 533 — fine for small N but should be a Map for correctness.
|
|
|
|
**N2: `BenchTab.tsx:190-197` — polling interval leaks on component unmount**
|
|
- `pollInterval` is created in `runBench` but `clearInterval` is only called when status changes or 10 min timeout fires. If the user navigates away from the Bench tab while a run is in progress, the interval keeps firing.
|
|
|
|
**N3: `playground.ts:125` — SSE relay drops the `data: ` prefix**
|
|
- `reply.raw.write(\`data: ${trimmed}\n\n\`)` — the `trimmed` line already has `data: ` stripped by the SSE parser in `bench-engine.ts:66`, but the playground relay receives raw SSE lines from llama-swap which may or may not have the prefix. If llama-swap sends `data: {...}`, `trimmed` becomes `data: {...}` (after trim) and the relay writes `data: data: {...}` — double prefix. However, `bench-engine.ts` strips it; the playground is a direct relay so it depends on what llama-swap sends. This is fragile.
|
|
|
|
**N4: `bench-engine.ts:211-222` — prompt generation is a rough approximation**
|
|
- `charsPerToken = 4` is used to generate deterministic prompts. The comment says "~1.3 chars/token is a rough average for English text" but the code uses 4. This is internally inconsistent. The prompt will be ~4x longer than intended token count.
|
|
|
|
**N5: `BenchTab.tsx:229` — delta calculation divides by zero risk**
|
|
- `const delta = (currentGenTps - baselineGenTps) / baselineGenTps;` — if `baselineGenTps` is 0, this produces `Infinity`. The `== null` check at line 227 does not guard against 0.
|
|
|
|
## Claims I did not verify
|
|
|
|
1. **`useControlStream` integration with Control.tsx** — I read the hook and page, but did not verify that `ControlProvider` wraps the Control page in `App.tsx`. The routing exists (`/control` in `App.tsx`), but the provider placement was not confirmed.
|
|
2. **`/api/control/playground/models` route path** — The playground routes are registered at `/api/playground/*` (route path prefix in `registerPlaygroundRoutes`), but the web client fetches `/api/control/playground/models` (PlaygroundTab.tsx:47). The control-proxy at `apps/server/src/routes/control-proxy.ts:64` rewrites `/api/control/*` to `/api/*`, so this should work. Not verified by reading the proxy rewrite logic end-to-end.
|
|
3. **`jobType: 'bench'` in the `WsFrameSchema`** — The `ControlJobFrame` has `jobType: z.enum(['bench', 'eval', 'action'])` (ws-frames.ts:548). This is correct.
|
|
4. **`BenchRunParams.temperature` and `topP` flow** — The bench route at `routes/bench.ts:142-143` passes `temperature`/`topP` to `runBenchAsync`, which passes them to `runBenchSuite`, which passes them to `runSingleBenchRequest`. The chain is complete.
|
|
5. **Contracts drift test coverage** — The `ws-frames.test.ts` passes (11 tests). I did not read the test file to confirm it covers all 5 new control frame types.
|