Files
boocode/openspec/changes/boocontrol/artifacts/p4-p5-audit.md
indifferentketchup b18de2a331 chore: snapshot working tree - pty_exited notifications + in-flight inference WIP
feat(booterm): structured pty_exited WS notifications. Plan-validated, impl-validated, code-reviewed green (contracts build clean, contracts test 29/29, booterm + web typecheck clean).

wip: in-progress inference/provider refactor (agents.ts, provider.ts, new llama-providers.ts, removed llama-args-validator), plus arena, dispatcher, compaction, schema changes.

openspec: pty-exit-notifications complete; x-agent-flags planned (not yet implemented).
2026-06-14 12:48:47 +00:00

186 lines
14 KiB
Markdown

# P4+P5 Audit: Combined Validation + Code Review
**Date:** 2026-06-12
**Change:** boocontrol
**Phases:** P4 (per-consumer attribution) + P5 (quality evals + sandbox)
**Mode:** Implementation (all 8 tasks checked)
---
## Build/Test Gates
| Gate | Result |
|------|--------|
| `pnpm -C apps/server build` | PASS |
| `pnpm -C apps/server test` | PASS (580 passed, 11 skipped, 51 files) |
| `pnpm -C apps/coder build` | PASS |
| `pnpm -C apps/coder test` | PASS (587 passed, 32 skipped, 51 files) |
| `pnpm -C apps/control build` | PASS |
| `pnpm -C apps/control test` | PASS (116 passed, 15 files) |
| `npx tsc -p apps/web/tsconfig.app.json --noEmit` | PASS |
---
# Validation: boocontrol P4+P5 (implementation mode)
## Verdict
**PASS-WITH-FINDINGS** -- all 8 tasks have implementing code; one design-specified behavior (judge temperature=0) is not implemented.
## Traceability
| Task | Claim | Evidence | Status |
|------|-------|----------|--------|
| P4.1 | X-Boo-Source on AI-SDK streaming path | `stream-phase-adapter.ts:309` passes `'boochat'` to `upstreamModel`; `provider.ts:19-44` `getSwapProvider` wraps fetch with header, cache keyed `baseURL\|\|source` | PASS |
| P4.1 | `includeUsage: true` preserved | `provider.ts:38` explicitly set on `createOpenAICompatible` | PASS |
| P4.1 | compaction.ts + task-model.ts headers | `compaction.ts:359` and `task-model.ts:27` both inject `X-Boo-Source: 'boochat'` in direct fetch headers | PASS |
| P4.2 | local-gateway.ts forwards x-boo-source | `local-gateway.ts:67` reads inbound header, defaults `'boocoder'`; `local-gateway.ts:76` forwards as `X-Boo-Source` | PASS |
| P4.2 | arena-model-call.ts source | `arena-model-call.ts:51` sets `X-Boo-Source: 'arena'` | PASS |
| P4.3 | control_requests.source migration | `schema.sql:48` `ALTER TABLE ADD COLUMN IF NOT EXISTS source TEXT` (idempotent); INSERT at `index.ts:182-183` includes source column; `index.ts:81` maps `source: null` for ring data (design S7 deviation documented) | PASS |
| P4.4 | Tests: header present + rows attribute | `pipeline.test.ts:248` asserts source=NULL for ring data; import/export tests for all three paths | PARTIAL |
| P5.1 | Suite format + YAML loading + DB schema | `eval-suites.ts:67-120` loads YAML from `data/`; `schema.sql:161-222` defines `eval_suites` (UNIQUE on name+version), `eval_runs`, `eval_results`; 4 YAML suite files present | PASS |
| P5.2 | Judge runner temperature=0 | `judge-runner.ts:239` scoreWithRubric uses `temperature: 0` (correct); `judge-runner.ts:182` generateResponse uses `temperature: 0.7` (NOT 0) | FAIL |
| P5.2 | Judge model+version pinned per run | `judge-runner.ts:59` constructs `judgeModelVersion` string; `eval_runs` table stores `judge_model` + `judge_model_version` | PASS |
| P5.2 | Rationale captured | `judge-runner.ts:97-98` stores rationale from scoreWithRubric | PASS |
| P5.2 | X-Boo-Source control-eval | `judge-runner.ts:177,237` both set `X-Boo-Source: 'control-eval'` | PASS |
| P5.3 | Sandbox hardening flags | `sandbox-runner.ts:258-273` docker args array: `--network none`, `--user 1000:1000`, `--memory`, `--cpus`, `--pids-limit`, `--tmpfs /workspace:rw,noexec,size=64m`, `--rm`, `--label boocontrol-eval`, `--security-opt no-new-privileges`, `--cap-drop ALL` | PASS |
| P5.3 | No volume mounts, no docker socket | Verified in docker args array at `sandbox-runner.ts:258-273` -- no `-v` or socket reference | PASS |
| P5.3 | Orphan prune at engine start | `sandbox-runner.ts:73` calls `pruneOrphanContainers()` at start of `runCodeEval` | PASS |
| P5.3 | Bounded concurrency + allSettled + finally cleanup | `sandbox-runner.ts:81-83` batch loop; `sandbox-runner.ts:86` `Promise.allSettled`; `sandbox-runner.ts:162-165` `finally` block with `cleanupContainer` | PASS |
| P5.3 | SANDBOX_TIMEOUT_MS type | `sandbox-runner.ts:37` typed as `number` but `process.env` is string -- `setTimeout` and `spawn` timeout receive string | ADVISORY |
| P5.4 | Leaderboard UI + scatter | `EvalsTab.tsx` renders scatter (`echarts.init` with `buildEChartsTheme`) + bar chart + run table + launcher | PASS |
## Findings
### Blocking
**V1: judge-runner.ts generateResponse uses temperature 0.7 instead of 0**
- **Location:** `apps/control/src/services/judge-runner.ts:182`
- **Evidence:** `body: JSON.stringify({ model, messages: [{ role: 'user', content: prompt }], temperature: 0.7, max_tokens: 2048 })` -- the generateResponse function (which generates the target model's response to be scored) uses temperature 0.7. The design at `design.md:195` specifies "temperature 0, judge model+version pinned per run." The scoreWithRubric function at line 239 correctly uses `temperature: 0`, but the response generation step does not.
- **Impact:** The target model's response is generated with non-deterministic sampling. For a reproducible eval framework this undermines the "temperature 0" claim in the task description. The judge scoring is deterministic (temp=0) but the input it scores is not.
- **Fix sketch:** Change line 182 from `temperature: 0.7` to `temperature: 0`.
### Advisory
**A1: sandbox-runner.ts SANDBOX_TIMEOUT_MS is string, not number**
- **Location:** `apps/control/src/services/sandbox-runner.ts:37`
- **Evidence:** `const SANDBOX_TIMEOUT_MS = (process.env.SANDBOX_TIMEOUT_MS ?? '30000') as unknown as number;` -- `process.env` values are `string | undefined`. The `as unknown as number` cast silences tsc but the runtime value is `'30000'` (string). This string flows to `spawn(..., { timeout: SANDBOX_TIMEOUT_MS })` at line 277 and `setTimeout(..., SANDBOX_TIMEOUT_MS)` at line 308. Node's `child_process.spawn` timeout accepts `number | undefined` and `setTimeout` accepts `number | string | undefined` (string is parsed). The timeout will likely work due to JS coercion, but the type lie masks future bugs (e.g. `SANDBOX_TIMEOUT_MS - 1000` would produce `NaN`).
- **Impact:** Low immediate risk (JS coercion makes it work), but the incorrect type annotation prevents catching arithmetic bugs. SANDBOX_CONCURRENCY at line 38 has the same issue.
- **Fix sketch:** `const SANDBOX_TIMEOUT_MS = Number(process.env.SANDBOX_TIMEOUT_MS ?? '30000');`
**A2: judge-runner tests exercise imports, not judge logic**
- **Location:** `apps/control/src/services/__tests__/judge-runner.test.ts`
- **Evidence:** Test 1 imports the module and checks `typeof mod.runJudgeEval === 'function'`. Test 2 calls `runJudgeEval` with a nonexistent provider and asserts the error message. Neither test exercises the actual judge request flow, rubric scoring, temperature setting, or rationale capture. The temperature=0.7 bug (V1) would not be caught by these tests.
- **Impact:** Regressions in judge scoring logic, temperature, or X-Boo-Source injection would not be caught by the test suite.
- **Reopen trigger:** Any bug where judge scoring produces wrong results or wrong temperature.
**A3: sandbox-runner tests exercise Promise patterns, not Docker flags**
- **Location:** `apps/control/src/services/__tests__/sandbox-runner.test.ts`
- **Evidence:** Tests verify `runCodeEval` is importable, that `Promise.allSettled` isolates failures, and that SIGKILL works. None of the tests verify the actual Docker arguments (security flags, label, resource caps), orphan pruning, or container cleanup. The test at line 19 (`bounded fan-out`) reimplements the pattern inline rather than calling `runCodeEval`.
- **Impact:** A regression in the Docker security flags (e.g. removing `--cap-drop ALL`) would pass all existing tests.
- **Reopen trigger:** Any sandbox escape or flag regression.
**A4: arena dispatch sites not fully traced**
- **Location:** `apps/coder/src/services/arena-model-call.ts:51`
- **Evidence:** `arenaModelCall` sets `X-Boo-Source: 'arena'`. However, the full arena dispatch chain (battle start, contestant model calls, cross-examination) was not traced end-to-end. The direct `arenaModelCall` path is verified; whether all arena sub-calls route through this function rather than making their own fetches was not checked.
- **Impact:** Low -- if arena uses `arenaModelCall` for all model calls, attribution is correct. If any arena path makes a direct fetch without `X-Boo-Source`, those requests would show as NULL in the activity feed.
- **Reopen trigger:** Arena requests showing as NULL in activity feed despite having a source.
## Claims I did not verify
- Whether the `includeUsage: true` survives AI-SDK v6's internal handling (this was verified in prior P1 review -- load-bearing per `apps/server/CLAUDE.md`)
- Whether the `sql.json(value as never)` pattern in `eval-suites.ts:170` correctly serializes the tasks array as JSONB (pattern is established and used elsewhere in the codebase)
- Whether the ECharts bundle tree-shaking works correctly in the production build (the `echarts/core` + per-chart imports pattern is established from P1)
- Whether the `eval_runs.judge_model_version` column is actually populated at run creation time (the `createEvalRun` function at `eval-suites.ts:258` receives `judgeModelVersion` as a parameter; whether callers pass it was not traced)
- Whether the leaderboard API endpoint exists and returns the correct shape (the frontend fetches from `/api/control/eval/leaderboard`; the backend route handler was not traced)
---
# Review: boocontrol P4+P5
## Scope
`apps/server/src/services/inference/provider.ts`, `apps/server/src/services/inference/stream-phase-adapter.ts`, `apps/server/src/services/compaction.ts`, `apps/server/src/services/task-model.ts`, `apps/coder/src/services/local-gateway.ts`, `apps/coder/src/services/arena-model-call.ts`, `apps/control/src/services/judge-runner.ts`, `apps/control/src/services/sandbox-runner.ts`, `apps/control/src/services/eval-suites.ts`, `apps/control/src/schema.sql`, `apps/web/src/components/control/EvalsTab.tsx`, `apps/web/src/pages/Control.tsx`, P4+P5 tests.
## Size
**Large** -- 12 source files across 3 apps + contracts, touches inference streaming path, SSE ingestion, Docker container spawning, DB schema, and ECharts UI.
## Summary
P4 (attribution) is correctly implemented end-to-end. All three paths (server streaming, coder gateway, arena) inject the correct `X-Boo-Source` header. The migration is idempotent and NULL-for-ring-data is documented. P5 (evals) has correct schema, YAML loading, and UI wiring, but the judge runner's response generation temperature (0.7) contradicts the design spec (0). Sandbox hardening is thorough.
| Classification | Count |
|----------------|-------|
| Blocking | 1 |
| Advisory | 4 |
| Nit | 1 |
## Findings
### Blocking
**B1: Judge response generation temperature is 0.7, not 0**
- **Location:** `apps/control/src/services/judge-runner.ts:182`
- **Evidence:** `temperature: 0.7` in the `generateResponse` request body. The design at `design.md:195` specifies "temperature 0, judge model+version pinned per run." The `scoreWithRubric` function at line 239 correctly uses `temperature: 0`.
- **Standard violated:** Design spec S8 ("temperature 0, judge model+version pinned per run").
- **Risk:** Non-deterministic eval inputs undermine reproducibility claims. A reviewer or auditor checking the design vs code will find this discrepancy.
- **Fix sketch:** `temperature: 0` on line 182.
### Advisory
**A1: SANDBOX_TIMEOUT_MS type mismatch**
- **Location:** `apps/control/src/services/sandbox-runner.ts:37`
- **Evidence:** `as unknown as number` cast on a string from `process.env`. Works at runtime due to JS coercion, but the type lie prevents catching arithmetic bugs.
- **YAGNI gate:** No known incident. Defer unless the sandbox timeout needs arithmetic (e.g. grace period).
**A2: Judge tests do not exercise scoring logic**
- **Location:** `apps/control/src/services/__tests__/judge-runner.test.ts`
- **Evidence:** Tests check import and error-on-bad-provider only. Rubric scoring, temperature, X-Boo-Source injection, and rationale capture are untested.
- **YAGNI gate:** No known scoring bug. Defer until judge scoring produces real evals.
**A3: Sandbox tests do not verify Docker flags**
- **Location:** `apps/control/src/services/__tests__/sandbox-runner.test.ts`
- **Evidence:** Tests exercise `Promise.allSettled` and `SIGKILL` patterns, not the actual Docker args construction. Security flags (network, caps, user, label) are untested.
- **YAGNI gate:** No known sandbox escape. Defer until sandbox runner processes untrusted code.
**A4: Arena dispatch chain not fully traced**
- **Location:** `apps/coder/src/services/arena-model-call.ts:51`
- **Evidence:** `arenaModelCall` sets `X-Boo-Source: 'arena'`. Whether all arena sub-calls (battle start, cross-examination) route through this function rather than making direct fetches was not verified.
- **YAGNI gate:** No known arena attribution bug. Defer until arena requests show NULL source.
### Nits
**N1: eval_suites UNIQUE on (name, version) uses ON CONFLICT DO NOTHING in seed, but upsertEvalSuite uses ON CONFLICT DO UPDATE**
- **Location:** `apps/control/src/services/eval-suites.ts:175` vs `eval-suites.ts:230`
- **Evidence:** `seedEvalSuites` uses `ON CONFLICT (id) DO NOTHING` (by primary key). `upsertEvalSuite` uses `ON CONFLICT (id) DO UPDATE`. The schema also has `UNIQUE (name, version)` at `schema.sql:170` which is NOT the conflict target in either function. If two suites share a name+version, the UNIQUE constraint would reject the second. This is the correct behavior (versioning is explicit), but the UNIQUE constraint and the ON CONFLICT target differ.
- **Note:** Style -- not a bug.
## Verdict
**APPROVE-WITH-NITS**
One blocking finding (B1: judge temperature 0.7 should be 0). Four advisory findings deferred per YAGNI gates. One nit on UNIQUE constraint targeting.
---
## Claims I did not verify
- Whether the AI-SDK `createOpenAICompatible` internal `fetch` wrapper correctly merges the custom fetch headers (established pattern from P1, not re-verified)
- Whether the `eval_runs.judge_model_version` column is populated by callers of `createEvalRun` (the function accepts it; caller trace was not performed)
- Whether the leaderboard API backend route exists and returns the correct shape
- Whether the ECharts tree-shaking in `EvalsTab.tsx` produces correct bundle sizes
- Whether arena battle start / cross-examination model calls all go through `arenaModelCall`
- Whether the `control_requests` INSERT at `index.ts:258` (the non-reconcile path) also correctly sets `source: null`