Files
boocode/openspec/changes/boocontrol/artifacts/p4-p5-audit.md
indifferentketchup b18de2a331 chore: snapshot working tree - pty_exited notifications + in-flight inference WIP
feat(booterm): structured pty_exited WS notifications. Plan-validated, impl-validated, code-reviewed green (contracts build clean, contracts test 29/29, booterm + web typecheck clean).

wip: in-progress inference/provider refactor (agents.ts, provider.ts, new llama-providers.ts, removed llama-args-validator), plus arena, dispatcher, compaction, schema changes.

openspec: pty-exit-notifications complete; x-agent-flags planned (not yet implemented).
2026-06-14 12:48:47 +00:00

14 KiB

P4+P5 Audit: Combined Validation + Code Review

Date: 2026-06-12 Change: boocontrol Phases: P4 (per-consumer attribution) + P5 (quality evals + sandbox) Mode: Implementation (all 8 tasks checked)


Build/Test Gates

Gate Result
pnpm -C apps/server build PASS
pnpm -C apps/server test PASS (580 passed, 11 skipped, 51 files)
pnpm -C apps/coder build PASS
pnpm -C apps/coder test PASS (587 passed, 32 skipped, 51 files)
pnpm -C apps/control build PASS
pnpm -C apps/control test PASS (116 passed, 15 files)
npx tsc -p apps/web/tsconfig.app.json --noEmit PASS

Validation: boocontrol P4+P5 (implementation mode)

Verdict

PASS-WITH-FINDINGS -- all 8 tasks have implementing code; one design-specified behavior (judge temperature=0) is not implemented.

Traceability

Task Claim Evidence Status
P4.1 X-Boo-Source on AI-SDK streaming path stream-phase-adapter.ts:309 passes 'boochat' to upstreamModel; provider.ts:19-44 getSwapProvider wraps fetch with header, cache keyed baseURL||source PASS
P4.1 includeUsage: true preserved provider.ts:38 explicitly set on createOpenAICompatible PASS
P4.1 compaction.ts + task-model.ts headers compaction.ts:359 and task-model.ts:27 both inject X-Boo-Source: 'boochat' in direct fetch headers PASS
P4.2 local-gateway.ts forwards x-boo-source local-gateway.ts:67 reads inbound header, defaults 'boocoder'; local-gateway.ts:76 forwards as X-Boo-Source PASS
P4.2 arena-model-call.ts source arena-model-call.ts:51 sets X-Boo-Source: 'arena' PASS
P4.3 control_requests.source migration schema.sql:48 ALTER TABLE ADD COLUMN IF NOT EXISTS source TEXT (idempotent); INSERT at index.ts:182-183 includes source column; index.ts:81 maps source: null for ring data (design S7 deviation documented) PASS
P4.4 Tests: header present + rows attribute pipeline.test.ts:248 asserts source=NULL for ring data; import/export tests for all three paths PARTIAL
P5.1 Suite format + YAML loading + DB schema eval-suites.ts:67-120 loads YAML from data/; schema.sql:161-222 defines eval_suites (UNIQUE on name+version), eval_runs, eval_results; 4 YAML suite files present PASS
P5.2 Judge runner temperature=0 judge-runner.ts:239 scoreWithRubric uses temperature: 0 (correct); judge-runner.ts:182 generateResponse uses temperature: 0.7 (NOT 0) FAIL
P5.2 Judge model+version pinned per run judge-runner.ts:59 constructs judgeModelVersion string; eval_runs table stores judge_model + judge_model_version PASS
P5.2 Rationale captured judge-runner.ts:97-98 stores rationale from scoreWithRubric PASS
P5.2 X-Boo-Source control-eval judge-runner.ts:177,237 both set X-Boo-Source: 'control-eval' PASS
P5.3 Sandbox hardening flags sandbox-runner.ts:258-273 docker args array: --network none, --user 1000:1000, --memory, --cpus, --pids-limit, --tmpfs /workspace:rw,noexec,size=64m, --rm, --label boocontrol-eval, --security-opt no-new-privileges, --cap-drop ALL PASS
P5.3 No volume mounts, no docker socket Verified in docker args array at sandbox-runner.ts:258-273 -- no -v or socket reference PASS
P5.3 Orphan prune at engine start sandbox-runner.ts:73 calls pruneOrphanContainers() at start of runCodeEval PASS
P5.3 Bounded concurrency + allSettled + finally cleanup sandbox-runner.ts:81-83 batch loop; sandbox-runner.ts:86 Promise.allSettled; sandbox-runner.ts:162-165 finally block with cleanupContainer PASS
P5.3 SANDBOX_TIMEOUT_MS type sandbox-runner.ts:37 typed as number but process.env is string -- setTimeout and spawn timeout receive string ADVISORY
P5.4 Leaderboard UI + scatter EvalsTab.tsx renders scatter (echarts.init with buildEChartsTheme) + bar chart + run table + launcher PASS

Findings

Blocking

V1: judge-runner.ts generateResponse uses temperature 0.7 instead of 0

  • Location: apps/control/src/services/judge-runner.ts:182
  • Evidence: body: JSON.stringify({ model, messages: [{ role: 'user', content: prompt }], temperature: 0.7, max_tokens: 2048 }) -- the generateResponse function (which generates the target model's response to be scored) uses temperature 0.7. The design at design.md:195 specifies "temperature 0, judge model+version pinned per run." The scoreWithRubric function at line 239 correctly uses temperature: 0, but the response generation step does not.
  • Impact: The target model's response is generated with non-deterministic sampling. For a reproducible eval framework this undermines the "temperature 0" claim in the task description. The judge scoring is deterministic (temp=0) but the input it scores is not.
  • Fix sketch: Change line 182 from temperature: 0.7 to temperature: 0.

Advisory

A1: sandbox-runner.ts SANDBOX_TIMEOUT_MS is string, not number

  • Location: apps/control/src/services/sandbox-runner.ts:37
  • Evidence: const SANDBOX_TIMEOUT_MS = (process.env.SANDBOX_TIMEOUT_MS ?? '30000') as unknown as number; -- process.env values are string | undefined. The as unknown as number cast silences tsc but the runtime value is '30000' (string). This string flows to spawn(..., { timeout: SANDBOX_TIMEOUT_MS }) at line 277 and setTimeout(..., SANDBOX_TIMEOUT_MS) at line 308. Node's child_process.spawn timeout accepts number | undefined and setTimeout accepts number | string | undefined (string is parsed). The timeout will likely work due to JS coercion, but the type lie masks future bugs (e.g. SANDBOX_TIMEOUT_MS - 1000 would produce NaN).
  • Impact: Low immediate risk (JS coercion makes it work), but the incorrect type annotation prevents catching arithmetic bugs. SANDBOX_CONCURRENCY at line 38 has the same issue.
  • Fix sketch: const SANDBOX_TIMEOUT_MS = Number(process.env.SANDBOX_TIMEOUT_MS ?? '30000');

A2: judge-runner tests exercise imports, not judge logic

  • Location: apps/control/src/services/__tests__/judge-runner.test.ts
  • Evidence: Test 1 imports the module and checks typeof mod.runJudgeEval === 'function'. Test 2 calls runJudgeEval with a nonexistent provider and asserts the error message. Neither test exercises the actual judge request flow, rubric scoring, temperature setting, or rationale capture. The temperature=0.7 bug (V1) would not be caught by these tests.
  • Impact: Regressions in judge scoring logic, temperature, or X-Boo-Source injection would not be caught by the test suite.
  • Reopen trigger: Any bug where judge scoring produces wrong results or wrong temperature.

A3: sandbox-runner tests exercise Promise patterns, not Docker flags

  • Location: apps/control/src/services/__tests__/sandbox-runner.test.ts
  • Evidence: Tests verify runCodeEval is importable, that Promise.allSettled isolates failures, and that SIGKILL works. None of the tests verify the actual Docker arguments (security flags, label, resource caps), orphan pruning, or container cleanup. The test at line 19 (bounded fan-out) reimplements the pattern inline rather than calling runCodeEval.
  • Impact: A regression in the Docker security flags (e.g. removing --cap-drop ALL) would pass all existing tests.
  • Reopen trigger: Any sandbox escape or flag regression.

A4: arena dispatch sites not fully traced

  • Location: apps/coder/src/services/arena-model-call.ts:51
  • Evidence: arenaModelCall sets X-Boo-Source: 'arena'. However, the full arena dispatch chain (battle start, contestant model calls, cross-examination) was not traced end-to-end. The direct arenaModelCall path is verified; whether all arena sub-calls route through this function rather than making their own fetches was not checked.
  • Impact: Low -- if arena uses arenaModelCall for all model calls, attribution is correct. If any arena path makes a direct fetch without X-Boo-Source, those requests would show as NULL in the activity feed.
  • Reopen trigger: Arena requests showing as NULL in activity feed despite having a source.

Claims I did not verify

  • Whether the includeUsage: true survives AI-SDK v6's internal handling (this was verified in prior P1 review -- load-bearing per apps/server/CLAUDE.md)
  • Whether the sql.json(value as never) pattern in eval-suites.ts:170 correctly serializes the tasks array as JSONB (pattern is established and used elsewhere in the codebase)
  • Whether the ECharts bundle tree-shaking works correctly in the production build (the echarts/core + per-chart imports pattern is established from P1)
  • Whether the eval_runs.judge_model_version column is actually populated at run creation time (the createEvalRun function at eval-suites.ts:258 receives judgeModelVersion as a parameter; whether callers pass it was not traced)
  • Whether the leaderboard API endpoint exists and returns the correct shape (the frontend fetches from /api/control/eval/leaderboard; the backend route handler was not traced)

Review: boocontrol P4+P5

Scope

apps/server/src/services/inference/provider.ts, apps/server/src/services/inference/stream-phase-adapter.ts, apps/server/src/services/compaction.ts, apps/server/src/services/task-model.ts, apps/coder/src/services/local-gateway.ts, apps/coder/src/services/arena-model-call.ts, apps/control/src/services/judge-runner.ts, apps/control/src/services/sandbox-runner.ts, apps/control/src/services/eval-suites.ts, apps/control/src/schema.sql, apps/web/src/components/control/EvalsTab.tsx, apps/web/src/pages/Control.tsx, P4+P5 tests.

Size

Large -- 12 source files across 3 apps + contracts, touches inference streaming path, SSE ingestion, Docker container spawning, DB schema, and ECharts UI.

Summary

P4 (attribution) is correctly implemented end-to-end. All three paths (server streaming, coder gateway, arena) inject the correct X-Boo-Source header. The migration is idempotent and NULL-for-ring-data is documented. P5 (evals) has correct schema, YAML loading, and UI wiring, but the judge runner's response generation temperature (0.7) contradicts the design spec (0). Sandbox hardening is thorough.

Classification Count
Blocking 1
Advisory 4
Nit 1

Findings

Blocking

B1: Judge response generation temperature is 0.7, not 0

  • Location: apps/control/src/services/judge-runner.ts:182
  • Evidence: temperature: 0.7 in the generateResponse request body. The design at design.md:195 specifies "temperature 0, judge model+version pinned per run." The scoreWithRubric function at line 239 correctly uses temperature: 0.
  • Standard violated: Design spec S8 ("temperature 0, judge model+version pinned per run").
  • Risk: Non-deterministic eval inputs undermine reproducibility claims. A reviewer or auditor checking the design vs code will find this discrepancy.
  • Fix sketch: temperature: 0 on line 182.

Advisory

A1: SANDBOX_TIMEOUT_MS type mismatch

  • Location: apps/control/src/services/sandbox-runner.ts:37
  • Evidence: as unknown as number cast on a string from process.env. Works at runtime due to JS coercion, but the type lie prevents catching arithmetic bugs.
  • YAGNI gate: No known incident. Defer unless the sandbox timeout needs arithmetic (e.g. grace period).

A2: Judge tests do not exercise scoring logic

  • Location: apps/control/src/services/__tests__/judge-runner.test.ts
  • Evidence: Tests check import and error-on-bad-provider only. Rubric scoring, temperature, X-Boo-Source injection, and rationale capture are untested.
  • YAGNI gate: No known scoring bug. Defer until judge scoring produces real evals.

A3: Sandbox tests do not verify Docker flags

  • Location: apps/control/src/services/__tests__/sandbox-runner.test.ts
  • Evidence: Tests exercise Promise.allSettled and SIGKILL patterns, not the actual Docker args construction. Security flags (network, caps, user, label) are untested.
  • YAGNI gate: No known sandbox escape. Defer until sandbox runner processes untrusted code.

A4: Arena dispatch chain not fully traced

  • Location: apps/coder/src/services/arena-model-call.ts:51
  • Evidence: arenaModelCall sets X-Boo-Source: 'arena'. Whether all arena sub-calls (battle start, cross-examination) route through this function rather than making direct fetches was not verified.
  • YAGNI gate: No known arena attribution bug. Defer until arena requests show NULL source.

Nits

N1: eval_suites UNIQUE on (name, version) uses ON CONFLICT DO NOTHING in seed, but upsertEvalSuite uses ON CONFLICT DO UPDATE

  • Location: apps/control/src/services/eval-suites.ts:175 vs eval-suites.ts:230
  • Evidence: seedEvalSuites uses ON CONFLICT (id) DO NOTHING (by primary key). upsertEvalSuite uses ON CONFLICT (id) DO UPDATE. The schema also has UNIQUE (name, version) at schema.sql:170 which is NOT the conflict target in either function. If two suites share a name+version, the UNIQUE constraint would reject the second. This is the correct behavior (versioning is explicit), but the UNIQUE constraint and the ON CONFLICT target differ.
  • Note: Style -- not a bug.

Verdict

APPROVE-WITH-NITS

One blocking finding (B1: judge temperature 0.7 should be 0). Four advisory findings deferred per YAGNI gates. One nit on UNIQUE constraint targeting.


Claims I did not verify

  • Whether the AI-SDK createOpenAICompatible internal fetch wrapper correctly merges the custom fetch headers (established pattern from P1, not re-verified)
  • Whether the eval_runs.judge_model_version column is populated by callers of createEvalRun (the function accepts it; caller trace was not performed)
  • Whether the leaderboard API backend route exists and returns the correct shape
  • Whether the ECharts tree-shaking in EvalsTab.tsx produces correct bundle sizes
  • Whether arena battle start / cross-examination model calls all go through arenaModelCall
  • Whether the control_requests INSERT at index.ts:258 (the non-reconcile path) also correctly sets source: null