feat(booterm): structured pty_exited WS notifications. Plan-validated, impl-validated, code-reviewed green (contracts build clean, contracts test 29/29, booterm + web typecheck clean). wip: in-progress inference/provider refactor (agents.ts, provider.ts, new llama-providers.ts, removed llama-args-validator), plus arena, dispatcher, compaction, schema changes. openspec: pty-exit-notifications complete; x-agent-flags planned (not yet implemented).
14 KiB
P4+P5 Audit: Combined Validation + Code Review
Date: 2026-06-12 Change: boocontrol Phases: P4 (per-consumer attribution) + P5 (quality evals + sandbox) Mode: Implementation (all 8 tasks checked)
Build/Test Gates
| Gate | Result |
|---|---|
pnpm -C apps/server build |
PASS |
pnpm -C apps/server test |
PASS (580 passed, 11 skipped, 51 files) |
pnpm -C apps/coder build |
PASS |
pnpm -C apps/coder test |
PASS (587 passed, 32 skipped, 51 files) |
pnpm -C apps/control build |
PASS |
pnpm -C apps/control test |
PASS (116 passed, 15 files) |
npx tsc -p apps/web/tsconfig.app.json --noEmit |
PASS |
Validation: boocontrol P4+P5 (implementation mode)
Verdict
PASS-WITH-FINDINGS -- all 8 tasks have implementing code; one design-specified behavior (judge temperature=0) is not implemented.
Traceability
| Task | Claim | Evidence | Status |
|---|---|---|---|
| P4.1 | X-Boo-Source on AI-SDK streaming path | stream-phase-adapter.ts:309 passes 'boochat' to upstreamModel; provider.ts:19-44 getSwapProvider wraps fetch with header, cache keyed baseURL||source |
PASS |
| P4.1 | includeUsage: true preserved |
provider.ts:38 explicitly set on createOpenAICompatible |
PASS |
| P4.1 | compaction.ts + task-model.ts headers | compaction.ts:359 and task-model.ts:27 both inject X-Boo-Source: 'boochat' in direct fetch headers |
PASS |
| P4.2 | local-gateway.ts forwards x-boo-source | local-gateway.ts:67 reads inbound header, defaults 'boocoder'; local-gateway.ts:76 forwards as X-Boo-Source |
PASS |
| P4.2 | arena-model-call.ts source | arena-model-call.ts:51 sets X-Boo-Source: 'arena' |
PASS |
| P4.3 | control_requests.source migration | schema.sql:48 ALTER TABLE ADD COLUMN IF NOT EXISTS source TEXT (idempotent); INSERT at index.ts:182-183 includes source column; index.ts:81 maps source: null for ring data (design S7 deviation documented) |
PASS |
| P4.4 | Tests: header present + rows attribute | pipeline.test.ts:248 asserts source=NULL for ring data; import/export tests for all three paths |
PARTIAL |
| P5.1 | Suite format + YAML loading + DB schema | eval-suites.ts:67-120 loads YAML from data/; schema.sql:161-222 defines eval_suites (UNIQUE on name+version), eval_runs, eval_results; 4 YAML suite files present |
PASS |
| P5.2 | Judge runner temperature=0 | judge-runner.ts:239 scoreWithRubric uses temperature: 0 (correct); judge-runner.ts:182 generateResponse uses temperature: 0.7 (NOT 0) |
FAIL |
| P5.2 | Judge model+version pinned per run | judge-runner.ts:59 constructs judgeModelVersion string; eval_runs table stores judge_model + judge_model_version |
PASS |
| P5.2 | Rationale captured | judge-runner.ts:97-98 stores rationale from scoreWithRubric |
PASS |
| P5.2 | X-Boo-Source control-eval | judge-runner.ts:177,237 both set X-Boo-Source: 'control-eval' |
PASS |
| P5.3 | Sandbox hardening flags | sandbox-runner.ts:258-273 docker args array: --network none, --user 1000:1000, --memory, --cpus, --pids-limit, --tmpfs /workspace:rw,noexec,size=64m, --rm, --label boocontrol-eval, --security-opt no-new-privileges, --cap-drop ALL |
PASS |
| P5.3 | No volume mounts, no docker socket | Verified in docker args array at sandbox-runner.ts:258-273 -- no -v or socket reference |
PASS |
| P5.3 | Orphan prune at engine start | sandbox-runner.ts:73 calls pruneOrphanContainers() at start of runCodeEval |
PASS |
| P5.3 | Bounded concurrency + allSettled + finally cleanup | sandbox-runner.ts:81-83 batch loop; sandbox-runner.ts:86 Promise.allSettled; sandbox-runner.ts:162-165 finally block with cleanupContainer |
PASS |
| P5.3 | SANDBOX_TIMEOUT_MS type | sandbox-runner.ts:37 typed as number but process.env is string -- setTimeout and spawn timeout receive string |
ADVISORY |
| P5.4 | Leaderboard UI + scatter | EvalsTab.tsx renders scatter (echarts.init with buildEChartsTheme) + bar chart + run table + launcher |
PASS |
Findings
Blocking
V1: judge-runner.ts generateResponse uses temperature 0.7 instead of 0
- Location:
apps/control/src/services/judge-runner.ts:182 - Evidence:
body: JSON.stringify({ model, messages: [{ role: 'user', content: prompt }], temperature: 0.7, max_tokens: 2048 })-- the generateResponse function (which generates the target model's response to be scored) uses temperature 0.7. The design atdesign.md:195specifies "temperature 0, judge model+version pinned per run." The scoreWithRubric function at line 239 correctly usestemperature: 0, but the response generation step does not. - Impact: The target model's response is generated with non-deterministic sampling. For a reproducible eval framework this undermines the "temperature 0" claim in the task description. The judge scoring is deterministic (temp=0) but the input it scores is not.
- Fix sketch: Change line 182 from
temperature: 0.7totemperature: 0.
Advisory
A1: sandbox-runner.ts SANDBOX_TIMEOUT_MS is string, not number
- Location:
apps/control/src/services/sandbox-runner.ts:37 - Evidence:
const SANDBOX_TIMEOUT_MS = (process.env.SANDBOX_TIMEOUT_MS ?? '30000') as unknown as number;--process.envvalues arestring | undefined. Theas unknown as numbercast silences tsc but the runtime value is'30000'(string). This string flows tospawn(..., { timeout: SANDBOX_TIMEOUT_MS })at line 277 andsetTimeout(..., SANDBOX_TIMEOUT_MS)at line 308. Node'schild_process.spawntimeout acceptsnumber | undefinedandsetTimeoutacceptsnumber | string | undefined(string is parsed). The timeout will likely work due to JS coercion, but the type lie masks future bugs (e.g.SANDBOX_TIMEOUT_MS - 1000would produceNaN). - Impact: Low immediate risk (JS coercion makes it work), but the incorrect type annotation prevents catching arithmetic bugs. SANDBOX_CONCURRENCY at line 38 has the same issue.
- Fix sketch:
const SANDBOX_TIMEOUT_MS = Number(process.env.SANDBOX_TIMEOUT_MS ?? '30000');
A2: judge-runner tests exercise imports, not judge logic
- Location:
apps/control/src/services/__tests__/judge-runner.test.ts - Evidence: Test 1 imports the module and checks
typeof mod.runJudgeEval === 'function'. Test 2 callsrunJudgeEvalwith a nonexistent provider and asserts the error message. Neither test exercises the actual judge request flow, rubric scoring, temperature setting, or rationale capture. The temperature=0.7 bug (V1) would not be caught by these tests. - Impact: Regressions in judge scoring logic, temperature, or X-Boo-Source injection would not be caught by the test suite.
- Reopen trigger: Any bug where judge scoring produces wrong results or wrong temperature.
A3: sandbox-runner tests exercise Promise patterns, not Docker flags
- Location:
apps/control/src/services/__tests__/sandbox-runner.test.ts - Evidence: Tests verify
runCodeEvalis importable, thatPromise.allSettledisolates failures, and that SIGKILL works. None of the tests verify the actual Docker arguments (security flags, label, resource caps), orphan pruning, or container cleanup. The test at line 19 (bounded fan-out) reimplements the pattern inline rather than callingrunCodeEval. - Impact: A regression in the Docker security flags (e.g. removing
--cap-drop ALL) would pass all existing tests. - Reopen trigger: Any sandbox escape or flag regression.
A4: arena dispatch sites not fully traced
- Location:
apps/coder/src/services/arena-model-call.ts:51 - Evidence:
arenaModelCallsetsX-Boo-Source: 'arena'. However, the full arena dispatch chain (battle start, contestant model calls, cross-examination) was not traced end-to-end. The directarenaModelCallpath is verified; whether all arena sub-calls route through this function rather than making their own fetches was not checked. - Impact: Low -- if arena uses
arenaModelCallfor all model calls, attribution is correct. If any arena path makes a direct fetch withoutX-Boo-Source, those requests would show as NULL in the activity feed. - Reopen trigger: Arena requests showing as NULL in activity feed despite having a source.
Claims I did not verify
- Whether the
includeUsage: truesurvives AI-SDK v6's internal handling (this was verified in prior P1 review -- load-bearing perapps/server/CLAUDE.md) - Whether the
sql.json(value as never)pattern ineval-suites.ts:170correctly serializes the tasks array as JSONB (pattern is established and used elsewhere in the codebase) - Whether the ECharts bundle tree-shaking works correctly in the production build (the
echarts/core+ per-chart imports pattern is established from P1) - Whether the
eval_runs.judge_model_versioncolumn is actually populated at run creation time (thecreateEvalRunfunction ateval-suites.ts:258receivesjudgeModelVersionas a parameter; whether callers pass it was not traced) - Whether the leaderboard API endpoint exists and returns the correct shape (the frontend fetches from
/api/control/eval/leaderboard; the backend route handler was not traced)
Review: boocontrol P4+P5
Scope
apps/server/src/services/inference/provider.ts, apps/server/src/services/inference/stream-phase-adapter.ts, apps/server/src/services/compaction.ts, apps/server/src/services/task-model.ts, apps/coder/src/services/local-gateway.ts, apps/coder/src/services/arena-model-call.ts, apps/control/src/services/judge-runner.ts, apps/control/src/services/sandbox-runner.ts, apps/control/src/services/eval-suites.ts, apps/control/src/schema.sql, apps/web/src/components/control/EvalsTab.tsx, apps/web/src/pages/Control.tsx, P4+P5 tests.
Size
Large -- 12 source files across 3 apps + contracts, touches inference streaming path, SSE ingestion, Docker container spawning, DB schema, and ECharts UI.
Summary
P4 (attribution) is correctly implemented end-to-end. All three paths (server streaming, coder gateway, arena) inject the correct X-Boo-Source header. The migration is idempotent and NULL-for-ring-data is documented. P5 (evals) has correct schema, YAML loading, and UI wiring, but the judge runner's response generation temperature (0.7) contradicts the design spec (0). Sandbox hardening is thorough.
| Classification | Count |
|---|---|
| Blocking | 1 |
| Advisory | 4 |
| Nit | 1 |
Findings
Blocking
B1: Judge response generation temperature is 0.7, not 0
- Location:
apps/control/src/services/judge-runner.ts:182 - Evidence:
temperature: 0.7in thegenerateResponserequest body. The design atdesign.md:195specifies "temperature 0, judge model+version pinned per run." ThescoreWithRubricfunction at line 239 correctly usestemperature: 0. - Standard violated: Design spec S8 ("temperature 0, judge model+version pinned per run").
- Risk: Non-deterministic eval inputs undermine reproducibility claims. A reviewer or auditor checking the design vs code will find this discrepancy.
- Fix sketch:
temperature: 0on line 182.
Advisory
A1: SANDBOX_TIMEOUT_MS type mismatch
- Location:
apps/control/src/services/sandbox-runner.ts:37 - Evidence:
as unknown as numbercast on a string fromprocess.env. Works at runtime due to JS coercion, but the type lie prevents catching arithmetic bugs. - YAGNI gate: No known incident. Defer unless the sandbox timeout needs arithmetic (e.g. grace period).
A2: Judge tests do not exercise scoring logic
- Location:
apps/control/src/services/__tests__/judge-runner.test.ts - Evidence: Tests check import and error-on-bad-provider only. Rubric scoring, temperature, X-Boo-Source injection, and rationale capture are untested.
- YAGNI gate: No known scoring bug. Defer until judge scoring produces real evals.
A3: Sandbox tests do not verify Docker flags
- Location:
apps/control/src/services/__tests__/sandbox-runner.test.ts - Evidence: Tests exercise
Promise.allSettledandSIGKILLpatterns, not the actual Docker args construction. Security flags (network, caps, user, label) are untested. - YAGNI gate: No known sandbox escape. Defer until sandbox runner processes untrusted code.
A4: Arena dispatch chain not fully traced
- Location:
apps/coder/src/services/arena-model-call.ts:51 - Evidence:
arenaModelCallsetsX-Boo-Source: 'arena'. Whether all arena sub-calls (battle start, cross-examination) route through this function rather than making direct fetches was not verified. - YAGNI gate: No known arena attribution bug. Defer until arena requests show NULL source.
Nits
N1: eval_suites UNIQUE on (name, version) uses ON CONFLICT DO NOTHING in seed, but upsertEvalSuite uses ON CONFLICT DO UPDATE
- Location:
apps/control/src/services/eval-suites.ts:175vseval-suites.ts:230 - Evidence:
seedEvalSuitesusesON CONFLICT (id) DO NOTHING(by primary key).upsertEvalSuiteusesON CONFLICT (id) DO UPDATE. The schema also hasUNIQUE (name, version)atschema.sql:170which is NOT the conflict target in either function. If two suites share a name+version, the UNIQUE constraint would reject the second. This is the correct behavior (versioning is explicit), but the UNIQUE constraint and the ON CONFLICT target differ. - Note: Style -- not a bug.
Verdict
APPROVE-WITH-NITS
One blocking finding (B1: judge temperature 0.7 should be 0). Four advisory findings deferred per YAGNI gates. One nit on UNIQUE constraint targeting.
Claims I did not verify
- Whether the AI-SDK
createOpenAICompatibleinternalfetchwrapper correctly merges the custom fetch headers (established pattern from P1, not re-verified) - Whether the
eval_runs.judge_model_versioncolumn is populated by callers ofcreateEvalRun(the function accepts it; caller trace was not performed) - Whether the leaderboard API backend route exists and returns the correct shape
- Whether the ECharts tree-shaking in
EvalsTab.tsxproduces correct bundle sizes - Whether arena battle start / cross-examination model calls all go through
arenaModelCall - Whether the
control_requestsINSERT atindex.ts:258(the non-reconcile path) also correctly setssource: null