Files

indifferentketchup b18de2a331 chore: snapshot working tree - pty_exited notifications + in-flight inference WIP

feat(booterm): structured pty_exited WS notifications. Plan-validated, impl-validated, code-reviewed green (contracts build clean, contracts test 29/29, booterm + web typecheck clean).

wip: in-progress inference/provider refactor (agents.ts, provider.ts, new llama-providers.ts, removed llama-args-validator), plus arena, dispatcher, compaction, schema changes.

openspec: pty-exit-notifications complete; x-agent-flags planned (not yet implemented).

2026-06-14 12:48:47 +00:00

14 KiB

Raw Blame History

P4+P5 Audit: Combined Validation + Code Review

Date: 2026-06-12 Change: boocontrol Phases: P4 (per-consumer attribution) + P5 (quality evals + sandbox) Mode: Implementation (all 8 tasks checked)

Build/Test Gates

Gate	Result
`pnpm -C apps/server build`	PASS
`pnpm -C apps/server test`	PASS (580 passed, 11 skipped, 51 files)
`pnpm -C apps/coder build`	PASS
`pnpm -C apps/coder test`	PASS (587 passed, 32 skipped, 51 files)
`pnpm -C apps/control build`	PASS
`pnpm -C apps/control test`	PASS (116 passed, 15 files)
`npx tsc -p apps/web/tsconfig.app.json --noEmit`	PASS

Validation: boocontrol P4+P5 (implementation mode)

Verdict

PASS-WITH-FINDINGS -- all 8 tasks have implementing code; one design-specified behavior (judge temperature=0) is not implemented.

Traceability

Task	Claim	Evidence	Status
P4.1	X-Boo-Source on AI-SDK streaming path	`stream-phase-adapter.ts:309` passes `'boochat'` to `upstreamModel`; `provider.ts:19-44` `getSwapProvider` wraps fetch with header, cache keyed `baseURL\|\|source`	PASS
P4.1	`includeUsage: true` preserved	`provider.ts:38` explicitly set on `createOpenAICompatible`	PASS
P4.1	compaction.ts + task-model.ts headers	`compaction.ts:359` and `task-model.ts:27` both inject `X-Boo-Source: 'boochat'` in direct fetch headers	PASS
P4.2	local-gateway.ts forwards x-boo-source	`local-gateway.ts:67` reads inbound header, defaults `'boocoder'`; `local-gateway.ts:76` forwards as `X-Boo-Source`	PASS
P4.2	arena-model-call.ts source	`arena-model-call.ts:51` sets `X-Boo-Source: 'arena'`	PASS
P4.3	control_requests.source migration	`schema.sql:48` `ALTER TABLE ADD COLUMN IF NOT EXISTS source TEXT` (idempotent); INSERT at `index.ts:182-183` includes source column; `index.ts:81` maps `source: null` for ring data (design S7 deviation documented)	PASS
P4.4	Tests: header present + rows attribute	`pipeline.test.ts:248` asserts source=NULL for ring data; import/export tests for all three paths	PARTIAL
P5.1	Suite format + YAML loading + DB schema	`eval-suites.ts:67-120` loads YAML from `data/`; `schema.sql:161-222` defines `eval_suites` (UNIQUE on name+version), `eval_runs`, `eval_results`; 4 YAML suite files present	PASS
P5.2	Judge runner temperature=0	`judge-runner.ts:239` scoreWithRubric uses `temperature: 0` (correct); `judge-runner.ts:182` generateResponse uses `temperature: 0.7` (NOT 0)	FAIL
P5.2	Judge model+version pinned per run	`judge-runner.ts:59` constructs `judgeModelVersion` string; `eval_runs` table stores `judge_model` + `judge_model_version`	PASS
P5.2	Rationale captured	`judge-runner.ts:97-98` stores rationale from scoreWithRubric	PASS
P5.2	X-Boo-Source control-eval	`judge-runner.ts:177,237` both set `X-Boo-Source: 'control-eval'`	PASS
P5.3	Sandbox hardening flags	`sandbox-runner.ts:258-273` docker args array: `--network none`, `--user 1000:1000`, `--memory`, `--cpus`, `--pids-limit`, `--tmpfs /workspace:rw,noexec,size=64m`, `--rm`, `--label boocontrol-eval`, `--security-opt no-new-privileges`, `--cap-drop ALL`	PASS
P5.3	No volume mounts, no docker socket	Verified in docker args array at `sandbox-runner.ts:258-273` -- no `-v` or socket reference	PASS
P5.3	Orphan prune at engine start	`sandbox-runner.ts:73` calls `pruneOrphanContainers()` at start of `runCodeEval`	PASS
P5.3	Bounded concurrency + allSettled + finally cleanup	`sandbox-runner.ts:81-83` batch loop; `sandbox-runner.ts:86` `Promise.allSettled`; `sandbox-runner.ts:162-165` `finally` block with `cleanupContainer`	PASS
P5.3	SANDBOX_TIMEOUT_MS type	`sandbox-runner.ts:37` typed as `number` but `process.env` is string -- `setTimeout` and `spawn` timeout receive string	ADVISORY
P5.4	Leaderboard UI + scatter	`EvalsTab.tsx` renders scatter (`echarts.init` with `buildEChartsTheme`) + bar chart + run table + launcher	PASS

Findings

Blocking

V1: judge-runner.ts generateResponse uses temperature 0.7 instead of 0

Location: apps/control/src/services/judge-runner.ts:182
Evidence: body: JSON.stringify({ model, messages: [{ role: 'user', content: prompt }], temperature: 0.7, max_tokens: 2048 }) -- the generateResponse function (which generates the target model's response to be scored) uses temperature 0.7. The design at design.md:195 specifies "temperature 0, judge model+version pinned per run." The scoreWithRubric function at line 239 correctly uses temperature: 0, but the response generation step does not.
Impact: The target model's response is generated with non-deterministic sampling. For a reproducible eval framework this undermines the "temperature 0" claim in the task description. The judge scoring is deterministic (temp=0) but the input it scores is not.
Fix sketch: Change line 182 from temperature: 0.7 to temperature: 0.

Advisory

A1: sandbox-runner.ts SANDBOX_TIMEOUT_MS is string, not number

Location: apps/control/src/services/sandbox-runner.ts:37
Evidence: const SANDBOX_TIMEOUT_MS = (process.env.SANDBOX_TIMEOUT_MS ?? '30000') as unknown as number; -- process.env values are string | undefined. The as unknown as number cast silences tsc but the runtime value is '30000' (string). This string flows to spawn(..., { timeout: SANDBOX_TIMEOUT_MS }) at line 277 and setTimeout(..., SANDBOX_TIMEOUT_MS) at line 308. Node's child_process.spawn timeout accepts number | undefined and setTimeout accepts number | string | undefined (string is parsed). The timeout will likely work due to JS coercion, but the type lie masks future bugs (e.g. SANDBOX_TIMEOUT_MS - 1000 would produce NaN).
Impact: Low immediate risk (JS coercion makes it work), but the incorrect type annotation prevents catching arithmetic bugs. SANDBOX_CONCURRENCY at line 38 has the same issue.
Fix sketch: const SANDBOX_TIMEOUT_MS = Number(process.env.SANDBOX_TIMEOUT_MS ?? '30000');

A2: judge-runner tests exercise imports, not judge logic

Location: apps/control/src/services/__tests__/judge-runner.test.ts
Evidence: Test 1 imports the module and checks typeof mod.runJudgeEval === 'function'. Test 2 calls runJudgeEval with a nonexistent provider and asserts the error message. Neither test exercises the actual judge request flow, rubric scoring, temperature setting, or rationale capture. The temperature=0.7 bug (V1) would not be caught by these tests.
Impact: Regressions in judge scoring logic, temperature, or X-Boo-Source injection would not be caught by the test suite.
Reopen trigger: Any bug where judge scoring produces wrong results or wrong temperature.

A3: sandbox-runner tests exercise Promise patterns, not Docker flags

Location: apps/control/src/services/__tests__/sandbox-runner.test.ts
Evidence: Tests verify runCodeEval is importable, that Promise.allSettled isolates failures, and that SIGKILL works. None of the tests verify the actual Docker arguments (security flags, label, resource caps), orphan pruning, or container cleanup. The test at line 19 (bounded fan-out) reimplements the pattern inline rather than calling runCodeEval.
Impact: A regression in the Docker security flags (e.g. removing --cap-drop ALL) would pass all existing tests.
Reopen trigger: Any sandbox escape or flag regression.

A4: arena dispatch sites not fully traced

Location: apps/coder/src/services/arena-model-call.ts:51
Evidence: arenaModelCall sets X-Boo-Source: 'arena'. However, the full arena dispatch chain (battle start, contestant model calls, cross-examination) was not traced end-to-end. The direct arenaModelCall path is verified; whether all arena sub-calls route through this function rather than making their own fetches was not checked.
Impact: Low -- if arena uses arenaModelCall for all model calls, attribution is correct. If any arena path makes a direct fetch without X-Boo-Source, those requests would show as NULL in the activity feed.
Reopen trigger: Arena requests showing as NULL in activity feed despite having a source.

Claims I did not verify

Whether the includeUsage: true survives AI-SDK v6's internal handling (this was verified in prior P1 review -- load-bearing per apps/server/CLAUDE.md)
Whether the sql.json(value as never) pattern in eval-suites.ts:170 correctly serializes the tasks array as JSONB (pattern is established and used elsewhere in the codebase)
Whether the ECharts bundle tree-shaking works correctly in the production build (the echarts/core + per-chart imports pattern is established from P1)
Whether the eval_runs.judge_model_version column is actually populated at run creation time (the createEvalRun function at eval-suites.ts:258 receives judgeModelVersion as a parameter; whether callers pass it was not traced)
Whether the leaderboard API endpoint exists and returns the correct shape (the frontend fetches from /api/control/eval/leaderboard; the backend route handler was not traced)

Review: boocontrol P4+P5

Scope

apps/server/src/services/inference/provider.ts, apps/server/src/services/inference/stream-phase-adapter.ts, apps/server/src/services/compaction.ts, apps/server/src/services/task-model.ts, apps/coder/src/services/local-gateway.ts, apps/coder/src/services/arena-model-call.ts, apps/control/src/services/judge-runner.ts, apps/control/src/services/sandbox-runner.ts, apps/control/src/services/eval-suites.ts, apps/control/src/schema.sql, apps/web/src/components/control/EvalsTab.tsx, apps/web/src/pages/Control.tsx, P4+P5 tests.

Size

Large -- 12 source files across 3 apps + contracts, touches inference streaming path, SSE ingestion, Docker container spawning, DB schema, and ECharts UI.

Summary

P4 (attribution) is correctly implemented end-to-end. All three paths (server streaming, coder gateway, arena) inject the correct X-Boo-Source header. The migration is idempotent and NULL-for-ring-data is documented. P5 (evals) has correct schema, YAML loading, and UI wiring, but the judge runner's response generation temperature (0.7) contradicts the design spec (0). Sandbox hardening is thorough.

Classification	Count
Blocking	1
Advisory	4
Nit	1

Findings

Blocking

B1: Judge response generation temperature is 0.7, not 0

Location: apps/control/src/services/judge-runner.ts:182
Evidence: temperature: 0.7 in the generateResponse request body. The design at design.md:195 specifies "temperature 0, judge model+version pinned per run." The scoreWithRubric function at line 239 correctly uses temperature: 0.
Standard violated: Design spec S8 ("temperature 0, judge model+version pinned per run").
Risk: Non-deterministic eval inputs undermine reproducibility claims. A reviewer or auditor checking the design vs code will find this discrepancy.
Fix sketch: temperature: 0 on line 182.

Advisory

A1: SANDBOX_TIMEOUT_MS type mismatch

Location: apps/control/src/services/sandbox-runner.ts:37
Evidence: as unknown as number cast on a string from process.env. Works at runtime due to JS coercion, but the type lie prevents catching arithmetic bugs.
YAGNI gate: No known incident. Defer unless the sandbox timeout needs arithmetic (e.g. grace period).

A2: Judge tests do not exercise scoring logic

Location: apps/control/src/services/__tests__/judge-runner.test.ts
Evidence: Tests check import and error-on-bad-provider only. Rubric scoring, temperature, X-Boo-Source injection, and rationale capture are untested.
YAGNI gate: No known scoring bug. Defer until judge scoring produces real evals.

A3: Sandbox tests do not verify Docker flags

Location: apps/control/src/services/__tests__/sandbox-runner.test.ts
Evidence: Tests exercise Promise.allSettled and SIGKILL patterns, not the actual Docker args construction. Security flags (network, caps, user, label) are untested.
YAGNI gate: No known sandbox escape. Defer until sandbox runner processes untrusted code.

A4: Arena dispatch chain not fully traced

Location: apps/coder/src/services/arena-model-call.ts:51
Evidence: arenaModelCall sets X-Boo-Source: 'arena'. Whether all arena sub-calls (battle start, cross-examination) route through this function rather than making direct fetches was not verified.
YAGNI gate: No known arena attribution bug. Defer until arena requests show NULL source.

Nits

N1: eval_suites UNIQUE on (name, version) uses ON CONFLICT DO NOTHING in seed, but upsertEvalSuite uses ON CONFLICT DO UPDATE

Location: apps/control/src/services/eval-suites.ts:175 vs eval-suites.ts:230
Evidence: seedEvalSuites uses ON CONFLICT (id) DO NOTHING (by primary key). upsertEvalSuite uses ON CONFLICT (id) DO UPDATE. The schema also has UNIQUE (name, version) at schema.sql:170 which is NOT the conflict target in either function. If two suites share a name+version, the UNIQUE constraint would reject the second. This is the correct behavior (versioning is explicit), but the UNIQUE constraint and the ON CONFLICT target differ.
Note: Style -- not a bug.

Verdict

APPROVE-WITH-NITS

One blocking finding (B1: judge temperature 0.7 should be 0). Four advisory findings deferred per YAGNI gates. One nit on UNIQUE constraint targeting.

Claims I did not verify

Whether the AI-SDK createOpenAICompatible internal fetch wrapper correctly merges the custom fetch headers (established pattern from P1, not re-verified)
Whether the eval_runs.judge_model_version column is populated by callers of createEvalRun (the function accepts it; caller trace was not performed)
Whether the leaderboard API backend route exists and returns the correct shape
Whether the ECharts tree-shaking in EvalsTab.tsx produces correct bundle sizes
Whether arena battle start / cross-examination model calls all go through arenaModelCall
Whether the control_requests INSERT at index.ts:258 (the non-reconcile path) also correctly sets source: null

14 KiB Raw Blame History

P4+P5 Audit: Combined Validation + Code Review

Build/Test Gates

Validation: boocontrol P4+P5 (implementation mode)

Verdict

Traceability

Findings

Blocking

Advisory

Claims I did not verify

Review: boocontrol P4+P5

Scope

Size

Summary

Findings

Blocking

Advisory

Nits

Verdict

Claims I did not verify

14 KiB

Raw Blame History