Files

indifferentketchup d6d246c15b feat(web,coder): arena pane — compare 2-6 AI competitors on same prompt

Arena is a new pane kind for competitive AI evaluation. A Battle runs
the same prompt against 2-6 Contestants across two concurrent lanes:
local lane (llama-swap models, serial) and cloud lane (parallel).

Added to all three registries: @boocode/contracts WsFrameSchema,
server InferenceFrame, and web WsFrame.

Backend (apps/coder):
- arena-runner: battle scheduler, lane classifier, benchmark, results
  writer, resume, user winner override
- arena-analyzer: two-stage digest→judge analysis on DEFAULT_MODEL
- arena-decisions: status transitions and resume logic (unit-tested)
- arena-analyzer-helpers: pure helper functions (unit-tested)
- arena-model-call: model call utility for analysis
- arena routes: create/get/list/stop/analyze/cross-examine/winner/diff
- schema: battles, contestants, cross_examinations tables (idempotent)
- remove old /api/arena* routes and tasks.arena_id column

Frontend (apps/web):
- ArenaLauncherDialog: battle type, prompt, contestant selection
- ArenaPane: live roster, streaming output, analysis, cross-exam
- DiffView: unified diff with line-by-line color for coding contests
- Winner override per-row dropdown (Trophy icon)
- battle_updated WS handler for live winner/analysis updates
- arena pane kind in Workspace, ChatTabBar, useSidebar

Cross-app:
- ArenaState and ArenaContestantShape/WsFrame types (contracts)
- battle_* frames in WsFrameSchema, InferenceFrame, and web WsFrame
- manifest.json written per battle results folder
- /Arena added to .gitignore

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-06 23:25:29 +00:00

1.1 KiB

Raw Blame History

Arena schedules contestants in a local lane (serial) and a cloud lane (parallel)

A Battle runs the same prompt against 2–6 Contestants. The local llama-swap server can only hold one model in memory at a time, so llama-swap-backed Contestants are placed in a local lane and run strictly one at a time, while cloud-backed Contestants (Claude Code, OpenCode-on-cloud) run all in parallel in a cloud lane; the two lanes run concurrently. We chose this over running everything serially (too slow for cloud) or everything in parallel (impossible for local, and it would corrupt the speed Benchmark) because the single-model constraint is physical and the serial local lane also gives each local model an uncontended, fair tokens/sec measurement.

Consequences

A Battle's wall-clock is roughly max(slowest cloud contestant, sum of local contestants). Deep local lanes (especially all-local Q&A battles) are slow by design; the launcher warns when the local lane is deep.
The speed Benchmark (tokens/sec) is only meaningful for local-lane Contestants, which is acceptable since external CLI agents don't report token usage anyway.

1.1 KiB Raw Blame History Unescape Escape

Arena schedules contestants in a local lane (serial) and a cloud lane (parallel)

Consequences

1.1 KiB

Raw Blame History