Arena is a new pane kind for competitive AI evaluation. A Battle runs the same prompt against 2-6 Contestants across two concurrent lanes: local lane (llama-swap models, serial) and cloud lane (parallel). Added to all three registries: @boocode/contracts WsFrameSchema, server InferenceFrame, and web WsFrame. Backend (apps/coder): - arena-runner: battle scheduler, lane classifier, benchmark, results writer, resume, user winner override - arena-analyzer: two-stage digest→judge analysis on DEFAULT_MODEL - arena-decisions: status transitions and resume logic (unit-tested) - arena-analyzer-helpers: pure helper functions (unit-tested) - arena-model-call: model call utility for analysis - arena routes: create/get/list/stop/analyze/cross-examine/winner/diff - schema: battles, contestants, cross_examinations tables (idempotent) - remove old /api/arena* routes and tasks.arena_id column Frontend (apps/web): - ArenaLauncherDialog: battle type, prompt, contestant selection - ArenaPane: live roster, streaming output, analysis, cross-exam - DiffView: unified diff with line-by-line color for coding contests - Winner override per-row dropdown (Trophy icon) - battle_updated WS handler for live winner/analysis updates - arena pane kind in Workspace, ChatTabBar, useSidebar Cross-app: - ArenaState and ArenaContestantShape/WsFrame types (contracts) - battle_* frames in WsFrameSchema, InferenceFrame, and web WsFrame - manifest.json written per battle results folder - /Arena added to .gitignore Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1.1 KiB
1.1 KiB
Arena schedules contestants in a local lane (serial) and a cloud lane (parallel)
A Battle runs the same prompt against 2–6 Contestants. The local llama-swap server can only hold one model in memory at a time, so llama-swap-backed Contestants are placed in a local lane and run strictly one at a time, while cloud-backed Contestants (Claude Code, OpenCode-on-cloud) run all in parallel in a cloud lane; the two lanes run concurrently. We chose this over running everything serially (too slow for cloud) or everything in parallel (impossible for local, and it would corrupt the speed Benchmark) because the single-model constraint is physical and the serial local lane also gives each local model an uncontended, fair tokens/sec measurement.
Consequences
- A Battle's wall-clock is roughly
max(slowest cloud contestant, sum of local contestants). Deep local lanes (especially all-local Q&A battles) are slow by design; the launcher warns when the local lane is deep. - The speed Benchmark (tokens/sec) is only meaningful for local-lane Contestants, which is acceptable since external CLI agents don't report token usage anyway.