feat(web,coder): arena pane — compare 2-6 AI competitors on same prompt

Arena is a new pane kind for competitive AI evaluation. A Battle runs
the same prompt against 2-6 Contestants across two concurrent lanes:
local lane (llama-swap models, serial) and cloud lane (parallel).

Added to all three registries: @boocode/contracts WsFrameSchema,
server InferenceFrame, and web WsFrame.

Backend (apps/coder):
- arena-runner: battle scheduler, lane classifier, benchmark, results
  writer, resume, user winner override
- arena-analyzer: two-stage digest→judge analysis on DEFAULT_MODEL
- arena-decisions: status transitions and resume logic (unit-tested)
- arena-analyzer-helpers: pure helper functions (unit-tested)
- arena-model-call: model call utility for analysis
- arena routes: create/get/list/stop/analyze/cross-examine/winner/diff
- schema: battles, contestants, cross_examinations tables (idempotent)
- remove old /api/arena* routes and tasks.arena_id column

Frontend (apps/web):
- ArenaLauncherDialog: battle type, prompt, contestant selection
- ArenaPane: live roster, streaming output, analysis, cross-exam
- DiffView: unified diff with line-by-line color for coding contests
- Winner override per-row dropdown (Trophy icon)
- battle_updated WS handler for live winner/analysis updates
- arena pane kind in Workspace, ChatTabBar, useSidebar

Cross-app:
- ArenaState and ArenaContestantShape/WsFrame types (contracts)
- battle_* frames in WsFrameSchema, InferenceFrame, and web WsFrame
- manifest.json written per battle results folder
- /Arena added to .gitignore

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-06 23:25:29 +00:00
parent e04d0fdaa8
commit d6d246c15b
34 changed files with 4581 additions and 146 deletions

View File

@@ -0,0 +1,19 @@
# Arena schedules contestants in a local lane (serial) and a cloud lane (parallel)
A Battle runs the same prompt against 26 Contestants. The local llama-swap
server can only hold one model in memory at a time, so llama-swap-backed
Contestants are placed in a **local lane** and run strictly one at a time, while
cloud-backed Contestants (Claude Code, OpenCode-on-cloud) run all in parallel in
a **cloud lane**; the two lanes run concurrently. We chose this over running
everything serially (too slow for cloud) or everything in parallel (impossible
for local, and it would corrupt the speed Benchmark) because the single-model
constraint is physical and the serial local lane also gives each local model an
uncontended, fair tokens/sec measurement.
## Consequences
- A Battle's wall-clock is roughly `max(slowest cloud contestant, sum of local
contestants)`. Deep local lanes (especially all-local Q&A battles) are slow by
design; the launcher warns when the local lane is deep.
- The speed Benchmark (tokens/sec) is only meaningful for local-lane Contestants,
which is acceptable since external CLI agents don't report token usage anyway.

View File

@@ -0,0 +1,22 @@
# Arena gets dedicated battles/contestants tables and replaces the old API-only arena
The Arena feature reuses the dispatcher, the `onTaskTerminal` advance hook, the
streaming→WS-frame pipeline, and the pane pattern from the Orchestrator, but
persists to its **own `battles` + `contestants` tables** rather than the
Orchestrator's `flow_runs`/`flow_steps`. A Battle is not shaped like a flow — it
has two scheduling lanes, per-contestant benchmarks, on-disk results folders, a
two-stage analysis, and cross-examinations — so modelling it as flow steps would
fight the schema. Each Contestant links to a real `tasks` row via `task_id`,
inheriting all worktree/streaming/dispatch machinery. This also **replaces the
earlier v2.0.5 API-only arena** (`POST /api/arena`, `tasks.arena_id`,
select-winner): that feature had no UI and no users, and the new Arena is a
strict superset, so the old routes and the `tasks.arena_id` column are removed
rather than left as a second, competing "arena" concept.
## Consequences
- Analysis and cross-examination run through a small pluggable **Analyzer** seam
(v1 = default-model two-stage judge). A v2 that drives a Han Orchestrator flow
as the analyzer slots in behind that seam without a schema change.
- The `arena` pane kind, `ArenaState`, and `battle_*` WS frames are added
alongside (not folded into) the Orchestrator's, mirroring its patterns.