feat(web,coder): arena pane — compare 2-6 AI competitors on same prompt
Arena is a new pane kind for competitive AI evaluation. A Battle runs the same prompt against 2-6 Contestants across two concurrent lanes: local lane (llama-swap models, serial) and cloud lane (parallel). Added to all three registries: @boocode/contracts WsFrameSchema, server InferenceFrame, and web WsFrame. Backend (apps/coder): - arena-runner: battle scheduler, lane classifier, benchmark, results writer, resume, user winner override - arena-analyzer: two-stage digest→judge analysis on DEFAULT_MODEL - arena-decisions: status transitions and resume logic (unit-tested) - arena-analyzer-helpers: pure helper functions (unit-tested) - arena-model-call: model call utility for analysis - arena routes: create/get/list/stop/analyze/cross-examine/winner/diff - schema: battles, contestants, cross_examinations tables (idempotent) - remove old /api/arena* routes and tasks.arena_id column Frontend (apps/web): - ArenaLauncherDialog: battle type, prompt, contestant selection - ArenaPane: live roster, streaming output, analysis, cross-exam - DiffView: unified diff with line-by-line color for coding contests - Winner override per-row dropdown (Trophy icon) - battle_updated WS handler for live winner/analysis updates - arena pane kind in Workspace, ChatTabBar, useSidebar Cross-app: - ArenaState and ArenaContestantShape/WsFrame types (contracts) - battle_* frames in WsFrameSchema, InferenceFrame, and web WsFrame - manifest.json written per battle results folder - /Arena added to .gitignore Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
19
docs/adr/0001-arena-two-lane-scheduling.md
Normal file
19
docs/adr/0001-arena-two-lane-scheduling.md
Normal file
@@ -0,0 +1,19 @@
|
||||
# Arena schedules contestants in a local lane (serial) and a cloud lane (parallel)
|
||||
|
||||
A Battle runs the same prompt against 2–6 Contestants. The local llama-swap
|
||||
server can only hold one model in memory at a time, so llama-swap-backed
|
||||
Contestants are placed in a **local lane** and run strictly one at a time, while
|
||||
cloud-backed Contestants (Claude Code, OpenCode-on-cloud) run all in parallel in
|
||||
a **cloud lane**; the two lanes run concurrently. We chose this over running
|
||||
everything serially (too slow for cloud) or everything in parallel (impossible
|
||||
for local, and it would corrupt the speed Benchmark) because the single-model
|
||||
constraint is physical and the serial local lane also gives each local model an
|
||||
uncontended, fair tokens/sec measurement.
|
||||
|
||||
## Consequences
|
||||
|
||||
- A Battle's wall-clock is roughly `max(slowest cloud contestant, sum of local
|
||||
contestants)`. Deep local lanes (especially all-local Q&A battles) are slow by
|
||||
design; the launcher warns when the local lane is deep.
|
||||
- The speed Benchmark (tokens/sec) is only meaningful for local-lane Contestants,
|
||||
which is acceptable since external CLI agents don't report token usage anyway.
|
||||
22
docs/adr/0002-arena-dedicated-tables-not-flow-runner.md
Normal file
22
docs/adr/0002-arena-dedicated-tables-not-flow-runner.md
Normal file
@@ -0,0 +1,22 @@
|
||||
# Arena gets dedicated battles/contestants tables and replaces the old API-only arena
|
||||
|
||||
The Arena feature reuses the dispatcher, the `onTaskTerminal` advance hook, the
|
||||
streaming→WS-frame pipeline, and the pane pattern from the Orchestrator, but
|
||||
persists to its **own `battles` + `contestants` tables** rather than the
|
||||
Orchestrator's `flow_runs`/`flow_steps`. A Battle is not shaped like a flow — it
|
||||
has two scheduling lanes, per-contestant benchmarks, on-disk results folders, a
|
||||
two-stage analysis, and cross-examinations — so modelling it as flow steps would
|
||||
fight the schema. Each Contestant links to a real `tasks` row via `task_id`,
|
||||
inheriting all worktree/streaming/dispatch machinery. This also **replaces the
|
||||
earlier v2.0.5 API-only arena** (`POST /api/arena`, `tasks.arena_id`,
|
||||
select-winner): that feature had no UI and no users, and the new Arena is a
|
||||
strict superset, so the old routes and the `tasks.arena_id` column are removed
|
||||
rather than left as a second, competing "arena" concept.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Analysis and cross-examination run through a small pluggable **Analyzer** seam
|
||||
(v1 = default-model two-stage judge). A v2 that drives a Han Orchestrator flow
|
||||
as the analyzer slots in behind that seam without a schema change.
|
||||
- The `arena` pane kind, `ArenaState`, and `battle_*` WS frames are added
|
||||
alongside (not folded into) the Orchestrator's, mirroring its patterns.
|
||||
Reference in New Issue
Block a user