Files

indifferentketchup d6d246c15b feat(web,coder): arena pane — compare 2-6 AI competitors on same prompt

Arena is a new pane kind for competitive AI evaluation. A Battle runs
the same prompt against 2-6 Contestants across two concurrent lanes:
local lane (llama-swap models, serial) and cloud lane (parallel).

Added to all three registries: @boocode/contracts WsFrameSchema,
server InferenceFrame, and web WsFrame.

Backend (apps/coder):
- arena-runner: battle scheduler, lane classifier, benchmark, results
  writer, resume, user winner override
- arena-analyzer: two-stage digest→judge analysis on DEFAULT_MODEL
- arena-decisions: status transitions and resume logic (unit-tested)
- arena-analyzer-helpers: pure helper functions (unit-tested)
- arena-model-call: model call utility for analysis
- arena routes: create/get/list/stop/analyze/cross-examine/winner/diff
- schema: battles, contestants, cross_examinations tables (idempotent)
- remove old /api/arena* routes and tasks.arena_id column

Frontend (apps/web):
- ArenaLauncherDialog: battle type, prompt, contestant selection
- ArenaPane: live roster, streaming output, analysis, cross-exam
- DiffView: unified diff with line-by-line color for coding contests
- Winner override per-row dropdown (Trophy icon)
- battle_updated WS handler for live winner/analysis updates
- arena pane kind in Workspace, ChatTabBar, useSidebar

Cross-app:
- ArenaState and ArenaContestantShape/WsFrame types (contracts)
- battle_* frames in WsFrameSchema, InferenceFrame, and web WsFrame
- manifest.json written per battle results folder
- /Arena added to .gitignore

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-06 23:25:29 +00:00

3.3 KiB

Raw Blame History

Context: BooCode

Glossary of the domain language. Terms only — no implementation detail.

Workspace

Pane — one tile in the multi-pane workspace. Each pane has a kind: Chat (BooChat), Coder (BooCoder), Terminal (BooTerm), Orchestrator, Arena, plus artifact/settings kinds.
Backend — an AI engine a task is dispatched to: native (BooChat inference on a local llama-swap model) or an external CLI agent (Claude Code, OpenCode, Qwen, Goose). Code sometimes calls this the "agent" (tasks.agent).
BooChat Agent (a.k.a. persona) — a preset from the data/AGENTS.md registry (e.g. "Code Reviewer", "Debugger"): a system prompt + tool whitelist + sampling knobs that runs on the native backend with a chosen model. Distinct from a Backend — this is the overloaded sense of "agent" the UI's Agent picker selects.

Arena

A way to run the same prompt against several AI competitors at once and pick the best result.

Battle — one Arena run. Dated. Produces a results folder at /<project-root>/Arena/<dated-battle>/. (The earlier API-only feature called this an "arena"; a Battle is one such run.)
Battle Type — what is being compared:
- Coding — Contestants change code; a result is the diff they produced (plus their explanation). Each Contestant works in its own worktree.
- Q&A — Contestants answer a prompt; a result is the text answer. No code changes.
Contestant — one competitor in a Battle, given the Battle's prompt. What defines a Contestant depends on Battle Type:
- Coding — a Backend + Model (e.g. Claude Code + opus, native BooCode + 35b). Each works in its own isolated git worktree (a branched on-disk copy of the project). Contestants do not see each other's work.
- Q&A — a BooChat Agent (persona) + Model (e.g. Debugger + 35b), running on the native backend only. No worktree (no code changes). The same model can appear under two Contestants, so a Contestant's identity is the (backend-or-persona, model) pair, not the model alone.
Benchmark — per-Contestant performance captured during a Battle. Wall-clock duration is recorded for every Contestant; throughput (tokens/sec) is recorded only for local (llama-swap) models, which are the ones the speed comparison is meaningful for.
Arena results folder (/<project-root>/Arena/<dated-battle>/) — where a Battle's results are written (not the working copies — those stay in each Contestant's worktree). Holds the per-Contestant result and the final analysis.
Lane — how a Battle's Contestants are scheduled. The local lane holds every llama-swap-backed Contestant and runs them strictly one at a time (the local server can only load one model at a time, which also keeps their speed Benchmark fair). The cloud lane holds cloud-backed Contestants (Claude Code, OpenCode-on-cloud) and runs them all in parallel. The two lanes run concurrently with each other.
Analysis — an end-of-Battle judgement of the Contestants' results, produced by the default BooChat model, naming a Winner.
Cross-examination — an after-the-Battle step where a chosen model (from any agent) is pointed at the Battle's results to interrogate / compare them.

3.3 KiB Raw Blame History

Context: BooCode

Workspace

Arena

3.3 KiB

Raw Blame History