Arena is a new pane kind for competitive AI evaluation. A Battle runs the same prompt against 2-6 Contestants across two concurrent lanes: local lane (llama-swap models, serial) and cloud lane (parallel). Added to all three registries: @boocode/contracts WsFrameSchema, server InferenceFrame, and web WsFrame. Backend (apps/coder): - arena-runner: battle scheduler, lane classifier, benchmark, results writer, resume, user winner override - arena-analyzer: two-stage digest→judge analysis on DEFAULT_MODEL - arena-decisions: status transitions and resume logic (unit-tested) - arena-analyzer-helpers: pure helper functions (unit-tested) - arena-model-call: model call utility for analysis - arena routes: create/get/list/stop/analyze/cross-examine/winner/diff - schema: battles, contestants, cross_examinations tables (idempotent) - remove old /api/arena* routes and tasks.arena_id column Frontend (apps/web): - ArenaLauncherDialog: battle type, prompt, contestant selection - ArenaPane: live roster, streaming output, analysis, cross-exam - DiffView: unified diff with line-by-line color for coding contests - Winner override per-row dropdown (Trophy icon) - battle_updated WS handler for live winner/analysis updates - arena pane kind in Workspace, ChatTabBar, useSidebar Cross-app: - ArenaState and ArenaContestantShape/WsFrame types (contracts) - battle_* frames in WsFrameSchema, InferenceFrame, and web WsFrame - manifest.json written per battle results folder - /Arena added to .gitignore Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
68 lines
3.3 KiB
Markdown
68 lines
3.3 KiB
Markdown
# Context: BooCode
|
|
|
|
Glossary of the domain language. Terms only — no implementation detail.
|
|
|
|
## Workspace
|
|
|
|
- **Pane** — one tile in the multi-pane workspace. Each pane has a *kind*:
|
|
Chat (BooChat), Coder (BooCoder), Terminal (BooTerm), Orchestrator, Arena,
|
|
plus artifact/settings kinds.
|
|
|
|
- **Backend** — an AI engine a task is dispatched to: *native* (BooChat
|
|
inference on a local llama-swap model) or an *external* CLI agent (Claude Code,
|
|
OpenCode, Qwen, Goose). Code sometimes calls this the "agent" (`tasks.agent`).
|
|
|
|
- **BooChat Agent** (a.k.a. *persona*) — a preset from the `data/AGENTS.md`
|
|
registry (e.g. "Code Reviewer", "Debugger"): a system prompt + tool whitelist +
|
|
sampling knobs that runs **on the native backend** with a chosen model.
|
|
Distinct from a Backend — this is the overloaded sense of "agent" the UI's
|
|
Agent picker selects.
|
|
|
|
## Arena
|
|
|
|
A way to run the **same prompt** against several AI competitors at once and pick
|
|
the best result.
|
|
|
|
- **Battle** — one Arena run. Dated. Produces a results folder at
|
|
`/<project-root>/Arena/<dated-battle>/`. (The earlier API-only feature called
|
|
this an "arena"; a Battle is one such run.)
|
|
|
|
- **Battle Type** — what is being compared:
|
|
- *Coding* — Contestants change code; a result is the **diff** they produced
|
|
(plus their explanation). Each Contestant works in its own worktree.
|
|
- *Q&A* — Contestants answer a prompt; a result is the **text answer**. No
|
|
code changes.
|
|
|
|
- **Contestant** — one competitor in a Battle, given the Battle's prompt. What
|
|
defines a Contestant depends on Battle Type:
|
|
- *Coding* — a **Backend + Model** (e.g. Claude Code + opus, native BooCode +
|
|
35b). Each works in its own isolated git **worktree** (a branched on-disk
|
|
copy of the project). Contestants do not see each other's work.
|
|
- *Q&A* — a **BooChat Agent (persona) + Model** (e.g. Debugger + 35b), running
|
|
on the native backend only. No worktree (no code changes).
|
|
The same model can appear under two Contestants, so a Contestant's identity is
|
|
the (backend-or-persona, model) pair, not the model alone.
|
|
|
|
- **Benchmark** — per-Contestant performance captured during a Battle. Wall-clock
|
|
**duration** is recorded for every Contestant; **throughput** (tokens/sec) is
|
|
recorded only for local (llama-swap) models, which are the ones the speed
|
|
comparison is meaningful for.
|
|
|
|
- **Arena results folder** (`/<project-root>/Arena/<dated-battle>/`) — where a
|
|
Battle's *results* are written (not the working copies — those stay in each
|
|
Contestant's worktree). Holds the per-Contestant result and the final
|
|
analysis.
|
|
|
|
- **Lane** — how a Battle's Contestants are scheduled. The *local lane* holds
|
|
every llama-swap-backed Contestant and runs them strictly one at a time (the
|
|
local server can only load one model at a time, which also keeps their speed
|
|
Benchmark fair). The *cloud lane* holds cloud-backed Contestants (Claude Code,
|
|
OpenCode-on-cloud) and runs them all in parallel. The two lanes run
|
|
concurrently with each other.
|
|
|
|
- **Analysis** — an end-of-Battle judgement of the Contestants' results,
|
|
produced by the default BooChat model, naming a **Winner**.
|
|
|
|
- **Cross-examination** — an after-the-Battle step where a chosen model (from any
|
|
agent) is pointed at the Battle's results to interrogate / compare them.
|