chore: snapshot working tree - pty_exited notifications + in-flight inference WIP

feat(booterm): structured pty_exited WS notifications. Plan-validated, impl-validated, code-reviewed green (contracts build clean, contracts test 29/29, booterm + web typecheck clean). wip: in-progress inference/provider refactor (agents.ts, provider.ts, new llama-providers.ts, removed llama-args-validator), plus arena, dispatcher, compaction, schema changes. openspec: pty-exit-notifications complete; x-agent-flags planned (not yet implemented).
2026-06-14 12:48:47 +00:00
parent 0ed506f1da
commit b18de2a331
204 changed files with 25344 additions and 867 deletions
--- a/openspec/changes/boocontrol/design.md
+++ b/openspec/changes/boocontrol/design.md
@@ -0,0 +1,246 @@
+# BooControl — design
+
+**Status:** ACCEPTED — decisions resolved 2026-06-11; architecture-analysis findings folded in; verification-pass fixes applied 2026-06-12 (chart lib decided: ECharts, §9). No open design items.
+
+## 1. Topology
+
+```
+┌─ Tailscale mesh ──────────────────────────────────────────────────────────┐
+│                                                                           │
+│  sam-desktop 100.101.41.16 (Windows, RTX 5090 32GB)                       │
+│    llama-swap v224 :8401  ─ /api/events SSE, /api/performance(GPU),       │
+│    D:\llama-server (CUDA)   /api/metrics, /api/captures, /running,        │
+│                             /logs/stream, POST /api/models/unload         │
+│                                                                           │
+│  embedding 100.90.172.55 (Linux, P104-100 8GB)                            │
+│    llama-swap :8411 ─ same API surface; 39 small models, ttl 1800         │
+│                                                                           │
+│  ubuntu-homelab 100.114.205.53 (no GPU)                                   │
+│    boocode container :9500 (apps/server + apps/web)                       │
+│    booterm container :9501                                                │
+│    boocoder host svc :9502 (apps/coder)                                   │
+│    boocontrol host svc :9503 (apps/control)  ◄── NEW                      │
+│    postgres :5500 (boochat DB)                                            │
+└───────────────────────────────────────────────────────────────────────────┘
+
+Browser ──WS/HTTP──► apps/server (/api/control/* proxy, WS relay)
+                        └────────► apps/control :9503
+                                      ├─ SSE client per provider (events)
+                                      ├─ pollers (/api/performance?after=, /running)
+                                      ├─ per-host action queue (warm/unload serialization)
+                                      ├─ bench + eval engines (manual v1)
+                                      ├─ ssh2 (P9 only: config edit + restart)
+                                      └─ Postgres (third schema owner, ordered startup)
+```
+
+Key fact that shapes everything: **the llama-swap fork exposes GPU/system telemetry, token metrics, request captures, and log streams over HTTP per instance** (`internal/perf/types.go` GpuStat/SysStat; `internal/server/apigroup.go`). The control service needs no agent on the GPU hosts. SSH is required only for config editing + service restart (P9).
+
+Why a host service and not a container: SSH key handling (P9), spawning sandbox containers for code evals (talking to dockerd from inside a container is a privilege escalation we don't need), and parity with the boocoder operational pattern (systemd, `.env.host`, deploy via `pnpm -C packages/contracts build && pnpm -C apps/control build && sudo systemctl restart boocontrol`).
+
+**There is no sidecar.** The llama-sidecar (:8402, per-agent flags) has been removed from the system entirely. No control-plane table, connector, or registry field references it.
+
+## 2. Fleet identity = the provider registry (`LlamaProvider.id`)
+
+The multi-provider batch introduces the shipped contract (`packages/contracts/src/llama-providers.ts`):
+
+```ts
+LlamaProviderSchema = { id, label, baseUrl, kind }   // ids: "sam-desktop", "embedding"
+```
+
+BooControl keys every host-scoped row on **`provider_id` = `LlamaProvider.id`** — the field that actually exists and that `resolveModelProvider` already resolves by. (Earlier drafts said `provider_name` against a `{name, sidecarUrl?}` shape; that shape was never shipped.) Control-plane attributes extend the registry entry rather than inventing a parallel hosts table:
+
+```
+control_hosts
+  provider_id TEXT PK            -- FK-by-convention to LlamaProvider.id ("sam-desktop", "embedding")
+  ssh_host TEXT, ssh_user TEXT, ssh_key_path TEXT      -- nullable: no SSH = no config editing (P9)
+  config_path TEXT               -- D:\llama-swap\config.yaml | ~/llama-swap/config.yaml (P9)
+  restart_cmd TEXT               -- nssm/systemctl invocation (P9)
+  os TEXT, gpu_label TEXT        -- display metadata
+  enabled BOOLEAN DEFAULT true
+```
+
+Lesson imported from stackctl's worst bug: its machines table was dropped + re-seeded on every container rebuild, losing user-added hosts. `control_hosts` rows are durable; seeding is `INSERT ... ON CONFLICT DO NOTHING`.
+
+## 3. Schema ownership + startup ordering (third schema owner)
+
+`apps/control/src/schema.sql`, applied by `apps/control/src/db.ts:applySchema()` on boot — the coder precedent. Two hardening rules the coder precedent lacks:
+
+1. **Startup ordering guard.** The coder schema holds real FKs into server-owned tables (`REFERENCES sessions(id)`, `chats(id)`); today the server-before-coder ordering is an accident of Docker-vs-host start timing. A third concurrent `applySchema` caller widens that race, so `apps/control` makes the ordering explicit:
+
+```ts
+// apps/control/src/index.ts — before applySchema()
+await waitForTable(sql, 'sessions', 30_000);  // poll information_schema; THROWS on timeout
+await applySchema(sql);
+```
+
+   "Fail loud" means **throw → process exits nonzero → systemd (`Restart=on-failure`) retries**. The guard is enforcing, not advisory: `applySchema` is never reached if the server schema is absent, so a partial-DDL state cannot occur.
+
+   (Control tables themselves currently take no FKs into server tables, but the guard costs one query and removes the timing dependency for any future FK.)
+
+2. **Dedup is enforced by the database, not application checks.** Every ingest table whose dedup matters carries a UNIQUE constraint and is written with `INSERT ... ON CONFLICT DO NOTHING` — check-then-act application dedup is racy under concurrent SSE + reconcile writers (analysis C2/C7).
+
+```
+control_requests          -- persisted ActivityLogEntry stream (the thing llama-swap forgets on restart)
+  id BIGSERIAL PK, provider_id TEXT, swap_entry_id INT,   -- llama-swap's ring id
+  ts TIMESTAMPTZ, model TEXT, req_path TEXT, status_code INT,
+  duration_ms INT, cache_tokens INT, input_tokens INT, output_tokens INT,
+  prompt_tps REAL, gen_tps REAL, has_capture BOOLEAN,
+  capture JSONB,                                           -- nullable; fetched-on-demand copy (req/resp, capped)
+  UNIQUE (provider_id, swap_entry_id, ts)                  -- survives ring-id reset; INSERT ... ON CONFLICT DO NOTHING
+  -- NOTE: no `source` column in P1. The X-Boo-Source attribution column is added by the
+  -- P4 migration, when injection actually works end-to-end (see §7). No NULL-forever rows.
+
+control_perf_samples      -- raw SysStat+GpuStat, short retention (48h default)
+  provider_id TEXT, ts TIMESTAMPTZ, gpu JSONB, sys JSONB,
+  UNIQUE (provider_id, ts)                                 -- restart-safe: re-polled samples no-op
+
+control_perf_rollup_5m    -- avg/max per 5min bucket, long retention (90d)
+  provider_id TEXT, bucket TIMESTAMPTZ, gpu_agg JSONB, sys_agg JSONB,
+  UNIQUE (provider_id, bucket)                             -- rollup is an idempotent upsert (§6)
+
+control_model_events      -- state transitions (stopped→starting→ready→stopping), swap durations
+  provider_id, model, state, ts, detail JSONB,
+  UNIQUE (provider_id, model, state, ts)                   -- reconcile can re-deliver model status; same ON CONFLICT DO NOTHING discipline
+
+bench_suites / bench_runs / bench_samples
+  -- suite: {prompt_tokens[], gen_tokens[], concurrency[], repetitions}
+  -- sample: per-request timings (ttft_ms, prompt_tps, gen_tps, total_ms) + run aggregates
+
+eval_suites / eval_runs / eval_results
+  -- suite: kind chat|code, tasks JSONB (prompt, reference, checker), judge_model
+  -- result: per-task score, judge rationale / execution log, sandbox exit info
+
+route_policies            -- P7: name, match rules JSONB, target ordering, fallback
+control_reports           -- generated digests (markdown + JSONB stats)
+  + schedule meta: {interval: 'daily'|'weekly', enabled, last_run_at TIMESTAMPTZ}
+  -- driven by the SAME in-process timer pattern as the retention job (P6): hourly tick
+  -- checks last_run_at vs interval, runs if due (catch-up on boot included). No cron dep,
+  -- no new scheduler abstraction (S7 stays YAGNI-deferred; reopen trigger unchanged).
+```
+
+`clock_timestamp()` inside transactions per repo convention; JSONB via `sql.json(...)`.
+
+## 4. Ingestion semantics
+
+- **SSE consumer** per enabled host: `GET /api/events` → envelopes `modelStatus | logData | metrics | inflight`. Reconnect with backoff + jitter (reconnect/circuit-breaker pattern: `apps/coder/src/services/backends/opencode-sse.ts` — NOTE the source has exponential backoff + circuit breaker but NO jitter; add jitter explicitly here, random 0-50% of the computed delay, per plan finding V1/F3). On reconnect, reconcile via `GET /api/metrics` (full ring). Reconcile and live SSE may both insert the same entry concurrently — that is fine **because dedup is the DB UNIQUE constraint** (`ON CONFLICT DO NOTHING`), not a check-then-act. The dedup key `(provider_id, swap_entry_id, ts)` includes the timestamp because llama-swap's ring ids restart from 0 on its restart.
+  - **Known bound, accepted:** the ring holds 1000 entries. An outage longer than 1000 requests loses the overwritten tail permanently — log a `gap_suspected` model event so the loss is visible rather than silent. **Detection rule (no-overlap heuristic):** if the *oldest* entry in the reconcile fetch is newer than the newest already-persisted entry for that provider, the ring wrapped past our tail; emit `gap_suspected` with both timestamps in `detail`. Overlap present = no gap, no event.
+  - **Second accepted residual:** a genuinely-new post-restart entry whose `(swap_entry_id, ts)` exactly collides with a pre-restart row (same ring slot, same timestamp to llama-swap's `ts` precision) is silently dropped by the UNIQUE constraint. Window = one entry per restart at sub-precision coincidence; accepted, not solvable client-side without a content hash in the key.
+- **Perf poller**: `GET /api/performance?after=<last-ts>` every 5s (llama-swap's own minimum collection interval). The watermark is recovered on restart from `MAX(ts)` per provider in `control_perf_samples` (not in-memory only); duplicate polls no-op on the UNIQUE constraint. **Cold start (`MAX(ts)` = NULL, fresh install):** omit `after` entirely and ingest whatever window the host returns — the UNIQUE constraint makes over-fetch harmless, and the next poll has a watermark.
+- **Host liveness is explicit state, not absence of data.** Each connector runs a small state machine `connected | reconnecting | down` (down after N failed reconnects); transitions publish a `control_fleet` delta and stamp `control_hosts`-adjacent in-memory state with `last_seen_at`. A late-joining browser therefore sees `down + last_seen_at`, never a stale "ready" snapshot (analysis B3).
+- **Snapshot/delta consistency.** The fleet state keeps a per-host monotonic `seq`, incremented on every mutation. The join snapshot carries the current `seq`s; every delta carries its `seq`. Client rule: **buffer (do not apply, do not discard) any delta that arrives before the snapshot**; after applying the snapshot, replay the buffer dropping deltas with `seq <=` the snapshot's per-host seq, and apply the filter to all subsequent deltas. On a single FIFO WS pre-snapshot deltas should not occur, but buffering makes the rule transport-independent. This closes the join race where a delta arrives during snapshot serialization (analysis B4).
+- **Logs are not persisted** by default (volume + low value at rest); they relay live SSE → WS with an in-memory tail buffer (last ~2k lines per host) for late joiners. Optional "record to file" toggle later.
+- **Fan-out to browser**: the control service publishes over its own WS (`/api/ws/control`), relayed by apps/server's proxy as `/api/control/ws`. This is a **second app-level WS connection** in the browser — `useControlStream` gets its own singleton guard + context; it does NOT share `useUserEvents`' `/api/ws/user` channel. Frames (added to `packages/contracts/src/ws-frames.ts` **first**, then the server loose union, then the web strict union — and the contracts drift test extended to cover them, so a partial edit fails the suite):
+  - `control_fleet` — full snapshot on join + seq-stamped state deltas (hosts, liveness, models, states, ttl deadlines, inflight)
+  - `control_activity` — new request rows (the live feed)
+  - `control_perf` — appended samples per host
+  - `control_log` — `{provider_id, source: proxy|upstream, line}` batches
+  - `control_job` — bench/eval run progress events
+
+## 5. Actions
+
+| Action | Mechanism |
+|---|---|
+| Warm/load model | 1-token `POST /v1/chat/completions` with the bare wire ID (stackctl-proven; llama-swap loads on demand — there is no load endpoint) |
+| Unload one/all | `POST /api/models/unload/:model` / `/api/models/unload` |
+| Inspect request | `GET /api/captures/:id` on the host, decode base64, persist trimmed copy, render |
+| Bench/eval runs | engines below (manual v1) |
+| Edit config / restart llama-swap | P9 (SFTP + schema validation + diff + timestamped backup + restart + health-wait) |
+
+**Per-host action queue.** All host-mutating actions (warm, unload, bench warm-up) from BooControl serialize through a single FIFO queue per `provider_id` inside the control service — double-clicks, warm-during-warm, and unload-during-bench from *this* service cannot interleave (analysis C3). An unload request while a bench run holds the host is rejected with a "bench in progress — takeover?" confirmation. Queue discipline (verification C-N1): **submissions are rejected immediately while the host's liveness state is `down`** ("host offline" toast); queue depth is capped (4) with reject-on-full; each action **re-checks liveness on dequeue and skips itself if stale** — a recovered host never replays a backlog of stale warms. (Pattern precedent: `arena-runner.ts` `advanceChain` promise-chain, plus its read-fresh-state-or-skip discipline.) This serializes BooControl's own hands only; BooChat/BooCoder/Arena traffic is uncoordinated until P8.
+
+All mutating actions publish `control_job`/`control_fleet` frames; UI handlers stay idempotent (event-dedup discipline per CLAUDE.md — no local emit after API call).
+
+**Manual op checklist (P2.5):** Before the capture inspector works end-to-end, enable `captureBuffer` and review `metricsMaxInMemory` on both hosts' llama-swap configs. These are per-host settings in `config.yaml` and must be set before captures will be available:
+
+- [ ] sam-desktop: set `captureBuffer: true` and verify `metricsMaxInMemory` (default 1000, sufficient for most workloads)
+- [ ] embedding: set `captureBuffer: true` and verify `metricsMaxInMemory`
+- [ ] Restart llama-swap on both hosts after config changes
+
+## 6. Retention (ships in the same P1 slice as ingestion)
+
+Daily job, crash-safe by construction:
+
+1. **Rollup is an idempotent upsert**: `INSERT INTO control_perf_rollup_5m ... ON CONFLICT (provider_id, bucket) DO UPDATE` recomputed from raw — a re-run after a crash recomputes the same buckets, never double-counts.
+2. **Delete raw only after the covering buckets are committed**, in **chunked transactions: one transaction per provider per 1-hour window** (≤720 rows each), never one 48h mega-transaction — bounds lock hold time so the live 5s poller's inserts into the same table never queue behind a multi-second aggregate+delete (verification C-N2). A crash between chunks leaves whole-hour windows either fully migrated or fully raw; the next run recomputes idempotently.
+3. Activity > 90d pruned; captures capped per-row (256KB) and pruned by total budget. All windows configurable via `.env.host`.
+
+Retention is a **P1 task in the same slice as ingestion**, not a fast-follow — the bloat window between "ingestion starts" and "retention exists" degrades the shared DB that serves all of BooChat (analysis R3).
+
+## 7. Attribution (X-Boo-Source) — own phase (P4), two blockers solved together
+
+The naive plan ("inject a header, small touch") is blocked on both inference paths:
+
+- **apps/server (BooChat streaming)**: `getSwapProvider()` caches `createOpenAICompatible` instances by `baseURL` in `swapCache`; headers are provider-level, baked at construction. Fix: a per-turn **fetch wrapper** — thread the source label through the call site and pass a wrapping `fetch` that injects `X-Boo-Source` (cache keyed by `baseURL+source` since the label set is tiny: `boochat|boocoder|arena|control-bench|control-eval`). **Interface constraint (verification S-N2):** `getSwapProvider` is private (fan-in 1), but the label must travel through the exported `upstreamModel`, whose file has a 28-file/13-route blast radius — the change MUST be additive (`upstreamModel(config, modelId, agent?, source?)` or an options object with optional `source`), never a breaking signature change; all existing call sites compile unchanged. The direct-fetch paths (`compaction.ts`, `task-model.ts`) just extend their existing headers object.
+- **apps/coder (opencode local gateway)**: `local-gateway.ts` builds a fresh headers object and silently strips inbound `X-Boo-Source`. Fix: forward it explicitly when present. Arena/dispatch direct paths set it at their own fetch sites.
+
+P4 lands: both fixes + the `control_requests.source` column migration + the `source` filter in the Activity UI. llama-swap's header capture (`captureBuffer`) must be enabled on the hosts first (P2 op task). Acceptance: a BooChat turn, a BooCoder dispatch, and an Arena battle each show their own label in the Activity feed; nothing shows NULL except genuinely external traffic.
+
+#### Implementation notes
+
+**P6.2 schedule meta lives in its own table, not on `control_reports`.** §3 sketched `control_reports + schedule meta: {interval, enabled, last_run_at}`. In implementation the scheduler state was split into a dedicated single-row `control_schedule_meta` table (keyed by schedule `name`, seeded `report-digest`) so generated `control_reports` rows stay immutable snapshots and the boot catch-up reads/writes one well-known row instead of scanning report history for the latest `last_run_at`. The retention-style hourly tick (`runReportSchedulerTick`) and the `{interval, enabled, last_run_at}` contract are unchanged.
+
+**P7 gateway identity.** The gateway registers as provider id `auto` (kind `boocontrol-gateway`); its virtual models are `auto`, `auto:code`, `auto:fast`, `auto:cheap`, so BooChat composite ids are `auto/auto:code` etc. and the wire model sent to the gateway is the bare virtual token. `getModelContext` reads `n_ctx` from the gateway's own `/upstream/<virtual>/props`, which proxies the first healthy candidate's props. The gateway is reached server-to-server via the registry baseUrl (not the `/api/control` proxy, which buffers responses and would break streaming).
+
+**P7 orphan detection.** An orphaned auto:* session is detected two ways: by registry `kind === 'boocontrol-gateway'` when the gateway is present (→ `gateway`), and by the virtual-model token shape (`auto` / `auto:*`) when the provider is absent (→ `gateway_error`, reason `offline`). The unknown-composite-provider swap fallback is overridden only for that token shape; all other unknown composites keep their existing best-effort swap behavior.
+
+**P9.1 uses shelled `ssh`, not an ssh2/SFTP library.** §5 and the P9 task say "SFTP read ... SFTP write". Implementation shells out to the system `ssh` (`cat <path>` to read, `cp` for the timestamped backup, `cat > <path>` over stdin to write, the configured `restart_cmd` to restart) with an explicit `-i <key> -o IdentitiesOnly=yes -o BatchMode=yes`. This matches the established booterm SSH-via-shell precedent and the Gitea deploy-key lesson (never offer the agent's default key), and avoids adding an `ssh2` native dependency. The exec is injected (`SshExec`) so every failure path (unreadable host, backup fail, write fail, restart fail, health never recovers) is unit-tested without a live host. The fork `config-schema.json` is bundled at `apps/control/data/config-schema.json` and validated with ajv (added as a control dependency). Backup always precedes write, so a failed write leaves the timestamped backup intact. Not live-smoked: there is no reachable Windows SSH target in the implementation session (the documented "Windows SSH fiddliness" risk); the failure-path suite is the standing verification.
+
+**ActivityLogEntry does not carry request headers.** The llama-swap fork's `ActivityLogEntry` struct (`internal/server/metrics.go`) contains `ID`, `Timestamp`, `Model`, `ReqPath`, `RespContentType`, `RespStatusCode`, `Tokens`, `DurationMs`, `HasCapture` -- no `source` field and no request headers. The `X-Boo-Source` header IS captured in `ReqRespCapture.ReqHeaders` (`internal/server/captures.go`), but captures are stored separately in a zstd-compressed cache and fetched on-demand via `GET /api/captures/:id`, not in the metrics ring.
+
+Therefore the `control_requests.source` column is NULL for ring-ingested data. The column exists for: (1) future llama-swap versions that may add source to ActivityLogEntry, (2) manual backfill from captures, (3) non-ring sources (bench/eval direct calls that set source explicitly). The metrics ingest mapper writes NULL for source, matching what the ring provides.
+
+## 8. Benchmark, eval, routing
+
+### Speed bench (P3 — manual, safe-by-construction)
+- HTTP-level, through llama-swap (measures what BooChat actually experiences) with llama.cpp `timings` (`prompt_per_second`, `predicted_per_second`, `cache_n`) parsed from the final stream chunk; TTFT measured client-side at first delta.
+- Suite = grid of (prompt_len × gen_len × concurrency) × N repetitions; warmup excluded; results as aggregates + raw samples. Runner fan-out is **bounded** (suite-declared concurrency only, `Promise.allSettled`, never unbounded `Promise.all`).
+- **v1 safety model**: every run is user-initiated with an explicit takeover confirmation when the target host shows recent traffic; embedding-host-first defaults. The `inflight==0` check is a *courtesy gate*, not a guarantee — BooChat/BooCoder/Arena can race it (TOCTOU, four uncoordinated writers). v1 accepts this because a human clicked "run"; **unattended scheduling is explicitly deferred to P8** (fleet lease). Bench results note `concurrent_foreign_requests` observed during the run (from the activity stream) so polluted runs are flagged, not silently trusted.
+- Baselines + regression: each (provider_id, model) keeps a baseline aggregate; new runs flag deltas beyond threshold (e.g. gen tok/s −10%) → surfaces in Reports and as a fleet-card badge.
+- Later: `llama-bench` over SSH for device-level (no-server) numbers, JSON output ingested alongside (P9, with the SSH plumbing).
+
+### Quality evals (P5)
+- **Suite program** (decided 2026-06-12): four suites measuring Sam's real workloads, in priority order — (1) **agent coding tasks** (TS/code-edit tasks like BooCoder dispatches, sandboxed pass@1), (2) **chat assistant quality** (judge rubrics), (3) **long-context retrieval** (needle/doc-QA for file-heavy sessions), (4) **utility calls** (titles/summaries/compaction — directly tunes the `FAST_MODEL` choice).
+- **Chat**: suite of curated prompts (data/ YAML, editable) scored by LLM-as-judge (rubric single-answer grading, MT-bench style; temperature 0, judge model + version pinned per run). Judge = strongest local model by default. Pairwise comparisons delegate to **Arena** (exists in apps/coder) — BooControl links/launches battles rather than re-implementing.
+- **Code**: HumanEval+/MBPP+-style tasks, executed in ephemeral sandbox containers on the homelab: `--network none`, non-root, mem/cpu/time caps, tmpfs workdir, `--rm`, kill-on-timeout, and a `boocontrol-eval` label so orphans are findable (`docker ps --filter label=...`) and pruned at engine start. Runner: **bounded concurrency** (default 4), `Promise.allSettled`, per-task `finally` cleanup — a single task failure never abandons in-flight containers (analysis C5; the CLAUDE.md child-supervisor lesson applies). `/opt/forks/openevals` is the reference implementation to borrow patterns from (TS).
+- Scorecards: per (provider_id, model, quant) leaderboard with speed × quality scatter — "is the Q4 actually worse for my use?" answered with my own suite, on my own hardware.
+
+### Routing (P6 advisory → P7 live gateway, committed)
+- **P6 — advisory**: routing scores (eval results + live latency + host health) exposed via API; the model picker badges "best code model right now".
+- **P7 — gateway**: control service exposes OpenAI-compatible virtual models (`auto`, `auto:code`, `auto:fast`, `auto:cheap`) implementing policy: rule match → candidate ordering → health/ctx-fit filter → dispatch with failover. BooChat adopts by adding a registry entry (`{id: "auto", baseUrl: "http://100.114.205.53:9503", kind: "boocontrol-gateway"}`) — zero inference-path changes elsewhere. Frontier providers slot in as policy targets when added to the registry.
+  - **Orphaned-session handling (explicit — REQUIRES a `provider.ts` code change, verification S-N1/B-N3)**: today `resolveModelProvider` silently falls back to `LLAMA_SWAP_URL` for any composite id with an unknown provider ("best-effort fallback, config incomplete" branch) — exactly the mis-route this section forbids. P7 must (a) extend the `InferenceRoute` union (currently `'swap' | 'deepseek'`) with a `'gateway'` variant (and an unhealthy/error representation), and (b) change the unknown-provider fallback so a known-`kind` gateway id that is missing/disabled resolves to a clean "routing gateway offline" error, never the swap fallback. All **5 callers** of `resolveModelProvider` must be audited for the new variant: `getModelContext`, `invalidateModelContext` (model-context.ts), `resolveRoute`, `upstreamModel`, `resolveModelEndpoint` (provider.ts). The session keeps its id, the picker flags it. Gateway-dispatched requests carry `X-Boo-Source` through to the target host so attribution survives the extra hop.
+- llama-swap `peers` could federate hosts at the proxy layer instead, but was rejected for the same reasons as the provider-registry research rejected it (flat list, coupled uptime, silent ID collisions).
+
+### Fleet coordination lease (P8 — cross-service)
+The proper fix for the four-writer TOCTOU: a per-host advisory lease in the shared DB (`control_host_leases`: holder, purpose, expires_at, heartbeat) that BooControl's scheduler *requires* and BooChat/BooCoder/Arena *honor* (check-before-dispatch, or queue behind an exclusive bench lease). This touches all four services and is therefore its own batch with its own design pass. **The P3 seam is a named function, not a convention** (verification C1'): the bench runner gates every run through `acquireHostAccess(providerId, purpose): Promise<HostGrant>` — the v1 implementation is the courtesy check (inflight==0 + takeover confirmation); P8 swaps its body for the lease without touching the bench engine. P3 implementers must NOT inline the inflight check in the runner. Unattended/scheduled benches and reproducible concurrency sweeps unlock here.
+
+## 9. UI design direction
+
+Route `/control`, nav entry under Memory (ProjectSidebar bottom cluster). Sub-views as tabs within the page: **Fleet · Activity · Logs · Models · Bench · Evals · Reports**.
+
+- **Aesthetic**: dark mission-control. Host cards as instrument clusters: VRAM arc gauge, GPU temp/power readouts, model chips with state glow (amber pulse `starting`, green steady `ready`, red `error`, grey `down` with last-seen), TTL countdown rings. Orbitron (already in the font pipeline) for numerals only; Inter for prose; JetBrains Mono for logs/JSON.
+- **Motion**: framer-motion (already a dep) — spring layout transitions on model chips during swaps, count-up tweens on token totals, animated activity-feed inserts. Respect `prefers-reduced-motion`.
+- **Charts**: **ECharts** (decided 2026-06-12). Gauges, scatter, heatmaps built in — covers the VRAM arcs, speed×quality scatter, and perf timelines from one lib; dark-theme native; 5s streaming append handled via `appendData`/`setOption`. The <100KB preference is consciously traded for batteries-included breadth; import per-chart modules (`echarts/core` + needed renderers) to keep the bundle sane.
+- **Logs**: react-virtuoso tail-follow viewer (already a dep), per-source filter (proxy/upstream/model), pause-on-scroll.
+- **Inspector**: activity table (virtuoso) → capture drawer: headers table + shiki-highlighted JSON bodies + "Open in Playground" replay.
+- **Playground**: param-tweakable single-model chat + A/B compare; "Battle in Arena" handoff for full cross-examination.
+- Skills to drive the build pass: `frontend-design` (aesthetic direction), `ui-ux-pro-max` (dashboard/chart patterns), `frontend-ui-engineering` (production quality), existing theme tokens (oklch palettes) so BooControl follows the active theme.
+
+## 10. Risks
+
+| Risk | Mitigation |
+|---|---|
+| PG bloat from time-series + captures | raw/rollup split; **retention job ships in the same P1 slice as ingestion**; UNIQUE constraints prevent restart-duplication inflation; capture size caps; measured in Reports (P7) |
+| Bench/eval evicts a model in active use | v1: manual runs + takeover confirmation + embedding-first + per-host action queue. Honest limit: `inflight==0` is a courtesy gate (TOCTOU vs 3 other writers). Real fix is the P8 lease |
+| llama-swap ring-id reset breaks dedup | DB UNIQUE on (provider_id, swap_entry_id, ts) + ON CONFLICT DO NOTHING — enforced at insert, not check-then-act |
+| Ring wraps during long outage | accepted bound; `gap_suspected` event logged with reconcile delta so loss is visible |
+| SSE disconnects / host down | backoff + jitter (opencode-sse pattern); explicit connected/reconnecting/down state machine + last_seen_at in control_fleet; favorites-style "hide, never delete" for offline hosts |
+| Snapshot/delta join race | per-host monotonic seq; client discards deltas ≤ snapshot seq |
+| Perf-poller restart duplicates | watermark recovered from MAX(ts) in DB; UNIQUE (provider_id, ts) |
+| Rollup crash double-count/loss | idempotent upsert + rollup-and-delete in one transaction |
+| Attribution silently NULL | no source column until P4; P4 solves both path blockers (server fetch wrapper + gateway forward) together with the migration |
+| Sandbox escape from generated code | no-network, non-root, caps, tmpfs, --rm, labeled for orphan prune; bounded allSettled runner with finally-cleanup; gVisor as upgrade path. Residual risk accepted for single-user |
+| LLM-judge bias/noise in chat evals | fixed rubrics, temperature 0, judge version pinned per run, pairwise via Arena for tie-breaks |
+| Windows SSH fiddliness (P9 config edit) | pre-apply JSON-schema validation (config-schema.json lives in the fork), timestamped backups before every write, health-wait after restart; stackctl's flow is the reference but gets tests here |
+| Orphaned `auto:*` sessions if gateway removed | resolver treats missing gateway provider as unhealthy-not-absent: clean error, no silent mis-route to LLAMA_SWAP_URL |
+| 5s × 2 hosts perf polling forever | trivial volume (~35k rows/day raw), rolled up + pruned at 48h |
+| Three applySchema callers race on restart | startup ordering guard: control waits for server-owned `sessions` table before applying schema |