Files
boocode/openspec/changes/boocontrol/proposal.md
indifferentketchup b18de2a331 chore: snapshot working tree - pty_exited notifications + in-flight inference WIP
feat(booterm): structured pty_exited WS notifications. Plan-validated, impl-validated, code-reviewed green (contracts build clean, contracts test 29/29, booterm + web typecheck clean).

wip: in-progress inference/provider refactor (agents.ts, provider.ts, new llama-providers.ts, removed llama-args-validator), plus arena, dispatcher, compaction, schema changes.

openspec: pty-exit-notifications complete; x-agent-flags planned (not yet implemented).
2026-06-14 12:48:47 +00:00

63 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# BooControl — a cockpit for the local AI fleet
**Status:** ACCEPTED — open decisions resolved 2026-06-11 (see "Decisions" below). Implementation gated only on P0 completion (commit + review of the multi-provider registry batch). Architecture analysis findings (S/B/C/R series) are folded into `design.md`.
## Why
BooCode talks to a fleet of llama-swap instances (Sam-desktop `100.101.41.16:8401` on the RTX 5090, embedding `100.90.172.55:8411` on the P104-100) but has zero visibility into it. Today the answers to "what model is loaded, how fast is it, what did that request actually send, why is the GPU pinned" live in three places: llama-swap's own single-instance Svelte UI (per-host, ephemeral, utilitarian), stackctl (Python, separate stack, ephemeral machines table, zero tests), and ssh + nvidia-smi. Nothing persists: llama-swap's activity log is a 1000-entry in-memory ring that dies on restart.
Meanwhile the llama-swap fork at `/opt/forks/llama-swap` already exposes everything a cockpit needs **over plain HTTP per instance**: SSE event stream (`/api/events`: model status, logs, per-request token metrics, in-flight count), system+GPU telemetry (`/api/performance`: CPU, RAM, GPU temp/VRAM/util/power), request/response captures (`/api/captures/:id`), load state (`/running`), unload (`POST /api/models/unload[/:model]`), Prometheus `/metrics`. The per-instance hard part is done. What does not exist anywhere — in llama-swap, stackctl, or any tool surveyed — is the **fleet layer**: aggregation across instances, persistent history, benchmarking (speed and quality), routing intelligence, and reports.
BooControl is that layer: a left-nav page in BooCode backed by a new host service, that matches llama-swap's UI per-instance and exceeds it fleet-wide.
## What changes
1. **`apps/control`** — new host service (Fastify + TS, port 9503, systemd `boocontrol.service`, `.env.host` pattern — the `apps/coder` precedent). Owns:
- **Fleet connectors**: one per provider from the provider registry; consumes each llama-swap's `/api/events` SSE, polls `/api/performance?after=`, `/running`.
- **Persistence** (third schema owner on the shared `boochat` DB, coder precedent, with a startup ordering guard — design §3): request activity, perf samples (with retention + rollups), model state transitions, benchmark and eval results, reports. Dedup enforced by DB UNIQUE constraints, not application checks (design §4).
- **Actions**: warm (load-via-1-token-request, the stackctl trick — llama-swap has no explicit load endpoint), unload, capture fetch. All host-mutating actions serialize through a per-host action queue (design §5). Config view/edit over SSH lands in a late phase (P9).
- **Benchmark engine**: speed sweeps (TTFT, prompt/gen tok/s vs concurrency from llama.cpp `timings`). v1 is **manual, safe-by-construction**: explicit takeover confirmation, embedding-host-first defaults, no unattended scheduling. Unattended scheduling requires the fleet coordination lease (P8).
- **Eval engine**: chat quality (LLM-as-judge suites; Arena handles pairwise battles already) and code quality (sandboxed execution of generated code in ephemeral no-network containers).
- **Routing layer** (late phases): advisory scoring feeding the model picker (P6), then OpenAI-compatible `auto:*` policy gateway models (P7).
2. **`apps/server`** — `registerControlProxy` (`/api/control/*` HTTP + WS relay to :9503; deliberate clone of `routes/coder-proxy.ts` — Rule of Three unmet, both files carry a keep-in-sync comment).
3. **`packages/contracts`** — new WS frame types for fleet status / activity / perf / log streaming. Three-location sync (contracts schema → server loose union → web strict union) executed in that order, with the contracts drift test extended to cover the new frames.
4. **`apps/web`** — `/control` route + nav entry (Memory-page precedent: `App.tsx`, `ProjectSidebar.tsx`, `pages/Control.tsx`), with sub-views: Fleet, Activity, Logs, Models, Benchmarks, Evals, Reports. Dark "mission control" aesthetic; Orbitron (already in the font pipeline) for instrumentation numerals; framer-motion (already a dep) for state-transition animation; react-virtuoso (already a dep) for live logs. The control stream is a **second app-level WS singleton** (`useControlStream` targets the proxied `/api/control/ws`, not the `/api/ws/user` channel) with its own context + connection guard. Chart library: see design.md §9.
5. **Per-consumer attribution**: BooChat / BooCoder / Arena inject an `X-Boo-Source` header on inference requests so the cockpit can attribute tokens and load per consumer. **This is its own phase (P4), not a P1 column**: the server's AI-SDK provider cache bakes headers in at construction (needs a per-turn fetch wrapper) and the coder's local gateway strips unknown headers (needs explicit forwarding). The `control_requests.source` column is added by the P4 migration, when it can actually be populated — no NULL-forever rows.
## Prerequisite batch
**Multi-llama-swap provider registry** (`openspec/changes/multi-llama-swap-providers-model-favorites/`) — implemented in the working tree (P0P8 of that batch checked off; UI/route tests and smoke tests remain). BooControl keys every host-scoped row on **`LlamaProvider.id`** (`"sam-desktop"`, `"embedding"` — the actual shipped contract `{id, label, baseUrl, kind}` in `packages/contracts/src/llama-providers.ts`). That batch must be **committed and reviewed** before BooControl P1 starts; this proposal does not duplicate its scope.
> Historical note: earlier drafts of this proposal assumed a `{name, baseUrl, sidecarUrl?}` registry shape. The shipped contract uses `id` (not `name`), and the llama-sidecar has since been removed entirely — there is no sidecar URL, port 8402, or per-agent-flags concept anywhere in the system. All control-plane keys are `provider_id`.
## The two options considered
- **Option A — built into BooCode (monorepo `apps/control` + `apps/web` page).** Chosen. Reuses: theme system (18 palettes), WS broker + contracts, coder-proxy pattern, Postgres + schema-owner precedent, framer-motion/virtuoso/shiki/lucide, Arena for playground battles, the provider registry itself, deploy muscle memory. One click from where Sam already lives.
- **Option B — standalone dockerized app at `/opt/boocontrol` → boocontrol.indifferentketchup.com.** Rejected as the *starting point*. The service boundary keeps a weaker form of Option B alive: `apps/control` has its own HTTP API and own schema, **but it does have a compile-time dependency on `@boocode/contracts`** (provider registry types + WS frames) — genuine extraction to a standalone repo would require extracting or vendoring the contracts package too. The domain itself is achievable cheaply at any time: point a Caddy/Authelia vhost at the boocode container with a rewrite to `/control` (P9).
## Non-goals
- Replacing stackctl wholesale (its Bifrost/agents/flows/personas serve other projects; only its llama-swap management is superseded).
- Managing non-llama-swap inference engines in v1 (vLLM, Ollama, infinity-emb — the connector interface should not preclude them; reopen when a second engine kind is actually added).
- Multi-user/auth (Authelia at the proxy, as everywhere else).
- Prometheus/Grafana — BooControl persists its own samples; the `/metrics` endpoints stay available for an external stack if ever wanted.
- Solving cross-process GPU arbitration in v1. BooChat, BooCoder, Arena, and BooControl are four uncoordinated writers to the same hosts; v1 bench/eval is manual + confirmed precisely because the `inflight==0` gate alone is a TOCTOU race. The real fix (fleet lease) is P8.
## Decisions (resolved 2026-06-11)
1. **Page vs pane** → page first. A slim `control` pane kind is cheap later once components exist (P9).
2. **Separate `apps/control` vs fold into `apps/coder`****separate service.** Blast-radius isolation from agent dispatch; Arena stays in coder and is reused, not moved. Cost accepted: third `applySchema` caller (mitigated by startup ordering guard, design §3) and a proxy clone (deliberate, S4/A6).
3. **SSH config-editing scope** → deferred to P9. Key lives in `secrets/` (gitignored), per the Gitea deploy-key precedent. Pre-apply schema validation + timestamped backup + health-wait are mandatory parts of that design.
4. **Eval suites** → both chat (LLM-as-judge, MT-bench-style rubrics) and code (sandboxed pass@1) are in scope (P5). Suite program (resolved 2026-06-12): agent coding tasks, chat assistant quality, long-context retrieval, utility calls (titles/summaries) — in that priority order. Judge = strongest local model by default, frontier judge optional later. Sandbox = hardened Docker (`--network none`, non-root, caps, tmpfs); gVisor is the upgrade path.
5. **Routing** → advisory scores first (P6), then **commit to the live `auto:*` gateway** (P7). BooChat adopts via a registry entry; orphaned `auto:*` session rows are explicitly handled (design §8).
6. **llama-swap host config changes** → enable `captureBuffer` and review `metricsMaxInMemory` as a documented manual op task in P2. No apiKeys (single-user Tailscale mesh).
7. **Retention windows** → raw perf 48h → 5m rollups 90d; activity 90d; captures 256KB/row cap + total budget prune. All configurable via `.env.host`.
8. **Standalone domain** → later (P9, optional). The service boundary is kept clean enough to allow it.
## Known hard parts (called out, not hand-waved)
- **Attribution is not a "small touch"** — it has its own phase (P4) because both inference paths block it today (design §7).
- **Bench results under live traffic are not reproducible** — `inflight==0` is a start gate, not a hold gate. v1 accepts this (manual runs, takeover confirmation, embedding-first); P8 fixes it properly.
- **Snapshot/delta consistency** on the control WS needs explicit sequencing (design §4) — without it, a late-joining browser can apply a stale snapshot over a newer delta.
- **Code-eval sandboxing runs LLM-generated code on the Tailscale hub.** Hardened Docker is the v1 posture; the residual risk is accepted for a single-user system, gVisor if that ever changes (design §10 risks).