feat(booterm): structured pty_exited WS notifications. Plan-validated, impl-validated, code-reviewed green (contracts build clean, contracts test 29/29, booterm + web typecheck clean). wip: in-progress inference/provider refactor (agents.ts, provider.ts, new llama-providers.ts, removed llama-args-validator), plus arena, dispatcher, compaction, schema changes. openspec: pty-exit-notifications complete; x-agent-flags planned (not yet implemented).
276 lines
29 KiB
Markdown
276 lines
29 KiB
Markdown
# Plan: boocontrol
|
|
|
|
## Folder
|
|
`openspec/changes/boocontrol/`
|
|
|
|
## Task count
|
|
51 (P0: 2, P1: 15, P2: 5, P3: 5, P4: 4, P5: 4, P6: 2, P7: 4, P8: 1 outline, P9: 1 outline)
|
|
|
|
## Size
|
|
Large -- 10-phase program spanning 4 apps + contracts, ~12 new DB tables, 5 new WS frame types, new host service, routing gateway, eval sandbox
|
|
|
|
## Validation
|
|
`openspec validate boocontrol`: skipped (pre-spec-format acceptance; validation against openspec CLI format not applicable to accepted spec)
|
|
Adversarial validator: 18 findings (3 CRITICAL folded, 7 MINOR folded, 8 CONFIRMED)
|
|
Junior developer: 24 findings (7 clarifying folded, 3 polish noted, 2 specialist handoffs deferred, 12 confirmed)
|
|
|
|
---
|
|
|
|
## Findings folded into this plan
|
|
|
|
**Critical (folded):**
|
|
- **V1 (jitter):** The `opencode-sse.ts` pattern referenced in design S4 has backoff + circuit-breaker but NO jitter. The BooControl SSE connector must add jitter explicitly (random 0-50% of computed delay) to avoid thundering-herd reconnections across N hosts.
|
|
- **V7 (waitForTable):** No `waitForTable` function exists anywhere in the codebase. P1 must create it in `apps/control/src/db.ts` as an explicit task.
|
|
- **V11 (schema indexes):** P1 schema creates tables but defines zero indexes. The retention job queries `control_requests` by `(provider_id, ts)`, the perf poller recovers watermarks via `MAX(ts)`, and the activity feed sorts by `ts`. Without indexes these queries scan full tables as rows accumulate (~35k/day raw). Add explicit index tasks for `control_requests(provider_id, ts)`, `control_perf_samples(provider_id, ts)`, `control_model_events(provider_id, ts)`.
|
|
|
|
**Clarifying (folded):**
|
|
- **JD1 (server loose union):** Control frames skip the server's broker entirely (they relay raw bytes through the proxy). Adding them to the server's `InferenceFrame` union is dead code. Skip the server union update; document that control frames use a 2-location pattern (contracts + web strict union only).
|
|
- **JD3 (control_hosts seed):** Seed `os` and `gpu_label` as hardcoded display metadata (`'Windows'`/`'RTX 5090 32GB'`, `'Linux'`/`'P104-100 8GB'`); `ssh_*`, `config_path`, `restart_cmd` are NULL until P9.
|
|
- **JD5 (@fastify/websocket):** Add `@fastify/websocket` to P1 scaffolding dependencies.
|
|
- **JD6 (capture cap):** The 256KB capture cap is application-enforced in the capture-fetch handler, not a DB constraint.
|
|
- **JD7 (acquireHostAccess):** Scaffold `acquireHostAccess` in P1 as a no-op (`{ok: true}`) so P3 calls it and P8 swaps its body.
|
|
- **JD8 (gap_suspected):** Store as a row in `control_model_events` with `model = '*'` and `state = 'gap_suspected'`, timestamps in `detail` JSONB.
|
|
- **JD14 (schema overview):** Only create P1 tables in P1; annotate the design S3 schema overview with phase tags.
|
|
- **JD16 (P1 source):** P1 activity feed shows `source = NULL`; per-consumer filtering lands in P4.
|
|
|
|
**Minor (folded):**
|
|
- **V2 (drift test):** The existing `ws-frames.test.ts` only checks `KNOWN_FRAME_TYPES` vs `WsFrameSchema` alignment, not web strict union sync. Add a comment to the P1 task noting web union sync is manual.
|
|
- **V3 (blast radius, corrected by plan validation F1/F4):** `upstreamModel` has exactly 1 production importer (`stream-phase-adapter.ts:16`), not ~5 and not 28/13. The other provider-module consumers import `resolveModelProvider`/`resolveModelEndpoint`/`resolveRoute`/`getModelContext` instead. The additive-change constraint stands; the real P7 blast surface is `resolveModelProvider`'s 6 direct callers propagating to ~10 downstream call sites.
|
|
- **V6 (local-gateway):** local-gateway.ts omits `X-Boo-Source` (doesn't include it) rather than actively stripping it. Same fix either way.
|
|
- **JD4 (proxy WS path):** The control proxy WS path is static (`/api/control/ws`), not parameterized like coder-proxy's per-session path.
|
|
|
|
**New findings (folded):**
|
|
- **V12 (P7 caller audit detail):** The prior plan says "audit all 5 callers" but doesn't specify what each caller needs. Added per-caller change specs: `getModelContext`/`invalidateModelContext` (model-context.ts) must handle gateway `baseUrl`; `resolveRoute` (provider.ts) must return `{route: 'gateway'}`; `upstreamModel` (provider.ts) must add gateway branch before swap fallback; `resolveModelEndpoint` (provider.ts) must handle gateway headers.
|
|
- **V13 (ECharts theme integration):** The plan says "dark-theme tokens from active oklch palette" but doesn't specify how. Added: use `echarts.init(dom, themeObject)` with a theme object built from the CSS custom properties (`--background`, `--foreground`, `--muted`, `--accent`) via `getComputedStyle`. One theme-build helper, not per-chart.
|
|
- **V14 (action queue semantics):** "unload-during-bench -> takeover confirmation" needs explicit HTTP semantics. Added: the action endpoint returns 409 with `{error: 'bench in progress', requiresConfirmation: true}`; the client shows a confirmation dialog and re-submits with `?confirm=true`.
|
|
- **V15 (capture total budget default):** The plan mentions "total budget prune" but gives no default. Added: 50MB default, configurable via `CAPTURE_BUDGET_MB` env var.
|
|
- **V16 (openevals reference verified):** `/opt/forks/openevals` exists and contains `js/`, `python/`, `sandbox/` directories. The sandbox pattern (Docker hardened containers) is confirmed available.
|
|
- **V17 (P7 gateway error shape):** `InferenceRoute` extension needs explicit error representation. Added: `'gateway' | 'gateway_error'` variants; `gateway_error` carries `{reason: 'offline' | 'unhealthy'}`. The 5 callers must handle both.
|
|
- **V18 (SSE connector event shape delta):** The opencode-sse.ts pattern is for the opencode SDK's `Event` type; BooControl consumes raw llama-swap SSE (`/api/events`) with a different envelope (`modelStatus | logData | metrics | inflight`). The reconnect/backoff/circuit-breaker pattern ports directly; the event parsing is new code, not a port. Noted in P1.4.
|
|
|
|
**Junior developer new findings (folded):**
|
|
- **JD17 (schema index timing):** Indexes should be created in the same P1 task as the tables they index, not as a separate phase. Consolidated into P1.3.
|
|
- **JD18 (action queue depth cap message):** When the queue is full (depth=4), the error message should include the current queue contents so the user knows what's pending. Added to P2.1 spec.
|
|
- **JD19 (acquireHostAccess signature):** The function signature must be `acquireHostAccess(providerId: string, purpose: string): Promise<{ok: boolean, reason?: string}>` -- explicit in P1.14, called by P3.1.
|
|
- **JD20 (snapshot rebuild on restart):** When the control service restarts, the in-memory fleet state is lost. The WS endpoint must rebuild from DB (control_model_events for latest state, control_requests for last-seen activity) before serving snapshots. Added to P1.6.
|
|
- **JD21 (activity feed sort order):** The live activity feed must sort by `ts DESC` (newest first) with react-virtuoso's `followOutput="bottom"` for live insertion. Added to P1.12.
|
|
- **JD22 (ECharts bundle impact):** Per-chart `echarts/core` imports add ~15-25KB per chart type (gauge, line, scatter). With 3-4 charts in P1, the incremental bundle is ~60-100KB. Acceptable given the batteries-included tradeoff documented in design S9. Noted in P1.13.
|
|
- **JD23 (P7 provider.ts callers -- compile check):** All 5 callers must compile unchanged for the new `InferenceRoute` variant. The `upstreamModel` function's implicit else branch (line 192) currently always reaches `getSwapProvider` -- the gateway variant must be handled before it. Added explicit check.
|
|
- **JD24 (deploy docs in P1.1):** The systemd unit file and deploy docs must include the `BOOCONTROL_URL` env var (for apps/server's proxy) and `DATABASE_URL` (shared boochat DB). Added to P1.1 spec.
|
|
|
|
---
|
|
|
|
## P0 -- prerequisite gate (separate batch: multi-llama-swap provider registry)
|
|
|
|
**Gate:** P0 must be committed and reviewed before P1 starts. BooControl keys every host-scoped row on `LlamaProvider.id` from `packages/contracts/src/llama-providers.ts`. The committed contract is the foundation.
|
|
|
|
- [ ] Finish remaining tasks in `openspec/changes/multi-llama-swap-providers-model-favorites/tasks.md`: favorites hide-not-delete UI/route tests; smoke test sam-desktop + embedding (+ DeepSeek config).
|
|
- [ ] Sam reviews and commits the batch (currently working-tree only).
|
|
|
|
---
|
|
|
|
## P1 -- read-only cockpit
|
|
|
|
**Demo:** Watch both hosts live (models, swaps, VRAM/temp, request feed) while chatting.
|
|
|
|
### Scaffold + DB
|
|
|
|
- [x] **P1.1** Scaffold `apps/control`: new directory, Fastify + `@fastify/websocket` + `postgres` + `zod` dependencies, TS NodeNext, `.env.example`/`.env.host`, port 9503, `/api/health` endpoint, systemd unit `boocontrol.service`. Deploy docs in root CLAUDE.md (include `BOOCONTROL_URL` for apps/server proxy, `DATABASE_URL` for shared boochat DB). Pattern: `apps/coder/src/index.ts` for Fastify bootstrap, `apps/coder/src/db.ts` for `getSql`/`applySchema`/`pingDb`/`closeDb`.
|
|
|
|
- [x] **P1.2** `apps/control/src/db.ts` with `applySchema` + `waitForTable` helper. `waitForTable(sql, tableName, timeoutMs)` polls `information_schema.tables WHERE table_name = $1` with exponential backoff (100ms base, 2s cap); throws on timeout so systemd `Restart=on-failure` retries. Call `waitForTable(sql, 'sessions', 30_000)` before `applySchema()`. Pattern: `apps/coder/src/db.ts` for the `getSql`/`applySchema`/`pingDb`/`closeDb` shape; `waitForTable` is new (no existing implementation).
|
|
|
|
- [x] **P1.3** `apps/control/src/schema.sql` -- P1 tables only (do NOT create bench_*/eval_*/route_policies/control_reports tables yet):
|
|
- `control_hosts`: `provider_id TEXT PK` (FK-by-convention to `LlamaProvider.id`), `ssh_host TEXT`, `ssh_user TEXT`, `ssh_key_path TEXT`, `config_path TEXT`, `restart_cmd TEXT`, `os TEXT`, `gpu_label TEXT`, `enabled BOOLEAN DEFAULT true`. Seed: `INSERT INTO control_hosts (provider_id, os, gpu_label) VALUES ('sam-desktop', 'Windows', 'RTX 5090 32GB'), ('embedding', 'Linux', 'P104-100 8GB') ON CONFLICT DO NOTHING`. SSH/config columns NULL until P9.
|
|
- `control_requests`: `id BIGSERIAL PK`, `provider_id TEXT`, `swap_entry_id INT`, `ts TIMESTAMPTZ`, `model TEXT`, `req_path TEXT`, `status_code INT`, `duration_ms INT`, `cache_tokens INT`, `input_tokens INT`, `output_tokens INT`, `prompt_tps REAL`, `gen_tps REAL`, `has_capture BOOLEAN`, `capture JSONB`. `UNIQUE (provider_id, swap_entry_id, ts)`. NO `source` column (P4 adds it). Index: `CREATE INDEX IF NOT EXISTS idx_control_requests_provider_ts ON control_requests (provider_id, ts DESC)`.
|
|
- `control_perf_samples`: `provider_id TEXT`, `ts TIMESTAMPTZ`, `gpu JSONB`, `sys JSONB`. `UNIQUE (provider_id, ts)`. Index: `CREATE INDEX IF NOT EXISTS idx_control_perf_samples_provider_ts ON control_perf_samples (provider_id, ts DESC)`.
|
|
- `control_perf_rollup_5m`: `provider_id TEXT`, `bucket TIMESTAMPTZ`, `gpu_agg JSONB`, `sys_agg JSONB`. `UNIQUE (provider_id, bucket)`.
|
|
- `control_model_events`: `provider_id TEXT`, `model TEXT`, `state TEXT`, `ts TIMESTAMPTZ`, `detail JSONB`. `UNIQUE (provider_id, model, state, ts)`. Index: `CREATE INDEX IF NOT EXISTS idx_control_model_events_provider_ts ON control_model_events (provider_id, ts DESC)`.
|
|
- All use `clock_timestamp()` for created_at; JSONB via `sql.json(value as never)`.
|
|
|
|
### Connectors + ingestion
|
|
|
|
- [x] **P1.4** Fleet connector per enabled host: SSE client consuming `GET /api/events` with exponential backoff (base 1s, max 30s) + **jitter** (random 0-50% of computed delay) + circuit-breaker (6 consecutive failures -> give-up). Port the `opencode-sse.ts` `reconnectDecision` function (add jitter to the BooControl copy). Note: the reconnect/backoff/circuit-breaker pattern ports directly from `opencode-sse.ts`; the event parsing is new code because llama-swap's SSE envelope (`modelStatus | logData | metrics | inflight`) differs from the opencode SDK's `Event` type. Explicit `connected | reconnecting | down` liveness state machine + `last_seen_at` in-memory. On reconnect, reconcile via `GET /api/metrics` (full ring) with `INSERT ... ON CONFLICT DO NOTHING` (never check-then-act). Gap detection: if oldest reconcile entry is newer than newest persisted entry for that provider, insert `gap_suspected` model event with `model='*'` and timestamps in `detail` JSONB.
|
|
|
|
- [x] **P1.5** Perf poller: `GET /api/performance?after=<watermark>` every 5s per host. Watermark recovered from `MAX(ts)` per provider in `control_perf_samples` on restart. NULL watermark (fresh install) -> omit `after` param, ingest returned window (UNIQUE constraint makes over-fetch harmless).
|
|
|
|
- [x] **P1.6** In-memory fleet state with per-host monotonic `seq` counter, incremented on every mutation. WS endpoint `/api/ws/control`: snapshot-on-join carrying current seqs + seq-stamped deltas. Client rule: buffer pre-snapshot deltas, replay after snapshot applying only `seq > snapshot_seq`. On service restart, rebuild fleet state from DB before serving snapshots: query `control_model_events` for latest model state per provider, `control_requests` for last activity, `control_perf_samples` for latest perf sample.
|
|
|
|
### Retention (same P1 slice)
|
|
|
|
- [x] **P1.7** Retention job: daily in-process timer. Rollup as idempotent upsert (`INSERT INTO control_perf_rollup_5m ... ON CONFLICT (provider_id, bucket) DO UPDATE` recomputed from raw). Delete raw only after covering buckets committed, in chunked transactions (one per provider per 1-hour window, never one mega-transaction). Activity prune > 90d. Capture size: 256KB per-row cap enforced in application code before INSERT (not a DB constraint); total budget prune with 50MB default, configurable via `CAPTURE_BUDGET_MB` env var. All windows configurable via `.env.host`.
|
|
|
|
### Contracts (build FIRST)
|
|
|
|
- [x] **P1.8** Add 5 frame types to `packages/contracts/src/ws-frames.ts`:
|
|
- `control_fleet` -- full snapshot on join + seq-stamped state deltas (hosts, liveness, models, states, ttl, inflight)
|
|
- `control_activity` -- new request rows (live feed)
|
|
- `control_perf` -- appended samples per host
|
|
- `control_log` -- `{provider_id, source: proxy|upstream, line}` batches
|
|
- `control_job` -- bench/eval run progress events
|
|
|
|
Add to both `WsFrameSchema` discriminated union AND `KNOWN_FRAME_TYPES` array. Rebuild package (`pnpm -C packages/contracts build`).
|
|
|
|
**Note:** Control frames use a 2-location sync pattern (contracts + web strict union only). They skip the server's `InferenceFrame` union because they never flow through the server's broker. The web strict union is the wire-format gate; missing it silently drops frames at JSON parse.
|
|
|
|
**Drift test note:** The existing `ws-frames.test.ts` checks `KNOWN_FRAME_TYPES` vs `WsFrameSchema` alignment. There is no automated check for web strict union sync -- that alignment is manual and verified by the implementer. Add a comment in the test noting this limitation.
|
|
|
|
### Server proxy
|
|
|
|
- [x] **P1.9** `apps/server/src/routes/control-proxy.ts`: `registerControlProxy(app, boocontrolOrigin)` following the same structure as `registerCoderProxy` but with a static WS path `/api/control/ws` (not parameterized per-session). HTTP all-catch at `/api/control/*`. Add keep-in-sync comment in both `coder-proxy.ts` and `control-proxy.ts`. `BOOCONTROL_URL` env var. Register in `apps/server/src/index.ts`.
|
|
|
|
### Web UI
|
|
|
|
- [x] **P1.10** Web: `/control` route in `App.tsx`, nav entry in `ProjectSidebar.tsx` (under Memory cluster, `Radio` icon from lucide), `pages/Control.tsx` shell with Fleet + Activity tabs. `useControlStream` as a second app-level WS singleton (own React context + connection guard, targets proxied `/api/control/ws`). Client discards deltas with `seq <= snapshot_seq`. Activity feed note: shows `source = NULL` in P1; per-consumer breakdown lands in P4.
|
|
|
|
- [x] **P1.11** Fleet tab: host cards as instrument clusters. State chips with color/glow (amber pulse `starting`, green steady `ready`, red `error`, grey `down` with last-seen relative time). VRAM/temp/power readouts. TTL countdown rings. Dark mission-control aesthetic. Orbitron for numerals, Inter for prose.
|
|
|
|
- [x] **P1.12** Activity feed: react-virtuoso tail-follow viewer (already a dep) with `followOutput="bottom"` for live insertion, `ts DESC` sort order. Filter chips for model and host. Pause-on-scroll toggle.
|
|
|
|
- [x] **P1.13** Charts: integrate ECharts (per-chart module imports via `echarts/core` + needed renderers). Dark theme: build a theme object from CSS custom properties (`--background`, `--foreground`, `--muted`, `--accent`) via `getComputedStyle(document.documentElement)` and pass to `echarts.init(dom, theme)`. One `buildEChartsTheme()` helper, not per-chart. Incremental bundle impact ~60-100KB for 3-4 chart types (gauge, line, scatter) -- acceptable per design S9 tradeoff.
|
|
|
|
### Host-access seam
|
|
|
|
- [x] **P1.14** Create `apps/control/src/services/host-access.ts` with `acquireHostAccess(providerId: string, purpose: string): Promise<{ok: boolean, reason?: string}>`. V1 body: no-op returning `{ok: true}`. This is the P8 seam -- P8 swaps the body for a DB lease without touching the bench engine. Export for P3.1 to import.
|
|
|
|
### Tests
|
|
|
|
- [x] **P1.15** Tests: connector dedup/reconcile + gap detection as pure helpers (`turn-guard.ts` pattern); liveness state machine transitions; retention idempotency (re-run same window produces identical rollups); seq logic (buffer, discard stale, apply snapshot). DB tests `describe.runIf(process.env.DATABASE_URL)`.
|
|
|
|
---
|
|
|
|
## P2 -- hands on the controls
|
|
|
|
**Demo:** Unload from UI, watch the swap stream, open a capture.
|
|
|
|
- [x] **P2.1** Per-host FIFO action queue in the control service. Actions: warm (1-token `POST /v1/chat/completions` with bare wire ID), unload one/all (`POST /api/models/unload/:model` or `/api/models/unload`). Serialize through single FIFO queue per `provider_id`. Unload-during-bench -> return 409 with `{error: 'bench in progress', requiresConfirmation: true}`; client shows confirmation dialog and re-submits with `?confirm=true`. Reject submissions while host is `down` ("host offline" toast). Cap depth (4) with reject-on-full; error response includes current queue contents so the user knows what's pending. Re-check liveness on dequeue + skip stale actions (design S5). Pattern: `arena-runner.ts` `advanceChain` promise-chain + read-fresh-state-or-skip.
|
|
|
|
- [x] **P2.2** Optimistic UI off `control_fleet` frames only. No local emits after API calls (event-dedup discipline per CLAUDE.md). The API call triggers a server-side mutation that publishes a `control_fleet` delta; the frontend updates from the WS frame, not from a local state change.
|
|
|
|
- [x] **P2.3** Logs tab: relay `/api/events` logData -> `control_log` frame. In-memory 2k-line tail buffer per host for late joiners. React-virtuoso tail-follow viewer with per-source filter (proxy/upstream/model) + pause-on-scroll.
|
|
|
|
- [x] **P2.4** Inspector: activity table (virtuoso) -> capture drawer. `GET /api/captures/:id` via control service, decode base64, persist trimmed copy (256KB cap enforced in application code before INSERT), render with shiki-highlighted JSON. "Open in Playground" stub (links to P3).
|
|
|
|
- [x] **P2.5** Op task (manual, documented in design): enable `captureBuffer` + review `metricsMaxInMemory` on both hosts' llama-swap configs.
|
|
|
|
---
|
|
|
|
## P3 -- playground + speed bench (manual, safe-by-construction)
|
|
|
|
**Demo:** TTFT-vs-concurrency curves for two quants, run by hand without disturbing a live chat.
|
|
|
|
- [x] **P3.1** Playground tab: model select (grouped picker from provider registry), param controls, streaming chat, side-by-side A/B compare (two `ModelBubble` components in parallel, same prompt, different models). "Battle in Arena" handoff link (opens Arena dialog with pre-filled prompt + contestants via the existing `ArenaLauncherDialog` pattern).
|
|
|
|
- [x] **P3.2** Bench engine: suite model (`data/` YAML, grid of prompt_len x gen_len x concurrency x repetitions). Runner with TTFT capture (client-side first delta) + llama.cpp `timings` parse (`prompt_per_second`, `predicted_per_second`, `cache_n` from final stream chunk). Bounded fan-out (`Promise.allSettled`, suite-declared concurrency only). Results as aggregates + raw samples to `bench_suites`/`bench_runs`/`bench_samples` tables. Add schema for these 3 tables in this task.
|
|
|
|
- [x] **P3.3** V1 safety: user-initiated runs only; takeover confirmation when target host shows recent traffic; embedding-host-first defaults; `concurrent_foreign_requests` recorded per run from activity stream to flag polluted results. Unattended scheduling deliberately absent (P8).
|
|
|
|
- [x] **P3.4** Wire `acquireHostAccess(providerId, purpose)` from P1.14 into the bench runner. The runner MUST gate every run through this function -- never inline the inflight check. P8 swaps its body.
|
|
|
|
- [x] **P3.5** Bench UI: run launcher, live progress via `control_job` frames, history charts (TTFT vs concurrency, tok/s over time via ECharts), baseline + regression flags (delta beyond -10% gen tok/s threshold).
|
|
|
|
---
|
|
|
|
## P4 -- per-consumer attribution (X-Boo-Source, end-to-end)
|
|
|
|
**Demo:** Activity feed filtered to "arena" shows only Arena traffic; nothing reads NULL.
|
|
|
|
- [x] **P4.1** `apps/server`: per-turn fetch-wrapper injection on AI-SDK streaming path. Thread `source` through the call site. `getSwapProvider` cache keyed by `baseURL+source` (label set: `boochat|boocoder|arena|control-bench|control-eval`). `upstreamModel` signature change must be additive (optional `source` param -- 1 production importer: `stream-phase-adapter.ts:309`; validated by plan-validation F1). Extend headers in `compaction.ts` and `task-model.ts` direct fetches.
|
|
|
|
- [x] **P4.2** `apps/coder`: forward inbound `x-boo-source` header in `local-gateway.ts` (currently omitted from forwarded headers). Set it at Arena + dispatch fetch sites.
|
|
|
|
- [x] **P4.3** Migration: `ALTER TABLE control_requests ADD COLUMN source TEXT`. Surface as Activity filter + per-source token aggregates in the UI.
|
|
|
|
- [x] **P4.4** Tests: header present on all three paths (server streaming, gateway-forwarded opencode, arena direct); rows attribute correctly in `control_requests`.
|
|
|
|
---
|
|
|
|
## P5 -- quality evals + sandbox
|
|
|
|
**Demo:** Fleet leaderboard with speed x quality scatter.
|
|
|
|
- [x] **P5.1** Suite format (`data/` YAML: chat rubric tasks, code tasks with tests); CRUD + versioning. Four suites in priority order: (1) agent coding tasks, (2) chat assistant quality, (3) long-context retrieval, (4) utility calls (titles/summaries). Add schema for `eval_suites`/`eval_runs`/`eval_results` tables in this task.
|
|
|
|
- [x] **P5.2** Judge runner: temperature 0, pinned judge model+version, rubric scoring, rationale capture. Pairwise tie-breaks delegate to Arena (links/launches battles, not re-implements). Judge = strongest local model by default.
|
|
|
|
- [x] **P5.3** Code sandbox runner: ephemeral Docker containers (`--network none`, non-root, caps dropped, tmpfs workdir, `--rm`, kill-on-timeout, `boocontrol-eval` label for orphan findability). Orphan prune at engine start (`docker ps --filter label=boocontrol-eval`). Bounded concurrency (default 4) + `Promise.allSettled` + per-task `finally` cleanup. Pass@1 scoring. Patterns from `/opt/forks/openevals` (verified: `sandbox/` directory exists with Docker hardened container patterns). Harden: `--security-opt=no-new-privileges`, `--cap-drop=ALL`.
|
|
|
|
- [x] **P5.4** Leaderboard UI + speed x quality scatter per (provider_id, model, quant) using ECharts (reuse the `buildEChartsTheme()` helper from P1.13).
|
|
|
|
---
|
|
|
|
## P6 -- advisory routing + reports
|
|
|
|
**Demo:** Picker badges "best code model right now"; Monday-morning fleet report.
|
|
|
|
- [ ] **P6.1** Advisory scores API (eval results + live latency + host health) -> model-picker badges. Expose via `GET /api/control/routing/scores`.
|
|
|
|
- [ ] **P6.2** Reports: scheduled digest job (usage, trends, swap counts, leaderboard deltas, anomalies vs baselines) -> `control_reports`. Same in-process timer pattern as retention (P1), `schedule_meta = {interval, enabled, last_run_at}` with catch-up on boot. Reports tab + markdown export. Add `control_reports` schema in this task.
|
|
|
|
---
|
|
|
|
## P7 -- live `auto:*` gateway (committed)
|
|
|
|
**Demo:** An `auto:code` session in BooChat routes to the current best code model with failover.
|
|
|
|
- [ ] **P7.1** Control service: OpenAI-compatible virtual models (`auto`, `auto:code`, `auto:fast`, `auto:cheap`) backed by `route_policies` table. Policy: rule match -> candidate ordering -> health/ctx-fit filter -> dispatch with failover. Gateway forwards `X-Boo-Source` to target host. Add `route_policies` schema in this task.
|
|
|
|
- [ ] **P7.2** Registry entry: `kind: "boocontrol-gateway"` with `baseUrl: "http://100.114.205.53:9503"`. BooChat adopts with zero inference-path changes.
|
|
|
|
- [ ] **P7.3** `apps/server/src/services/inference/provider.ts` -- the code change required for orphaned-session handling:
|
|
- Extend `InferenceRoute` from `'swap' | 'deepseek'` to `'swap' | 'deepseek' | 'gateway' | 'gateway_error'`
|
|
- `gateway_error` carries `{reason: 'offline' | 'unhealthy'}` for structured error reporting
|
|
- Override the unknown-provider fallback (current behavior at line 147: composite id with unknown provider silently routes to `LLAMA_SWAP_URL`). For gateway-kind ids that are missing/disabled, resolve to `route: 'gateway_error'` with `reason: 'offline'`, never the swap fallback.
|
|
- **Audit all 5 callers** with explicit per-caller changes:
|
|
1. `getModelContext` (model-context.ts:85) -- must handle gateway `baseUrl` (query `/upstream/<model>/props` against the control service, not the target host)
|
|
2. `invalidateModelContext` (model-context.ts:160) -- must handle gateway variant (no-op; gateway doesn't cache model context)
|
|
3. `resolveRoute` (provider.ts:175) -- must return `{route: 'gateway'}` for gateway-kind ids
|
|
4. `upstreamModel` (provider.ts:184) -- **must add gateway branch before the swap fallback** at line 192; the implicit else currently always reaches `getSwapProvider`
|
|
5. `resolveModelEndpoint` (provider.ts:201) -- must handle gateway headers (forward `X-Boo-Source`)
|
|
- Propagation note (plan-validation F2): these 5 direct call sites fan out to ~10 downstream production call sites (stream-phase-adapter, compaction, task-model, system-prompt, error-handler, tool-phase, chats, stream-phase); none need signature changes (gateway handling is internal to each function) but all need test coverage.
|
|
- Audit clarification (plan-validation F7): `system-prompt.ts:195` calls `resolveRoute(agent)` with no config/modelId, so it always returns `{route: 'swap'}` and needs NO gateway handling.
|
|
- All must compile unchanged for the new variant (additive, not breaking)
|
|
- The session keeps its id; the picker flags affected sessions.
|
|
|
|
- [ ] **P7.4** Policy editor UI (route_policies CRUD) + per-policy dispatch log in the Reports tab.
|
|
|
|
---
|
|
|
|
## P8 -- fleet coordination lease (cross-service batch, own design pass)
|
|
|
|
**Outline only.** The proper fix for the four-writer TOCTOU. P3 left a seam (`acquireHostAccess` in `host-access.ts`) that P8 swaps.
|
|
|
|
- [ ] **P8.1** Design + ship `control_host_leases` (holder, purpose, expires_at, heartbeat) and the honor-protocol in all four writers (BooChat, BooCoder, Arena, BooControl). Scope: separate proposal under `openspec/changes/`. The BooControl bench scheduler consumes it through the `acquireHostAccess` seam left in P3. Unattended bench scheduling + reproducible concurrency sweeps unlock here.
|
|
|
|
---
|
|
|
|
## P9 -- remote hands + optional
|
|
|
|
**Outline only.**
|
|
|
|
- [ ] **P9.1** SSH config editor: SFTP read -> schema-validated edit (config-schema.json from the fork) -> diff preview -> timestamped backup -> SFTP write -> restart (nssm/systemctl) -> health-wait. Key in `secrets/` (gitignored). Tests for the failure paths.
|
|
|
|
- [ ] **P9.2** `llama-bench`-over-SSH ingestion for device-level numbers.
|
|
|
|
- [ ] **P9.3** `boocontrol.indifferentketchup.com` vhost (Caddy/Authelia rewrite -> `/control`).
|
|
|
|
- [ ] **P9.4** Frontier providers as routing targets; slim `control` pane kind for in-workspace mini-cockpit.
|
|
|
|
---
|
|
|
|
## Deferred (YAGNI)
|
|
|
|
Items removed from active scope with reopen triggers:
|
|
|
|
- **Prometheus/Grafana integration** -- BooControl persists its own samples; `/metrics` endpoints stay available. Reopen when an external monitoring stack is actually deployed.
|
|
- **Multi-user/auth** -- Authelia at the proxy layer. Reopen when multi-user is needed.
|
|
- **Non-llama-swap engine connectors** (vLLM, Ollama, infinity-emb) -- connector interface should not preclude them. Reopen when a second engine kind is actually added.
|
|
- **Cross-process GPU arbitration** -- four uncoordinated writers is accepted in v1. Reopen when the P8 lease proves insufficient.
|
|
- **Log persistence to file** -- logs are relay-only with in-memory tail. Reopen when log volume warrants durable storage.
|
|
- **llama-bench over SSH** (P9.2) -- device-level numbers. Reopen when SSH plumbing from P9.1 lands.
|
|
- **`llama-swap` peers federation** -- flat list, coupled uptime, silent ID collisions. Reopen if the provider registry proves insufficient for host coordination.
|
|
|
|
---
|
|
|
|
## Next step
|
|
Validate independently with boo-validating-changes boocontrol, then implement with boo-implementing-changes boocontrol. P0 gate first (commit the multi-provider batch), then P1.
|