Files
boocode/openspec/changes/boocontrol/tasks.md
indifferentketchup b18de2a331 chore: snapshot working tree - pty_exited notifications + in-flight inference WIP
feat(booterm): structured pty_exited WS notifications. Plan-validated, impl-validated, code-reviewed green (contracts build clean, contracts test 29/29, booterm + web typecheck clean).

wip: in-progress inference/provider refactor (agents.ts, provider.ts, new llama-providers.ts, removed llama-args-validator), plus arena, dispatcher, compaction, schema changes.

openspec: pty-exit-notifications complete; x-agent-flags planned (not yet implemented).
2026-06-14 12:48:47 +00:00

76 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# BooControl — tasks
**Status:** READY (decisions resolved 2026-06-11). Gate: P0 must be **committed and reviewed** before P1 starts. Each phase is a vertical slice with a demo; the whole idea ships eventually — P1→P3 are the cockpit, P4→P7 are intelligence, P8→P9 are coordination + remote hands.
## P0 — prerequisite gate (separate batch: multi-llama-swap provider registry)
- [ ] Finish remaining tasks in `openspec/changes/multi-llama-swap-providers-model-favorites/tasks.md`: favorites hide-not-delete UI/route tests; smoke test sam-desktop + embedding (+ DeepSeek config); opencode duplicate-name routing smoke if in scope.
- [ ] Sam reviews and **commits** the batch (currently working-tree only). BooControl keys on `LlamaProvider.id` — the committed contract is the foundation.
## P1 — read-only cockpit
**Demo: watch both hosts live (models, swaps, VRAM/temp, request feed) while chatting.**
- [ ] Scaffold `apps/control`: Fastify, TS NodeNext, `.env.example`/`.env.host`, port 9503, `/api/health`, systemd unit `boocontrol.service`, deploy docs in root CLAUDE.md.
- [ ] `db.ts` with `applySchema` + **startup ordering guard** (`waitForTable(sql, 'sessions')` before DDL — design §3).
- [ ] `schema.sql`: `control_hosts` seed (sam-desktop, embedding) `ON CONFLICT DO NOTHING`; `control_requests` (NO source column — that's P4) with `UNIQUE (provider_id, swap_entry_id, ts)`; `control_perf_samples` with `UNIQUE (provider_id, ts)`; `control_perf_rollup_5m` with `UNIQUE (provider_id, bucket)`; `control_model_events` with `UNIQUE (provider_id, model, state, ts)`.
- [ ] Fleet connector per enabled host: SSE client w/ backoff+jitter+circuit-breaker (port the `opencode-sse.ts` pattern); explicit `connected|reconnecting|down` liveness state machine + `last_seen_at`; reconcile via `/api/metrics` on reconnect with `INSERT ... ON CONFLICT DO NOTHING` (never check-then-act); `gap_suspected` via the no-overlap heuristic (design §4).
- [ ] Perf poller (5s, `/api/performance?after=`); watermark recovered from `MAX(ts)` on restart; NULL watermark (fresh install) → omit `after`, ingest returned window (design §4).
- [ ] In-memory fleet state with per-host monotonic `seq`; WS endpoint `/api/ws/control`: snapshot-on-join carrying seqs + seq-stamped deltas.
- [ ] **Retention job in this slice** (not a fast-follow): rollup as idempotent upsert + raw delete in chunked per-provider-per-hour transactions (design §6); activity prune; configurable windows.
- [ ] Contracts: add `control_fleet`, `control_activity`, `control_perf`, `control_log`, `control_job` to `WsFrameSchema` + `KNOWN_FRAME_TYPES`; rebuild package; mirror in the web strict union; extend the contracts drift test to cover the five new frames. (Server loose union NOT needed — control frames bypass the broker via the raw proxy relay, so this is a 2-location sync; plan finding JD1.)
- [ ] `apps/server`: `registerControlProxy` (`/api/control/*` HTTP + `/api/control/ws` WS relay; clone of `routes/coder-proxy.ts` with keep-in-sync comments in both files); `BOOCONTROL_URL` env.
- [ ] Web: `/control` route (`App.tsx`), nav entry (`ProjectSidebar.tsx`), `pages/Control.tsx` shell with Fleet + Activity tabs; `useControlStream` as a **second app-level WS singleton** (own context + connection guard; client discards deltas ≤ snapshot seq); host cards (state chips incl. grey `down`+last-seen, VRAM/temp/power readouts, TTL countdowns); live activity feed (virtuoso).
- [ ] Charts: integrate ECharts (per-chart module imports via `echarts/core`) for perf timelines; dark-theme tokens from active palette.
- [ ] Tests: connector dedup/reconcile + seq logic as pure helpers (`turn-guard.ts` pattern); liveness state machine; retention idempotency (re-run same window → identical rollups); DB tests `describe.runIf(DATABASE_URL)`.
## P2 — hands on the controls
**Demo: unload from UI, watch the swap stream, open a capture.**
- [x] Per-host FIFO action queue in the control service; warm (1-token completion w/ bare wire ID) + unload one/all routed through it; unload-during-bench → takeover confirmation; reject submissions while host is `down`, cap depth (4), re-check liveness on dequeue + skip stale actions (design §5).
- [x] Optimistic UI off `control_fleet` frames only (no local emits, per event-dedup discipline).
- [x] Logs tab: relay `/api/events` logData → `control_log`; in-memory 2k-line tail for late joiners; virtuoso tail-follow viewer w/ source filters + pause-on-scroll.
- [x] Inspector: activity table → capture drawer (`GET /api/captures/:id` via control svc, trimmed persist, shiki JSON, headers); "Open in Playground" stub.
- [x] Op task (manual, documented in design): enable `captureBuffer` + review `metricsMaxInMemory` on both hosts' llama-swap configs.
## P3 — playground + speed bench (manual, safe-by-construction)
**Demo: TTFT-vs-concurrency curves for two quants, run by hand without disturbing a live chat.**
- [x] Playground tab: model select (grouped picker from P0), param controls, streaming chat, side-by-side A/B; "Battle in Arena" handoff link.
- [x] Bench engine: suite model (grid + repetitions), runner w/ TTFT capture + `timings` parse; bounded fan-out (`Promise.allSettled`, suite-declared concurrency only); aggregates + raw samples to `bench_*` tables.
- [x] v1 safety: user-initiated runs only; takeover confirmation when target host shows recent traffic; embedding-host-first defaults; `concurrent_foreign_requests` recorded per run to flag polluted results. (Unattended scheduling deliberately absent — P8.)
- [x] The P8 seam: every run gates through `acquireHostAccess(providerId, purpose)` (v1 body = courtesy check + confirmation); never inline the inflight check in the runner (design §8).
- [x] Bench UI: run launcher, live progress via `control_job`, history charts (TTFT vs concurrency, tok/s over time), baseline + regression flags.
## P4 — per-consumer attribution (X-Boo-Source, end-to-end)
**Demo: Activity feed filtered to "arena" shows only Arena traffic; nothing reads NULL.**
- [x] `apps/server`: per-turn fetch-wrapper injection on the AI-SDK streaming path (thread source through the call site; wrapper-aware `getSwapProvider`, cache keyed by baseURL+source). **`upstreamModel` change must be additive** (optional `source` param/options — its file has 28-file/13-route blast radius, design §7); extend headers in `compaction.ts` + `task-model.ts` direct fetches.
- [x] `apps/coder`: forward inbound `x-boo-source` in `local-gateway.ts`; set it at arena + dispatch fetch sites.
- [x] Migration: add `source TEXT` to `control_requests`; surface as Activity filter + per-source token aggregates.
- [x] Tests: header present on all three paths (server streaming, gateway-forwarded opencode, arena direct); rows attribute correctly.
## P5 — quality evals + sandbox
**Demo: fleet leaderboard with speed×quality scatter.**
- [x] Suite format (data/ YAML: chat rubric tasks; code tasks with tests); CRUD + versioning.
- [x] Judge runner (temperature 0, pinned judge model+version, rubric scoring, rationale capture); pairwise tie-breaks delegate to Arena.
- [x] Code sandbox runner: ephemeral containers (`--network none`, non-root, mem/cpu/time caps, tmpfs, `--rm`, `boocontrol-eval` label); orphan prune at engine start; bounded concurrency (default 4) + `Promise.allSettled` + per-task `finally` cleanup; pass@1 scoring; borrow patterns from `/opt/forks/openevals`.
- [x] Leaderboard UI + speed×quality scatter per (provider_id, model, quant).
## P6 — advisory routing + reports
**Demo: picker badges "best code model right now"; Monday-morning fleet report.**
- [x] Advisory scores API (evals + live latency + host health) → model-picker badges. `services/routing-scores.ts` (`assignBadges` pure helper, unit-tested), `GET /api/control/routing/scores`; `ModelPicker.tsx` fetches badges (non-fatal) and renders best-code/best-chat/best-fast chips. Verify: `pnpm -C apps/control test` (routing-scores 4), `npx tsc -p apps/web/tsconfig.app.json --noEmit`.
- [x] Reports: scheduled digest job (usage, trends, swap counts, leaderboard deltas, anomalies vs baselines) → `control_reports`; same in-process timer pattern as retention, schedule meta in `control_schedule_meta` table (`{interval, enabled, last_run_at}`) w/ catch-up on boot; Reports tab + markdown export (`renderReportMarkdown`/`isReportDue` pure, unit-tested). See design `## Implementation notes` for the schedule-meta-table deviation. Verify: `pnpm -C apps/control test` (reports 7).
## P7 — live `auto:*` gateway (committed)
**Demo: an `auto:code` session in BooChat routes to the current best code model with failover.**
- [x] OpenAI-compatible virtual models (`auto`, `auto:code`, `auto:fast`, `auto:cheap`) backed by `route_policies`: rule match → candidate ordering → health/ctx-fit filter → dispatch w/ failover; gateway forwards `X-Boo-Source` to the target host. `routes/gateway.ts` (`/v1/models`, `/v1/chat/completions`, `/upstream/:model/props`) + `services/gateway.ts` (`orderCandidates` pure, unit-tested). Reached server-to-server (registry baseUrl), not via the buffering /api/control proxy, so streaming survives. Verify: `pnpm -C apps/control test` (gateway 11) + live smoke.
- [x] Registry entry (`kind: "boocontrol-gateway"`) so BooChat adopts with zero inference-path changes. Added to `data/llama-providers.example.json`; control service filters gateway-kind providers out of fleet connectors/pollers/retention (`fleetProviders` in `index.ts`) so it never SSE-connects to itself.
- [x] **Orphaned-session handling — `provider.ts` code change** (design §8): `InferenceRoute` extended to `'swap' | 'deepseek' | 'gateway' | 'gateway_error'` (gateway_error carries `gatewayReason`); known gateway-kind id → `'gateway'`; orphaned auto:* id (provider missing) → `'gateway_error'` reason `offline`, NEVER the swap fallback. All callers audited: `upstreamModel`/`resolveModelEndpoint` add gateway branch + throw on gateway_error; `getModelContext` proxies gateway props / null on gateway_error; `resolveRoute` returns the new variant (system-prompt.ts `ObservedInputs.route` widened to `InferenceRoute`); `invalidateModelContext` unchanged (composite-key path covers it). Picker flags orphaned sessions (`isOrphanedGatewayValue` banner in `ModelPicker.tsx`). Verify: `pnpm -C apps/server test` (provider gateway tests), `pnpm -C apps/server build`.
- [x] Policy editor UI (route_policies CRUD) + per-policy dispatch log. `routes/policies.ts` (CRUD + `/dispatch-log`); `ReportsTab.tsx` Policies + Dispatch Log sub-views. Verify: `npx tsc -p apps/web/tsconfig.app.json --noEmit`.
## P8 — fleet coordination lease (cross-service batch, own design pass)
**Demo: a scheduled overnight bench runs unattended without ever evicting a live model.**
- [x] Outlined, see `openspec/changes/fleet-coordination-lease/` (proposal + tasks, OUTLINE status). Design + ship `control_host_leases` (holder, purpose, expires_at, heartbeat) and the honor-protocol in all four writers (BooChat, BooCoder, Arena, BooControl); BooControl consumes it through the `acquireHostAccess` seam left in P3. NOT implemented here — outline only per the program decision.
- [x] Outlined, see `openspec/changes/fleet-coordination-lease/` (tasks L4). Unattended bench scheduling + reproducible concurrency sweeps unlock behind the lease.
## P9 — remote hands + optional
- [x] SSH config editor: SSH read → schema-validated edit (config-schema.json from the fork, bundled at `apps/control/data/config-schema.json`, ajv-validated) → diff preview → timestamped backup → write → restart → health-wait. `services/ssh-config.ts` (pure `validateLlamaConfig`/`computeDiff`/`backupFilename` + injectable-exec `applyRemoteConfig` pipeline) + `routes/ssh-config.ts` (`GET/PATCH /api/hosts`, `/config`, `/config/validate`, `/config/diff`, `/config/apply`) + `HostConfigEditor.tsx` (gear button on each Fleet card). SSH via shelled `ssh` (booterm precedent, key from `control_hosts.ssh_key_path``secrets/`, gitignored) instead of an ssh2 dependency. Failure-path tests for every pipeline step (`ssh-config.test.ts`, 15 tests). NOTE deviation: SFTP replaced by `ssh cat`/`cat >` (no ssh2 dep); recorded in design `## Implementation notes`. Verify: `pnpm -C apps/control test` (ssh-config 15). Not live-smoked — no reachable Windows SSH target in this session (the "Windows SSH fiddliness" risk); the failure-path test suite stands in.
- [ ] DEFERRED — `llama-bench`-over-SSH ingestion for device-level numbers. Reason: depends on the SSH plumbing from P9.1 *landing + a live host to run `llama-bench` on*; it is also explicitly YAGNI-deferred in the implementation-plan ("Reopen when SSH plumbing from P9.1 lands"). The P9.1 exec seam (`SshExec`) is the hook a follow-up reuses.
- [ ] DEFERRED — boocontrol.indifferentketchup.com vhost (Caddy/Authelia rewrite → `/control`). Reason: pure reverse-proxy/ops config (Caddyfile + Authelia rules) on the homelab host, no repo code; `/control` already works behind the existing boocode origin via the `registerControlProxy` relay. Out of scope for a code batch.
- [ ] DEFERRED — Frontier providers as routing targets; slim `control` pane kind for in-workspace mini-cockpit. Reason: two sizeable independent features (frontier-provider routing belongs with the registry/provider work; a new workspace pane kind is its own UI batch). Marked optional in the implementation-plan Deferred section; out of reach for an additive P6P9 pass without dedicated design.