docs: changelog for v2.7.17-orchestrator + orchestration research

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 15:25:59 +00:00
parent 1937af8df9
commit edc348baf3
2 changed files with 179 additions and 0 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -2,6 +2,10 @@

 All notable changes per release tag. Most recent on top, ordered by tag creation date (which matches the git history). Tag names follow `vMAJOR.MINOR.PATCH-slug` — the slug describes what shipped, so the tag name alone is enough to recall the batch.

+## v2.7.17-orchestrator — 2026-06-03
+
+Brings the deterministic multi-agent "conductor" into the app as the **Orchestrator**: launch any read-only Han flow (research, code-review, investigate, architectural-analysis, security-review, …) from BooChat or BooCoder and watch each specialist agent stream live in a Paseo-style run pane, ending with an evidence-disciplined, adversarially-validated report — all on free local Qwen, persisted and resumable. Built and audited end-to-end via `paseo-epic` in an isolated worktree, on top of the prior `/opt/boocode/conductor` standalone CLI: the conductor's 22 flow definitions, Spine factory, and Han evidence/YAGNI contracts were re-homed into `apps/coder/src/conductor`, and a new DB-backed flow-runner (`flow_runs`/`flow_steps`) dispatches each step as a real BooCoder task through the existing dispatcher — reusing its streaming→WS-frame pipeline and worktree-as-read-snapshot, with an `onTaskTerminal` hook that advances the wave and a startup resume that re-dispatches in-flight steps after a coder restart. Read-only is enforced hard: every step is dispatched `qwen --approval-mode plan`, an adversarial-security review caught and closed a bypass where a qwen-unavailable task silently fell through to write-capable native inference (now fails closed), and the ACP path's mode-set was made fail-closed too. The UI adds a fourth `orchestrator` pane kind (collapsed agent roster, expand-one live stream, report on top), a Workflow button + slash flows on the shared `ChatInput` for full BooChat/BooCoder parity, a "New Orchestrator" entry in the + and split menus, a category-grouped launcher dialog, runs history, and export (copy / save-to-file / send-to-chat) — fed by two new `flow_run_*` WS frames on a coder user channel. Qwen-only by design (Claude Code remains the Claude path); the existing model-competition Arena stays a separate feature. Spec/plan in `openspec/changes/orchestrator`; coder 373 tests green (42 new scheduler/resume/read-only decision tests), contracts/coder/server builds + web tsc clean. Built on `v2.7.16-container-git-safedir`; pairs conceptually with the earlier `v2.7.12-audit-cleanup` multi-agent orchestration.
+
 ## v2.7.16-container-git-safedir — 2026-06-03

 Hotfix that makes the `v2.7.15-git-diff-panel` work in production. The `boocode` container runs as root but bind-mounts host project repos owned by uid 1000, so git rejected them with "detected dubious ownership" and the diff route reported every project as not-a-repo — which hid the Git tab entirely (and had been silently nulling the existing branch indicator too). Adds `git config --system --add safe.directory '*'` to the Dockerfile runtime stage so the container's git trusts the mounted repos; applied live to the running container and baked into the image for future rebuilds. Surfaced by a live smoke immediately after the v2.7.14/v2.7.15 deploy.
--- a/docs/research/2026-06-03-boocode-orchestration-integration.md
+++ b/docs/research/2026-06-03-boocode-orchestration-integration.md
@@ -0,0 +1,175 @@
+# Research: Can we lift Paseo's orchestration into BooCode so it can orchestrate different agents?
+
+Open-ended question: *Can we lift Paseo's multi-agent orchestration into BooCode, and can we get this done?* — Evidence mode: **strict**.
+
+## Summary
+
+Yes, this is achievable — but "lifting Paseo's orchestrator" turns out to be the wrong mental model, because **Paseo has no orchestration engine to lift**. Paseo's daemon is a process supervisor; the actual sequencing is done by the parent *model*, which calls `create_agent` / `wait_for_agent` / `archive_agent` as MCP tools inside its own reasoning loop. So the real choice is between two different things: (a) copy that model-drives-itself pattern, which is simple and Paseo-proven but fragile when a weak local model is the conductor, or (b) build a small deterministic sequencing layer in BooCode's own code, which is more work but reliable regardless of model.
+
+BooCode already owns roughly the worker half of this — dispatch, four agent backends, parallel fan-out, per-agent git worktrees, resumable sessions. What it lacks is any notion of "do step B after step A and feed A's output into B." Which path is right hinges on one unanswered question: **who conducts — a strong model like Claude, or a weak local model like Qwen?** With Claude, the simple copy-Paseo path wins and a deterministic engine would be over-engineering. With Qwen, the deterministic-code path is necessary, because a 35B local model self-orchestrating is exactly what failed live in this session. Until that conductor question is answered, there is no single winner — there is a fork with a clear deciding criterion.
+
+- **Confidence:** Medium
+
+## Research Results
+
+**BooCode already owns the worker layer (A15–A19).** Task dispatch runs off a Postgres `LISTEN/NOTIFY 'tasks_new'` fast path plus a poll fallback (A15). Four execution backends sit behind a common `AgentBackend` interface — opencode-server, warm-ACP, claude-SDK, and one-shot ACP/PTY (A18). Arena already fans the same prompt out to 2–5 agents in parallel, each with its own task and worktree (A16). Per-task and per-session git worktrees, resumable `agent_sessions`, and output capture (full text to `messages.content`, a diff to `pending_changes`) all exist. This half is real, working code.
+
+**But the sequencing substrate genuinely does not exist (A16, A17, A19).** The dispatcher's poll selects on `state='pending'` alone — it never reads `parent_task_id`, and there is no `depends_on`, `step_index`, `flows` table, or fold/synthesis code anywhere (A17, verified by the validator against `dispatcher.ts:105-110`). Arena is parallel-only and single-step; its "winner selection" is a `[SELECTED]` string prefix that no code downstream consumes (A16). The one inter-task channel a parent can cheaply read — `tasks.output_summary` — is capped at **500 characters** in every completion path (A19), far too small to pass a real artifact (a research finding, a diff) from one step into the next. So the existing substrate is a head-start on *dispatch* but actively wrong-shaped for *data flow between steps*.
+
+**Paseo's orchestrator is liftable as plumbing, but it isn't a conductor (A20–A24).** `AgentManager` is a plain TypeScript class with no React-Native/Zustand dependency (A20, confirmed), and the MCP tool server exposing `create_agent`/`wait_for_agent`/`archive_agent` sits cleanly on top (A21). Agent state is JSON files on disk, parent/child is a single label on a flat map (A23, A24) — trivial to replicate. **The decisive finding:** Paseo has *no* deterministic DAG or sequencing engine (A22). The parent agent — itself a running ACP process — decides the order by making MCP tool calls in its own loop. Paseo's daemon only spawns, waits, and bookkeeps.
+
+**The web evidence argues that model-self-orchestration is fragile for weak models, and recommends deterministic code sequencing (A4, A5, A6, A9) — corroborated live (A25).** Disinterested academic work reports a deterministic engine calling the model only for bounded sub-tasks beats self-orchestration by ~10 points with large reductions in turns/tool-calls (A6), and that flat-context steering accuracy falls from ~60% to ~21% as agent count climbs (A9, though this is the 3→10-agent regime, not a 2–3-step bounded flow). In this very session, opencode/Qwen-35B ran the han research skill, dispatched the first specialist, then **dropped the adversarial-validator step entirely** (A25) — a live, n=1 instance of exactly this failure. The lowest-ops way to host deterministic sequencing on a single-user Postgres stack is in-process Postgres-backed execution; Temporal is too heavy for one user (A1, A12), Restate is lighter but a second stateful service (A2, A3), Inngest isn't truly self-hostable (A13, A14).
+
+**Conflict surfaced:** the prior-art angle pushes toward a deterministic engine; the Paseo codebase proves the *opposite* pattern (model self-orchestration) ships and works — with a capable parent model. The two only reconcile once you fix who the conductor is.
+
+## Options to Consider
+
+### O-A: Copy Paseo's pattern — model self-orchestrates via an MCP toolbox
+
+- **What it is:** Expose `create_agent` / `wait_for_agent` / `archive_agent` (and friends) as MCP tools over BooCode's existing backends; a parent agent sequences sub-agents from inside its own reasoning loop. Optionally lift Paseo's `AgentManager` + MCP server (~5 files, no RN deps) rather than writing the toolbox fresh.
+- **Trade-offs:** Lowest effort and Paseo-proven — *with a strong parent model*. Fragile with a weak local conductor: non-deterministic, workflow state hidden in the model's context, no crash-replay. This is the pattern that dropped a step on Qwen in this session (A25).
+- **Rests on:** (A20, A21, A22, A24) — and the con on (A4, A5, A6, A9, A25)
+- **Evidence status:** corroborated
+
+### O-B: Build a deterministic code flow-runner on BooCode's own primitives
+
+- **What it is:** Add step sequencing in code — a `depends_on` / step-state column on the existing `tasks` table, dispatched by the existing `LISTEN/NOTIFY` poll; a real result-passing channel (replacing the 500-char `output_summary`); and a fold step. Agents stay bounded single-task workers.
+- **Trade-offs:** Reliable regardless of conductor model and crash-resumable. But the missing sequencing + result-passing + fold is **greenfield** — the reused worker layer is plumbing, this is where the schedule and the distributed-systems hazards live. The hand-rolled variant is genuinely near-zero new infra; a library (DBOS Transact) is **not**, given BooCode's host-systemd/Docker split and dual-schema DB (see V6) — DBOS is an unvalidated option, not a co-equal one.
+- **Rests on:** (A15, A16, A17, A19) for the substrate; (A6) for the architecture; (A1, A7, A8, A10) for the tooling angle — the last two vendor-sourced and caveated.
+- **Evidence status:** architecture corroborated; specific tooling (DBOS/pg-workflows) single-source (caveated)
+
+### O-C: Hybrid — lift Paseo's ACP supervisor as the worker layer, code orchestrator on top
+
+- **What it is:** Replace BooCode's backends with Paseo's lifted ACP client + `AgentManager`, then put an O-B deterministic conductor over it.
+- **Trade-offs:** Net-negative. BooCode already has four working backends; importing Paseo's provider tree (heavy intra-package `@getpaseo/protocol` coupling, per V7) swaps working code for a foreign dependency. Recommended **against**.
+- **Rests on:** (A18, A20) — and V7
+- **Evidence status:** corroborated (against)
+
+## Recommendation
+
+- **Recommendation:** **No single winner until the conductor model is fixed — this is the deciding criterion.**
+  - **If the conductor is a strong model (Claude, in-stack today):** choose **O-A**. It is the simpler, Paseo-proven path, and building a deterministic engine would be over-engineering scope that wasn't asked for. Expose the MCP toolbox over BooCode's existing backends; lifting Paseo's `AgentManager`/MCP server is optional convenience, not necessity.
+  - **If the conductor must be a weak local model (Qwen) — i.e. the goal is free/local multi-agent flows:** choose **O-B, hand-rolled** (`depends_on` + step-state on `tasks`, dispatched by the existing `LISTEN/NOTIFY`, plus a real result channel and a fold step). Determinism is not optional here; the model cannot be trusted to sequence itself (A25). Treat DBOS as an unvalidated alternative, not the default.
+  - **In neither case O-C**, and in neither case lift Paseo's *conductor* — there isn't one to lift (A22).
+- **Evidence basis:** The codebase findings (A15–A24) are current-state anchors, all independently verified by the validator against live source — the worker-layer-exists / sequencing-absent / Paseo-has-no-engine conclusions are solid. The "weak models can't self-orchestrate" direction rests on one disinterested source (A6) plus general degradation work (A9) and a single live anecdote (A25) — enough to make O-A *fragile* on Qwen, not enough to call it impossible. The specific DBOS tooling pick rests only on vendor marketing (A7, A8) and a single-source library (A10), so it is explicitly demoted. The fork itself — which option wins — rests on a constraint (conductor model) that the question never stated and the operator must supply.
+
+## Validation
+
+### V1: The "reuse 70%, build 30%" split is inverted on effort
+
+- **Strategy:** Challenge the Recommendation
+- **Investigation:** Read `dispatcher.ts` end-to-end; poll query dispatches on `state='pending'` only, no dependency awareness; no fold/synthesis code exists.
+- **Result:** Partially Refuted
+- **Impact:** The reused 70% is already-debugged plumbing; the missing piece (crash-resumable step state machine, result-passing, fold) is 100% greenfield and is where the risk lives. Direction survives; the *effort framing* was misleading and is corrected in O-B.
+
+### V2: The 500-char `output_summary` cap makes the substrate hostile to result-passing
+
+- **Strategy:** Challenge the Evidence
+- **Investigation:** The 500-char cap is applied in every completion path (`dispatcher.ts:249/440/509/855/1120/1375`); full output goes to `messages.content` (50k) but the cheap parent-readable field is 500 chars.
+- **Result:** Confirmed
+- **Impact:** Result-passing must *replace* this primitive, not reuse it. Folded into O-B's scope.
+
+### V3: "O-A fails on Qwen" is over-weighted by one live anecdote (A25)
+
+- **Strategy:** Challenge the Evidence-Gathering Integrity
+- **Investigation:** A25 is n=1, self-collected this session; A9's degradation figures are for 3→10 agents in flat context, not a 2–3-step bounded flow; no test of Qwen on a *tightened* skill.
+- **Result:** Partially Refuted
+- **Impact:** Reworded to "O-A is *fragile* with a weak local conductor," not "fails." A stronger local model or a step-gated skill could flip it.
+
+### V4: The synthesis assumes BooCode *wants* deterministic flows
+
+- **Strategy:** Challenge the Recommendation
+- **Investigation:** The question was "lift Paseo's orchestration," which Paseo proves works via MCP self-orchestration with a strong parent. Nothing in the question mandates a weak local conductor; BooCode already runs Claude-SDK and opencode backends.
+- **Result:** Refuted (as an unstated assumption)
+- **Impact:** **Load-bearing.** The recommendation was rewritten from "O-B" into the conditional fork above, with conductor-model as the explicit deciding criterion.
+
+### V5: Discounting interested-party web sources (A4, A7, A8, A13, A14)
+
+- **Strategy:** Challenge the Evidence-Gathering Integrity
+- **Investigation:** Sensitivity test. Remove A4 (Praetorian): architecture claim still holds on A6 (disinterested arXiv). Remove A7/A8/A10 (DBOS/pg-workflows): the *specific tooling* recommendation loses its entire basis. Remove A13/A14 (Inngest): irrelevant.
+- **Result:** Partially Refuted
+- **Impact:** Architectural recommendation survives on A6 alone; the DBOS tooling pick was demoted to an unvalidated option in O-B.
+
+### V6: "No new infra, just Postgres" ignores the host/container/dual-schema split
+
+- **Strategy:** Challenge the Fix
+- **Investigation:** BooCoder runs as a host systemd service; `apps/server` runs in Docker; the coder schema is applied separately. DBOS keeps its own system tables and owns transaction boundaries — a third schema-owner in the shared DB plus a second durable-execution engine overlapping the existing poll machinery.
+- **Result:** Confirmed
+- **Impact:** "No new infra" is true only for the **hand-rolled** variant, false for DBOS. O-B now recommends hand-rolled and separates the two.
+
+### V7: The Paseo "5 files, 2–3 days" lift drags in the provider tree
+
+- **Strategy:** Challenge the Evidence
+- **Investigation:** `agent-manager.ts` imports 30+ types from `agent-sdk-types`, plus `@getpaseo/protocol/*` and the full provider implementations. The "no React-Native deps" claim is true; the *implied* cheap isolated lift is not.
+- **Result:** Partially Refuted
+- **Impact:** O-C (reuse Paseo's worker supervisor) is net-negative since BooCode already has four working backends. O-C recommended against.
+
+### V8: Provenance of the two untracked artifact files
+
+- **Strategy:** Challenge the Evidence-Gathering Integrity
+- **Investigation:** Untracked `docs/features/git-diff-panel/artifacts/*.md` are the synthesis's own working files, not fetched external content; low injection risk but unversioned.
+- **Result:** Confirmed (low severity)
+- **Impact:** Codebase claims (A15–A24) are reproducible and were checked; web claims (A1–A14) are not re-fetchable from here and were taken on the retrieval's word.
+
+### Adjustments Made
+
+The recommendation did **not** survive in its original "O-B as the spine" form. It was rewritten into the conditional fork above (deciding criterion: conductor model), per V4. O-C was dropped (V7); DBOS was demoted from co-equal to unvalidated (V5, V6); the "O-A fails on Qwen" claim was softened to "fragile" (V3); the effort framing and the 500-char result-passing problem were folded into O-B's scope (V1, V2).
+
+### Confidence Assessment
+
+- **Confidence:** Medium
+- **Remaining Risks:** The web tier (A1–A14) is unverifiable from this environment and the DBOS/pg-workflows specifics are vendor/single-sourced. A25 is n=1. The single assumption that flips the entire recommendation — whether the conductor is Claude or Qwen — is the operator's to confirm; everything downstream depends on it.
+
+## Sources
+
+| ID | Source | Link / location | Retrieved | Trust class | Summary (one line) | Evidence status |
+|---|---|---|---|---|---|---|
+| A1 | Nango: left Temporal for Postgres orchestration | https://nango.dev/blog/migrating-from-temporal-to-a-postgres-based-task-orchestrator/ | 2026-06-03 | web | Temporal ops overhead drove a move to Postgres-backed orchestration | corroborated by A12 |
+| A2 | Restate self-hosted overview | https://docs.restate.dev/server/overview | 2026-06-03 | web | Single binary, embedded RocksDB, no external DB | corroborated by A3 |
+| A3 | Show HN: Restate | https://news.ycombinator.com/item?id=40659160 | 2026-06-03 | web | Confirms single-binary lightweight deploy | corroborated by A2 |
+| A4 | Praetorian: Deterministic AI Orchestration | https://www.praetorian.com/blog/deterministic-ai-orchestration-a-platform-architecture-for-autonomous-development/ | 2026-06-03 | web | External orchestration beats self-orchestration; small models do better as bounded workers | corroborated by A6, A9 (interested party) |
+| A5 | Hatchworks: Orchestrating AI Agents | https://hatchworks.com/blog/ai-agents/orchestrating-ai-agents/ | 2026-06-03 | web | Self-orchestration failure modes; external control plane as default | corroborated by A4, A6 |
+| A6 | arXiv 2508.02721 Blueprint-First | https://arxiv.org/abs/2508.02721 | 2026-06-03 | web | Deterministic engine + LLM for bounded sub-tasks; +10.1pp, fewer turns | corroborated by A4, A5 |
+| A7 | Show HN: DBOS TypeScript | https://news.ycombinator.com/item?id=42727970 | 2026-06-03 | web | In-process Postgres durable execution, decorator steps | corroborated by A8 |
+| A8 | DBOS Transact | https://www.dbos.dev/dbos-transact | 2026-06-03 | web | "Just your program and Postgres," no orchestration server | corroborated by A7 (interested party) |
+| A9 | arXiv 2604.07911 context scoping | https://arxiv.org/pdf/2604.07911 | 2026-06-03 | web | Flat-context steering accuracy 60%→21% from 3→10 agents | corroborated by A4, A6 |
+| A10 | pg-workflows | https://sokratisvidros.github.io/pg-workflows/ | 2026-06-03 | web | Pure-Postgres TS workflow library, step exactly-once | single source (caveated) |
+| A11 | Hatchet v1 HN | https://news.ycombinator.com/item?id=43572733 | 2026-06-03 | web | Postgres + RabbitMQ; separate server process | corroborated by A1 |
+| A12 | Temporal self-host guide | https://docs.temporal.io/self-hosted-guide/deployment | 2026-06-03 | web | Multi-service self-host overhead | corroborated by A1 |
+| A13 | Inngest vs Trigger vs Restate | https://www.pkgpulse.com/guides/inngest-vs-trigger-dev-v3-vs-restate-2026 | 2026-06-03 | web | Inngest cloud-first; not truly self-hostable | corroborated by A14 |
+| A14 | Inngest self-hosting docs | https://www.inngest.com/docs/self-hosting | 2026-06-03 | web | Engine proprietary to Inngest Cloud | corroborated by A13 (interested party) |
+| A15 | BooCode dispatcher + LISTEN/NOTIFY | `apps/coder/src/services/dispatcher.ts:46` | n/a | codebase | Task dispatch via `tasks_new` notify + poll; backend routing | corroborated by A18 |
+| A16 | BooCode Arena | `apps/coder/src/routes/arena.ts:34` | n/a | codebase | Parallel fan-out 2–5 contestants; selection is `[SELECTED]` prefix, no consumer; sessionless | single source (codebase anchor) |
+| A17 | BooCode tasks table | `apps/coder/src/schema.sql:18` | n/a | codebase | `parent_task_id` FK written-but-not-dispatched-on; no `depends_on`/step/flows | single source (codebase anchor) |
+| A18 | AgentBackend + 4 backends | `apps/coder/src/services/agent-backend.ts:97` | n/a | codebase | Common ensureSession/prompt surface; opencode/warm-acp/claude-sdk/one-shot | corroborated by A15 |
+| A19 | new_task / output_summary cap | `apps/coder/src/services/tools/new_task.ts:13` | n/a | codebase | Native-only tools; parent reads only 500-char `output_summary` | single source (codebase anchor) |
+| A20 | Paseo AgentManager | `/opt/forks/paseo/packages/server/src/server/agent/agent-manager.ts:413` | n/a | codebase | Plain TS class, no RN deps; create/stream/run/wait/cascade-archive | corroborated by A22 |
+| A21 | Paseo createAgentMcpServer | `/opt/forks/paseo/.../agent/mcp-server.ts:479` | n/a | codebase | create_agent/wait_for_agent/archive_agent as MCP tools; child gets parent MCP URL | corroborated by A20 |
+| A22 | Paseo has no sequencing engine | `/opt/forks/paseo/.../agent-manager.ts` (verdict) | n/a | codebase | Parent model self-orchestrates via MCP; daemon supervises only | single source (codebase anchor) |
+| A23 | Paseo AgentStorage | `/opt/forks/paseo/.../agent/agent-storage.ts:84` | n/a | codebase | JSON files on disk + in-memory timeline, no DB | single source (codebase anchor) |
+| A24 | Paseo parent/child label | `/opt/forks/paseo/packages/protocol/src/agent-labels.ts` | n/a | codebase | Relationship is one label on a flat map | corroborated by A22 |
+| A25 | Live smoke test (this session) | provided: opencode/Qwen-35B han research run | n/a | provided | Qwen dispatched analyst, dropped the validator step, skipped template | single source (live, n=1) |
+
+### A22: Paseo has no deterministic sequencing engine — recommendation-bearing
+
+- **Link / location:** `/opt/forks/paseo/packages/server/src/server/agent/agent-manager.ts:413` (+ explorer verdict)
+- **Retrieved:** n/a
+- **Trust class:** codebase (current-state anchor)
+- **Summary:** Paseo's daemon spawns, waits on, and bookkeeps agents; it contains no DAG or workflow engine. The parent agent — itself an ACP process — does all sequencing by calling `create_agent`/`wait_for_agent`/`archive_agent` as MCP tools in its own reasoning loop. This is why "lift Paseo's orchestrator" is a category error: the conductor is the model, not Paseo. It reframes the entire recommendation into "build a conductor (O-B) vs adopt model-as-conductor (O-A)."
+- **Evidence status:** corroborated by A20, A24
+
+### A17: BooCode tasks table lacks sequencing columns — recommendation-bearing
+
+- **Link / location:** `apps/coder/src/schema.sql:18`
+- **Retrieved:** n/a
+- **Trust class:** codebase (current-state anchor)
+- **Summary:** The `tasks` table has `parent_task_id` (written by `new_task`, read only by `list_tasks`, never by the dispatcher) but no `depends_on`, `step_index`, or `flows` definition. The dispatcher poll selects on `state='pending'` alone. This is the concrete gap O-B must fill, and it confirms the deterministic sequencing substrate genuinely does not exist today.
+- **Evidence status:** single source (codebase anchor), verified by the validator against live source
+
+### A6: Blueprint-First deterministic workflow (arXiv) — recommendation-bearing
+
+- **Link / location:** https://arxiv.org/abs/2508.02721
+- **Retrieved:** 2026-06-03
+- **Trust class:** web (peer-reviewed preprint, disinterested)
+- **Summary:** A deterministic engine executes an expert-defined blueprint and calls the LLM only for bounded sub-tasks, never to decide workflow path; reports ~10-point gains and large reductions in turns and tool calls versus self-orchestrating agents. This is the one disinterested source carrying the "deterministic code beats model self-orchestration" direction after the interested parties (A4) are discounted.
+- **Evidence status:** corroborated by A4, A5, A9