Files
boocode/docs/research/2026-06-03-boocode-orchestration-integration.md
2026-06-03 15:25:59 +00:00

176 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Research: Can we lift Paseo's orchestration into BooCode so it can orchestrate different agents?
Open-ended question: *Can we lift Paseo's multi-agent orchestration into BooCode, and can we get this done?* — Evidence mode: **strict**.
## Summary
Yes, this is achievable — but "lifting Paseo's orchestrator" turns out to be the wrong mental model, because **Paseo has no orchestration engine to lift**. Paseo's daemon is a process supervisor; the actual sequencing is done by the parent *model*, which calls `create_agent` / `wait_for_agent` / `archive_agent` as MCP tools inside its own reasoning loop. So the real choice is between two different things: (a) copy that model-drives-itself pattern, which is simple and Paseo-proven but fragile when a weak local model is the conductor, or (b) build a small deterministic sequencing layer in BooCode's own code, which is more work but reliable regardless of model.
BooCode already owns roughly the worker half of this — dispatch, four agent backends, parallel fan-out, per-agent git worktrees, resumable sessions. What it lacks is any notion of "do step B after step A and feed A's output into B." Which path is right hinges on one unanswered question: **who conducts — a strong model like Claude, or a weak local model like Qwen?** With Claude, the simple copy-Paseo path wins and a deterministic engine would be over-engineering. With Qwen, the deterministic-code path is necessary, because a 35B local model self-orchestrating is exactly what failed live in this session. Until that conductor question is answered, there is no single winner — there is a fork with a clear deciding criterion.
- **Confidence:** Medium
## Research Results
**BooCode already owns the worker layer (A15A19).** Task dispatch runs off a Postgres `LISTEN/NOTIFY 'tasks_new'` fast path plus a poll fallback (A15). Four execution backends sit behind a common `AgentBackend` interface — opencode-server, warm-ACP, claude-SDK, and one-shot ACP/PTY (A18). Arena already fans the same prompt out to 25 agents in parallel, each with its own task and worktree (A16). Per-task and per-session git worktrees, resumable `agent_sessions`, and output capture (full text to `messages.content`, a diff to `pending_changes`) all exist. This half is real, working code.
**But the sequencing substrate genuinely does not exist (A16, A17, A19).** The dispatcher's poll selects on `state='pending'` alone — it never reads `parent_task_id`, and there is no `depends_on`, `step_index`, `flows` table, or fold/synthesis code anywhere (A17, verified by the validator against `dispatcher.ts:105-110`). Arena is parallel-only and single-step; its "winner selection" is a `[SELECTED]` string prefix that no code downstream consumes (A16). The one inter-task channel a parent can cheaply read — `tasks.output_summary` — is capped at **500 characters** in every completion path (A19), far too small to pass a real artifact (a research finding, a diff) from one step into the next. So the existing substrate is a head-start on *dispatch* but actively wrong-shaped for *data flow between steps*.
**Paseo's orchestrator is liftable as plumbing, but it isn't a conductor (A20A24).** `AgentManager` is a plain TypeScript class with no React-Native/Zustand dependency (A20, confirmed), and the MCP tool server exposing `create_agent`/`wait_for_agent`/`archive_agent` sits cleanly on top (A21). Agent state is JSON files on disk, parent/child is a single label on a flat map (A23, A24) — trivial to replicate. **The decisive finding:** Paseo has *no* deterministic DAG or sequencing engine (A22). The parent agent — itself a running ACP process — decides the order by making MCP tool calls in its own loop. Paseo's daemon only spawns, waits, and bookkeeps.
**The web evidence argues that model-self-orchestration is fragile for weak models, and recommends deterministic code sequencing (A4, A5, A6, A9) — corroborated live (A25).** Disinterested academic work reports a deterministic engine calling the model only for bounded sub-tasks beats self-orchestration by ~10 points with large reductions in turns/tool-calls (A6), and that flat-context steering accuracy falls from ~60% to ~21% as agent count climbs (A9, though this is the 3→10-agent regime, not a 23-step bounded flow). In this very session, opencode/Qwen-35B ran the han research skill, dispatched the first specialist, then **dropped the adversarial-validator step entirely** (A25) — a live, n=1 instance of exactly this failure. The lowest-ops way to host deterministic sequencing on a single-user Postgres stack is in-process Postgres-backed execution; Temporal is too heavy for one user (A1, A12), Restate is lighter but a second stateful service (A2, A3), Inngest isn't truly self-hostable (A13, A14).
**Conflict surfaced:** the prior-art angle pushes toward a deterministic engine; the Paseo codebase proves the *opposite* pattern (model self-orchestration) ships and works — with a capable parent model. The two only reconcile once you fix who the conductor is.
## Options to Consider
### O-A: Copy Paseo's pattern — model self-orchestrates via an MCP toolbox
- **What it is:** Expose `create_agent` / `wait_for_agent` / `archive_agent` (and friends) as MCP tools over BooCode's existing backends; a parent agent sequences sub-agents from inside its own reasoning loop. Optionally lift Paseo's `AgentManager` + MCP server (~5 files, no RN deps) rather than writing the toolbox fresh.
- **Trade-offs:** Lowest effort and Paseo-proven — *with a strong parent model*. Fragile with a weak local conductor: non-deterministic, workflow state hidden in the model's context, no crash-replay. This is the pattern that dropped a step on Qwen in this session (A25).
- **Rests on:** (A20, A21, A22, A24) — and the con on (A4, A5, A6, A9, A25)
- **Evidence status:** corroborated
### O-B: Build a deterministic code flow-runner on BooCode's own primitives
- **What it is:** Add step sequencing in code — a `depends_on` / step-state column on the existing `tasks` table, dispatched by the existing `LISTEN/NOTIFY` poll; a real result-passing channel (replacing the 500-char `output_summary`); and a fold step. Agents stay bounded single-task workers.
- **Trade-offs:** Reliable regardless of conductor model and crash-resumable. But the missing sequencing + result-passing + fold is **greenfield** — the reused worker layer is plumbing, this is where the schedule and the distributed-systems hazards live. The hand-rolled variant is genuinely near-zero new infra; a library (DBOS Transact) is **not**, given BooCode's host-systemd/Docker split and dual-schema DB (see V6) — DBOS is an unvalidated option, not a co-equal one.
- **Rests on:** (A15, A16, A17, A19) for the substrate; (A6) for the architecture; (A1, A7, A8, A10) for the tooling angle — the last two vendor-sourced and caveated.
- **Evidence status:** architecture corroborated; specific tooling (DBOS/pg-workflows) single-source (caveated)
### O-C: Hybrid — lift Paseo's ACP supervisor as the worker layer, code orchestrator on top
- **What it is:** Replace BooCode's backends with Paseo's lifted ACP client + `AgentManager`, then put an O-B deterministic conductor over it.
- **Trade-offs:** Net-negative. BooCode already has four working backends; importing Paseo's provider tree (heavy intra-package `@getpaseo/protocol` coupling, per V7) swaps working code for a foreign dependency. Recommended **against**.
- **Rests on:** (A18, A20) — and V7
- **Evidence status:** corroborated (against)
## Recommendation
- **Recommendation:** **No single winner until the conductor model is fixed — this is the deciding criterion.**
- **If the conductor is a strong model (Claude, in-stack today):** choose **O-A**. It is the simpler, Paseo-proven path, and building a deterministic engine would be over-engineering scope that wasn't asked for. Expose the MCP toolbox over BooCode's existing backends; lifting Paseo's `AgentManager`/MCP server is optional convenience, not necessity.
- **If the conductor must be a weak local model (Qwen) — i.e. the goal is free/local multi-agent flows:** choose **O-B, hand-rolled** (`depends_on` + step-state on `tasks`, dispatched by the existing `LISTEN/NOTIFY`, plus a real result channel and a fold step). Determinism is not optional here; the model cannot be trusted to sequence itself (A25). Treat DBOS as an unvalidated alternative, not the default.
- **In neither case O-C**, and in neither case lift Paseo's *conductor* — there isn't one to lift (A22).
- **Evidence basis:** The codebase findings (A15A24) are current-state anchors, all independently verified by the validator against live source — the worker-layer-exists / sequencing-absent / Paseo-has-no-engine conclusions are solid. The "weak models can't self-orchestrate" direction rests on one disinterested source (A6) plus general degradation work (A9) and a single live anecdote (A25) — enough to make O-A *fragile* on Qwen, not enough to call it impossible. The specific DBOS tooling pick rests only on vendor marketing (A7, A8) and a single-source library (A10), so it is explicitly demoted. The fork itself — which option wins — rests on a constraint (conductor model) that the question never stated and the operator must supply.
## Validation
### V1: The "reuse 70%, build 30%" split is inverted on effort
- **Strategy:** Challenge the Recommendation
- **Investigation:** Read `dispatcher.ts` end-to-end; poll query dispatches on `state='pending'` only, no dependency awareness; no fold/synthesis code exists.
- **Result:** Partially Refuted
- **Impact:** The reused 70% is already-debugged plumbing; the missing piece (crash-resumable step state machine, result-passing, fold) is 100% greenfield and is where the risk lives. Direction survives; the *effort framing* was misleading and is corrected in O-B.
### V2: The 500-char `output_summary` cap makes the substrate hostile to result-passing
- **Strategy:** Challenge the Evidence
- **Investigation:** The 500-char cap is applied in every completion path (`dispatcher.ts:249/440/509/855/1120/1375`); full output goes to `messages.content` (50k) but the cheap parent-readable field is 500 chars.
- **Result:** Confirmed
- **Impact:** Result-passing must *replace* this primitive, not reuse it. Folded into O-B's scope.
### V3: "O-A fails on Qwen" is over-weighted by one live anecdote (A25)
- **Strategy:** Challenge the Evidence-Gathering Integrity
- **Investigation:** A25 is n=1, self-collected this session; A9's degradation figures are for 3→10 agents in flat context, not a 23-step bounded flow; no test of Qwen on a *tightened* skill.
- **Result:** Partially Refuted
- **Impact:** Reworded to "O-A is *fragile* with a weak local conductor," not "fails." A stronger local model or a step-gated skill could flip it.
### V4: The synthesis assumes BooCode *wants* deterministic flows
- **Strategy:** Challenge the Recommendation
- **Investigation:** The question was "lift Paseo's orchestration," which Paseo proves works via MCP self-orchestration with a strong parent. Nothing in the question mandates a weak local conductor; BooCode already runs Claude-SDK and opencode backends.
- **Result:** Refuted (as an unstated assumption)
- **Impact:** **Load-bearing.** The recommendation was rewritten from "O-B" into the conditional fork above, with conductor-model as the explicit deciding criterion.
### V5: Discounting interested-party web sources (A4, A7, A8, A13, A14)
- **Strategy:** Challenge the Evidence-Gathering Integrity
- **Investigation:** Sensitivity test. Remove A4 (Praetorian): architecture claim still holds on A6 (disinterested arXiv). Remove A7/A8/A10 (DBOS/pg-workflows): the *specific tooling* recommendation loses its entire basis. Remove A13/A14 (Inngest): irrelevant.
- **Result:** Partially Refuted
- **Impact:** Architectural recommendation survives on A6 alone; the DBOS tooling pick was demoted to an unvalidated option in O-B.
### V6: "No new infra, just Postgres" ignores the host/container/dual-schema split
- **Strategy:** Challenge the Fix
- **Investigation:** BooCoder runs as a host systemd service; `apps/server` runs in Docker; the coder schema is applied separately. DBOS keeps its own system tables and owns transaction boundaries — a third schema-owner in the shared DB plus a second durable-execution engine overlapping the existing poll machinery.
- **Result:** Confirmed
- **Impact:** "No new infra" is true only for the **hand-rolled** variant, false for DBOS. O-B now recommends hand-rolled and separates the two.
### V7: The Paseo "5 files, 23 days" lift drags in the provider tree
- **Strategy:** Challenge the Evidence
- **Investigation:** `agent-manager.ts` imports 30+ types from `agent-sdk-types`, plus `@getpaseo/protocol/*` and the full provider implementations. The "no React-Native deps" claim is true; the *implied* cheap isolated lift is not.
- **Result:** Partially Refuted
- **Impact:** O-C (reuse Paseo's worker supervisor) is net-negative since BooCode already has four working backends. O-C recommended against.
### V8: Provenance of the two untracked artifact files
- **Strategy:** Challenge the Evidence-Gathering Integrity
- **Investigation:** Untracked `docs/features/git-diff-panel/artifacts/*.md` are the synthesis's own working files, not fetched external content; low injection risk but unversioned.
- **Result:** Confirmed (low severity)
- **Impact:** Codebase claims (A15A24) are reproducible and were checked; web claims (A1A14) are not re-fetchable from here and were taken on the retrieval's word.
### Adjustments Made
The recommendation did **not** survive in its original "O-B as the spine" form. It was rewritten into the conditional fork above (deciding criterion: conductor model), per V4. O-C was dropped (V7); DBOS was demoted from co-equal to unvalidated (V5, V6); the "O-A fails on Qwen" claim was softened to "fragile" (V3); the effort framing and the 500-char result-passing problem were folded into O-B's scope (V1, V2).
### Confidence Assessment
- **Confidence:** Medium
- **Remaining Risks:** The web tier (A1A14) is unverifiable from this environment and the DBOS/pg-workflows specifics are vendor/single-sourced. A25 is n=1. The single assumption that flips the entire recommendation — whether the conductor is Claude or Qwen — is the operator's to confirm; everything downstream depends on it.
## Sources
| ID | Source | Link / location | Retrieved | Trust class | Summary (one line) | Evidence status |
|---|---|---|---|---|---|---|
| A1 | Nango: left Temporal for Postgres orchestration | https://nango.dev/blog/migrating-from-temporal-to-a-postgres-based-task-orchestrator/ | 2026-06-03 | web | Temporal ops overhead drove a move to Postgres-backed orchestration | corroborated by A12 |
| A2 | Restate self-hosted overview | https://docs.restate.dev/server/overview | 2026-06-03 | web | Single binary, embedded RocksDB, no external DB | corroborated by A3 |
| A3 | Show HN: Restate | https://news.ycombinator.com/item?id=40659160 | 2026-06-03 | web | Confirms single-binary lightweight deploy | corroborated by A2 |
| A4 | Praetorian: Deterministic AI Orchestration | https://www.praetorian.com/blog/deterministic-ai-orchestration-a-platform-architecture-for-autonomous-development/ | 2026-06-03 | web | External orchestration beats self-orchestration; small models do better as bounded workers | corroborated by A6, A9 (interested party) |
| A5 | Hatchworks: Orchestrating AI Agents | https://hatchworks.com/blog/ai-agents/orchestrating-ai-agents/ | 2026-06-03 | web | Self-orchestration failure modes; external control plane as default | corroborated by A4, A6 |
| A6 | arXiv 2508.02721 Blueprint-First | https://arxiv.org/abs/2508.02721 | 2026-06-03 | web | Deterministic engine + LLM for bounded sub-tasks; +10.1pp, fewer turns | corroborated by A4, A5 |
| A7 | Show HN: DBOS TypeScript | https://news.ycombinator.com/item?id=42727970 | 2026-06-03 | web | In-process Postgres durable execution, decorator steps | corroborated by A8 |
| A8 | DBOS Transact | https://www.dbos.dev/dbos-transact | 2026-06-03 | web | "Just your program and Postgres," no orchestration server | corroborated by A7 (interested party) |
| A9 | arXiv 2604.07911 context scoping | https://arxiv.org/pdf/2604.07911 | 2026-06-03 | web | Flat-context steering accuracy 60%→21% from 3→10 agents | corroborated by A4, A6 |
| A10 | pg-workflows | https://sokratisvidros.github.io/pg-workflows/ | 2026-06-03 | web | Pure-Postgres TS workflow library, step exactly-once | single source (caveated) |
| A11 | Hatchet v1 HN | https://news.ycombinator.com/item?id=43572733 | 2026-06-03 | web | Postgres + RabbitMQ; separate server process | corroborated by A1 |
| A12 | Temporal self-host guide | https://docs.temporal.io/self-hosted-guide/deployment | 2026-06-03 | web | Multi-service self-host overhead | corroborated by A1 |
| A13 | Inngest vs Trigger vs Restate | https://www.pkgpulse.com/guides/inngest-vs-trigger-dev-v3-vs-restate-2026 | 2026-06-03 | web | Inngest cloud-first; not truly self-hostable | corroborated by A14 |
| A14 | Inngest self-hosting docs | https://www.inngest.com/docs/self-hosting | 2026-06-03 | web | Engine proprietary to Inngest Cloud | corroborated by A13 (interested party) |
| A15 | BooCode dispatcher + LISTEN/NOTIFY | `apps/coder/src/services/dispatcher.ts:46` | n/a | codebase | Task dispatch via `tasks_new` notify + poll; backend routing | corroborated by A18 |
| A16 | BooCode Arena | `apps/coder/src/routes/arena.ts:34` | n/a | codebase | Parallel fan-out 25 contestants; selection is `[SELECTED]` prefix, no consumer; sessionless | single source (codebase anchor) |
| A17 | BooCode tasks table | `apps/coder/src/schema.sql:18` | n/a | codebase | `parent_task_id` FK written-but-not-dispatched-on; no `depends_on`/step/flows | single source (codebase anchor) |
| A18 | AgentBackend + 4 backends | `apps/coder/src/services/agent-backend.ts:97` | n/a | codebase | Common ensureSession/prompt surface; opencode/warm-acp/claude-sdk/one-shot | corroborated by A15 |
| A19 | new_task / output_summary cap | `apps/coder/src/services/tools/new_task.ts:13` | n/a | codebase | Native-only tools; parent reads only 500-char `output_summary` | single source (codebase anchor) |
| A20 | Paseo AgentManager | `/opt/forks/paseo/packages/server/src/server/agent/agent-manager.ts:413` | n/a | codebase | Plain TS class, no RN deps; create/stream/run/wait/cascade-archive | corroborated by A22 |
| A21 | Paseo createAgentMcpServer | `/opt/forks/paseo/.../agent/mcp-server.ts:479` | n/a | codebase | create_agent/wait_for_agent/archive_agent as MCP tools; child gets parent MCP URL | corroborated by A20 |
| A22 | Paseo has no sequencing engine | `/opt/forks/paseo/.../agent-manager.ts` (verdict) | n/a | codebase | Parent model self-orchestrates via MCP; daemon supervises only | single source (codebase anchor) |
| A23 | Paseo AgentStorage | `/opt/forks/paseo/.../agent/agent-storage.ts:84` | n/a | codebase | JSON files on disk + in-memory timeline, no DB | single source (codebase anchor) |
| A24 | Paseo parent/child label | `/opt/forks/paseo/packages/protocol/src/agent-labels.ts` | n/a | codebase | Relationship is one label on a flat map | corroborated by A22 |
| A25 | Live smoke test (this session) | provided: opencode/Qwen-35B han research run | n/a | provided | Qwen dispatched analyst, dropped the validator step, skipped template | single source (live, n=1) |
### A22: Paseo has no deterministic sequencing engine — recommendation-bearing
- **Link / location:** `/opt/forks/paseo/packages/server/src/server/agent/agent-manager.ts:413` (+ explorer verdict)
- **Retrieved:** n/a
- **Trust class:** codebase (current-state anchor)
- **Summary:** Paseo's daemon spawns, waits on, and bookkeeps agents; it contains no DAG or workflow engine. The parent agent — itself an ACP process — does all sequencing by calling `create_agent`/`wait_for_agent`/`archive_agent` as MCP tools in its own reasoning loop. This is why "lift Paseo's orchestrator" is a category error: the conductor is the model, not Paseo. It reframes the entire recommendation into "build a conductor (O-B) vs adopt model-as-conductor (O-A)."
- **Evidence status:** corroborated by A20, A24
### A17: BooCode tasks table lacks sequencing columns — recommendation-bearing
- **Link / location:** `apps/coder/src/schema.sql:18`
- **Retrieved:** n/a
- **Trust class:** codebase (current-state anchor)
- **Summary:** The `tasks` table has `parent_task_id` (written by `new_task`, read only by `list_tasks`, never by the dispatcher) but no `depends_on`, `step_index`, or `flows` definition. The dispatcher poll selects on `state='pending'` alone. This is the concrete gap O-B must fill, and it confirms the deterministic sequencing substrate genuinely does not exist today.
- **Evidence status:** single source (codebase anchor), verified by the validator against live source
### A6: Blueprint-First deterministic workflow (arXiv) — recommendation-bearing
- **Link / location:** https://arxiv.org/abs/2508.02721
- **Retrieved:** 2026-06-03
- **Trust class:** web (peer-reviewed preprint, disinterested)
- **Summary:** A deterministic engine executes an expert-defined blueprint and calls the LLM only for bounded sub-tasks, never to decide workflow path; reports ~10-point gains and large reductions in turns and tool calls versus self-orchestrating agents. This is the one disinterested source carrying the "deterministic code beats model self-orchestration" direction after the interested parties (A4) are discounted.
- **Evidence status:** corroborated by A4, A5, A9