Files
boocode/docs/research/2026-06-03-boocode-orchestration-integration.md
2026-06-03 15:25:59 +00:00

21 KiB
Raw Blame History

Research: Can we lift Paseo's orchestration into BooCode so it can orchestrate different agents?

Open-ended question: Can we lift Paseo's multi-agent orchestration into BooCode, and can we get this done? — Evidence mode: strict.

Summary

Yes, this is achievable — but "lifting Paseo's orchestrator" turns out to be the wrong mental model, because Paseo has no orchestration engine to lift. Paseo's daemon is a process supervisor; the actual sequencing is done by the parent model, which calls create_agent / wait_for_agent / archive_agent as MCP tools inside its own reasoning loop. So the real choice is between two different things: (a) copy that model-drives-itself pattern, which is simple and Paseo-proven but fragile when a weak local model is the conductor, or (b) build a small deterministic sequencing layer in BooCode's own code, which is more work but reliable regardless of model.

BooCode already owns roughly the worker half of this — dispatch, four agent backends, parallel fan-out, per-agent git worktrees, resumable sessions. What it lacks is any notion of "do step B after step A and feed A's output into B." Which path is right hinges on one unanswered question: who conducts — a strong model like Claude, or a weak local model like Qwen? With Claude, the simple copy-Paseo path wins and a deterministic engine would be over-engineering. With Qwen, the deterministic-code path is necessary, because a 35B local model self-orchestrating is exactly what failed live in this session. Until that conductor question is answered, there is no single winner — there is a fork with a clear deciding criterion.

  • Confidence: Medium

Research Results

BooCode already owns the worker layer (A15A19). Task dispatch runs off a Postgres LISTEN/NOTIFY 'tasks_new' fast path plus a poll fallback (A15). Four execution backends sit behind a common AgentBackend interface — opencode-server, warm-ACP, claude-SDK, and one-shot ACP/PTY (A18). Arena already fans the same prompt out to 25 agents in parallel, each with its own task and worktree (A16). Per-task and per-session git worktrees, resumable agent_sessions, and output capture (full text to messages.content, a diff to pending_changes) all exist. This half is real, working code.

But the sequencing substrate genuinely does not exist (A16, A17, A19). The dispatcher's poll selects on state='pending' alone — it never reads parent_task_id, and there is no depends_on, step_index, flows table, or fold/synthesis code anywhere (A17, verified by the validator against dispatcher.ts:105-110). Arena is parallel-only and single-step; its "winner selection" is a [SELECTED] string prefix that no code downstream consumes (A16). The one inter-task channel a parent can cheaply read — tasks.output_summary — is capped at 500 characters in every completion path (A19), far too small to pass a real artifact (a research finding, a diff) from one step into the next. So the existing substrate is a head-start on dispatch but actively wrong-shaped for data flow between steps.

Paseo's orchestrator is liftable as plumbing, but it isn't a conductor (A20A24). AgentManager is a plain TypeScript class with no React-Native/Zustand dependency (A20, confirmed), and the MCP tool server exposing create_agent/wait_for_agent/archive_agent sits cleanly on top (A21). Agent state is JSON files on disk, parent/child is a single label on a flat map (A23, A24) — trivial to replicate. The decisive finding: Paseo has no deterministic DAG or sequencing engine (A22). The parent agent — itself a running ACP process — decides the order by making MCP tool calls in its own loop. Paseo's daemon only spawns, waits, and bookkeeps.

The web evidence argues that model-self-orchestration is fragile for weak models, and recommends deterministic code sequencing (A4, A5, A6, A9) — corroborated live (A25). Disinterested academic work reports a deterministic engine calling the model only for bounded sub-tasks beats self-orchestration by ~10 points with large reductions in turns/tool-calls (A6), and that flat-context steering accuracy falls from ~60% to ~21% as agent count climbs (A9, though this is the 3→10-agent regime, not a 23-step bounded flow). In this very session, opencode/Qwen-35B ran the han research skill, dispatched the first specialist, then dropped the adversarial-validator step entirely (A25) — a live, n=1 instance of exactly this failure. The lowest-ops way to host deterministic sequencing on a single-user Postgres stack is in-process Postgres-backed execution; Temporal is too heavy for one user (A1, A12), Restate is lighter but a second stateful service (A2, A3), Inngest isn't truly self-hostable (A13, A14).

Conflict surfaced: the prior-art angle pushes toward a deterministic engine; the Paseo codebase proves the opposite pattern (model self-orchestration) ships and works — with a capable parent model. The two only reconcile once you fix who the conductor is.

Options to Consider

O-A: Copy Paseo's pattern — model self-orchestrates via an MCP toolbox

  • What it is: Expose create_agent / wait_for_agent / archive_agent (and friends) as MCP tools over BooCode's existing backends; a parent agent sequences sub-agents from inside its own reasoning loop. Optionally lift Paseo's AgentManager + MCP server (~5 files, no RN deps) rather than writing the toolbox fresh.
  • Trade-offs: Lowest effort and Paseo-proven — with a strong parent model. Fragile with a weak local conductor: non-deterministic, workflow state hidden in the model's context, no crash-replay. This is the pattern that dropped a step on Qwen in this session (A25).
  • Rests on: (A20, A21, A22, A24) — and the con on (A4, A5, A6, A9, A25)
  • Evidence status: corroborated

O-B: Build a deterministic code flow-runner on BooCode's own primitives

  • What it is: Add step sequencing in code — a depends_on / step-state column on the existing tasks table, dispatched by the existing LISTEN/NOTIFY poll; a real result-passing channel (replacing the 500-char output_summary); and a fold step. Agents stay bounded single-task workers.
  • Trade-offs: Reliable regardless of conductor model and crash-resumable. But the missing sequencing + result-passing + fold is greenfield — the reused worker layer is plumbing, this is where the schedule and the distributed-systems hazards live. The hand-rolled variant is genuinely near-zero new infra; a library (DBOS Transact) is not, given BooCode's host-systemd/Docker split and dual-schema DB (see V6) — DBOS is an unvalidated option, not a co-equal one.
  • Rests on: (A15, A16, A17, A19) for the substrate; (A6) for the architecture; (A1, A7, A8, A10) for the tooling angle — the last two vendor-sourced and caveated.
  • Evidence status: architecture corroborated; specific tooling (DBOS/pg-workflows) single-source (caveated)

O-C: Hybrid — lift Paseo's ACP supervisor as the worker layer, code orchestrator on top

  • What it is: Replace BooCode's backends with Paseo's lifted ACP client + AgentManager, then put an O-B deterministic conductor over it.
  • Trade-offs: Net-negative. BooCode already has four working backends; importing Paseo's provider tree (heavy intra-package @getpaseo/protocol coupling, per V7) swaps working code for a foreign dependency. Recommended against.
  • Rests on: (A18, A20) — and V7
  • Evidence status: corroborated (against)

Recommendation

  • Recommendation: No single winner until the conductor model is fixed — this is the deciding criterion.
    • If the conductor is a strong model (Claude, in-stack today): choose O-A. It is the simpler, Paseo-proven path, and building a deterministic engine would be over-engineering scope that wasn't asked for. Expose the MCP toolbox over BooCode's existing backends; lifting Paseo's AgentManager/MCP server is optional convenience, not necessity.
    • If the conductor must be a weak local model (Qwen) — i.e. the goal is free/local multi-agent flows: choose O-B, hand-rolled (depends_on + step-state on tasks, dispatched by the existing LISTEN/NOTIFY, plus a real result channel and a fold step). Determinism is not optional here; the model cannot be trusted to sequence itself (A25). Treat DBOS as an unvalidated alternative, not the default.
    • In neither case O-C, and in neither case lift Paseo's conductor — there isn't one to lift (A22).
  • Evidence basis: The codebase findings (A15A24) are current-state anchors, all independently verified by the validator against live source — the worker-layer-exists / sequencing-absent / Paseo-has-no-engine conclusions are solid. The "weak models can't self-orchestrate" direction rests on one disinterested source (A6) plus general degradation work (A9) and a single live anecdote (A25) — enough to make O-A fragile on Qwen, not enough to call it impossible. The specific DBOS tooling pick rests only on vendor marketing (A7, A8) and a single-source library (A10), so it is explicitly demoted. The fork itself — which option wins — rests on a constraint (conductor model) that the question never stated and the operator must supply.

Validation

V1: The "reuse 70%, build 30%" split is inverted on effort

  • Strategy: Challenge the Recommendation
  • Investigation: Read dispatcher.ts end-to-end; poll query dispatches on state='pending' only, no dependency awareness; no fold/synthesis code exists.
  • Result: Partially Refuted
  • Impact: The reused 70% is already-debugged plumbing; the missing piece (crash-resumable step state machine, result-passing, fold) is 100% greenfield and is where the risk lives. Direction survives; the effort framing was misleading and is corrected in O-B.

V2: The 500-char output_summary cap makes the substrate hostile to result-passing

  • Strategy: Challenge the Evidence
  • Investigation: The 500-char cap is applied in every completion path (dispatcher.ts:249/440/509/855/1120/1375); full output goes to messages.content (50k) but the cheap parent-readable field is 500 chars.
  • Result: Confirmed
  • Impact: Result-passing must replace this primitive, not reuse it. Folded into O-B's scope.

V3: "O-A fails on Qwen" is over-weighted by one live anecdote (A25)

  • Strategy: Challenge the Evidence-Gathering Integrity
  • Investigation: A25 is n=1, self-collected this session; A9's degradation figures are for 3→10 agents in flat context, not a 23-step bounded flow; no test of Qwen on a tightened skill.
  • Result: Partially Refuted
  • Impact: Reworded to "O-A is fragile with a weak local conductor," not "fails." A stronger local model or a step-gated skill could flip it.

V4: The synthesis assumes BooCode wants deterministic flows

  • Strategy: Challenge the Recommendation
  • Investigation: The question was "lift Paseo's orchestration," which Paseo proves works via MCP self-orchestration with a strong parent. Nothing in the question mandates a weak local conductor; BooCode already runs Claude-SDK and opencode backends.
  • Result: Refuted (as an unstated assumption)
  • Impact: Load-bearing. The recommendation was rewritten from "O-B" into the conditional fork above, with conductor-model as the explicit deciding criterion.

V5: Discounting interested-party web sources (A4, A7, A8, A13, A14)

  • Strategy: Challenge the Evidence-Gathering Integrity
  • Investigation: Sensitivity test. Remove A4 (Praetorian): architecture claim still holds on A6 (disinterested arXiv). Remove A7/A8/A10 (DBOS/pg-workflows): the specific tooling recommendation loses its entire basis. Remove A13/A14 (Inngest): irrelevant.
  • Result: Partially Refuted
  • Impact: Architectural recommendation survives on A6 alone; the DBOS tooling pick was demoted to an unvalidated option in O-B.

V6: "No new infra, just Postgres" ignores the host/container/dual-schema split

  • Strategy: Challenge the Fix
  • Investigation: BooCoder runs as a host systemd service; apps/server runs in Docker; the coder schema is applied separately. DBOS keeps its own system tables and owns transaction boundaries — a third schema-owner in the shared DB plus a second durable-execution engine overlapping the existing poll machinery.
  • Result: Confirmed
  • Impact: "No new infra" is true only for the hand-rolled variant, false for DBOS. O-B now recommends hand-rolled and separates the two.

V7: The Paseo "5 files, 23 days" lift drags in the provider tree

  • Strategy: Challenge the Evidence
  • Investigation: agent-manager.ts imports 30+ types from agent-sdk-types, plus @getpaseo/protocol/* and the full provider implementations. The "no React-Native deps" claim is true; the implied cheap isolated lift is not.
  • Result: Partially Refuted
  • Impact: O-C (reuse Paseo's worker supervisor) is net-negative since BooCode already has four working backends. O-C recommended against.

V8: Provenance of the two untracked artifact files

  • Strategy: Challenge the Evidence-Gathering Integrity
  • Investigation: Untracked docs/features/git-diff-panel/artifacts/*.md are the synthesis's own working files, not fetched external content; low injection risk but unversioned.
  • Result: Confirmed (low severity)
  • Impact: Codebase claims (A15A24) are reproducible and were checked; web claims (A1A14) are not re-fetchable from here and were taken on the retrieval's word.

Adjustments Made

The recommendation did not survive in its original "O-B as the spine" form. It was rewritten into the conditional fork above (deciding criterion: conductor model), per V4. O-C was dropped (V7); DBOS was demoted from co-equal to unvalidated (V5, V6); the "O-A fails on Qwen" claim was softened to "fragile" (V3); the effort framing and the 500-char result-passing problem were folded into O-B's scope (V1, V2).

Confidence Assessment

  • Confidence: Medium
  • Remaining Risks: The web tier (A1A14) is unverifiable from this environment and the DBOS/pg-workflows specifics are vendor/single-sourced. A25 is n=1. The single assumption that flips the entire recommendation — whether the conductor is Claude or Qwen — is the operator's to confirm; everything downstream depends on it.

Sources

ID Source Link / location Retrieved Trust class Summary (one line) Evidence status
A1 Nango: left Temporal for Postgres orchestration https://nango.dev/blog/migrating-from-temporal-to-a-postgres-based-task-orchestrator/ 2026-06-03 web Temporal ops overhead drove a move to Postgres-backed orchestration corroborated by A12
A2 Restate self-hosted overview https://docs.restate.dev/server/overview 2026-06-03 web Single binary, embedded RocksDB, no external DB corroborated by A3
A3 Show HN: Restate https://news.ycombinator.com/item?id=40659160 2026-06-03 web Confirms single-binary lightweight deploy corroborated by A2
A4 Praetorian: Deterministic AI Orchestration https://www.praetorian.com/blog/deterministic-ai-orchestration-a-platform-architecture-for-autonomous-development/ 2026-06-03 web External orchestration beats self-orchestration; small models do better as bounded workers corroborated by A6, A9 (interested party)
A5 Hatchworks: Orchestrating AI Agents https://hatchworks.com/blog/ai-agents/orchestrating-ai-agents/ 2026-06-03 web Self-orchestration failure modes; external control plane as default corroborated by A4, A6
A6 arXiv 2508.02721 Blueprint-First https://arxiv.org/abs/2508.02721 2026-06-03 web Deterministic engine + LLM for bounded sub-tasks; +10.1pp, fewer turns corroborated by A4, A5
A7 Show HN: DBOS TypeScript https://news.ycombinator.com/item?id=42727970 2026-06-03 web In-process Postgres durable execution, decorator steps corroborated by A8
A8 DBOS Transact https://www.dbos.dev/dbos-transact 2026-06-03 web "Just your program and Postgres," no orchestration server corroborated by A7 (interested party)
A9 arXiv 2604.07911 context scoping https://arxiv.org/pdf/2604.07911 2026-06-03 web Flat-context steering accuracy 60%→21% from 3→10 agents corroborated by A4, A6
A10 pg-workflows https://sokratisvidros.github.io/pg-workflows/ 2026-06-03 web Pure-Postgres TS workflow library, step exactly-once single source (caveated)
A11 Hatchet v1 HN https://news.ycombinator.com/item?id=43572733 2026-06-03 web Postgres + RabbitMQ; separate server process corroborated by A1
A12 Temporal self-host guide https://docs.temporal.io/self-hosted-guide/deployment 2026-06-03 web Multi-service self-host overhead corroborated by A1
A13 Inngest vs Trigger vs Restate https://www.pkgpulse.com/guides/inngest-vs-trigger-dev-v3-vs-restate-2026 2026-06-03 web Inngest cloud-first; not truly self-hostable corroborated by A14
A14 Inngest self-hosting docs https://www.inngest.com/docs/self-hosting 2026-06-03 web Engine proprietary to Inngest Cloud corroborated by A13 (interested party)
A15 BooCode dispatcher + LISTEN/NOTIFY apps/coder/src/services/dispatcher.ts:46 n/a codebase Task dispatch via tasks_new notify + poll; backend routing corroborated by A18
A16 BooCode Arena apps/coder/src/routes/arena.ts:34 n/a codebase Parallel fan-out 25 contestants; selection is [SELECTED] prefix, no consumer; sessionless single source (codebase anchor)
A17 BooCode tasks table apps/coder/src/schema.sql:18 n/a codebase parent_task_id FK written-but-not-dispatched-on; no depends_on/step/flows single source (codebase anchor)
A18 AgentBackend + 4 backends apps/coder/src/services/agent-backend.ts:97 n/a codebase Common ensureSession/prompt surface; opencode/warm-acp/claude-sdk/one-shot corroborated by A15
A19 new_task / output_summary cap apps/coder/src/services/tools/new_task.ts:13 n/a codebase Native-only tools; parent reads only 500-char output_summary single source (codebase anchor)
A20 Paseo AgentManager /opt/forks/paseo/packages/server/src/server/agent/agent-manager.ts:413 n/a codebase Plain TS class, no RN deps; create/stream/run/wait/cascade-archive corroborated by A22
A21 Paseo createAgentMcpServer /opt/forks/paseo/.../agent/mcp-server.ts:479 n/a codebase create_agent/wait_for_agent/archive_agent as MCP tools; child gets parent MCP URL corroborated by A20
A22 Paseo has no sequencing engine /opt/forks/paseo/.../agent-manager.ts (verdict) n/a codebase Parent model self-orchestrates via MCP; daemon supervises only single source (codebase anchor)
A23 Paseo AgentStorage /opt/forks/paseo/.../agent/agent-storage.ts:84 n/a codebase JSON files on disk + in-memory timeline, no DB single source (codebase anchor)
A24 Paseo parent/child label /opt/forks/paseo/packages/protocol/src/agent-labels.ts n/a codebase Relationship is one label on a flat map corroborated by A22
A25 Live smoke test (this session) provided: opencode/Qwen-35B han research run n/a provided Qwen dispatched analyst, dropped the validator step, skipped template single source (live, n=1)

A22: Paseo has no deterministic sequencing engine — recommendation-bearing

  • Link / location: /opt/forks/paseo/packages/server/src/server/agent/agent-manager.ts:413 (+ explorer verdict)
  • Retrieved: n/a
  • Trust class: codebase (current-state anchor)
  • Summary: Paseo's daemon spawns, waits on, and bookkeeps agents; it contains no DAG or workflow engine. The parent agent — itself an ACP process — does all sequencing by calling create_agent/wait_for_agent/archive_agent as MCP tools in its own reasoning loop. This is why "lift Paseo's orchestrator" is a category error: the conductor is the model, not Paseo. It reframes the entire recommendation into "build a conductor (O-B) vs adopt model-as-conductor (O-A)."
  • Evidence status: corroborated by A20, A24

A17: BooCode tasks table lacks sequencing columns — recommendation-bearing

  • Link / location: apps/coder/src/schema.sql:18
  • Retrieved: n/a
  • Trust class: codebase (current-state anchor)
  • Summary: The tasks table has parent_task_id (written by new_task, read only by list_tasks, never by the dispatcher) but no depends_on, step_index, or flows definition. The dispatcher poll selects on state='pending' alone. This is the concrete gap O-B must fill, and it confirms the deterministic sequencing substrate genuinely does not exist today.
  • Evidence status: single source (codebase anchor), verified by the validator against live source

A6: Blueprint-First deterministic workflow (arXiv) — recommendation-bearing

  • Link / location: https://arxiv.org/abs/2508.02721
  • Retrieved: 2026-06-03
  • Trust class: web (peer-reviewed preprint, disinterested)
  • Summary: A deterministic engine executes an expert-defined blueprint and calls the LLM only for bounded sub-tasks, never to decide workflow path; reports ~10-point gains and large reductions in turns and tool calls versus self-orchestrating agents. This is the one disinterested source carrying the "deterministic code beats model self-orchestration" direction after the interested parties (A4) are discounted.
  • Evidence status: corroborated by A4, A5, A9