Files
boocode/openspec/changes/orchestrator/artifacts/implementation-decision-log.md
indifferentketchup 1937af8df9 feat: in-app Orchestrator (Phase 2) — multi-agent conductor
Brings the deterministic Han-flow conductor into BooCode: launch any read-only
flow from BooChat or BooCoder, watch each agent stream live in a Paseo-style
run pane, get an evidence-disciplined report — on local Qwen, persisted and
resumable. Read-only enforced hard via qwen --approval-mode plan (orchestrator
tasks fail closed if qwen is unavailable; never fall to write-capable native).

Backend (apps/coder): re-homed conductor defs, flow_runs/flow_steps schema,
flow-runner + dispatcher onTaskTerminal hook, restart-resume, runs routes
(launch/list/get/cancel), user-channel WS. Contracts: two flow_run_* frames.
Web: orchestrator pane kind + OrchestratorPane, Workflow button + slash flows
(BooChat/BooCoder parity), FlowLauncherDialog, "New Orchestrator" in the + and
split menus, runs history + export. Plan: openspec/changes/orchestrator.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 15:22:48 +00:00

25 KiB
Raw Blame History

Implementation Decision Log — Orchestrator (Phase 2)

Han synthesis output. Each decision is committed: it cites evidence, records rejected alternatives, names an owner, and a revisit criterion. Cross-reference invariant: every D-N here is referenced by design.md and/or tasks.md, and produced by a round recorded in implementation-iteration-history.md.

Source: a conversational grill-me design session. The settled behavioral spec is captured in design-context.md (12 decisions; decision 5 is REVISED by D-3 / D-4 below). Specialist findings are in the claim ledger C1C16 of the iteration history.

Trust class of evidence below: codebase (file:line in this repo) unless noted. No single-source web claims underpin any committed decision.


D-1 — Re-home the pure conductor definitions into apps/coder/src/conductor/

Decision. Copy the pure (dispatch-free) conductor definition files — spine.ts, flows/*, contracts.ts, types.ts, render.ts — into apps/coder/src/conductor/, plus the 23 Han personas (conductor/agents/*.md). The Phase-1 standalone CLI (conductor/) stays alive and unchanged. Sever the flows/code-review.tsdispatch.ts coupling by adding a DispatchFn to StepContext, injected by the flow-runner. Parameterize spine.ts's model from process.env.CONDUCTOR_MODEL to the run's configured model.

Rationale. The flow definitions are pure data + closures; only dispatch.ts (the opencode run subprocess path) and flow.ts (the in-memory scheduler) are Phase-1-specific. Copying the pure files avoids a workspace-package extraction (YAGNI — only two consumers) while keeping the Phase-1 CLI as a regression oracle. The evidence/yagni contracts are preserved because the flow-runner calls step.run(ctx) in-process to build each prompt BEFORE inserting the task — the closures execute in the coder process; prompts are never serialized to DB.

Evidence. code-review.ts:10 (import { dispatchAgent } from '../dispatch.js') and :62 (the per-dimension dispatch call) — the only flow→dispatch coupling (C1). spine.ts:122 renders process.env.CONDUCTOR_MODEL into the report header (C14). spine.ts:73 — contracts injected via the step closure, in-process (C11). 23 personas confirmed at conductor/agents/*.md.

Rejected alternatives.

  • A @boocode/conductor workspace package — rejected: only two consumers (Phase-1 CLI + coder); a shared package is premature abstraction (YAGNI). Deferred with a reopen trigger (a 3rd app needing conductor types). See Deferred (YAGNI).
  • Importing conductor/src/* directly from apps/coder across the workspace boundary — rejected: couples the coder build to the standalone CLI tree and its opencode-flavored dispatch import graph.

Specialist owner. software-architect. Revisit criterion. A third app needs the conductor types (then extract the workspace package). Driven by rounds: R1. Referenced in plan: design.md §Re-home & DispatchFn seam; tasks.md group 1.


D-2 — DB-driven flow-runner with an onTaskTerminal dispatcher hook

Decision. Add apps/coder/src/services/flow-runner.ts: a DB-backed scheduler that owns flow_runs/flow_steps, computes the ready wave from the loaded flow def, INSERTs each ready agent step as a tasks row, runs code steps inline, and advances. Fan-out is driven by ONE new hook — an onTaskTerminal(taskId, state) callback on createDispatcher — invoked when any task reaches a terminal state. No third poll loop; no modification to the dispatcher's internal run functions.

Rationale. The dispatcher already has the LISTEN/NOTIFY + poll machinery and the terminal-state transitions; a single callback at those transition points lets the flow-runner react without duplicating the dispatch loop. The flow-runner stays a pure scheduler; execution stays in the dispatcher.

Evidence. dispatcher.ts:46-179 (the loop + runTask), :279-286 (the notify_tasks_new trigger) (C2). Terminal transitions the hook attaches to: external completed dispatcher.ts:642-646, external failed :659-661. Full step output must persist in flow_steps.output TEXT because tasks.output_summary is ≤500 char and cannot reconstruct ctx.results for render/resume (schema.sql:26, flow.ts:49,59) (C3).

Rejected alternatives.

  • A standalone third poll loop in the flow-runner — rejected: duplicates the dispatcher's LISTEN/poll, two writers racing on tasks.
  • Modifying the dispatcher's runTask internals to know about flows — rejected: couples the generic dispatcher to the orchestrator; the callback seam keeps the dispatcher flow-agnostic.

Specialist owner. software-architect. Revisit criterion. Step throughput requires batching beyond what one callback per terminal task supports. Driven by rounds: R1. Referenced in plan: design.md §Flow-runner & onTaskTerminal; tasks.md group 4.


D-3 — Reuse the existing dispatcher (insert pending task), not a direct-PTY bypass

Decision. The flow-runner INSERTs each ready step as a normal state='pending' tasks row; the existing dispatcher picks it up via LISTEN 'tasks_new', runs it through the existing external-agent path (creating a git worktree as a stable HEAD read-checkout), and streams AgentEvents → WS frames unchanged. The new onTaskTerminal hook (D-2) notifies the flow-runner on terminal state. No direct-PTY bypass; the dispatcher is reused with exactly one new hook.

This REVISES design-context decision 5 ("no worktree") to: a worktree IS created, but it is a harmless read snapshot — read-only is enforced by plan mode (D-4), not by the absence of a worktree.

Rationale. Reuse (architect's A2) gets streaming, persistence, resume, cancellation, and AgentEvent→WS mapping for free. The only objection to A2 was that it creates a worktree the "no worktree" decision-5 wanted to avoid; once read-only is enforced at the tool level by plan mode (D-4), the worktree is inert (a checkout the agent cannot write to), so the objection dissolves. This was the user's explicit choice over the architect's leaning-toward-bypass (A4).

Evidence. The external-agent path with worktree creation and AgentEvent→WS streaming: dispatcher.ts external branch (worktree create → run → terminal at :642-646/:659-661). Task-as-dispatch precedents the flow-runner copies: routes/skills.ts:94 (a skill is already dispatched as a task), routes/arena.ts:49, tools/new_task.ts:54. Dispatch tension recorded as C12 (A2 vs A4, architect self-flagged Disputed); resolved here by user choice.

Rejected alternatives.

  • A4 direct-dispatchViaPty bypass (insert running task + call PTY directly to skip worktree creation) — rejected: duplicates streaming/persistence/resume wiring, and a restart kills the PTY child outside the dispatcher's lifecycle (worsening resume, C15). The worktree it was avoiding is harmless under D-4.
  • design-context decision 5's "no worktree, read project dir directly" — rejected: reusing the dispatcher means reusing its worktree creation; under D-4 the worktree is a read snapshot, so avoiding it bought nothing and cost the reuse.

Specialist owner. software-architect (execution path); devops-engineer (operational behavior of the reused dispatcher under flow load). Revisit criterion. Worktree creation per step becomes a measured throughput or disk-cost problem under real flow concurrency. Driven by rounds: R1 (C12), R2 (read-only finding that made the worktree inert). Referenced in plan: design.md §Execution via dispatcher reuse; tasks.md group 4.


D-4 — Read-only enforced HARD by mode_id='plan' (qwen --approval-mode plan)

Decision. Every orchestrator step task is dispatched with mode_id = 'plan', which the PTY dispatcher passes to qwen as --approval-mode plan — a built-in tool-level gate: reads allowed, writes blocked. The flow-runner hardcodes mode_id='plan' for every step task; it is never user-overridable. This is the sole read-only enforcement mechanism. BOOCODE_TOOLS and persona prompts are NOT relied upon (they do not govern external CLI agents).

Rationale. Read-only is a safety-critical invariant of the whole feature (flows never write the repo). Prompt-level intent and BOOCODE_TOOLS ceilings govern BooChat's in-process tools, not an external qwen CLI child — so they are not watertight. qwen's --approval-mode plan is a tool-level gate inside the agent binary itself, which the adversarial-security-analyst (R2) identified as the only enforcement that actually binds the external agent. Qwen-only (decision 6) makes a single hardcoded flag sufficient.

Evidence. The wiring already exists: pty-dispatch.ts:75if (modeId) args.push('--approval-mode', modeId) in the qwen spawn spec. R2 security finding recorded as C13 (the R1 claim that prompt-level + BOOCODE_TOOLS enforcement was sufficient was Anecdotal/unproven; R2 refuted it and named plan mode as the binding control).

Rejected alternatives.

  • Prompt-level read-only intent (personas tell the agent not to write) — rejected (C13, R2): an instruction, not a gate; a model can ignore or be steered past it.
  • BOOCODE_TOOLS=core as the gate — rejected (C13, R2): governs BooChat's in-process tool registry, does not constrain the external qwen CLI's own tools.
  • A read_only boolean flag on tasks — rejected: superseded by mode_id='plan', which is an existing column already plumbed to the binary. See Deferred (YAGNI).

Specialist owner. adversarial-security-analyst. Revisit criterion. A non-qwen agent is added to flows (re-verify that agent's equivalent of --approval-mode plan before allowing it), or qwen changes --approval-mode plan semantics. Driven by rounds: R1 (C13 flagged), R2 (resolved). Referenced in plan: design.md §Read-only via plan mode; tasks.md group 4.


D-5 — flow_runs + flow_steps schema in the coder schema

Decision. Add two tables to apps/coder/src/schema.sql:

  • flow_runs(id, project_id [no FK, matches tasks.project_id], flow_name, band [CHECK small|medium|large], model, status [CHECK-named], input JSONB [CHECK (input ? 'question')], report TEXT [nullable], error, timestamps).
  • flow_steps(id, run_id [FK → flow_runs ON DELETE CASCADE], step_id, kind [CHECK agent|code], agent, status [CHECK-named], task_id [UUID → tasks(id) ON DELETE SET NULL; nullable, code steps NULL], chat_id [UUID → chats(id) ON DELETE SET NULL], input TEXT, output TEXT [FULL output], error, timestamps, UNIQUE(run_id, step_id)).

No depends_on column (derive from the loaded flow def). Do NOT insert skipped-step rows (when() is pure on stored input). Indexes: flow_steps(run_id, status), flow_runs(project_id, created_at DESC). Explicit CHECK constraint names + the repo's DROP-IF-EXISTS → guarded-ADD migration discipline.

Rationale. A run spans multiple tasks; existing tables (tasks, agent_sessions) model single dispatches, not a DAG. flow_steps.task_id → tasks(id) (not a column on tasks) keeps tasks generic. output TEXT is FULL because tasks.output_summary is ≤500 char and cannot reconstruct ctx.results. project_id has no FK to match tasks.project_id's existing convention.

Evidence. tasks shape and output_summary ≤500 char: schema.sql:18-34, :26 (C3, C4). flow.ts:49,59 (results reconstruction needs full output, C3). flow.ts:28-41, types.ts:27 (deps + when() derivable from flow def — omit depends_on and skipped rows, C6). schema.sql:19,32 (project_id no-FK pattern; CHECK-named discipline, C5). Migration discipline: root CLAUDE.md schema section.

Rejected alternatives.

  • A depends_on column on flow_steps — rejected (C6, YAGNI): deps are in the loaded flow def; storing them duplicates the source of truth. Deferred.
  • Persisting skipped-step rows — rejected (C6, YAGNI): when() is pure on stored input, so a skip is reconstructable. Deferred.
  • A column on tasks (e.g. flow_step_id) — rejected (C4): pollutes the generic tasks table; the FK belongs on flow_steps.

Specialist owner. data-engineer. Revisit criterion. A stored-run DAG visualization needs deps without loading the flow def (then add depends_on); the UI must explain a skip without the flow def (then persist skipped rows). Driven by rounds: R1. Referenced in plan: design.md §Schema; tasks.md group 2.


D-6 — Two new WS frames; per-agent stream reuses existing frames by chat_id

Decision. Add two frames to packages/contracts/src/ws-frames.ts:

  • flow_run_started: run_id, flow_name, band, steps[] (each step_id, agent, kind, chat_id, label).
  • flow_run_step_updated: run_id, step_id, status, run_status?, report?.

The per-agent content stream REUSES the existing delta / tool_call / message_complete frames keyed by the step's chat_id. Each agent step gets a synthetic chats row for stream attribution. Register in all THREE frame registries: contracts WsFrameSchema, the server InferenceFrame union (services/inference/turn.ts), and the web strict WsFrame union (apps/web/src/api/types.ts) — the web type is the wire-format gate.

Rationale. The run-level lifecycle (which agents exist, their status, the final report) needs new frames; the per-agent token stream is exactly what the existing delta/tool_call/message_complete pipeline already carries, so keying it by a synthetic chat_id reuses the whole broker→WS path with no new streaming code. The report rides on flow_run_step_updated rather than its own frame (one fewer frame type; revisit only if reports exceed the frame size limit).

Evidence. Existing broker→WS frame pipeline and frame list: ws-frames.ts (snapshot…error). Three-registry rule + web-type-is-wire-gate: root CLAUDE.md "Adding a new WS frame type" + discovery notes §packages/contracts. Stream-by-chat reuse precedent: the dispatcher publishes delta/tool_call/message_complete keyed by chat already (C7).

Rejected alternatives.

  • New per-agent stream frames (flow_agent_delta, etc.) — rejected: the existing delta/tool_call/message_complete already stream by chat; new frames duplicate them.
  • A separate flow_run_report frame — rejected (YAGNI): the report fits on flow_run_step_updated. Deferred with a reopen trigger (reports exceed ~50KB).

Specialist owner. software-architect. Revisit criterion. Reports exceed the frame size limit (~50KB) → split the report onto its own frame. Driven by rounds: R1. Referenced in plan: design.md §WS frames; tasks.md group 3.


D-7 — orchestrator pane kind + OrchestratorPane

Decision. Add an orchestrator pane kind (following the markdown_artifact/html_artifact precedent) — touching WorkspacePaneKind, useWorkspacePanes, Workspace, NewPaneMenu, ChatTabBar, PaneHeaderActions. OrchestratorPane.tsx: run header; report-at-top on completion; collapsed agent roster reusing AgentStatusDot; expand-one-at-a-time detail well reusing CoderPane stream rendering; mobile single-column inline expand; auto-expand-follows-active. Runs history in NewPaneMenu. Export (copy / save-file / send-to-chat via the existing sendToChat) in the pane header , conditional on a completed report.

Rationale. A fourth pane kind is already a precedented extension point; the pane reuses AgentStatusDot and the CoderPane stream renderer, so the new surface is composition, not new streaming UI. Expand-one-at-a-time avoids the crowding the grill rejected.

Evidence. Pane-kind precedent: api/types.ts:386 WorkspacePaneKind (with markdown_artifact/html_artifact). Roster/status reuse: AgentComposerBar.tsx:204 (AgentStatusDot), CoderPane stream rendering (C8). Launcher categories from the flow registry: flows/index.ts; runs history host NewPaneMenu.tsx; export via lib/events.ts sendToChat (C10).

Rejected alternatives.

  • Rendering runs inside the existing coder pane — rejected: a run is a parent-with-nested-children view, not a single agent session; conflating them crowds both.
  • All-agents-expanded simultaneously — rejected (C8): the crowding the design session explicitly rejected.

Specialist owner. user-experience-designer. Revisit criterion. Users cannot follow multiple concurrent runs from the roster (then revisit the expand model). Driven by rounds: R1. Referenced in plan: design.md §Orchestrator pane; tasks.md groups 7, 10.


D-8 — Workflow toolbar button + slash launch, BooChat/BooCoder parity

Decision. Add a Workflow (lucide) button on ChatInput's controls row, between the SquareSlash chip and the Globe pill — yielding parity in BooChat (ChatPane) and BooCoder (CoderPane) for free. Label "Flows" on desktop, icon-only on mobile (toolbar confirmed to fit one line). Slash launches instantly with defaults (band small, current pane's project, text-after-command = focus), opening the pane. The button opens FlowLauncherDialog.tsx first: 5 category tabs (Analysis/Discovery/Planning/Authoring/Review) → filtered flow list + size + focus

  • fast toggle; defaults Analysis/Small/off.

Rationale. ChatInput is the shared composer rendered by both panes, so a single button gives both doors with parity at no extra cost. The toolbar fits one line at ≤5 elements, so adding the button does not force scroll/wrap (a standing mobile constraint).

Evidence. ChatInput.tsx:648-732, :673 — the controls row is ≤5 elements; adding the Workflow icon between SquareSlash and Globe keeps it one line; refutes junior Q13's crowding worry (C9). Launcher categories from flows/index.ts (C10). Shared-composer fact: discovery notes §apps/web (ChatInput rendered by ChatPane + CoderPane).

Rejected alternatives.

  • Separate buttons in ChatPane and CoderPane — rejected: duplicates wiring; the shared composer already gives parity from one button.
  • A launcher search box instead of category tabs — rejected (YAGNI): 22 flows in 5 categories are browsable; a search box is unproven need. Deferred.

Specialist owner. user-experience-designer. Revisit criterion. Category grouping fails users at the 22-flow catalog size (then add the search box). Driven by rounds: R1. Referenced in plan: design.md §Toolbar button & launcher; tasks.md groups 8, 9.


D-9 — Resumable runs via initResume on coder startup

Decision. On coder startup, an initResume re-advances every flow_runs WHERE status='running': a step whose task completed → mark the step done + advance the run; a step whose task is lost/failed (PTY died on restart) → re-dispatch; completed steps are kept. (design-context decision 4 commits to "resumable".)

Rationale. A restart can land mid-flight. Because execution goes through the dispatcher with persisted task state (D-3), a step's outcome is recoverable from the DB; the run-level scheduler just has to re-derive the wave and re-dispatch only the steps that did not finish. Reconcile-and-advance (architect A3) beats mark-run-failed (data's conservative option) because decision 4 already committed to resumable and the task state is durable.

Evidence. No run-level resume exists today (single tasks resume via agent_sessions; a run spanning tasks does not) — discovery notes §Enumerated gaps. Resume tension recorded as C15 (architect reconcile-and-advance vs data mark-failed); resolved toward reconcile-and-advance by decision 4 + durable task state under D-3.

Rejected alternatives.

  • Mark a running run failed on restart — rejected (C15): contradicts decision 4 (resumable) and discards recoverable completed-step work.
  • Re-running the whole flow from step 0 — rejected: re-does completed steps, burning the local model on work already persisted.

Specialist owner. software-architect (scheduler); data-engineer (recovery query). Revisit criterion. A step-level idempotency hazard surfaces where re-dispatch of a "lost" step double-counts side effects (none expected under read-only plan mode). Driven by rounds: R1. Referenced in plan: design.md §Resume; tasks.md group 5.


D-10 — Concurrency: multiple runs, no queued status, single model per run

Decision. Multiple runs are allowed; each gets its own pane + flow_runs row, no shared state. Step statuses are pending / running / completed / failed / skipped — there is NO separate queued status (the dispatcher's pending covers a step waiting on the busy model or on deps). Model is a single config value per run, default qwen3.6-35b-a3b-mxfp4.

Rationale. Each run is independent state, so concurrency needs no coordination beyond the dispatcher's existing per-session serialization. A queued status is not observable: with the model busy, a task is simply pending/running and llama-swap does not expose queue position, so a distinct queued state would be a label the system cannot honestly populate (revising decision-11's "panes show queued honestly").

Evidence. queued unobservability recorded as C16 (junior Q11, data DATA-005): llama-swap does not report queue position; the status reduces to pending(dep/model-wait)/running. Single-model-per-run carried from decision 6/11.

Rejected alternatives.

  • A distinct queued step status — rejected (C16): nothing can populate it honestly; pending already means "waiting". Deferred (reopen if llama-swap exposes queue position).
  • Serializing runs (one at a time) — rejected: runs are independent; serialization adds coordination for no benefit and hurts the multi-pane UX (decision 11).

Specialist owner. data-engineer (status set), devops-engineer (model-busy behavior under concurrent runs). Revisit criterion. llama-swap exposes queue position → add an observable queued status. Driven by rounds: R1. Referenced in plan: design.md §Schema (status sets) + §Concurrency; tasks.md group 2.


Cross-reference index

Decision Driven by Design.md section Tasks.md group
D-1 Re-home + DispatchFn R1 (C1, C11, C14) Re-home & DispatchFn seam 1
D-2 Flow-runner + onTaskTerminal R1 (C2, C3) Flow-runner & onTaskTerminal 4
D-3 Dispatcher reuse (not bypass) R1 (C12), R2 Execution via dispatcher reuse 4
D-4 Read-only via plan mode R1 (C13), R2 Read-only via plan mode 4
D-5 Schema flow_runs/flow_steps R1 (C3C6) Schema 2
D-6 WS frames R1 (C7) WS frames 3
D-7 Orchestrator pane R1 (C8) Orchestrator pane 7, 10
D-8 Toolbar button + slash R1 (C9, C10) Toolbar button & launcher 8, 9
D-9 Resume R1 (C15) Resume 5
D-10 Concurrency / no-queued R1 (C16) Schema + Concurrency 2

Deferred (YAGNI)

These were considered and deferred under the evidence rule. Each names the trigger that would justify reopening.

@boocode/conductor workspace package

  • Why deferred: only two consumers (Phase-1 CLI + coder); copy-in (D-1) avoids premature shared-package abstraction.
  • Reopen when: a third app needs the conductor types.
  • Source: architect (D-1 rejected alternative).

flow_steps.depends_on column

  • Why deferred: deps are derivable from the loaded flow def (flow.ts:28-41, types.ts:27); a column duplicates the source of truth.
  • Reopen when: a stored-run DAG visualization must show deps without loading the flow def.
  • Source: data-engineer C6 (D-5 rejected alternative).

Persisted skipped-step rows

  • Why deferred: when() is pure on stored input, so a skip is reconstructable from the flow def + run input.
  • Reopen when: the UI must explain a skip without the flow def.
  • Source: data-engineer C6 (D-5 rejected alternative).

read_only flag on tasks

  • Why deferred: superseded by mode_id='plan' (D-4), an existing column already plumbed to qwen's --approval-mode.
  • Reopen when: a non-qwen agent without a --approval-mode plan equivalent is added to flows.
  • Source: D-4 rejected alternative.

Explicit queued step status

  • Why deferred: llama-swap does not expose queue position; nothing can populate the status honestly (C16). pending covers waiting.
  • Reopen when: llama-swap exposes queue position.
  • Source: junior Q11 / data DATA-005 (D-10 rejected alternative).
  • Why deferred: 22 flows in 5 category tabs are browsable; a search box is unproven need.
  • Reopen when: category grouping fails users at the catalog size.
  • Source: UX C10 (D-8 rejected alternative).

Separate report-stored WS frame

  • Why deferred: the report rides on flow_run_step_updated (D-6).
  • Reopen when: reports exceed the ~50KB frame size limit.
  • Source: architect C7 (D-6 rejected alternative).