feat: in-app Orchestrator (Phase 2) — multi-agent conductor

Brings the deterministic Han-flow conductor into BooCode: launch any read-only flow from BooChat or BooCoder, watch each agent stream live in a Paseo-style run pane, get an evidence-disciplined report — on local Qwen, persisted and resumable. Read-only enforced hard via qwen --approval-mode plan (orchestrator tasks fail closed if qwen is unavailable; never fall to write-capable native). Backend (apps/coder): re-homed conductor defs, flow_runs/flow_steps schema, flow-runner + dispatcher onTaskTerminal hook, restart-resume, runs routes (launch/list/get/cancel), user-channel WS. Contracts: two flow_run_* frames. Web: orchestrator pane kind + OrchestratorPane, Workflow button + slash flows (BooChat/BooCoder parity), FlowLauncherDialog, "New Orchestrator" in the + and split menus, runs history + export. Plan: openspec/changes/orchestrator. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 14:59:07 +00:00
parent 519b1d2ca1
commit 1937af8df9
118 changed files with 15723 additions and 27 deletions
--- a/conductor/references/evidence-rule.md
+++ b/conductor/references/evidence-rule.md
@@ -0,0 +1,74 @@
+# Evidence Rule (Evidence-Based)
+
+This rule defines what evidence means in Han, how to characterize how strong it is, and what to do when no evidence exists at all. The rule supplements [`yagni-rule.md`](yagni-rule.md). YAGNI's categories answer *is there any evidence to include this item?* This rule answers *once an item passes that test, how confident should you be in the evidence, and what is the response when no evidence is available?*
+
+The vocabulary and the corroboration gate here originated in `/research`; this file is the canonical extraction so other skills and agents can apply the same primitives.
+
+## Trust classes
+
+Every artifact a skill or agent cites carries one of three trust classes:
+
+- **Codebase** is the trusted current-state anchor. The current source code, current tests, current configuration, current build output. When codebase evidence contradicts other evidence, treat the codebase as authoritative on what the system does today.
+- **Web** sits outside the trust boundary. Documentation, blog posts, Stack Overflow, GitHub issues, RFCs, vendor whitepapers, LLM-generated content. Web sources can be wrong, stale, adversarially shaped, or contextually misapplied.
+- **Provided** is operator-supplied material. Files pasted in, links handed to a skill, screenshots, transcripts. Apply interested-party scrutiny; hold to the same standard as web sources.
+
+## The three principles
+
+### Principle 1: Proximity to origin (heuristic, not ranked tier list)
+
+Evidence drawn from closer to the originating event or data carries more weight than evidence at greater remove. Apply this as a heuristic, not as a ranked tier list. A numbered ordering of source types looks operational but breaks at the first tier boundary.
+
+The principle inverts in three contexts: formal-methods or specification-compliance contexts (the specification is the authoritative artifact); regulatory or contractual contexts (the regulation wins); and pre-incident observation of intended behavior (a passing test proves only that tested inputs behaved correctly for tested code paths; passing and failing tests are not symmetric evidence). See [`docs/evidence.md#principle-1-proximity-to-origin`](../../docs/evidence.md#principle-1-proximity-to-origin) for the inversion conditions.
+
+### Principle 2: Independent corroboration (web-source scope)
+
+A claim corroborated by two or more independent sources carries more weight than a claim resting on one. Applied as a gate to web sources:
+
+**A web claim that bears on a recommendation and has no independent corroboration is marked single-source and cannot be the sole basis for the recommendation.**
+
+The gate does not apply to codebase evidence. A single file path at a specific line number is not weakened by being a single citation; the current source code is the current state of the system. Extending the gate to codebase evidence is deferred work, opened only when a specific failure forces the adaptation.
+
+When sources contradict each other, surface the conflict. Record both, name the disagreement, and let the reader judge. When codebase evidence and web evidence disagree, the codebase wins on what the system does today; add "continue with the current approach" as a named alternative.
+
+### Principle 3: Explicit no-evidence labeling
+
+When a claim has no evidence at any tier, label it. Defer the dependent decision. Name the concrete trigger that would justify revisiting.
+
+Do not collapse "no evidence" into "very weak evidence." They are different states. The response pattern is the same one [YAGNI](yagni-rule.md) uses for deferred items: a labeled defer with a concrete reopen trigger (a measured metric, an incident class, a customer commitment, a regulation taking effect, a dependency landing). Aspirational triggers do not qualify.
+
+## How to apply the rule
+
+### When producing a judgment (research, investigation, plan, review)
+
+For every claim that drives a conclusion:
+
+1. Name the trust class (codebase, web, provided).
+2. For web claims that bear on the recommendation, apply the corroboration gate. Single-source web claims get marked and cannot stand alone.
+3. For codebase claims, cite the file path and line number; the single-source caveat does not apply.
+4. For claims with no evidence at any tier, label the claim, defer the dependent decision, and record the reopen trigger.
+
+### When reviewing a judgment (review skills, review agents)
+
+For every committed claim in the artifact:
+
+1. Check that the trust class is named or inferable.
+2. Check that single-source web claims are marked and do not stand alone as the basis for a recommendation.
+3. Check that no-evidence claims are labeled and deferred with a trigger, not silently treated as weak evidence.
+4. Surface contradictions between sources rather than picking the agreeable one.
+
+### When the rule and YAGNI both apply
+
+Apply YAGNI Gate 1 first. If the item fails the YAGNI evidence test (none of YAGNI's five categories of acceptable evidence apply), defer the item per YAGNI regardless of any quality consideration this rule would raise. If the item passes YAGNI Gate 1, then characterize the quality of the evidence using this rule: name the trust class, apply the corroboration gate to web claims, and label no-evidence states.
+
+YAGNI gates inclusion. This rule characterizes quality once inclusion is justified. The two rules do not collapse into one.
+
+## Escalation
+
+Claims that fail the corroboration gate and cannot be corroborated are **never silently accepted**. They surface to the user with the single-source label so the choice to act on them is conscious. The user always wins; they may direct a single-source web claim to be acted on against the gate, and the override is recorded with rationale so the choice stays visible.
+
+## What this rule is not
+
+- **Not a replacement for YAGNI's evidence test.** YAGNI's five categories of acceptable evidence remain the gate for inclusion. This rule applies after YAGNI passes.
+- **Not a ranked tier list.** The proximity-to-origin principle is a heuristic. A numbered ordering ("production > tests > codebase > docs > blogs") will produce inconsistent results across skill invocations.
+- **Not a codebase-evidence corroboration gate.** The gate applies to web sources only. Single-file codebase findings stand on their citation.
+- **Not a bar for academic rigor.** The bar is operational. "You can tell where this came from and how strongly it rests" is the standard.
--- a/conductor/references/yagni-rule.md
+++ b/conductor/references/yagni-rule.md
@@ -0,0 +1,102 @@
+# YAGNI Rule (Evidence-Based)
+
+YAGNI — "You Aren't Gonna Need It" — is the rule this project uses to keep specs, plans, code, and operational machinery from accreting work that isn't needed yet. The rule is evidence-based, not absolute. Items survive when evidence justifies them. Items without evidence get deferred — recorded for later, not silently dropped.
+
+Every line of code, every section of a spec, every runbook, every abstraction, every configuration knob, every observability hook is ongoing maintenance cost — and is also a pattern future agents will treat as load-bearing and copy. The bar for inclusion is "we need this now and have evidence to prove it," not "we might want this someday."
+
+The categories below answer whether evidence exists at all (the inclusion gate). For how strong the evidence is once it exists — trust classes, the corroboration gate for web sources, the no-evidence label — see the companion [`evidence-rule.md`](evidence-rule.md). The two rules work together; this one gates inclusion, that one characterizes quality.
+
+## The two gates
+
+### Gate 1: The evidence test (gate for inclusion)
+
+Any committed item — feature behavior, spec section, code change, abstraction, configuration option, ADR, coding standard, runbook, observability hook, alert, test, plan step, build phase — must cite **at least one** piece of evidence that it is needed *now*. Acceptable evidence:
+
+1. **A user-described need** in the source artifact (PRD, feature spec, ticket, conversation with the user, stakeholder commitment).
+2. **A named direct dependency** — another in-scope item literally cannot work without it. The dependent item must itself pass the evidence test.
+3. **An existing production code path or contract that will break without it** — cite the file/path/function or external consumer that depends on the current behavior.
+4. **A regulatory or compliance rule that demonstrably applies to this project today** — cite the specific regulation and how it touches the change. "Compliance might require…" is not evidence.
+5. **A documented incident, real production alert that has fired, real customer report, or measured metric** showing the problem exists. Hypothetical alerts don't qualify; alerts that have actually fired do.
+
+If no evidence in this list applies, the item is **YAGNI** — defer it. Record the deferral with the trigger that would justify reopening it.
+
+### Gate 2: The simpler-version test (gate for shape)
+
+When evidence justifies an item, ask: **is there a strictly simpler version that satisfies the same evidence?**
+
+- A simpler version uses fewer files, fewer abstractions, fewer configuration surfaces, fewer code paths, fewer tests, fewer phases.
+- A single function beats a class. A class beats a class hierarchy. A class hierarchy beats a framework.
+- One concrete implementation beats an interface with one implementation.
+- A literal value beats a configurable value beats a configurable value with a default.
+- An inline check beats a helper beats a middleware beats a framework.
+- One end-to-end test beats one end-to-end test plus three integration tests plus twelve unit tests, when the end-to-end test catches every realistic failure mode.
+
+If a simpler version satisfies the same evidence, the simpler version replaces the larger one. The larger version is YAGNI until the simpler one demonstrably falls short.
+
+## Named anti-patterns (auto-flag as YAGNI candidates)
+
+Any of the following, when found in a spec / plan / code change / ADR / runbook, is a YAGNI candidate by default. The evidence test must affirmatively justify keeping it.
+
+- **"We might need…" / "for future flexibility" / "in case we want to…"** — pure speculation.
+- **"When we scale" / "at scale" / "for performance"** — scaling work without measured pressure that the change actually addresses.
+- **"Best practice says…" / "the standard is…"** — best practices that don't solve a problem this project actually has.
+- **Symmetry / completeness** — "we have create, so we should have delete," "we have one endpoint, so we should have all the CRUD verbs," "this enum has three values, so the test should cover all three" when only one is reachable.
+- **Single-implementation interfaces / abstract base classes** — abstractions introduced before three concrete uses exist (the Rule of Three).
+- **Speculative configuration knobs** — config options no caller sets, env vars no environment overrides, feature flags wrapping a single code path with no rollout plan that uses them.
+- **Defensive code at trusted internal boundaries** — null checks, type checks, and validation for inputs that internal callers fully control. Validate at system boundaries (user input, external APIs); trust internal contracts.
+- **Speculative observability** — instrumentation, dashboards, or log fields for systems whose telemetry isn't reaching the destination yet, or for failure modes that have never occurred.
+- **Runbooks for alerts that have never fired** and have no signal data flowing — the canonical example from this project's history (Sentry runbooks for staging-only Sentry where data isn't reaching production).
+- **SLOs and error budgets for traffic the system doesn't yet receive.**
+- **Multi-region / HA infrastructure** for a workload that hasn't proven single-region pressure.
+- **Indexes for queries that don't run, audit columns nobody reads, denormalization for read patterns that don't exist, partitioning for data volumes the project doesn't have.**
+- **Tests for code paths that don't exist yet, or for hypothetical adversaries the change doesn't touch.**
+- **ADRs about decisions that don't have a forcing function today.**
+- **Coding standards about patterns the project doesn't actually use yet.**
+- **Phases of work whose only justification is "completeness of the roadmap."**
+
+## How to apply YAGNI in skills and agents
+
+### When producing artifacts (specs, plans, code recommendations)
+
+For every item you are about to commit:
+
+1. State the evidence that justifies the item, citing the source per the evidence test.
+2. If no evidence applies, do not commit the item — record it under the artifact's `## Deferred (YAGNI)` section with the trigger that would justify reopening it.
+3. If evidence applies, ask the simpler-version test. Replace with the simpler version when one satisfies the same evidence.
+
+### When reviewing artifacts (review skills, review agents)
+
+For every committed item in the artifact:
+
+1. Run the evidence test. If no evidence is cited, raise a `Category: YAGNI candidate` finding.
+2. Resolution paths for a YAGNI candidate finding:
+   - **Cite the missing evidence** — the item is justified, finding closes.
+   - **Replace with a simpler version** — record the rationale, the larger version moves to deferred.
+   - **Move to `## Deferred (YAGNI)`** — record the trigger that would reopen it.
+3. Anti-patterns from the named list above force a finding regardless of severity rules.
+
+### Escalation
+
+YAGNI candidates are **never silently dropped**. They surface to the user as deferrals with the reopening trigger named. The user always wins — they may direct an item to be kept against the rule. The rule's job is to make the cost of including the item visible so the choice is conscious.
+
+## Deferred (YAGNI) section format
+
+Every artifact that may produce YAGNI deferrals adds a section with this structure:
+
+```
+## Deferred (YAGNI)
+
+### {item name}
+**Why deferred:** {which gate failed — evidence test or simpler-version test, with the specific reason}
+**Reopen when:** {the concrete trigger that would justify revisiting — a metric, an incident class, a customer commitment, a dependency landing, a regulation taking effect}
+**Source:** {where the item was originally proposed — review finding ID, agent name, conversation context}
+```
+
+When no items are deferred, the section is omitted entirely (don't write empty stub sections).
+
+## What YAGNI is not
+
+- **Not "never plan ahead."** Plan ahead when planning ahead is the evidence — a regulatory deadline, a customer commitment, a dependency that requires lead time. The evidence test welcomes that.
+- **Not "skip security / data integrity / correctness."** Critical-path correctness work passes the evidence test trivially. YAGNI applies to *speculative* security hardening, not to addressing actual exploit paths or actual data corruption.
+- **Not "never refactor."** Refactor when the existing structure demonstrably impedes the change being made — that's evidence. Refactor for "cleanliness" alone is YAGNI.
+- **Not an excuse to skip user-described requirements.** If the user said they want it, that is evidence. The rule challenges what *agents and skills* add on top of what the user asked for.