Files
boocode/openspec/changes/archived/2026-06-07-hybrid-workflow-engine/design.md
indifferentketchup c935687725 chore(openspec): drop 9 superseded proposals + 11 stub archive files
Drop 9 batch proposals that are superseded by the boocode-lift-analysis
(boocontext-audit, conductor upgrades, self-healing/verify-gate skills):
add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform,
conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul,
agent-reliability.

Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only)
that provide zero documentation value over the existing CHANGELOG.md + git tags.
2026-06-07 22:15:38 +00:00

63 lines
5.7 KiB
Markdown

## Context
Three workflow engine patterns were researched: **Archon** (DAG-based YAML, git isolation), **Agent SOP** (markdown instructions with RFC 2119 constraints), and **Vercel Workflow** (event-sourced durable execution). Each excels in one dimension but has fundamental gaps:
- **Archon**: Clean DAG format + variable substitution + approval gates, but no crash recovery, tightly coupled to its monorepo (Bun/SQLite/Claude SDK)
- **Agent SOP**: Zero parser complexity, AI-native markdown, but completely stateless — no execution engine, no validation, no persistence
- **Vercel Workflow**: Gold-standard durability via event sourcing, but requires Rust SWC plugin, VM sandbox, 24-36 week rebuild — extreme complexity for the value in most use cases
**Ion** extracts the portable essence of each: Archon's DAG schema and executor, Agent SOP's markdown readability, Vercel's event sourcing (simplified — no SWC, no VM, no compile transforms).
## Goals / Non-Goals
**Goals:**
- Fully portable DAG execution engine in pure TypeScript (zero Rust/SWC/wasm)
- YAML-first workflow definitions with 7 node types (command, prompt, bash, script, loop, approval, cancel)
- `.sop.md` markdown format as a secondary input (transpiled to DAG nodes)
- Event-sourced persistence for crash recovery with deterministic replay — simplified to "log of node outcomes" rather than "log of every async operation"
- Plugable storage backends: filesystem (dev), SQLite/Postgres (production)
- CLI tool + library API dual distribution
- Approval gates with capture_response and on_reject
- Variable substitution ($nodeId.output, $ARGUMENTS, $LOOP_PREV_OUTPUT, etc.)
- Script execution via bun/node (TS) and uv/python3 (Python) with deps support
**Non-Goals:**
- No SWC compiler plugin or build-time transforms (Vercel's approach is overkill for this scope)
- No VM sandbox for workflow execution (workflows run as regular async functions)
- No git worktree isolation (leave to the host application)
- No multi-tenant or serverless platform (single-tenant CLI/library focus)
- No web UI in the initial build (CLI + library only; web can be added later)
- No AI provider integration (host application provides the AI; Ion just routes prompts)
## Decisions
### Decision 1: Event Log = Node Outcomes, Not Every Async Operation
**Vercel** logs every `step_created`, `step_completed`, `wait_created`, `hook_received` etc. — 17 event types. This requires SWC transforms to intercept all async boundaries.
**Ion** logs only *node-level* events: `node_started`, `node_completed`, `node_failed`, `workflow_started`, `workflow_completed`, `workflow_failed`. No micro-events. Replay means "re-run the DAG from the top, skipping completed nodes using stored outputs" — identical to Archon's `resume` approach.
**Rationale**: Simpler by an order of magnitude. No interceptors, no transforms, no VM. Crash recovery works: if the process dies mid-workflow, replay skips completed nodes and re-executes from the last failed/incomplete layer.
### Decision 2: Pure TypeScript — No Rust, No SWC, No WASM
All three engines studied: Archon (pure TS), Vercel (Rust SWC plugin), Agent SOP (pure Python). The SWC plugin is the single biggest contributor to Vercel's 24-36 week build time.
**Ion** stays pure TS. The DAG executor, YAML loader, variable substitution, event log — all standard async/await. No build step beyond `tsc` or `bun build`.
### Decision 3: YAML Primary, Markdown Secondary
**Archon's YAML** format is the primary definition: structured, validated by Zod, machine-parseable. **Agent SOP's markdown** is the secondary format: human-writable, conversational, auto-converted.
The transpiler is simple: parse `## Parameters` → extract required fields, parse `## Steps` → convert each step to a `prompt:` node with constraints embedded in the prompt text. No AST-level parsing needed.
### Decision 4: Storage via IWorkflowStore Interface
**Archon's pattern**: `IWorkflowStore` interface with `createWorkflowRun`, `getWorkflowRun`, `updateWorkflowRun`, `failWorkflowRun`, `createWorkflowEvent`, `getCompletedDagNodeOutputs`. Adapters implement the interface.
**Ion** copies this pattern exactly. FilesystemStore (JSON files per run), SqliteStore, PostgresStore. The interface is the seam.
### Decision 5: CLI + Library, Not Server
**Archon** has a server + web UI. **Vercel** is a platform SDK. **Ion** ships only as a CLI + library.
The CLI wraps the library: `ion run <workflow>`, `ion list`, `ion approve`, `ion reject`, `ion resume`. The library exports `executeWorkflow()`, `createStore()`, `parseWorkflow()`, `discoverWorkflows()`.
## Risks / Trade-offs
| Risk | Mitigation |
|---|---|
| **Event-sourcing is simplified to node-level only** — means no intra-node recovery (if a 30-min AI prompt crashes at 29 min, it restarts from scratch). | Acceptable tradeoff. AI prompts are idempotent. For script/bash nodes, provide `timeout` and `retry` config. Node-level replay is 90% of the value at 10% of the complexity. |
| **No VM sandbox** — workflows run as regular async functions, so `while(true){}` hangs the process. | Document that workflow code must be well-behaved. The `idle_timeout` per node provides a circuit breaker. Production deployments can run workflows in a separate child process. |
| **Markdown-to-YAML transpiler** may lose nuance — SOP's RFC 2119 constraints are prose, not structured. | Constraints stay embedded in the prompt text of the resulting `prompt:` node. The transpiler extracts Parameters (→ node metadata) and Steps (→ prompt body). Lossless for the critical path. |
| **Competing with existing engines** — Archon exists, Temporal exists, Inngest exists. | Ion targets a different niche: portable CLI-first engine that fits in a single repo. Not a platform, not a cloud service. |