## Context Three workflow engine patterns were researched: **Archon** (DAG-based YAML, git isolation), **Agent SOP** (markdown instructions with RFC 2119 constraints), and **Vercel Workflow** (event-sourced durable execution). Each excels in one dimension but has fundamental gaps: - **Archon**: Clean DAG format + variable substitution + approval gates, but no crash recovery, tightly coupled to its monorepo (Bun/SQLite/Claude SDK) - **Agent SOP**: Zero parser complexity, AI-native markdown, but completely stateless — no execution engine, no validation, no persistence - **Vercel Workflow**: Gold-standard durability via event sourcing, but requires Rust SWC plugin, VM sandbox, 24-36 week rebuild — extreme complexity for the value in most use cases **Ion** extracts the portable essence of each: Archon's DAG schema and executor, Agent SOP's markdown readability, Vercel's event sourcing (simplified — no SWC, no VM, no compile transforms). ## Goals / Non-Goals **Goals:** - Fully portable DAG execution engine in pure TypeScript (zero Rust/SWC/wasm) - YAML-first workflow definitions with 7 node types (command, prompt, bash, script, loop, approval, cancel) - `.sop.md` markdown format as a secondary input (transpiled to DAG nodes) - Event-sourced persistence for crash recovery with deterministic replay — simplified to "log of node outcomes" rather than "log of every async operation" - Plugable storage backends: filesystem (dev), SQLite/Postgres (production) - CLI tool + library API dual distribution - Approval gates with capture_response and on_reject - Variable substitution ($nodeId.output, $ARGUMENTS, $LOOP_PREV_OUTPUT, etc.) - Script execution via bun/node (TS) and uv/python3 (Python) with deps support **Non-Goals:** - No SWC compiler plugin or build-time transforms (Vercel's approach is overkill for this scope) - No VM sandbox for workflow execution (workflows run as regular async functions) - No git worktree isolation (leave to the host application) - No multi-tenant or serverless platform (single-tenant CLI/library focus) - No web UI in the initial build (CLI + library only; web can be added later) - No AI provider integration (host application provides the AI; Ion just routes prompts) ## Decisions ### Decision 1: Event Log = Node Outcomes, Not Every Async Operation **Vercel** logs every `step_created`, `step_completed`, `wait_created`, `hook_received` etc. — 17 event types. This requires SWC transforms to intercept all async boundaries. **Ion** logs only *node-level* events: `node_started`, `node_completed`, `node_failed`, `workflow_started`, `workflow_completed`, `workflow_failed`. No micro-events. Replay means "re-run the DAG from the top, skipping completed nodes using stored outputs" — identical to Archon's `resume` approach. **Rationale**: Simpler by an order of magnitude. No interceptors, no transforms, no VM. Crash recovery works: if the process dies mid-workflow, replay skips completed nodes and re-executes from the last failed/incomplete layer. ### Decision 2: Pure TypeScript — No Rust, No SWC, No WASM All three engines studied: Archon (pure TS), Vercel (Rust SWC plugin), Agent SOP (pure Python). The SWC plugin is the single biggest contributor to Vercel's 24-36 week build time. **Ion** stays pure TS. The DAG executor, YAML loader, variable substitution, event log — all standard async/await. No build step beyond `tsc` or `bun build`. ### Decision 3: YAML Primary, Markdown Secondary **Archon's YAML** format is the primary definition: structured, validated by Zod, machine-parseable. **Agent SOP's markdown** is the secondary format: human-writable, conversational, auto-converted. The transpiler is simple: parse `## Parameters` → extract required fields, parse `## Steps` → convert each step to a `prompt:` node with constraints embedded in the prompt text. No AST-level parsing needed. ### Decision 4: Storage via IWorkflowStore Interface **Archon's pattern**: `IWorkflowStore` interface with `createWorkflowRun`, `getWorkflowRun`, `updateWorkflowRun`, `failWorkflowRun`, `createWorkflowEvent`, `getCompletedDagNodeOutputs`. Adapters implement the interface. **Ion** copies this pattern exactly. FilesystemStore (JSON files per run), SqliteStore, PostgresStore. The interface is the seam. ### Decision 5: CLI + Library, Not Server **Archon** has a server + web UI. **Vercel** is a platform SDK. **Ion** ships only as a CLI + library. The CLI wraps the library: `ion run `, `ion list`, `ion approve`, `ion reject`, `ion resume`. The library exports `executeWorkflow()`, `createStore()`, `parseWorkflow()`, `discoverWorkflows()`. ## Risks / Trade-offs | Risk | Mitigation | |---|---| | **Event-sourcing is simplified to node-level only** — means no intra-node recovery (if a 30-min AI prompt crashes at 29 min, it restarts from scratch). | Acceptable tradeoff. AI prompts are idempotent. For script/bash nodes, provide `timeout` and `retry` config. Node-level replay is 90% of the value at 10% of the complexity. | | **No VM sandbox** — workflows run as regular async functions, so `while(true){}` hangs the process. | Document that workflow code must be well-behaved. The `idle_timeout` per node provides a circuit breaker. Production deployments can run workflows in a separate child process. | | **Markdown-to-YAML transpiler** may lose nuance — SOP's RFC 2119 constraints are prose, not structured. | Constraints stay embedded in the prompt text of the resulting `prompt:` node. The transpiler extracts Parameters (→ node metadata) and Steps (→ prompt body). Lossless for the critical path. | | **Competing with existing engines** — Archon exists, Temporal exists, Inngest exists. | Ion targets a different niche: portable CLI-first engine that fits in a single repo. Not a platform, not a cloud service. |