Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
5.7 KiB
Context
Three workflow engine patterns were researched: Archon (DAG-based YAML, git isolation), Agent SOP (markdown instructions with RFC 2119 constraints), and Vercel Workflow (event-sourced durable execution). Each excels in one dimension but has fundamental gaps:
- Archon: Clean DAG format + variable substitution + approval gates, but no crash recovery, tightly coupled to its monorepo (Bun/SQLite/Claude SDK)
- Agent SOP: Zero parser complexity, AI-native markdown, but completely stateless — no execution engine, no validation, no persistence
- Vercel Workflow: Gold-standard durability via event sourcing, but requires Rust SWC plugin, VM sandbox, 24-36 week rebuild — extreme complexity for the value in most use cases
Ion extracts the portable essence of each: Archon's DAG schema and executor, Agent SOP's markdown readability, Vercel's event sourcing (simplified — no SWC, no VM, no compile transforms).
Goals / Non-Goals
Goals:
- Fully portable DAG execution engine in pure TypeScript (zero Rust/SWC/wasm)
- YAML-first workflow definitions with 7 node types (command, prompt, bash, script, loop, approval, cancel)
.sop.mdmarkdown format as a secondary input (transpiled to DAG nodes)- Event-sourced persistence for crash recovery with deterministic replay — simplified to "log of node outcomes" rather than "log of every async operation"
- Plugable storage backends: filesystem (dev), SQLite/Postgres (production)
- CLI tool + library API dual distribution
- Approval gates with capture_response and on_reject
- Variable substitution ($nodeId.output, $ARGUMENTS, $LOOP_PREV_OUTPUT, etc.)
- Script execution via bun/node (TS) and uv/python3 (Python) with deps support
Non-Goals:
- No SWC compiler plugin or build-time transforms (Vercel's approach is overkill for this scope)
- No VM sandbox for workflow execution (workflows run as regular async functions)
- No git worktree isolation (leave to the host application)
- No multi-tenant or serverless platform (single-tenant CLI/library focus)
- No web UI in the initial build (CLI + library only; web can be added later)
- No AI provider integration (host application provides the AI; Ion just routes prompts)
Decisions
Decision 1: Event Log = Node Outcomes, Not Every Async Operation
Vercel logs every step_created, step_completed, wait_created, hook_received etc. — 17 event types. This requires SWC transforms to intercept all async boundaries.
Ion logs only node-level events: node_started, node_completed, node_failed, workflow_started, workflow_completed, workflow_failed. No micro-events. Replay means "re-run the DAG from the top, skipping completed nodes using stored outputs" — identical to Archon's resume approach.
Rationale: Simpler by an order of magnitude. No interceptors, no transforms, no VM. Crash recovery works: if the process dies mid-workflow, replay skips completed nodes and re-executes from the last failed/incomplete layer.
Decision 2: Pure TypeScript — No Rust, No SWC, No WASM
All three engines studied: Archon (pure TS), Vercel (Rust SWC plugin), Agent SOP (pure Python). The SWC plugin is the single biggest contributor to Vercel's 24-36 week build time.
Ion stays pure TS. The DAG executor, YAML loader, variable substitution, event log — all standard async/await. No build step beyond tsc or bun build.
Decision 3: YAML Primary, Markdown Secondary
Archon's YAML format is the primary definition: structured, validated by Zod, machine-parseable. Agent SOP's markdown is the secondary format: human-writable, conversational, auto-converted.
The transpiler is simple: parse ## Parameters → extract required fields, parse ## Steps → convert each step to a prompt: node with constraints embedded in the prompt text. No AST-level parsing needed.
Decision 4: Storage via IWorkflowStore Interface
Archon's pattern: IWorkflowStore interface with createWorkflowRun, getWorkflowRun, updateWorkflowRun, failWorkflowRun, createWorkflowEvent, getCompletedDagNodeOutputs. Adapters implement the interface.
Ion copies this pattern exactly. FilesystemStore (JSON files per run), SqliteStore, PostgresStore. The interface is the seam.
Decision 5: CLI + Library, Not Server
Archon has a server + web UI. Vercel is a platform SDK. Ion ships only as a CLI + library.
The CLI wraps the library: ion run <workflow>, ion list, ion approve, ion reject, ion resume. The library exports executeWorkflow(), createStore(), parseWorkflow(), discoverWorkflows().
Risks / Trade-offs
| Risk | Mitigation |
|---|---|
| Event-sourcing is simplified to node-level only — means no intra-node recovery (if a 30-min AI prompt crashes at 29 min, it restarts from scratch). | Acceptable tradeoff. AI prompts are idempotent. For script/bash nodes, provide timeout and retry config. Node-level replay is 90% of the value at 10% of the complexity. |
No VM sandbox — workflows run as regular async functions, so while(true){} hangs the process. |
Document that workflow code must be well-behaved. The idle_timeout per node provides a circuit breaker. Production deployments can run workflows in a separate child process. |
| Markdown-to-YAML transpiler may lose nuance — SOP's RFC 2119 constraints are prose, not structured. | Constraints stay embedded in the prompt text of the resulting prompt: node. The transpiler extracts Parameters (→ node metadata) and Steps (→ prompt body). Lossless for the critical path. |
| Competing with existing engines — Archon exists, Temporal exists, Inngest exists. | Ion targets a different niche: portable CLI-first engine that fits in a single repo. Not a platform, not a cloud service. |