Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
63 lines
5.7 KiB
Markdown
63 lines
5.7 KiB
Markdown
## Context
|
|
|
|
Three workflow engine patterns were researched: **Archon** (DAG-based YAML, git isolation), **Agent SOP** (markdown instructions with RFC 2119 constraints), and **Vercel Workflow** (event-sourced durable execution). Each excels in one dimension but has fundamental gaps:
|
|
|
|
- **Archon**: Clean DAG format + variable substitution + approval gates, but no crash recovery, tightly coupled to its monorepo (Bun/SQLite/Claude SDK)
|
|
- **Agent SOP**: Zero parser complexity, AI-native markdown, but completely stateless — no execution engine, no validation, no persistence
|
|
- **Vercel Workflow**: Gold-standard durability via event sourcing, but requires Rust SWC plugin, VM sandbox, 24-36 week rebuild — extreme complexity for the value in most use cases
|
|
|
|
**Ion** extracts the portable essence of each: Archon's DAG schema and executor, Agent SOP's markdown readability, Vercel's event sourcing (simplified — no SWC, no VM, no compile transforms).
|
|
|
|
## Goals / Non-Goals
|
|
|
|
**Goals:**
|
|
- Fully portable DAG execution engine in pure TypeScript (zero Rust/SWC/wasm)
|
|
- YAML-first workflow definitions with 7 node types (command, prompt, bash, script, loop, approval, cancel)
|
|
- `.sop.md` markdown format as a secondary input (transpiled to DAG nodes)
|
|
- Event-sourced persistence for crash recovery with deterministic replay — simplified to "log of node outcomes" rather than "log of every async operation"
|
|
- Plugable storage backends: filesystem (dev), SQLite/Postgres (production)
|
|
- CLI tool + library API dual distribution
|
|
- Approval gates with capture_response and on_reject
|
|
- Variable substitution ($nodeId.output, $ARGUMENTS, $LOOP_PREV_OUTPUT, etc.)
|
|
- Script execution via bun/node (TS) and uv/python3 (Python) with deps support
|
|
|
|
**Non-Goals:**
|
|
- No SWC compiler plugin or build-time transforms (Vercel's approach is overkill for this scope)
|
|
- No VM sandbox for workflow execution (workflows run as regular async functions)
|
|
- No git worktree isolation (leave to the host application)
|
|
- No multi-tenant or serverless platform (single-tenant CLI/library focus)
|
|
- No web UI in the initial build (CLI + library only; web can be added later)
|
|
- No AI provider integration (host application provides the AI; Ion just routes prompts)
|
|
|
|
## Decisions
|
|
|
|
### Decision 1: Event Log = Node Outcomes, Not Every Async Operation
|
|
**Vercel** logs every `step_created`, `step_completed`, `wait_created`, `hook_received` etc. — 17 event types. This requires SWC transforms to intercept all async boundaries.
|
|
**Ion** logs only *node-level* events: `node_started`, `node_completed`, `node_failed`, `workflow_started`, `workflow_completed`, `workflow_failed`. No micro-events. Replay means "re-run the DAG from the top, skipping completed nodes using stored outputs" — identical to Archon's `resume` approach.
|
|
**Rationale**: Simpler by an order of magnitude. No interceptors, no transforms, no VM. Crash recovery works: if the process dies mid-workflow, replay skips completed nodes and re-executes from the last failed/incomplete layer.
|
|
|
|
### Decision 2: Pure TypeScript — No Rust, No SWC, No WASM
|
|
All three engines studied: Archon (pure TS), Vercel (Rust SWC plugin), Agent SOP (pure Python). The SWC plugin is the single biggest contributor to Vercel's 24-36 week build time.
|
|
**Ion** stays pure TS. The DAG executor, YAML loader, variable substitution, event log — all standard async/await. No build step beyond `tsc` or `bun build`.
|
|
|
|
### Decision 3: YAML Primary, Markdown Secondary
|
|
**Archon's YAML** format is the primary definition: structured, validated by Zod, machine-parseable. **Agent SOP's markdown** is the secondary format: human-writable, conversational, auto-converted.
|
|
The transpiler is simple: parse `## Parameters` → extract required fields, parse `## Steps` → convert each step to a `prompt:` node with constraints embedded in the prompt text. No AST-level parsing needed.
|
|
|
|
### Decision 4: Storage via IWorkflowStore Interface
|
|
**Archon's pattern**: `IWorkflowStore` interface with `createWorkflowRun`, `getWorkflowRun`, `updateWorkflowRun`, `failWorkflowRun`, `createWorkflowEvent`, `getCompletedDagNodeOutputs`. Adapters implement the interface.
|
|
**Ion** copies this pattern exactly. FilesystemStore (JSON files per run), SqliteStore, PostgresStore. The interface is the seam.
|
|
|
|
### Decision 5: CLI + Library, Not Server
|
|
**Archon** has a server + web UI. **Vercel** is a platform SDK. **Ion** ships only as a CLI + library.
|
|
The CLI wraps the library: `ion run <workflow>`, `ion list`, `ion approve`, `ion reject`, `ion resume`. The library exports `executeWorkflow()`, `createStore()`, `parseWorkflow()`, `discoverWorkflows()`.
|
|
|
|
## Risks / Trade-offs
|
|
|
|
| Risk | Mitigation |
|
|
|---|---|
|
|
| **Event-sourcing is simplified to node-level only** — means no intra-node recovery (if a 30-min AI prompt crashes at 29 min, it restarts from scratch). | Acceptable tradeoff. AI prompts are idempotent. For script/bash nodes, provide `timeout` and `retry` config. Node-level replay is 90% of the value at 10% of the complexity. |
|
|
| **No VM sandbox** — workflows run as regular async functions, so `while(true){}` hangs the process. | Document that workflow code must be well-behaved. The `idle_timeout` per node provides a circuit breaker. Production deployments can run workflows in a separate child process. |
|
|
| **Markdown-to-YAML transpiler** may lose nuance — SOP's RFC 2119 constraints are prose, not structured. | Constraints stay embedded in the prompt text of the resulting `prompt:` node. The transpiler extracts Parameters (→ node metadata) and Steps (→ prompt body). Lossless for the critical path. |
|
|
| **Competing with existing engines** — Archon exists, Temporal exists, Inngest exists. | Ion targets a different niche: portable CLI-first engine that fits in a single repo. Not a platform, not a cloud service. |
|