boocode/openspec/changes/archived/2026-06-07-memory-context-engineering/design.md

## Context

Current agents have no durable memory beyond the immediate LLM context window. Research across three production-grade OSS repos (LangMem, DeerFlow, CowAgent) reveals a consistent architectural pattern: a **tiered memory pipeline** with short-term context management, long-term semantic extraction, and periodic background consolidation. This design synthesizes those patterns into a portable, framework-agnostic `memory-engine` module.

The engine must be:
- **Portable** — works with any LLM, any agent framework, any embedding provider
- **Tiered** — separates ephemeral session context from persistent long-term knowledge
- **Efficient** — background processing, debounced writes, token-budget-aware formatting
- **Searchable** — hybrid keyword + vector retrieval with scoring

## Goals / Non-Goals

**Goals:**
- Provide a unified public API: `MemoryEngine` class with `manage()`, `search()`, `flush()`, `dream()` methods
- Short-term context: token-budget windowing + incremental summarization (LangMem's `summarize_messages` pattern)
- Long-term memory: LLM-extracted facts stored in SQLite with typed schemas (LangMem's `MemoryManager` + DeerFlow's fact model)
- Tiered consolidation: context→daily→core pipeline with configurable promotion rules (CowAgent's 3-tier)
- Hybrid search: FTS5 keyword + numpy-vectorized cosine similarity with weighted merge (CowAgent's `MemoryStorage`)
- Background processing: debounced async queue for memory updates (DeerFlow's `MemoryUpdateQueue` + LangMem's `ReflectionExecutor`)
- Agent tools: `manage_memory(content, action, id)` and `search_memory(query, limit)` as framework-agnostic callables

**Non-Goals:**
- Not a standalone agent framework — integrates into existing loops
- No built-in LLM provider — caller provides model
- No built-in embedding provider — caller provides or we degrade to keyword-only
- No real-time sync / distributed consensus — single-process design
- No graph-based memory (entity-relationship knowledge graphs) — deferred to future

## Decisions

### D1: SQLite as the single persistence backend
- **Choice**: SQLite with WAL mode for both keyword search (FTS5) and vector storage (BLOB embeddings)
- **Rationale**: Zero-dependency, production-proven, FTS5 is stdlib-compatible, numpy integration in-process
- **Alternatives considered**:
  - *JSON files* (DeerFlow) → simpler but no built-in search, concurrency issues
  - *External vector DB* (Pinecone, pgvector) → adds operational complexity, violates portability goal
  - *LMDB/RocksDB* → overkill, no FTS5 equivalent

### D2: Three-tier architecture with file-based daily layer
- **Choice**: In-memory context tier → Markdown-file daily tier → SQLite-indexed core tier
- **Rationale**: Daily Markdown files are human-readable, easily audited, and serve as the input to Deep Dream consolidation. Core tier is the indexed, searchable fact store.
- **Alternatives considered**:
  - *Single SQLite DB for everything* → loses human-readability of daily records
  - *All in-memory* → no persistence across restarts

### D3: Fact extraction via structured LLM output (tool-calling pattern)
- **Choice**: LLM returns structured JSON (DeerFlow pattern) rather than tool-calling-based extraction (LangMem trustcall pattern)
- **Rationale**: Simpler, fewer dependencies, compatible with any LLM provider. LangMem's trustcall approach is more robust for complex multi-step edits but requires the `trustcall` library.
- **Fallback**: Confidence-thresholded insertion with content-dedup hashing to prevent duplicates

### D4: Hybrid search with numpy-vectorized cosine similarity
- **Choice**: Load relevant embeddings from SQLite, compute cosine similarity via `matrix @ vector` (numpy), merge with FTS5 BM25 scores
- **Rationale**: ~100x faster than per-row Python loops. Uses numpy which is near-ubiquitous in Python ML.
- **Fallback**: Pure-Python cosine similarity when numpy unavailable

### D5: Debounced background memory update queue
- **Choice**: Thread-safe priority queue with configurable debounce timer (DeerFlow pattern)
- **Rationale**: Prevents thundering-herd on LLM API during rapid conversation turns. Threaded execution avoids blocking the main agent loop.
- **Alternatives considered**: asyncio queue → fine for async-only, but MemoryEngine must support sync callers

### D6: Namespace isolation via tuple-based scoping
- **Choice**: `(scope_type, user_id, agent_id)` tuple namespace for multi-tenant isolation
- **Rationale**: LangMem's `NamespaceTemplate` pattern proven in production. Allows `("user", "u-123")` or `("org", "acme", "agent-alpha")`.

## Architecture Overview

```
┌─────────────────────────────────────────────────────────┐
│                    MemoryEngine                          │
├─────────────────────────────────────────────────────────┤
│  manage_memory(content, scope, metadata) → fact_id       │
│  search_memory(query, limit, scope) → SearchResults[]    │
│  flush_messages(messages, scope) → boolean               │
│  deep_dream(lookback_days, scope) → boolean              │
│  format_for_injection(scope, max_tokens) → str           │
└──────────────────────┬──────────────────────────────────┘
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
┌──────────────┐ ┌──────────┐ ┌──────────────┐
│ Context Tier  │ │ Daily    │ │  Core Tier   │
│ (in-memory)  │ │ Tier     │ │  (SQLite +   │
│              │ │ (Markdown│ │   FTS5 +     │
│ RunningSumm. │ │  files)  │ │   vectors)   │
│ token budget │ │          │ │              │
└──────────────┘ │ Deep     │ │ MemoryStore  │
                 │ Dream ───┼─┤ (facts)      │
                 └──────────┘ │ HybridSearch │
                              └──────────────┘
                                       │
                              ┌────────┴────────┐
                              ▼                  ▼
                       ┌────────────┐  ┌────────────────┐
                       │ Keyword    │  │ Vector Search   │
                       │ (FTS5)     │  │ (numpy cosine)  │
                       └────────────┘  └────────────────┘
```

### Data Flow

1. **Agent sends message** → Context tier tracks token budget, optionally summarizes
2. **Conversation turn completes** → Messages queued to background `MemoryUpdateQueue`
3. **Debounce timer fires** → `MemoryUpdater` calls LLM with current memory + conversation → extracts facts
4. **Facts persisted** → Core tier SQLite: chunks table with embedding, FTS5 index
5. **Daily recording** → `MemoryFlushManager` appends to `memory/YYYY-MM-DD.md`
6. **Deep Dream (scheduled)** → LLM reads MEMORY.md + recent daily files → rewrites MEMORY.md → writes dream diary
7. **Agent starts new session** → `format_for_injection()` reads core tier → builds token-budgeted context string → injects into system prompt

## Module Structure

```
memory-engine/
├── __init__.py               # Public API: MemoryEngine, MemoryConfig
├── config.py                 # Pydantic config model
├── core/
│   ├── __init__.py
│   ├── store.py              # MemoryStore (SQLite + FTS5 + vectors)
│   ├── hybrid_search.py      # Vector + keyword merge with temporal decay
│   └── schemas.py            # Memory, Fact, SearchResult models
├── extraction/
│   ├── __init__.py
│   ├── manager.py            # MemoryManager (LLM fact extraction)
│   └── prompts.py            # System prompts for memory extraction
├── tiers/
│   ├── __init__.py
│   ├── context.py            # ContextTier (short-term summarization)
│   ├── daily.py              # DailyTier (Markdown file management)
│   └── core.py               # CoreTier (long-term persistent store)
├── background/
│   ├── __init__.py
│   ├── queue.py              # MemoryUpdateQueue (debounced)
│   └── deep_dream.py         # Deep Dream consolidation
├── tools/
│   ├── __init__.py
│   ├── manage.py             # manage_memory callable
│   └── search.py             # search_memory callable
├── embedding/
│   ├── __init__.py
│   ├── base.py               # EmbeddingProvider ABC
│   └── openai.py             # OpenAI embedding implementation
└── utils/
    ├── __init__.py
    ├── namespace.py          # NamespaceTemplate
    ├── token_counter.py      # Token counting (tiktoken wrapper)
    └── chunker.py            # Text chunking
```

## Risks / Trade-offs

| Risk | Mitigation |
|------|-----------|
| [R1] LLM extraction latency blocks agent loop | Background queue with debounce — agent never waits for memory update |
| [R2] Embedding API failures degrade search | Graceful degradation to keyword-only; vector results omitted, not fatal |
| [R3] SQLite write contention under high concurrency | WAL mode + RLock per connection; single-process assumption |
| [R4] FTS5 corrupted after crash | Self-healing on init: detect corrupt shadow tables, rebuild from chunks table |
| [R5] Memory bloat from unbounded fact accumulation | Configurable `max_facts` limit (default 500); sorted by confidence, oldest trimmed |
| [R6] Deep Dream overwrites valuable long-term data | Dream diary preserves audit trail; content-hash dedup prevents re-processing |
| [R7] Token budget exceeded in context injection | `format_for_injection()` enforces strict token limit with truncation |

## Open Questions

- Q1: Should Deep Dream be scheduled (cron) or event-driven (every N daily files)?
- Q2: What is the default `max_facts` limit for the core tier?
- Q3: Should the daily tier support per-user isolation (user-specific daily files) or always shared?