## Context Current agents have no durable memory beyond the immediate LLM context window. Research across three production-grade OSS repos (LangMem, DeerFlow, CowAgent) reveals a consistent architectural pattern: a **tiered memory pipeline** with short-term context management, long-term semantic extraction, and periodic background consolidation. This design synthesizes those patterns into a portable, framework-agnostic `memory-engine` module. The engine must be: - **Portable** — works with any LLM, any agent framework, any embedding provider - **Tiered** — separates ephemeral session context from persistent long-term knowledge - **Efficient** — background processing, debounced writes, token-budget-aware formatting - **Searchable** — hybrid keyword + vector retrieval with scoring ## Goals / Non-Goals **Goals:** - Provide a unified public API: `MemoryEngine` class with `manage()`, `search()`, `flush()`, `dream()` methods - Short-term context: token-budget windowing + incremental summarization (LangMem's `summarize_messages` pattern) - Long-term memory: LLM-extracted facts stored in SQLite with typed schemas (LangMem's `MemoryManager` + DeerFlow's fact model) - Tiered consolidation: context→daily→core pipeline with configurable promotion rules (CowAgent's 3-tier) - Hybrid search: FTS5 keyword + numpy-vectorized cosine similarity with weighted merge (CowAgent's `MemoryStorage`) - Background processing: debounced async queue for memory updates (DeerFlow's `MemoryUpdateQueue` + LangMem's `ReflectionExecutor`) - Agent tools: `manage_memory(content, action, id)` and `search_memory(query, limit)` as framework-agnostic callables **Non-Goals:** - Not a standalone agent framework — integrates into existing loops - No built-in LLM provider — caller provides model - No built-in embedding provider — caller provides or we degrade to keyword-only - No real-time sync / distributed consensus — single-process design - No graph-based memory (entity-relationship knowledge graphs) — deferred to future ## Decisions ### D1: SQLite as the single persistence backend - **Choice**: SQLite with WAL mode for both keyword search (FTS5) and vector storage (BLOB embeddings) - **Rationale**: Zero-dependency, production-proven, FTS5 is stdlib-compatible, numpy integration in-process - **Alternatives considered**: - *JSON files* (DeerFlow) → simpler but no built-in search, concurrency issues - *External vector DB* (Pinecone, pgvector) → adds operational complexity, violates portability goal - *LMDB/RocksDB* → overkill, no FTS5 equivalent ### D2: Three-tier architecture with file-based daily layer - **Choice**: In-memory context tier → Markdown-file daily tier → SQLite-indexed core tier - **Rationale**: Daily Markdown files are human-readable, easily audited, and serve as the input to Deep Dream consolidation. Core tier is the indexed, searchable fact store. - **Alternatives considered**: - *Single SQLite DB for everything* → loses human-readability of daily records - *All in-memory* → no persistence across restarts ### D3: Fact extraction via structured LLM output (tool-calling pattern) - **Choice**: LLM returns structured JSON (DeerFlow pattern) rather than tool-calling-based extraction (LangMem trustcall pattern) - **Rationale**: Simpler, fewer dependencies, compatible with any LLM provider. LangMem's trustcall approach is more robust for complex multi-step edits but requires the `trustcall` library. - **Fallback**: Confidence-thresholded insertion with content-dedup hashing to prevent duplicates ### D4: Hybrid search with numpy-vectorized cosine similarity - **Choice**: Load relevant embeddings from SQLite, compute cosine similarity via `matrix @ vector` (numpy), merge with FTS5 BM25 scores - **Rationale**: ~100x faster than per-row Python loops. Uses numpy which is near-ubiquitous in Python ML. - **Fallback**: Pure-Python cosine similarity when numpy unavailable ### D5: Debounced background memory update queue - **Choice**: Thread-safe priority queue with configurable debounce timer (DeerFlow pattern) - **Rationale**: Prevents thundering-herd on LLM API during rapid conversation turns. Threaded execution avoids blocking the main agent loop. - **Alternatives considered**: asyncio queue → fine for async-only, but MemoryEngine must support sync callers ### D6: Namespace isolation via tuple-based scoping - **Choice**: `(scope_type, user_id, agent_id)` tuple namespace for multi-tenant isolation - **Rationale**: LangMem's `NamespaceTemplate` pattern proven in production. Allows `("user", "u-123")` or `("org", "acme", "agent-alpha")`. ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────┐ │ MemoryEngine │ ├─────────────────────────────────────────────────────────┤ │ manage_memory(content, scope, metadata) → fact_id │ │ search_memory(query, limit, scope) → SearchResults[] │ │ flush_messages(messages, scope) → boolean │ │ deep_dream(lookback_days, scope) → boolean │ │ format_for_injection(scope, max_tokens) → str │ └──────────────────────┬──────────────────────────────────┘ │ ┌──────────────┼──────────────┐ ▼ ▼ ▼ ┌──────────────┐ ┌──────────┐ ┌──────────────┐ │ Context Tier │ │ Daily │ │ Core Tier │ │ (in-memory) │ │ Tier │ │ (SQLite + │ │ │ │ (Markdown│ │ FTS5 + │ │ RunningSumm. │ │ files) │ │ vectors) │ │ token budget │ │ │ │ │ └──────────────┘ │ Deep │ │ MemoryStore │ │ Dream ───┼─┤ (facts) │ └──────────┘ │ HybridSearch │ └──────────────┘ │ ┌────────┴────────┐ ▼ ▼ ┌────────────┐ ┌────────────────┐ │ Keyword │ │ Vector Search │ │ (FTS5) │ │ (numpy cosine) │ └────────────┘ └────────────────┘ ``` ### Data Flow 1. **Agent sends message** → Context tier tracks token budget, optionally summarizes 2. **Conversation turn completes** → Messages queued to background `MemoryUpdateQueue` 3. **Debounce timer fires** → `MemoryUpdater` calls LLM with current memory + conversation → extracts facts 4. **Facts persisted** → Core tier SQLite: chunks table with embedding, FTS5 index 5. **Daily recording** → `MemoryFlushManager` appends to `memory/YYYY-MM-DD.md` 6. **Deep Dream (scheduled)** → LLM reads MEMORY.md + recent daily files → rewrites MEMORY.md → writes dream diary 7. **Agent starts new session** → `format_for_injection()` reads core tier → builds token-budgeted context string → injects into system prompt ## Module Structure ``` memory-engine/ ├── __init__.py # Public API: MemoryEngine, MemoryConfig ├── config.py # Pydantic config model ├── core/ │ ├── __init__.py │ ├── store.py # MemoryStore (SQLite + FTS5 + vectors) │ ├── hybrid_search.py # Vector + keyword merge with temporal decay │ └── schemas.py # Memory, Fact, SearchResult models ├── extraction/ │ ├── __init__.py │ ├── manager.py # MemoryManager (LLM fact extraction) │ └── prompts.py # System prompts for memory extraction ├── tiers/ │ ├── __init__.py │ ├── context.py # ContextTier (short-term summarization) │ ├── daily.py # DailyTier (Markdown file management) │ └── core.py # CoreTier (long-term persistent store) ├── background/ │ ├── __init__.py │ ├── queue.py # MemoryUpdateQueue (debounced) │ └── deep_dream.py # Deep Dream consolidation ├── tools/ │ ├── __init__.py │ ├── manage.py # manage_memory callable │ └── search.py # search_memory callable ├── embedding/ │ ├── __init__.py │ ├── base.py # EmbeddingProvider ABC │ └── openai.py # OpenAI embedding implementation └── utils/ ├── __init__.py ├── namespace.py # NamespaceTemplate ├── token_counter.py # Token counting (tiktoken wrapper) └── chunker.py # Text chunking ``` ## Risks / Trade-offs | Risk | Mitigation | |------|-----------| | [R1] LLM extraction latency blocks agent loop | Background queue with debounce — agent never waits for memory update | | [R2] Embedding API failures degrade search | Graceful degradation to keyword-only; vector results omitted, not fatal | | [R3] SQLite write contention under high concurrency | WAL mode + RLock per connection; single-process assumption | | [R4] FTS5 corrupted after crash | Self-healing on init: detect corrupt shadow tables, rebuild from chunks table | | [R5] Memory bloat from unbounded fact accumulation | Configurable `max_facts` limit (default 500); sorted by confidence, oldest trimmed | | [R6] Deep Dream overwrites valuable long-term data | Dream diary preserves audit trail; content-hash dedup prevents re-processing | | [R7] Token budget exceeded in context injection | `format_for_injection()` enforces strict token limit with truncation | ## Open Questions - Q1: Should Deep Dream be scheduled (cron) or event-driven (every N daily files)? - Q2: What is the default `max_facts` limit for the core tier? - Q3: Should the daily tier support per-user isolation (user-specific daily files) or always shared?