boocode/openspec/changes/archived/2026-06-07-memory-context-engineering/design.md at c860b6c4b76d49350e83aeca3647478c3d0bae37

Files

indifferentketchup c935687725 chore(openspec): drop 9 superseded proposals + 11 stub archive files

Drop 9 batch proposals that are superseded by the boocode-lift-analysis
(boocontext-audit, conductor upgrades, self-healing/verify-gate skills):
add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform,
conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul,
agent-reliability.

Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only)
that provide zero documentation value over the existing CHANGELOG.md + git tags.

2026-06-07 22:15:38 +00:00

10 KiB

Raw Blame History

Context

Current agents have no durable memory beyond the immediate LLM context window. Research across three production-grade OSS repos (LangMem, DeerFlow, CowAgent) reveals a consistent architectural pattern: a tiered memory pipeline with short-term context management, long-term semantic extraction, and periodic background consolidation. This design synthesizes those patterns into a portable, framework-agnostic memory-engine module.

The engine must be:

Portable — works with any LLM, any agent framework, any embedding provider
Tiered — separates ephemeral session context from persistent long-term knowledge
Efficient — background processing, debounced writes, token-budget-aware formatting
Searchable — hybrid keyword + vector retrieval with scoring

Goals / Non-Goals

Goals:

Provide a unified public API: MemoryEngine class with manage(), search(), flush(), dream() methods
Short-term context: token-budget windowing + incremental summarization (LangMem's summarize_messages pattern)
Long-term memory: LLM-extracted facts stored in SQLite with typed schemas (LangMem's MemoryManager + DeerFlow's fact model)
Tiered consolidation: context→daily→core pipeline with configurable promotion rules (CowAgent's 3-tier)
Hybrid search: FTS5 keyword + numpy-vectorized cosine similarity with weighted merge (CowAgent's MemoryStorage)
Background processing: debounced async queue for memory updates (DeerFlow's MemoryUpdateQueue + LangMem's ReflectionExecutor)
Agent tools: manage_memory(content, action, id) and search_memory(query, limit) as framework-agnostic callables

Non-Goals:

Not a standalone agent framework — integrates into existing loops
No built-in LLM provider — caller provides model
No built-in embedding provider — caller provides or we degrade to keyword-only
No real-time sync / distributed consensus — single-process design
No graph-based memory (entity-relationship knowledge graphs) — deferred to future

Decisions

D1: SQLite as the single persistence backend

Choice: SQLite with WAL mode for both keyword search (FTS5) and vector storage (BLOB embeddings)
Rationale: Zero-dependency, production-proven, FTS5 is stdlib-compatible, numpy integration in-process
Alternatives considered:
- JSON files (DeerFlow) → simpler but no built-in search, concurrency issues
- External vector DB (Pinecone, pgvector) → adds operational complexity, violates portability goal
- LMDB/RocksDB → overkill, no FTS5 equivalent

D2: Three-tier architecture with file-based daily layer

Choice: In-memory context tier → Markdown-file daily tier → SQLite-indexed core tier
Rationale: Daily Markdown files are human-readable, easily audited, and serve as the input to Deep Dream consolidation. Core tier is the indexed, searchable fact store.
Alternatives considered:
- Single SQLite DB for everything → loses human-readability of daily records
- All in-memory → no persistence across restarts

D3: Fact extraction via structured LLM output (tool-calling pattern)

Choice: LLM returns structured JSON (DeerFlow pattern) rather than tool-calling-based extraction (LangMem trustcall pattern)
Rationale: Simpler, fewer dependencies, compatible with any LLM provider. LangMem's trustcall approach is more robust for complex multi-step edits but requires the trustcall library.
Fallback: Confidence-thresholded insertion with content-dedup hashing to prevent duplicates

D4: Hybrid search with numpy-vectorized cosine similarity

Choice: Load relevant embeddings from SQLite, compute cosine similarity via matrix @ vector (numpy), merge with FTS5 BM25 scores
Rationale: ~100x faster than per-row Python loops. Uses numpy which is near-ubiquitous in Python ML.
Fallback: Pure-Python cosine similarity when numpy unavailable

D5: Debounced background memory update queue

Choice: Thread-safe priority queue with configurable debounce timer (DeerFlow pattern)
Rationale: Prevents thundering-herd on LLM API during rapid conversation turns. Threaded execution avoids blocking the main agent loop.
Alternatives considered: asyncio queue → fine for async-only, but MemoryEngine must support sync callers

D6: Namespace isolation via tuple-based scoping

Choice: (scope_type, user_id, agent_id) tuple namespace for multi-tenant isolation
Rationale: LangMem's NamespaceTemplate pattern proven in production. Allows ("user", "u-123") or ("org", "acme", "agent-alpha").

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                    MemoryEngine                          │
├─────────────────────────────────────────────────────────┤
│  manage_memory(content, scope, metadata) → fact_id       │
│  search_memory(query, limit, scope) → SearchResults[]    │
│  flush_messages(messages, scope) → boolean               │
│  deep_dream(lookback_days, scope) → boolean              │
│  format_for_injection(scope, max_tokens) → str           │
└──────────────────────┬──────────────────────────────────┘
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
┌──────────────┐ ┌──────────┐ ┌──────────────┐
│ Context Tier  │ │ Daily    │ │  Core Tier   │
│ (in-memory)  │ │ Tier     │ │  (SQLite +   │
│              │ │ (Markdown│ │   FTS5 +     │
│ RunningSumm. │ │  files)  │ │   vectors)   │
│ token budget │ │          │ │              │
└──────────────┘ │ Deep     │ │ MemoryStore  │
                 │ Dream ───┼─┤ (facts)      │
                 └──────────┘ │ HybridSearch │
                              └──────────────┘
                                       │
                              ┌────────┴────────┐
                              ▼                  ▼
                       ┌────────────┐  ┌────────────────┐
                       │ Keyword    │  │ Vector Search   │
                       │ (FTS5)     │  │ (numpy cosine)  │
                       └────────────┘  └────────────────┘

Data Flow

Agent sends message → Context tier tracks token budget, optionally summarizes
Conversation turn completes → Messages queued to background MemoryUpdateQueue
Debounce timer fires → MemoryUpdater calls LLM with current memory + conversation → extracts facts
Facts persisted → Core tier SQLite: chunks table with embedding, FTS5 index
Daily recording → MemoryFlushManager appends to memory/YYYY-MM-DD.md
Deep Dream (scheduled) → LLM reads MEMORY.md + recent daily files → rewrites MEMORY.md → writes dream diary
Agent starts new session → format_for_injection() reads core tier → builds token-budgeted context string → injects into system prompt

Module Structure

memory-engine/
├── __init__.py               # Public API: MemoryEngine, MemoryConfig
├── config.py                 # Pydantic config model
├── core/
│   ├── __init__.py
│   ├── store.py              # MemoryStore (SQLite + FTS5 + vectors)
│   ├── hybrid_search.py      # Vector + keyword merge with temporal decay
│   └── schemas.py            # Memory, Fact, SearchResult models
├── extraction/
│   ├── __init__.py
│   ├── manager.py            # MemoryManager (LLM fact extraction)
│   └── prompts.py            # System prompts for memory extraction
├── tiers/
│   ├── __init__.py
│   ├── context.py            # ContextTier (short-term summarization)
│   ├── daily.py              # DailyTier (Markdown file management)
│   └── core.py               # CoreTier (long-term persistent store)
├── background/
│   ├── __init__.py
│   ├── queue.py              # MemoryUpdateQueue (debounced)
│   └── deep_dream.py         # Deep Dream consolidation
├── tools/
│   ├── __init__.py
│   ├── manage.py             # manage_memory callable
│   └── search.py             # search_memory callable
├── embedding/
│   ├── __init__.py
│   ├── base.py               # EmbeddingProvider ABC
│   └── openai.py             # OpenAI embedding implementation
└── utils/
    ├── __init__.py
    ├── namespace.py          # NamespaceTemplate
    ├── token_counter.py      # Token counting (tiktoken wrapper)
    └── chunker.py            # Text chunking

Risks / Trade-offs

Risk	Mitigation
[R1] LLM extraction latency blocks agent loop	Background queue with debounce — agent never waits for memory update
[R2] Embedding API failures degrade search	Graceful degradation to keyword-only; vector results omitted, not fatal
[R3] SQLite write contention under high concurrency	WAL mode + RLock per connection; single-process assumption
[R4] FTS5 corrupted after crash	Self-healing on init: detect corrupt shadow tables, rebuild from chunks table
[R5] Memory bloat from unbounded fact accumulation	Configurable `max_facts` limit (default 500); sorted by confidence, oldest trimmed
[R6] Deep Dream overwrites valuable long-term data	Dream diary preserves audit trail; content-hash dedup prevents re-processing
[R7] Token budget exceeded in context injection	`format_for_injection()` enforces strict token limit with truncation

Open Questions

Q1: Should Deep Dream be scheduled (cron) or event-driven (every N daily files)?
Q2: What is the default max_facts limit for the core tier?
Q3: Should the daily tier support per-user isolation (user-specific daily files) or always shared?

10 KiB Raw Blame History