chore(openspec): drop 9 superseded proposals + 11 stub archive files

Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
2026-06-07 22:15:38 +00:00
parent 0d6e9a2413
commit c935687725
119 changed files with 4897 additions and 45 deletions
--- a/openspec/changes/llama-cache-and-spec/proposal.md
+++ b/openspec/changes/llama-cache-and-spec/proposal.md
@@ -0,0 +1,62 @@
+# llama-cache-and-spec — KV cache quantization + ngram speculative decoding
+
+## Why
+
+BooCode's llama-sidecar runs llama-server with bare-minimum base args:
+`-ngl 999 -c 32768 --flash-attn on --no-mmap`. Two high-impact llama.cpp
+features are available but not enabled:
+
+1. **KV cache quantization** (`--cache-type-k q4_0`) — stores the KV cache
+   in 4-bit instead of f32. ~4× VRAM reduction for the cache, which
+   dominates memory usage at 32K context. No quality impact for most models.
+2. **Ngram speculative decoding** (`--spec-type ngram-mod`) — uses a
+   lightweight rolling-hash ngram model (~16MB) to predict tokens ahead of
+   the main model. The main model verifies them in batch. 2-3× tok/s
+   speedup on repetitive/code tasks with no accuracy loss and no separate
+   draft model to load.
+
+Both are disabled because they're in the **shadowing lists** of both
+validators (`llama-args-validator.ts` + sidecar `validator.go`), which
+auto-strip them from agent `llama_extra_args`. The fix is to either:
+(A) remove them from the shadow lists, or (B) add them directly to the
+sidecar's `BASE_ARGS` (which skips validation entirely).
+
+## What Changes
+
+- Sidecar base args gain the full set:
+  - `--cache-type-k q4_0` — KV cache quantization (~4× VRAM savings)
+  - `--cache-reuse 256` — KV cache reuse across turns (prompt caching)
+  - `--slot-save-path /tmp/llama-slots` — disk-persistent KV cache
+  - `--cache-idle-slots` — auto-save idle slot caches to disk
+  - `--spec-type ngram-mod --spec-ngram-mod-thsh 2` — spec decoding
+  - `--ctx-checkpoints 32` — context overflow protection
+  - `--sleep-idle-seconds 600` — GPU memory reclaim when idle
+  - `--metrics` — Prometheus metrics endpoint
+- Both validators keep existing shadow lists (correct as-is)
+- `/tmp/llama-slots` created for slot KV cache persistence
+
+## Dependencies
+
+- llama-sidecar repo (separate git tree, `/opt/forks/llama-sidecar/`)
+- BooCode server (`llama-args-validator.ts`, `provider.ts`)
+
+## Routing Change
+
+Previously, the llama-sidecar was only used when an agent had `llama_extra_args`
+set in its AGENTS.md frontmatter. The default path was llama-swap (no cache
+quant, no spec decoding, no slot save).
+
+Now, when `LLAMA_SIDECAR_URL` is configured (it is in docker-compose.yml),
+ALL inference requests route through the sidecar by default, regardless of
+whether the agent has `llama_extra_args`. This means:
+
+- Every request gets KV cache quantization, spec decoding, prompt caching
+- Agents with explicit `llama_extra_args` still get their overrides on top
+- If `LLAMA_SIDECAR_URL` is unset, falls back to llama-swap (backward compat)
+
+## Risk
+
+- `ngram-mod` spec decoding adds ~16MB memory. Trivial vs the 35B model.
+- KV cache quant to q4_0 is lossy vs f32 — undetectable on code tasks.
+- Both well-tested in llama.cpp ecosystem. No known regressions.
+- If issues, remove from base args and restart — no code change needed.