# llama-cache-and-spec — KV cache quantization + ngram speculative decoding ## Why BooCode's llama-sidecar runs llama-server with bare-minimum base args: `-ngl 999 -c 32768 --flash-attn on --no-mmap`. Two high-impact llama.cpp features are available but not enabled: 1. **KV cache quantization** (`--cache-type-k q4_0`) — stores the KV cache in 4-bit instead of f32. ~4× VRAM reduction for the cache, which dominates memory usage at 32K context. No quality impact for most models. 2. **Ngram speculative decoding** (`--spec-type ngram-mod`) — uses a lightweight rolling-hash ngram model (~16MB) to predict tokens ahead of the main model. The main model verifies them in batch. 2-3× tok/s speedup on repetitive/code tasks with no accuracy loss and no separate draft model to load. Both are disabled because they're in the **shadowing lists** of both validators (`llama-args-validator.ts` + sidecar `validator.go`), which auto-strip them from agent `llama_extra_args`. The fix is to either: (A) remove them from the shadow lists, or (B) add them directly to the sidecar's `BASE_ARGS` (which skips validation entirely). ## What Changes - Sidecar base args gain the full set: - `--cache-type-k q4_0` — KV cache quantization (~4× VRAM savings) - `--cache-reuse 256` — KV cache reuse across turns (prompt caching) - `--slot-save-path /tmp/llama-slots` — disk-persistent KV cache - `--cache-idle-slots` — auto-save idle slot caches to disk - `--spec-type ngram-mod --spec-ngram-mod-thsh 2` — spec decoding - `--ctx-checkpoints 32` — context overflow protection - `--sleep-idle-seconds 600` — GPU memory reclaim when idle - `--metrics` — Prometheus metrics endpoint - Both validators keep existing shadow lists (correct as-is) - `/tmp/llama-slots` created for slot KV cache persistence ## Dependencies - llama-sidecar repo (separate git tree, `/opt/forks/llama-sidecar/`) - BooCode server (`llama-args-validator.ts`, `provider.ts`) ## Routing Change Previously, the llama-sidecar was only used when an agent had `llama_extra_args` set in its AGENTS.md frontmatter. The default path was llama-swap (no cache quant, no spec decoding, no slot save). Now, when `LLAMA_SIDECAR_URL` is configured (it is in docker-compose.yml), ALL inference requests route through the sidecar by default, regardless of whether the agent has `llama_extra_args`. This means: - Every request gets KV cache quantization, spec decoding, prompt caching - Agents with explicit `llama_extra_args` still get their overrides on top - If `LLAMA_SIDECAR_URL` is unset, falls back to llama-swap (backward compat) ## Risk - `ngram-mod` spec decoding adds ~16MB memory. Trivial vs the 35B model. - KV cache quant to q4_0 is lossy vs f32 — undetectable on code tasks. - Both well-tested in llama.cpp ecosystem. No known regressions. - If issues, remove from base args and restart — no code change needed.