Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
2.9 KiB
llama-cache-and-spec — KV cache quantization + ngram speculative decoding
Why
BooCode's llama-sidecar runs llama-server with bare-minimum base args:
-ngl 999 -c 32768 --flash-attn on --no-mmap. Two high-impact llama.cpp
features are available but not enabled:
- KV cache quantization (
--cache-type-k q4_0) — stores the KV cache in 4-bit instead of f32. ~4× VRAM reduction for the cache, which dominates memory usage at 32K context. No quality impact for most models. - Ngram speculative decoding (
--spec-type ngram-mod) — uses a lightweight rolling-hash ngram model (~16MB) to predict tokens ahead of the main model. The main model verifies them in batch. 2-3× tok/s speedup on repetitive/code tasks with no accuracy loss and no separate draft model to load.
Both are disabled because they're in the shadowing lists of both
validators (llama-args-validator.ts + sidecar validator.go), which
auto-strip them from agent llama_extra_args. The fix is to either:
(A) remove them from the shadow lists, or (B) add them directly to the
sidecar's BASE_ARGS (which skips validation entirely).
What Changes
- Sidecar base args gain the full set:
--cache-type-k q4_0— KV cache quantization (~4× VRAM savings)--cache-reuse 256— KV cache reuse across turns (prompt caching)--slot-save-path /tmp/llama-slots— disk-persistent KV cache--cache-idle-slots— auto-save idle slot caches to disk--spec-type ngram-mod --spec-ngram-mod-thsh 2— spec decoding--ctx-checkpoints 32— context overflow protection--sleep-idle-seconds 600— GPU memory reclaim when idle--metrics— Prometheus metrics endpoint
- Both validators keep existing shadow lists (correct as-is)
/tmp/llama-slotscreated for slot KV cache persistence
Dependencies
- llama-sidecar repo (separate git tree,
/opt/forks/llama-sidecar/) - BooCode server (
llama-args-validator.ts,provider.ts)
Routing Change
Previously, the llama-sidecar was only used when an agent had llama_extra_args
set in its AGENTS.md frontmatter. The default path was llama-swap (no cache
quant, no spec decoding, no slot save).
Now, when LLAMA_SIDECAR_URL is configured (it is in docker-compose.yml),
ALL inference requests route through the sidecar by default, regardless of
whether the agent has llama_extra_args. This means:
- Every request gets KV cache quantization, spec decoding, prompt caching
- Agents with explicit
llama_extra_argsstill get their overrides on top - If
LLAMA_SIDECAR_URLis unset, falls back to llama-swap (backward compat)
Risk
ngram-modspec decoding adds ~16MB memory. Trivial vs the 35B model.- KV cache quant to q4_0 is lossy vs f32 — undetectable on code tasks.
- Both well-tested in llama.cpp ecosystem. No known regressions.
- If issues, remove from base args and restart — no code change needed.