Files
boocode/openspec/changes/llama-cache-and-spec/proposal.md
indifferentketchup c935687725 chore(openspec): drop 9 superseded proposals + 11 stub archive files
Drop 9 batch proposals that are superseded by the boocode-lift-analysis
(boocontext-audit, conductor upgrades, self-healing/verify-gate skills):
add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform,
conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul,
agent-reliability.

Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only)
that provide zero documentation value over the existing CHANGELOG.md + git tags.
2026-06-07 22:15:38 +00:00

2.9 KiB
Raw Blame History

llama-cache-and-spec — KV cache quantization + ngram speculative decoding

Why

BooCode's llama-sidecar runs llama-server with bare-minimum base args: -ngl 999 -c 32768 --flash-attn on --no-mmap. Two high-impact llama.cpp features are available but not enabled:

  1. KV cache quantization (--cache-type-k q4_0) — stores the KV cache in 4-bit instead of f32. ~4× VRAM reduction for the cache, which dominates memory usage at 32K context. No quality impact for most models.
  2. Ngram speculative decoding (--spec-type ngram-mod) — uses a lightweight rolling-hash ngram model (~16MB) to predict tokens ahead of the main model. The main model verifies them in batch. 2-3× tok/s speedup on repetitive/code tasks with no accuracy loss and no separate draft model to load.

Both are disabled because they're in the shadowing lists of both validators (llama-args-validator.ts + sidecar validator.go), which auto-strip them from agent llama_extra_args. The fix is to either: (A) remove them from the shadow lists, or (B) add them directly to the sidecar's BASE_ARGS (which skips validation entirely).

What Changes

  • Sidecar base args gain the full set:
    • --cache-type-k q4_0 — KV cache quantization (~4× VRAM savings)
    • --cache-reuse 256 — KV cache reuse across turns (prompt caching)
    • --slot-save-path /tmp/llama-slots — disk-persistent KV cache
    • --cache-idle-slots — auto-save idle slot caches to disk
    • --spec-type ngram-mod --spec-ngram-mod-thsh 2 — spec decoding
    • --ctx-checkpoints 32 — context overflow protection
    • --sleep-idle-seconds 600 — GPU memory reclaim when idle
    • --metrics — Prometheus metrics endpoint
  • Both validators keep existing shadow lists (correct as-is)
  • /tmp/llama-slots created for slot KV cache persistence

Dependencies

  • llama-sidecar repo (separate git tree, /opt/forks/llama-sidecar/)
  • BooCode server (llama-args-validator.ts, provider.ts)

Routing Change

Previously, the llama-sidecar was only used when an agent had llama_extra_args set in its AGENTS.md frontmatter. The default path was llama-swap (no cache quant, no spec decoding, no slot save).

Now, when LLAMA_SIDECAR_URL is configured (it is in docker-compose.yml), ALL inference requests route through the sidecar by default, regardless of whether the agent has llama_extra_args. This means:

  • Every request gets KV cache quantization, spec decoding, prompt caching
  • Agents with explicit llama_extra_args still get their overrides on top
  • If LLAMA_SIDECAR_URL is unset, falls back to llama-swap (backward compat)

Risk

  • ngram-mod spec decoding adds ~16MB memory. Trivial vs the 35B model.
  • KV cache quant to q4_0 is lossy vs f32 — undetectable on code tasks.
  • Both well-tested in llama.cpp ecosystem. No known regressions.
  • If issues, remove from base args and restart — no code change needed.