Files
boocode/openspec/changes/llama-cache-and-spec/proposal.md
indifferentketchup c935687725 chore(openspec): drop 9 superseded proposals + 11 stub archive files
Drop 9 batch proposals that are superseded by the boocode-lift-analysis
(boocontext-audit, conductor upgrades, self-healing/verify-gate skills):
add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform,
conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul,
agent-reliability.

Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only)
that provide zero documentation value over the existing CHANGELOG.md + git tags.
2026-06-07 22:15:38 +00:00

63 lines
2.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# llama-cache-and-spec — KV cache quantization + ngram speculative decoding
## Why
BooCode's llama-sidecar runs llama-server with bare-minimum base args:
`-ngl 999 -c 32768 --flash-attn on --no-mmap`. Two high-impact llama.cpp
features are available but not enabled:
1. **KV cache quantization** (`--cache-type-k q4_0`) — stores the KV cache
in 4-bit instead of f32. ~4× VRAM reduction for the cache, which
dominates memory usage at 32K context. No quality impact for most models.
2. **Ngram speculative decoding** (`--spec-type ngram-mod`) — uses a
lightweight rolling-hash ngram model (~16MB) to predict tokens ahead of
the main model. The main model verifies them in batch. 2-3× tok/s
speedup on repetitive/code tasks with no accuracy loss and no separate
draft model to load.
Both are disabled because they're in the **shadowing lists** of both
validators (`llama-args-validator.ts` + sidecar `validator.go`), which
auto-strip them from agent `llama_extra_args`. The fix is to either:
(A) remove them from the shadow lists, or (B) add them directly to the
sidecar's `BASE_ARGS` (which skips validation entirely).
## What Changes
- Sidecar base args gain the full set:
- `--cache-type-k q4_0` — KV cache quantization (~4× VRAM savings)
- `--cache-reuse 256` — KV cache reuse across turns (prompt caching)
- `--slot-save-path /tmp/llama-slots` — disk-persistent KV cache
- `--cache-idle-slots` — auto-save idle slot caches to disk
- `--spec-type ngram-mod --spec-ngram-mod-thsh 2` — spec decoding
- `--ctx-checkpoints 32` — context overflow protection
- `--sleep-idle-seconds 600` — GPU memory reclaim when idle
- `--metrics` — Prometheus metrics endpoint
- Both validators keep existing shadow lists (correct as-is)
- `/tmp/llama-slots` created for slot KV cache persistence
## Dependencies
- llama-sidecar repo (separate git tree, `/opt/forks/llama-sidecar/`)
- BooCode server (`llama-args-validator.ts`, `provider.ts`)
## Routing Change
Previously, the llama-sidecar was only used when an agent had `llama_extra_args`
set in its AGENTS.md frontmatter. The default path was llama-swap (no cache
quant, no spec decoding, no slot save).
Now, when `LLAMA_SIDECAR_URL` is configured (it is in docker-compose.yml),
ALL inference requests route through the sidecar by default, regardless of
whether the agent has `llama_extra_args`. This means:
- Every request gets KV cache quantization, spec decoding, prompt caching
- Agents with explicit `llama_extra_args` still get their overrides on top
- If `LLAMA_SIDECAR_URL` is unset, falls back to llama-swap (backward compat)
## Risk
- `ngram-mod` spec decoding adds ~16MB memory. Trivial vs the 35B model.
- KV cache quant to q4_0 is lossy vs f32 — undetectable on code tasks.
- Both well-tested in llama.cpp ecosystem. No known regressions.
- If issues, remove from base args and restart — no code change needed.