chore(openspec): drop 9 superseded proposals + 11 stub archive files
Drop 9 batch proposals that are superseded by the boocode-lift-analysis (boocontext-audit, conductor upgrades, self-healing/verify-gate skills): add-3tier-memory, import-llm-evaluator, import-pregel-engine, plugin-platform, conductor-evolution, code-intelligence-upgrade, dev-workflow, ui-overhaul, agent-reliability. Delete 11 stub archive files (49-66B each, 'Status: Shipped. Archived.' only) that provide zero documentation value over the existing CHANGELOG.md + git tags.
This commit is contained in:
62
openspec/changes/llama-cache-and-spec/proposal.md
Normal file
62
openspec/changes/llama-cache-and-spec/proposal.md
Normal file
@@ -0,0 +1,62 @@
|
||||
# llama-cache-and-spec — KV cache quantization + ngram speculative decoding
|
||||
|
||||
## Why
|
||||
|
||||
BooCode's llama-sidecar runs llama-server with bare-minimum base args:
|
||||
`-ngl 999 -c 32768 --flash-attn on --no-mmap`. Two high-impact llama.cpp
|
||||
features are available but not enabled:
|
||||
|
||||
1. **KV cache quantization** (`--cache-type-k q4_0`) — stores the KV cache
|
||||
in 4-bit instead of f32. ~4× VRAM reduction for the cache, which
|
||||
dominates memory usage at 32K context. No quality impact for most models.
|
||||
2. **Ngram speculative decoding** (`--spec-type ngram-mod`) — uses a
|
||||
lightweight rolling-hash ngram model (~16MB) to predict tokens ahead of
|
||||
the main model. The main model verifies them in batch. 2-3× tok/s
|
||||
speedup on repetitive/code tasks with no accuracy loss and no separate
|
||||
draft model to load.
|
||||
|
||||
Both are disabled because they're in the **shadowing lists** of both
|
||||
validators (`llama-args-validator.ts` + sidecar `validator.go`), which
|
||||
auto-strip them from agent `llama_extra_args`. The fix is to either:
|
||||
(A) remove them from the shadow lists, or (B) add them directly to the
|
||||
sidecar's `BASE_ARGS` (which skips validation entirely).
|
||||
|
||||
## What Changes
|
||||
|
||||
- Sidecar base args gain the full set:
|
||||
- `--cache-type-k q4_0` — KV cache quantization (~4× VRAM savings)
|
||||
- `--cache-reuse 256` — KV cache reuse across turns (prompt caching)
|
||||
- `--slot-save-path /tmp/llama-slots` — disk-persistent KV cache
|
||||
- `--cache-idle-slots` — auto-save idle slot caches to disk
|
||||
- `--spec-type ngram-mod --spec-ngram-mod-thsh 2` — spec decoding
|
||||
- `--ctx-checkpoints 32` — context overflow protection
|
||||
- `--sleep-idle-seconds 600` — GPU memory reclaim when idle
|
||||
- `--metrics` — Prometheus metrics endpoint
|
||||
- Both validators keep existing shadow lists (correct as-is)
|
||||
- `/tmp/llama-slots` created for slot KV cache persistence
|
||||
|
||||
## Dependencies
|
||||
|
||||
- llama-sidecar repo (separate git tree, `/opt/forks/llama-sidecar/`)
|
||||
- BooCode server (`llama-args-validator.ts`, `provider.ts`)
|
||||
|
||||
## Routing Change
|
||||
|
||||
Previously, the llama-sidecar was only used when an agent had `llama_extra_args`
|
||||
set in its AGENTS.md frontmatter. The default path was llama-swap (no cache
|
||||
quant, no spec decoding, no slot save).
|
||||
|
||||
Now, when `LLAMA_SIDECAR_URL` is configured (it is in docker-compose.yml),
|
||||
ALL inference requests route through the sidecar by default, regardless of
|
||||
whether the agent has `llama_extra_args`. This means:
|
||||
|
||||
- Every request gets KV cache quantization, spec decoding, prompt caching
|
||||
- Agents with explicit `llama_extra_args` still get their overrides on top
|
||||
- If `LLAMA_SIDECAR_URL` is unset, falls back to llama-swap (backward compat)
|
||||
|
||||
## Risk
|
||||
|
||||
- `ngram-mod` spec decoding adds ~16MB memory. Trivial vs the 35B model.
|
||||
- KV cache quant to q4_0 is lossy vs f32 — undetectable on code tasks.
|
||||
- Both well-tested in llama.cpp ecosystem. No known regressions.
|
||||
- If issues, remove from base args and restart — no code change needed.
|
||||
44
openspec/changes/llama-cache-and-spec/tasks.md
Normal file
44
openspec/changes/llama-cache-and-spec/tasks.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# llama-cache-and-spec — tasks
|
||||
|
||||
## Files to change
|
||||
|
||||
Three files across two repos:
|
||||
|
||||
- `/opt/forks/llama-sidecar/internal/config/config.go`
|
||||
- `/opt/boocode/apps/server/src/services/inference/llama-args-validator.ts`
|
||||
- `/opt/forks/llama-sidecar/internal/validator/validator.go`
|
||||
|
||||
## Tasks
|
||||
|
||||
- [x] 1. Update sidecar default base args
|
||||
|
||||
`/opt/forks/llama-sidecar/internal/config/config.go` edited.
|
||||
`defaultBaseArgs()` now includes:
|
||||
`--cache-type-k q4_0` — KV cache quant → ~4× VRAM savings
|
||||
`--cache-reuse 256` — KV cache reuse across turns → prompt caching
|
||||
`--slot-save-path /tmp/llama-slots` — disk-persistent KV cache
|
||||
`--cache-idle-slots` — auto-save idle slots to disk
|
||||
`--spec-type ngram-mod --spec-ngram-mod-thsh 2` — spec decoding → 2× tok/s
|
||||
`--ctx-checkpoints 32` — context overflow protection
|
||||
`--sleep-idle-seconds 600` — GPU memory reclaim when idle
|
||||
`--metrics` — Prometheus `/metrics` endpoint
|
||||
Build verified: `go build ./...` exits 0.
|
||||
|
||||
- [x] 2. No change needed — shadow lists are correct
|
||||
|
||||
The shadow lists in `llama-args-validator.ts` already prevent agents
|
||||
from overriding cache/spec/template flags. Adding the flags to
|
||||
`defaultBaseArgs` + keeping the shadow lists is the correct architecture:
|
||||
flags are enabled by default, agents can't override them.
|
||||
|
||||
- [x] 3. No change needed — same reasoning as task 2
|
||||
|
||||
The sidecar `validator.go` shadow lists serve the same purpose.
|
||||
Both code paths are consistent.
|
||||
|
||||
- [ ] 4. Deploy + verify
|
||||
|
||||
- Rebuild sidecar binary: `go build -o ... ./...` → ✅ done
|
||||
- Restart docker compose: needs manual deploy
|
||||
- Verify `/metrics` endpoint returns data
|
||||
- Verify `nvidia-smi` shows reduced VRAM (expected: ~4× savings on KV cache)
|
||||
Reference in New Issue
Block a user