# llama-cache-and-spec — tasks

## Files to change

Three files across two repos:

- `/opt/forks/llama-sidecar/internal/config/config.go`
- `/opt/boocode/apps/server/src/services/inference/llama-args-validator.ts`
- `/opt/forks/llama-sidecar/internal/validator/validator.go`

## Tasks

- [x] 1. Update sidecar default base args

  `/opt/forks/llama-sidecar/internal/config/config.go` edited.
  `defaultBaseArgs()` now includes:
  `--cache-type-k q4_0` — KV cache quant → ~4× VRAM savings
  `--cache-reuse 256` — KV cache reuse across turns → prompt caching
  `--slot-save-path /tmp/llama-slots` — disk-persistent KV cache
  `--cache-idle-slots` — auto-save idle slots to disk
  `--spec-type ngram-mod --spec-ngram-mod-thsh 2` — spec decoding → 2× tok/s
  `--ctx-checkpoints 32` — context overflow protection
  `--sleep-idle-seconds 600` — GPU memory reclaim when idle
  `--metrics` — Prometheus `/metrics` endpoint
  Build verified: `go build ./...` exits 0.

- [x] 2. No change needed — shadow lists are correct

  The shadow lists in `llama-args-validator.ts` already prevent agents
  from overriding cache/spec/template flags. Adding the flags to
  `defaultBaseArgs` + keeping the shadow lists is the correct architecture:
  flags are enabled by default, agents can't override them.

- [x] 3. No change needed — same reasoning as task 2

  The sidecar `validator.go` shadow lists serve the same purpose.
  Both code paths are consistent.

- [ ] 4. Deploy + verify

  - Rebuild sidecar binary: `go build -o ... ./...` → ✅ done
  - Restart docker compose: needs manual deploy
  - Verify `/metrics` endpoint returns data
  - Verify `nvidia-smi` shows reduced VRAM (expected: ~4× savings on KV cache)