KV Cache Quantization

Format for the attention KV cache. Lower = less VRAM. q4_0 gives ~4x savings.

Prompt Caching

update('cache_reuse', Number(e.target.value))} className="w-32 bg-background border border-border rounded px-2 py-1.5 text-sm outline-none focus:border-ring" /> {config.cache_reuse > 0 ? 'On (min chunk size in tokens)' : 'Disabled'}

Reuses KV cache across turns when prompt prefix matches. 256 is a good default. 0 = disabled. The local equivalent of prompt caching.

Speculative Decoding

{config.spec_type === 'ngram-mod' && (

update('spec_ngram_mod_thsh', Number(e.target.value))} className="w-24 bg-background border border-border rounded px-2 py-1.5 text-sm outline-none focus:border-ring" /> Match threshold (2 = default)

)}

Predicts tokens ahead with a small model; main model verifies in batch. 2-3x speedup on repetitive/code tasks.

Context Checkpoints

update('ctx_checkpoints', Number(e.target.value))} className="w-24 bg-background border border-border rounded px-2 py-1.5 text-sm outline-none focus:border-ring" /> {config.ctx_checkpoints > 0 ? `Max ${config.ctx_checkpoints} checkpoints per slot` : 'Disabled'}

Prevents context overflow on long conversations. Default: 32.

Auto-sleep Timeout

update('sleep_idle_seconds', Number(e.target.value))} className="w-24 bg-background border border-border rounded px-2 py-1.5 text-sm outline-none focus:border-ring" /> seconds

GPU auto-sleeps after N seconds idle. -1 = disabled. 600 = 10 min.

Prometheus Metrics

update('metrics_enabled', v)} />

Enable /metrics endpoint for Prometheus monitoring (token rates, latency).

Slot KV Cache Path

update('slot_save_path', e.target.value)} className="w-full bg-background border border-border rounded px-2 py-1.5 text-sm font-mono outline-none focus:border-ring" />

Directory for disk-persistent KV cache. Idle slot caches are saved here.