v0.1.0
Go daemon (cmd/llama-sidecar): per-agent llama-server process pool with LRU eviction, OpenAI-compatible proxy, flag validation (Unsloth port), deterministic hash-keyed sidecar reuse. Windows service support via schtasks/NSSM with DETACHED_PROCESS, stdout pipe drain, and request-ctx decoupled child lifetime. Bug fixes (3b.1–3b5): -c flag drop from StripShadowingFlags, UTF-8 BOM in JSON config, -fa → --flash-attn on default, child process exit after one request (stdin devnull, stdout pipe, CREATE_NO_WINDOW → DETACHED, context.Background for child lifetime, background reaper goroutine). bench/: MTP on/off throughput sweep across 8 GGUFs via SSH+schtasks automation to sam-desktop. Per-GGUF production flags from llama-swap config with --ctx-size 32768 override. eval/: accuracy benchmarks (MMLU 100q, GSM8K 50q, HumanEval 164) + A/B model comparison (14 agent-typed prompts × 8 models). All scripts resumable at individual question level. 94 Go tests, race detector clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
llama-sidecar
Per-agent llama-server process pool daemon. Runs on sam-desktop alongside llama-swap. Spawns or reuses llama-server processes keyed on (modelID, flags) hash.
License
AGPL-3.0-only.
The validator package (internal/validator/) is ported from Unsloth Studio (AGPL-3.0). BooCode's TypeScript port (apps/server/src/services/inference/llama-args-validator.ts) is the sibling — update both when upstream changes.
Build
# Linux (development)
make build
# Windows AMD64 (production target — cross-compile from Linux)
make build-windows
# Copy to sam-desktop
# scp bin/llama-sidecar.exe sam-desktop:C:\llama-sidecar\
Configuration
All via environment variables (no CLI flags):
| Variable | Required | Default | Description |
|---|---|---|---|
LLAMA_SERVER_BIN |
yes | — | Path to llama-server.exe |
MODEL_DIR_MAP_FILE |
yes | — | JSON file mapping model IDs to GGUF paths |
LLAMA_SIDECAR_BIND |
no | 127.0.0.1:8402 |
Listen address |
PORT_RANGE |
no | 8500-8599 |
Port range for sidecar processes |
MAX_SIDECARS |
no | 2 |
Max concurrent sidecar processes |
LOG_LEVEL |
no | info |
Log level (debug, info, warn, error) |
BASE_ARGS |
no | ["-ngl","999","-c","32768","--flash-attn","on","--no-mmap"] |
JSON array of base llama-server args |
HEALTH_TIMEOUT_SECONDS |
no | 60 |
Max wait for sidecar health check |
HEALTH_INTERVAL_SECONDS |
no | 30 |
Background health check interval |
API
GET /health
Returns daemon status.
GET /sidecars
Returns list of active sidecar processes.
DELETE /sidecars/{hash}
Kill and remove a sidecar process.
POST /v1/chat/completions
OpenAI-compatible proxy. Routes to a sidecar process based on model + flags.
Headers:
X-Agent-Flags: --top-k 20 --cache-type-k q8_0(optional)X-Model-Id: qwen3.6-35b-a3b-mxfp4(optional, overrides body.model)
Test
make test # unit tests
make test-integration # requires real llama-server + GGUF
make lint # vet + gofmt
NSSM Service
Pre-configured on sam-desktop as llama-sidecar. Start/stop via:
C:\Tools\nssm\nssm.exe start llama-sidecar
C:\Tools\nssm\nssm.exe stop llama-sidecar
C:\Tools\nssm\nssm.exe status llama-sidecar
Description
Languages
Python
41.3%
Go
39.8%
Shell
18.6%
Makefile
0.3%