Files

indifferentketchup fe7f36ae98 llama-sidecar v0.1.0: daemon + benchmarks + eval suite

Go daemon (cmd/llama-sidecar): per-agent llama-server process pool with
LRU eviction, OpenAI-compatible proxy, flag validation (Unsloth port),
deterministic hash-keyed sidecar reuse. Windows service support via
schtasks/NSSM with DETACHED_PROCESS, stdout pipe drain, and request-ctx
decoupled child lifetime.

Bug fixes (3b.1–3b5): -c flag drop from StripShadowingFlags, UTF-8 BOM
in JSON config, -fa → --flash-attn on default, child process exit after
one request (stdin devnull, stdout pipe, CREATE_NO_WINDOW → DETACHED,
context.Background for child lifetime, background reaper goroutine).

bench/: MTP on/off throughput sweep across 8 GGUFs via SSH+schtasks
automation to sam-desktop. Per-GGUF production flags from llama-swap
config with --ctx-size 32768 override.

eval/: accuracy benchmarks (MMLU 100q, GSM8K 50q, HumanEval 164) +
A/B model comparison (14 agent-typed prompts × 8 models). All scripts
resumable at individual question level.

94 Go tests, race detector clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-28 01:55:13 +00:00

2.4 KiB

Raw Permalink Blame History

llama-sidecar

Per-agent llama-server process pool daemon. Runs on sam-desktop alongside llama-swap. Spawns or reuses llama-server processes keyed on (modelID, flags) hash.

License

AGPL-3.0-only.

The validator package (internal/validator/) is ported from Unsloth Studio (AGPL-3.0). BooCode's TypeScript port (apps/server/src/services/inference/llama-args-validator.ts) is the sibling — update both when upstream changes.

Build

# Linux (development)
make build

# Windows AMD64 (production target — cross-compile from Linux)
make build-windows

# Copy to sam-desktop
# scp bin/llama-sidecar.exe sam-desktop:C:\llama-sidecar\

Configuration

All via environment variables (no CLI flags):

Variable	Required	Default	Description
`LLAMA_SERVER_BIN`	yes	—	Path to llama-server.exe
`MODEL_DIR_MAP_FILE`	yes	—	JSON file mapping model IDs to GGUF paths
`LLAMA_SIDECAR_BIND`	no	`127.0.0.1:8402`	Listen address
`PORT_RANGE`	no	`8500-8599`	Port range for sidecar processes
`MAX_SIDECARS`	no	`2`	Max concurrent sidecar processes
`LOG_LEVEL`	no	`info`	Log level (debug, info, warn, error)
`BASE_ARGS`	no	`["-ngl","999","-c","32768","--flash-attn","on","--no-mmap"]`	JSON array of base llama-server args
`HEALTH_TIMEOUT_SECONDS`	no	`60`	Max wait for sidecar health check
`HEALTH_INTERVAL_SECONDS`	no	`30`	Background health check interval

API

`GET /health`

Returns daemon status.

`GET /sidecars`

Returns list of active sidecar processes.

`DELETE /sidecars/{hash}`

Kill and remove a sidecar process.

`POST /v1/chat/completions`

OpenAI-compatible proxy. Routes to a sidecar process based on model + flags.

Headers:

X-Agent-Flags: --top-k 20 --cache-type-k q8_0 (optional)
X-Model-Id: qwen3.6-35b-a3b-mxfp4 (optional, overrides body.model)

Test

make test                  # unit tests
make test-integration      # requires real llama-server + GGUF
make lint                  # vet + gofmt

NSSM Service

Pre-configured on sam-desktop as llama-sidecar. Start/stop via:

C:\Tools\nssm\nssm.exe start llama-sidecar
C:\Tools\nssm\nssm.exe stop llama-sidecar
C:\Tools\nssm\nssm.exe status llama-sidecar

2.4 KiB Raw Permalink Blame History