llama-sidecar v0.1.0: daemon + benchmarks + eval suite

Go daemon (cmd/llama-sidecar): per-agent llama-server process pool with LRU eviction, OpenAI-compatible proxy, flag validation (Unsloth port), deterministic hash-keyed sidecar reuse. Windows service support via schtasks/NSSM with DETACHED_PROCESS, stdout pipe drain, and request-ctx decoupled child lifetime. Bug fixes (3b.1–3b5): -c flag drop from StripShadowingFlags, UTF-8 BOM in JSON config, -fa → --flash-attn on default, child process exit after one request (stdin devnull, stdout pipe, CREATE_NO_WINDOW → DETACHED, context.Background for child lifetime, background reaper goroutine). bench/: MTP on/off throughput sweep across 8 GGUFs via SSH+schtasks automation to sam-desktop. Per-GGUF production flags from llama-swap config with --ctx-size 32768 override. eval/: accuracy benchmarks (MMLU 100q, GSM8K 50q, HumanEval 164) + A/B model comparison (14 agent-typed prompts × 8 models). All scripts resumable at individual question level. 94 Go tests, race detector clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-28 01:55:13 +00:00
parent babbb4f39b
commit fe7f36ae98
39 changed files with 4228 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,77 @@
+# llama-sidecar
+
+Per-agent llama-server process pool daemon. Runs on sam-desktop alongside llama-swap. Spawns or reuses llama-server processes keyed on (modelID, flags) hash.
+
+## License
+
+AGPL-3.0-only.
+
+The validator package (`internal/validator/`) is ported from [Unsloth Studio](https://github.com/unslothai/unsloth/blob/main/studio/backend/core/inference/llama_server_args.py) (AGPL-3.0). BooCode's TypeScript port (`apps/server/src/services/inference/llama-args-validator.ts`) is the sibling — update both when upstream changes.
+
+## Build
+
+```bash
+# Linux (development)
+make build
+
+# Windows AMD64 (production target — cross-compile from Linux)
+make build-windows
+
+# Copy to sam-desktop
+# scp bin/llama-sidecar.exe sam-desktop:C:\llama-sidecar\
+```
+
+## Configuration
+
+All via environment variables (no CLI flags):
+
+| Variable | Required | Default | Description |
+|----------|----------|---------|-------------|
+| `LLAMA_SERVER_BIN` | yes | — | Path to llama-server.exe |
+| `MODEL_DIR_MAP_FILE` | yes | — | JSON file mapping model IDs to GGUF paths |
+| `LLAMA_SIDECAR_BIND` | no | `127.0.0.1:8402` | Listen address |
+| `PORT_RANGE` | no | `8500-8599` | Port range for sidecar processes |
+| `MAX_SIDECARS` | no | `2` | Max concurrent sidecar processes |
+| `LOG_LEVEL` | no | `info` | Log level (debug, info, warn, error) |
+| `BASE_ARGS` | no | `["-ngl","999","-c","32768","--flash-attn","on","--no-mmap"]` | JSON array of base llama-server args |
+| `HEALTH_TIMEOUT_SECONDS` | no | `60` | Max wait for sidecar health check |
+| `HEALTH_INTERVAL_SECONDS` | no | `30` | Background health check interval |
+
+## API
+
+### `GET /health`
+
+Returns daemon status.
+
+### `GET /sidecars`
+
+Returns list of active sidecar processes.
+
+### `DELETE /sidecars/{hash}`
+
+Kill and remove a sidecar process.
+
+### `POST /v1/chat/completions`
+
+OpenAI-compatible proxy. Routes to a sidecar process based on model + flags.
+
+Headers:
+- `X-Agent-Flags: --top-k 20 --cache-type-k q8_0` (optional)
+- `X-Model-Id: qwen3.6-35b-a3b-mxfp4` (optional, overrides body.model)
+
+## Test
+
+```bash
+make test                  # unit tests
+make test-integration      # requires real llama-server + GGUF
+make lint                  # vet + gofmt
+```
+
+## NSSM Service
+
+Pre-configured on sam-desktop as `llama-sidecar`. Start/stop via:
+```
+C:\Tools\nssm\nssm.exe start llama-sidecar
+C:\Tools\nssm\nssm.exe stop llama-sidecar
+C:\Tools\nssm\nssm.exe status llama-sidecar
+```