llama-sidecar v0.1.0: daemon + benchmarks + eval suite
Go daemon (cmd/llama-sidecar): per-agent llama-server process pool with LRU eviction, OpenAI-compatible proxy, flag validation (Unsloth port), deterministic hash-keyed sidecar reuse. Windows service support via schtasks/NSSM with DETACHED_PROCESS, stdout pipe drain, and request-ctx decoupled child lifetime. Bug fixes (3b.1–3b5): -c flag drop from StripShadowingFlags, UTF-8 BOM in JSON config, -fa → --flash-attn on default, child process exit after one request (stdin devnull, stdout pipe, CREATE_NO_WINDOW → DETACHED, context.Background for child lifetime, background reaper goroutine). bench/: MTP on/off throughput sweep across 8 GGUFs via SSH+schtasks automation to sam-desktop. Per-GGUF production flags from llama-swap config with --ctx-size 32768 override. eval/: accuracy benchmarks (MMLU 100q, GSM8K 50q, HumanEval 164) + A/B model comparison (14 agent-typed prompts × 8 models). All scripts resumable at individual question level. 94 Go tests, race detector clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
18
.gitignore
vendored
Normal file
18
.gitignore
vendored
Normal file
@@ -0,0 +1,18 @@
|
|||||||
|
bin/
|
||||||
|
*.exe
|
||||||
|
eval/.venv/
|
||||||
|
eval/results/
|
||||||
|
eval/scores.csv
|
||||||
|
eval/SUMMARY.md
|
||||||
|
eval/eval.log
|
||||||
|
eval/ab/results/
|
||||||
|
eval/ab/COMPARE.md
|
||||||
|
eval/ab/timing.csv
|
||||||
|
eval/ab/run.log
|
||||||
|
bench/results/
|
||||||
|
bench/SUMMARY.md
|
||||||
|
bench/results.csv
|
||||||
|
bench/llama-swap-recommendations.md
|
||||||
|
internal/pool/*.bak-*
|
||||||
|
internal/pool/sidecar_windows.go.bak-*
|
||||||
|
__pycache__/
|
||||||
19
Makefile
Normal file
19
Makefile
Normal file
@@ -0,0 +1,19 @@
|
|||||||
|
.PHONY: build build-windows test test-integration lint
|
||||||
|
|
||||||
|
GO = /snap/go/current/bin/go
|
||||||
|
|
||||||
|
build:
|
||||||
|
$(GO) build -o bin/llama-sidecar ./cmd/llama-sidecar
|
||||||
|
|
||||||
|
build-windows:
|
||||||
|
GOOS=windows GOARCH=amd64 $(GO) build -o bin/llama-sidecar.exe ./cmd/llama-sidecar
|
||||||
|
|
||||||
|
test:
|
||||||
|
$(GO) test ./internal/...
|
||||||
|
|
||||||
|
test-integration:
|
||||||
|
$(GO) test -tags=integration ./internal/...
|
||||||
|
|
||||||
|
lint:
|
||||||
|
$(GO) vet ./...
|
||||||
|
gofmt -l .
|
||||||
77
README.md
Normal file
77
README.md
Normal file
@@ -0,0 +1,77 @@
|
|||||||
|
# llama-sidecar
|
||||||
|
|
||||||
|
Per-agent llama-server process pool daemon. Runs on sam-desktop alongside llama-swap. Spawns or reuses llama-server processes keyed on (modelID, flags) hash.
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
AGPL-3.0-only.
|
||||||
|
|
||||||
|
The validator package (`internal/validator/`) is ported from [Unsloth Studio](https://github.com/unslothai/unsloth/blob/main/studio/backend/core/inference/llama_server_args.py) (AGPL-3.0). BooCode's TypeScript port (`apps/server/src/services/inference/llama-args-validator.ts`) is the sibling — update both when upstream changes.
|
||||||
|
|
||||||
|
## Build
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Linux (development)
|
||||||
|
make build
|
||||||
|
|
||||||
|
# Windows AMD64 (production target — cross-compile from Linux)
|
||||||
|
make build-windows
|
||||||
|
|
||||||
|
# Copy to sam-desktop
|
||||||
|
# scp bin/llama-sidecar.exe sam-desktop:C:\llama-sidecar\
|
||||||
|
```
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
All via environment variables (no CLI flags):
|
||||||
|
|
||||||
|
| Variable | Required | Default | Description |
|
||||||
|
|----------|----------|---------|-------------|
|
||||||
|
| `LLAMA_SERVER_BIN` | yes | — | Path to llama-server.exe |
|
||||||
|
| `MODEL_DIR_MAP_FILE` | yes | — | JSON file mapping model IDs to GGUF paths |
|
||||||
|
| `LLAMA_SIDECAR_BIND` | no | `127.0.0.1:8402` | Listen address |
|
||||||
|
| `PORT_RANGE` | no | `8500-8599` | Port range for sidecar processes |
|
||||||
|
| `MAX_SIDECARS` | no | `2` | Max concurrent sidecar processes |
|
||||||
|
| `LOG_LEVEL` | no | `info` | Log level (debug, info, warn, error) |
|
||||||
|
| `BASE_ARGS` | no | `["-ngl","999","-c","32768","--flash-attn","on","--no-mmap"]` | JSON array of base llama-server args |
|
||||||
|
| `HEALTH_TIMEOUT_SECONDS` | no | `60` | Max wait for sidecar health check |
|
||||||
|
| `HEALTH_INTERVAL_SECONDS` | no | `30` | Background health check interval |
|
||||||
|
|
||||||
|
## API
|
||||||
|
|
||||||
|
### `GET /health`
|
||||||
|
|
||||||
|
Returns daemon status.
|
||||||
|
|
||||||
|
### `GET /sidecars`
|
||||||
|
|
||||||
|
Returns list of active sidecar processes.
|
||||||
|
|
||||||
|
### `DELETE /sidecars/{hash}`
|
||||||
|
|
||||||
|
Kill and remove a sidecar process.
|
||||||
|
|
||||||
|
### `POST /v1/chat/completions`
|
||||||
|
|
||||||
|
OpenAI-compatible proxy. Routes to a sidecar process based on model + flags.
|
||||||
|
|
||||||
|
Headers:
|
||||||
|
- `X-Agent-Flags: --top-k 20 --cache-type-k q8_0` (optional)
|
||||||
|
- `X-Model-Id: qwen3.6-35b-a3b-mxfp4` (optional, overrides body.model)
|
||||||
|
|
||||||
|
## Test
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make test # unit tests
|
||||||
|
make test-integration # requires real llama-server + GGUF
|
||||||
|
make lint # vet + gofmt
|
||||||
|
```
|
||||||
|
|
||||||
|
## NSSM Service
|
||||||
|
|
||||||
|
Pre-configured on sam-desktop as `llama-sidecar`. Start/stop via:
|
||||||
|
```
|
||||||
|
C:\Tools\nssm\nssm.exe start llama-sidecar
|
||||||
|
C:\Tools\nssm\nssm.exe stop llama-sidecar
|
||||||
|
C:\Tools\nssm\nssm.exe status llama-sidecar
|
||||||
|
```
|
||||||
215
bench/analyze.py
Normal file
215
bench/analyze.py
Normal file
@@ -0,0 +1,215 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Analyze MTP on/off benchmark results → CSV + SUMMARY.md + recommendations."""
|
||||||
|
|
||||||
|
import csv
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import statistics
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
RESULTS_DIR = Path(__file__).parent / "results"
|
||||||
|
CSV_PATH = Path(__file__).parent / "results.csv"
|
||||||
|
SUMMARY_PATH = Path(__file__).parent / "SUMMARY.md"
|
||||||
|
RECO_PATH = Path(__file__).parent / "llama-swap-recommendations.md"
|
||||||
|
|
||||||
|
FNAME_RE = re.compile(
|
||||||
|
r"^(?P<stem>.+?)__mtp-(?P<mtp>on|off)__len(?P<len>\d+)__run(?P<run>\d+)\.json$"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def parse_result(path: Path) -> dict | None:
|
||||||
|
m = FNAME_RE.match(path.name)
|
||||||
|
if not m:
|
||||||
|
return None
|
||||||
|
try:
|
||||||
|
data = json.loads(path.read_text())
|
||||||
|
except (json.JSONDecodeError, OSError):
|
||||||
|
return None
|
||||||
|
t = data.get("timings", {})
|
||||||
|
return {
|
||||||
|
"gguf": m.group("stem"),
|
||||||
|
"mtp": m.group("mtp"),
|
||||||
|
"prompt_len": int(m.group("len")),
|
||||||
|
"run": int(m.group("run")),
|
||||||
|
"prompt_tps": t.get("prompt_per_second"),
|
||||||
|
"predicted_tps": t.get("predicted_per_second"),
|
||||||
|
"cache_n": t.get("cache_n"),
|
||||||
|
"draft_n": t.get("draft_n"),
|
||||||
|
"accepted_n": t.get("draft_n_accepted"),
|
||||||
|
"total_ms": (t.get("prompt_ms", 0) or 0) + (t.get("predicted_ms", 0) or 0),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def load_all() -> list[dict]:
|
||||||
|
rows = []
|
||||||
|
for f in sorted(RESULTS_DIR.glob("*.json")):
|
||||||
|
r = parse_result(f)
|
||||||
|
if r:
|
||||||
|
rows.append(r)
|
||||||
|
return rows
|
||||||
|
|
||||||
|
|
||||||
|
def write_csv(rows: list[dict]) -> None:
|
||||||
|
fields = ["gguf", "mtp", "prompt_len", "run", "prompt_tps", "predicted_tps",
|
||||||
|
"cache_n", "draft_n", "accepted_n", "total_ms"]
|
||||||
|
with open(CSV_PATH, "w", newline="") as f:
|
||||||
|
w = csv.DictWriter(f, fieldnames=fields)
|
||||||
|
w.writeheader()
|
||||||
|
w.writerows(rows)
|
||||||
|
print(f"Wrote {len(rows)} rows to {CSV_PATH}")
|
||||||
|
|
||||||
|
|
||||||
|
def median_of(values: list[float]) -> float:
|
||||||
|
return statistics.median(values) if values else 0.0
|
||||||
|
|
||||||
|
|
||||||
|
def write_summary(rows: list[dict]) -> None:
|
||||||
|
ggufs = sorted(set(r["gguf"] for r in rows))
|
||||||
|
lens = sorted(set(r["prompt_len"] for r in rows))
|
||||||
|
lines = ["# MTP On/Off Benchmark Results\n"]
|
||||||
|
lines.append(f"**{len(rows)} measurements across {len(ggufs)} GGUFs.**\n")
|
||||||
|
lines.append(f"Runs 2 & 3 used for median (run 1 = warmup, discarded).\n")
|
||||||
|
|
||||||
|
verdicts = []
|
||||||
|
|
||||||
|
for gguf in ggufs:
|
||||||
|
lines.append(f"\n## {gguf}\n")
|
||||||
|
header_parts = ["prompt_len"]
|
||||||
|
for state in ["off", "on"]:
|
||||||
|
header_parts.append(f"MTP-{state} tok/s")
|
||||||
|
header_parts.extend(["delta %", "accept %"])
|
||||||
|
lines.append("| " + " | ".join(header_parts) + " |")
|
||||||
|
lines.append("|" + "|".join("---" for _ in header_parts) + "|")
|
||||||
|
|
||||||
|
any_above_10 = False
|
||||||
|
for pl in lens:
|
||||||
|
off_vals = [r["predicted_tps"] for r in rows
|
||||||
|
if r["gguf"] == gguf and r["mtp"] == "off"
|
||||||
|
and r["prompt_len"] == pl and r["run"] >= 2
|
||||||
|
and r["predicted_tps"] is not None]
|
||||||
|
on_vals = [r["predicted_tps"] for r in rows
|
||||||
|
if r["gguf"] == gguf and r["mtp"] == "on"
|
||||||
|
and r["prompt_len"] == pl and r["run"] >= 2
|
||||||
|
and r["predicted_tps"] is not None]
|
||||||
|
|
||||||
|
off_med = median_of(off_vals)
|
||||||
|
on_med = median_of(on_vals)
|
||||||
|
|
||||||
|
if off_med > 0:
|
||||||
|
delta = ((on_med - off_med) / off_med) * 100
|
||||||
|
else:
|
||||||
|
delta = 0.0
|
||||||
|
|
||||||
|
if abs(delta) >= 10:
|
||||||
|
any_above_10 = True
|
||||||
|
|
||||||
|
draft_rows = [r for r in rows
|
||||||
|
if r["gguf"] == gguf and r["mtp"] == "on"
|
||||||
|
and r["prompt_len"] == pl and r["run"] >= 2
|
||||||
|
and r.get("draft_n")]
|
||||||
|
total_draft = sum(r.get("draft_n", 0) for r in draft_rows)
|
||||||
|
total_accepted = sum(r.get("accepted_n", 0) for r in draft_rows)
|
||||||
|
accept_pct = f"{(total_accepted / total_draft * 100):.0f}%" if total_draft > 0 else "—"
|
||||||
|
|
||||||
|
lines.append(
|
||||||
|
f"| {pl} | {off_med:.1f} | {on_med:.1f} | {delta:+.1f}% | {accept_pct} |"
|
||||||
|
)
|
||||||
|
|
||||||
|
if any_above_10:
|
||||||
|
verdict = "KEEP MTP"
|
||||||
|
else:
|
||||||
|
verdict = "DROP MTP"
|
||||||
|
verdicts.append((gguf, verdict))
|
||||||
|
lines.append(f"\n**Verdict: {verdict}**\n")
|
||||||
|
|
||||||
|
lines.append("\n---\n")
|
||||||
|
lines.append("## Verdict Summary\n")
|
||||||
|
lines.append("| GGUF | Verdict |")
|
||||||
|
lines.append("|------|---------|")
|
||||||
|
for gguf, verdict in verdicts:
|
||||||
|
lines.append(f"| {gguf} | {verdict} |")
|
||||||
|
|
||||||
|
summary = "\n".join(lines) + "\n"
|
||||||
|
SUMMARY_PATH.write_text(summary)
|
||||||
|
print(f"Wrote {SUMMARY_PATH}")
|
||||||
|
print(summary)
|
||||||
|
|
||||||
|
|
||||||
|
def write_recommendations(rows: list[dict]) -> None:
|
||||||
|
ggufs = sorted(set(r["gguf"] for r in rows))
|
||||||
|
lens = sorted(set(r["prompt_len"] for r in rows))
|
||||||
|
|
||||||
|
lines = ["# llama-swap Config Recommendations\n"]
|
||||||
|
lines.append("Based on MTP on/off benchmark results.\n")
|
||||||
|
lines.append("**Read-only reference** — do NOT edit D:\\llama-swap\\config.yaml directly.\n")
|
||||||
|
lines.append("```yaml")
|
||||||
|
lines.append("# Commented diff against current config.yaml")
|
||||||
|
lines.append("# Lines starting with + should be added, - should be removed")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
model_map = {
|
||||||
|
"Qwen3.6-35B-A3B-MXFP4_MOE": "qwen3.6-35b-a3b-mxfp4",
|
||||||
|
"Qwen3.6-27B-Q6_K": "qwen3.6-27b-mtp",
|
||||||
|
"Qwopus3.5-4B-v3-MTP-Q8_0": "qwopus3.5-4b-mtp",
|
||||||
|
"Qwen3.5-9B-DeepSeek-V4-Flash-MTP-Q8_0": "qwen3.5-9b-deepseek-v4-mtp",
|
||||||
|
"Qwopus3.6-35B-A3B-v1-MTP-Q4_K_M": "qwopus3.6-35b-a3b-v1-mtp",
|
||||||
|
"Qwopus3.6-35B-A3B-v1-MTP-MXFP4_MOE_BF16": "qwopus3.6-35b-a3b-mxfp4-mtp",
|
||||||
|
"Qwopus3.6-27B-v2-MTP-Q6_K": "qwopus3.6-27b-v2-mtp",
|
||||||
|
"Qwopus3.5-9B-Coder-MTP-Q8_0": "qwopus3.5-9b-coder-mtp",
|
||||||
|
}
|
||||||
|
|
||||||
|
currently_mtp = {
|
||||||
|
"Qwen3.6-35B-A3B-MXFP4_MOE": False,
|
||||||
|
"Qwen3.6-27B-Q6_K": True,
|
||||||
|
"Qwopus3.5-4B-v3-MTP-Q8_0": True,
|
||||||
|
"Qwen3.5-9B-DeepSeek-V4-Flash-MTP-Q8_0": True,
|
||||||
|
"Qwopus3.6-35B-A3B-v1-MTP-Q4_K_M": True,
|
||||||
|
"Qwopus3.6-35B-A3B-v1-MTP-MXFP4_MOE_BF16": True,
|
||||||
|
"Qwopus3.6-27B-v2-MTP-Q6_K": True,
|
||||||
|
"Qwopus3.5-9B-Coder-MTP-Q8_0": True,
|
||||||
|
}
|
||||||
|
|
||||||
|
for gguf in ggufs:
|
||||||
|
model_id = model_map.get(gguf, gguf)
|
||||||
|
is_mtp_now = currently_mtp.get(gguf, False)
|
||||||
|
|
||||||
|
off_vals = [r["predicted_tps"] for r in rows
|
||||||
|
if r["gguf"] == gguf and r["mtp"] == "off" and r["run"] >= 2
|
||||||
|
and r["predicted_tps"] is not None]
|
||||||
|
on_vals = [r["predicted_tps"] for r in rows
|
||||||
|
if r["gguf"] == gguf and r["mtp"] == "on" and r["run"] >= 2
|
||||||
|
and r["predicted_tps"] is not None]
|
||||||
|
off_med = median_of(off_vals)
|
||||||
|
on_med = median_of(on_vals)
|
||||||
|
delta = ((on_med - off_med) / off_med * 100) if off_med > 0 else 0
|
||||||
|
|
||||||
|
should_mtp = delta >= 10
|
||||||
|
lines.append(f" # {model_id}: MTP {'on' if is_mtp_now else 'off'} → {'on' if should_mtp else 'off'} (delta {delta:+.1f}%)")
|
||||||
|
|
||||||
|
if should_mtp and not is_mtp_now:
|
||||||
|
lines.append(f" # + --spec-type draft-mtp --spec-draft-n-max 2")
|
||||||
|
elif not should_mtp and is_mtp_now:
|
||||||
|
lines.append(f" # - --spec-type draft-mtp --spec-draft-n-max 2")
|
||||||
|
else:
|
||||||
|
lines.append(f" # (no change)")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
lines.append("```\n")
|
||||||
|
reco = "\n".join(lines)
|
||||||
|
RECO_PATH.write_text(reco)
|
||||||
|
print(f"Wrote {RECO_PATH}")
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
rows = load_all()
|
||||||
|
if not rows:
|
||||||
|
print("No results found in", RESULTS_DIR)
|
||||||
|
return
|
||||||
|
write_csv(rows)
|
||||||
|
write_summary(rows)
|
||||||
|
write_recommendations(rows)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
192
bench/bench.sh
Executable file
192
bench/bench.sh
Executable file
@@ -0,0 +1,192 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
ENDPOINT="http://100.101.41.16:8650"
|
||||||
|
SSH_HOST="samki@100.101.41.16"
|
||||||
|
TASK_NAME="bench_llama"
|
||||||
|
BAT_PATH='%TEMP%\bench_run.bat'
|
||||||
|
RESULTS_DIR="$(cd "$(dirname "$0")" && pwd)/results"
|
||||||
|
PROMPTS_DIR="$(cd "$(dirname "$0")" && pwd)/prompts"
|
||||||
|
MAX_TOKENS=200
|
||||||
|
HEALTH_TIMEOUT=120
|
||||||
|
LLAMA_BIN='D:\llama-server\llama-server.exe'
|
||||||
|
|
||||||
|
mkdir -p "$RESULTS_DIR"
|
||||||
|
|
||||||
|
# ── Config matrix: STEM|MTP_STATE|FULL_ARGS ───────────────────────────
|
||||||
|
|
||||||
|
CONFIGS=(
|
||||||
|
'Qwen3.6-35B-A3B-MXFP4_MOE|off|--host 0.0.0.0 --port 8650 -m D:\models\Qwen3.6-35B-A3B-MXFP4_MOE.gguf --mmproj D:\models\Qwen3.6-35B-A3B-MXFP4_MOE\mmproj.gguf -ngl 99 --ctx-size 32768 --flash-attn on --cont-batching --cache-type-k q8_0 --cache-type-v q8_0 --jinja --chat-template-file D:\models\qwen3.6.jinja --keep -1 --cache-reuse 2048 --parallel 1 --batch-size 4096 --ubatch-size 1024 --threads 8 --no-mmap --mlock --seed 42 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-penalty 1.0'
|
||||||
|
|
||||||
|
'Qwen3.6-35B-A3B-MXFP4_MOE|on|--host 0.0.0.0 --port 8650 -m D:\models\Qwen3.6-35B-A3B-MXFP4_MOE.gguf --mmproj D:\models\Qwen3.6-35B-A3B-MXFP4_MOE\mmproj.gguf -ngl 99 --ctx-size 32768 --flash-attn on --cont-batching --cache-type-k q8_0 --cache-type-v q8_0 --jinja --chat-template-file D:\models\qwen3.6.jinja --keep -1 --cache-reuse 2048 --parallel 1 --batch-size 4096 --ubatch-size 1024 --threads 8 --no-mmap --mlock --seed 42 --spec-type draft-mtp --spec-draft-n-max 2 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-penalty 1.0'
|
||||||
|
|
||||||
|
'Qwen3.6-27B-Q6_K|off|--host 0.0.0.0 --port 8650 -m D:\models\Qwen3.6-27B-Q6_K.gguf -ngl 99 --ctx-size 32768 --flash-attn on --cont-batching --cache-type-k q4_0 --cache-type-v q4_0 --jinja --chat-template-file D:\models\qwen3.6.jinja --keep -1 --cache-reuse 1024 --parallel 1 --batch-size 2048 --ubatch-size 512 --threads 8 --no-mmap --mlock --seed 42 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-penalty 1.0'
|
||||||
|
|
||||||
|
'Qwen3.6-27B-Q6_K|on|--host 0.0.0.0 --port 8650 -m D:\models\Qwen3.6-27B-Q6_K.gguf -ngl 99 --ctx-size 32768 --flash-attn on --cont-batching --cache-type-k q4_0 --cache-type-v q4_0 --jinja --chat-template-file D:\models\qwen3.6.jinja --keep -1 --cache-reuse 1024 --parallel 1 --batch-size 2048 --ubatch-size 512 --threads 8 --no-mmap --mlock --seed 42 --spec-type draft-mtp --spec-draft-n-max 2 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-penalty 1.0'
|
||||||
|
|
||||||
|
'Qwopus3.5-4B-v3-MTP-Q8_0|off|--host 0.0.0.0 --port 8650 -m D:\models\Qwopus3.5-4B-v3-MTP-Q8_0.gguf -ngl 99 --ctx-size 32768 --flash-attn on --cont-batching --cache-type-k q8_0 --cache-type-v q8_0 --jinja --keep -1 --cache-reuse 1024 --parallel 1 --batch-size 2048 --ubatch-size 512 --threads 8 --no-mmap --mlock --seed 42 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-penalty 1.0'
|
||||||
|
|
||||||
|
'Qwopus3.5-4B-v3-MTP-Q8_0|on|--host 0.0.0.0 --port 8650 -m D:\models\Qwopus3.5-4B-v3-MTP-Q8_0.gguf -ngl 99 --ctx-size 32768 --flash-attn on --cont-batching --cache-type-k q8_0 --cache-type-v q8_0 --jinja --keep -1 --cache-reuse 1024 --parallel 1 --batch-size 2048 --ubatch-size 512 --threads 8 --no-mmap --mlock --seed 42 --spec-type draft-mtp --spec-draft-n-max 2 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-penalty 1.0'
|
||||||
|
|
||||||
|
'Qwen3.5-9B-DeepSeek-V4-Flash-MTP-Q8_0|off|--host 0.0.0.0 --port 8650 -m D:\models\Qwen3.5-9B-DeepSeek-V4-Flash-MTP-Q8_0.gguf -ngl 99 --ctx-size 32768 --flash-attn on --cont-batching --cache-type-k q8_0 --cache-type-v q8_0 --jinja --keep -1 --cache-reuse 1024 --parallel 1 --batch-size 2048 --ubatch-size 512 --threads 8 --no-mmap --mlock --seed 42 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-penalty 1.0'
|
||||||
|
|
||||||
|
'Qwen3.5-9B-DeepSeek-V4-Flash-MTP-Q8_0|on|--host 0.0.0.0 --port 8650 -m D:\models\Qwen3.5-9B-DeepSeek-V4-Flash-MTP-Q8_0.gguf -ngl 99 --ctx-size 32768 --flash-attn on --cont-batching --cache-type-k q8_0 --cache-type-v q8_0 --jinja --keep -1 --cache-reuse 1024 --parallel 1 --batch-size 2048 --ubatch-size 512 --threads 8 --no-mmap --mlock --seed 42 --spec-type draft-mtp --spec-draft-n-max 2 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-penalty 1.0'
|
||||||
|
|
||||||
|
'Qwopus3.6-35B-A3B-v1-MTP-Q4_K_M|off|--host 0.0.0.0 --port 8650 -m D:\models\Qwopus3.6-35B-A3B-v1-MTP-Q4_K_M.gguf -ngl 99 --ctx-size 32768 --flash-attn on --cont-batching --cache-type-k q8_0 --cache-type-v q8_0 --jinja --chat-template-file D:\models\qwen3.6.jinja --keep -1 --cache-reuse 2048 --parallel 1 --batch-size 4096 --ubatch-size 1024 --threads 8 --no-mmap --mlock --seed 42 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-penalty 1.0'
|
||||||
|
|
||||||
|
'Qwopus3.6-35B-A3B-v1-MTP-Q4_K_M|on|--host 0.0.0.0 --port 8650 -m D:\models\Qwopus3.6-35B-A3B-v1-MTP-Q4_K_M.gguf -ngl 99 --ctx-size 32768 --flash-attn on --cont-batching --cache-type-k q8_0 --cache-type-v q8_0 --jinja --chat-template-file D:\models\qwen3.6.jinja --keep -1 --cache-reuse 2048 --parallel 1 --batch-size 4096 --ubatch-size 1024 --threads 8 --no-mmap --mlock --seed 42 --spec-type draft-mtp --spec-draft-n-max 2 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-penalty 1.0'
|
||||||
|
|
||||||
|
'Qwopus3.6-35B-A3B-v1-MTP-MXFP4_MOE_BF16|off|--host 0.0.0.0 --port 8650 -m D:\models\Qwopus3.6-35B-A3B-v1-MTP-MXFP4_MOE_BF16.gguf -ngl 99 --ctx-size 32768 --flash-attn on --cont-batching --cache-type-k q8_0 --cache-type-v q8_0 --jinja --chat-template-file D:\models\qwen3.6.jinja --keep -1 --cache-reuse 2048 --parallel 1 --batch-size 4096 --ubatch-size 1024 --threads 8 --no-mmap --mlock --seed 42 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-penalty 1.0'
|
||||||
|
|
||||||
|
'Qwopus3.6-35B-A3B-v1-MTP-MXFP4_MOE_BF16|on|--host 0.0.0.0 --port 8650 -m D:\models\Qwopus3.6-35B-A3B-v1-MTP-MXFP4_MOE_BF16.gguf -ngl 99 --ctx-size 32768 --flash-attn on --cont-batching --cache-type-k q8_0 --cache-type-v q8_0 --jinja --chat-template-file D:\models\qwen3.6.jinja --keep -1 --cache-reuse 2048 --parallel 1 --batch-size 4096 --ubatch-size 1024 --threads 8 --no-mmap --mlock --seed 42 --spec-type draft-mtp --spec-draft-n-max 2 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-penalty 1.0'
|
||||||
|
|
||||||
|
'Qwopus3.6-27B-v2-MTP-Q6_K|off|--host 0.0.0.0 --port 8650 -m D:\models\Qwopus3.6-27B-v2-MTP-Q6_K.gguf -ngl 99 --ctx-size 32768 --flash-attn on --cont-batching --cache-type-k q4_0 --cache-type-v q4_0 --jinja --chat-template-file D:\models\qwen3.6.jinja --keep -1 --cache-reuse 1024 --parallel 1 --batch-size 2048 --ubatch-size 512 --threads 8 --no-mmap --mlock --seed 42 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-penalty 1.0'
|
||||||
|
|
||||||
|
'Qwopus3.6-27B-v2-MTP-Q6_K|on|--host 0.0.0.0 --port 8650 -m D:\models\Qwopus3.6-27B-v2-MTP-Q6_K.gguf -ngl 99 --ctx-size 32768 --flash-attn on --cont-batching --cache-type-k q4_0 --cache-type-v q4_0 --jinja --chat-template-file D:\models\qwen3.6.jinja --keep -1 --cache-reuse 1024 --parallel 1 --batch-size 2048 --ubatch-size 512 --threads 8 --no-mmap --mlock --seed 42 --spec-type draft-mtp --spec-draft-n-max 2 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-penalty 1.0'
|
||||||
|
|
||||||
|
'Qwopus3.5-9B-Coder-MTP-Q8_0|off|--host 0.0.0.0 --port 8650 -m D:\models\Qwopus3.5-9B-Coder-MTP-Q8_0.gguf -ngl 99 --ctx-size 32768 --flash-attn on --cont-batching --cache-type-k q8_0 --cache-type-v q8_0 --jinja --keep -1 --cache-reuse 1024 --parallel 1 --batch-size 2048 --ubatch-size 512 --threads 8 --no-mmap --mlock --seed 42 --temp 0.4 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.0'
|
||||||
|
|
||||||
|
'Qwopus3.5-9B-Coder-MTP-Q8_0|on|--host 0.0.0.0 --port 8650 -m D:\models\Qwopus3.5-9B-Coder-MTP-Q8_0.gguf -ngl 99 --ctx-size 32768 --flash-attn on --cont-batching --cache-type-k q8_0 --cache-type-v q8_0 --jinja --keep -1 --cache-reuse 1024 --parallel 1 --batch-size 2048 --ubatch-size 512 --threads 8 --no-mmap --mlock --seed 42 --spec-type draft-mtp --spec-draft-n-max 2 --temp 0.4 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.0'
|
||||||
|
)
|
||||||
|
|
||||||
|
PROMPT_LENS=(256 1024 4096)
|
||||||
|
|
||||||
|
# ── Helper functions ──────────────────────────────────────────────────
|
||||||
|
|
||||||
|
kill_bench_server() {
|
||||||
|
local pids
|
||||||
|
pids=$(ssh "$SSH_HOST" 'for /f "tokens=5" %a in ('"'"'netstat -aon ^| findstr :8650 ^| findstr LISTENING'"'"') do @echo %a' 2>/dev/null || true)
|
||||||
|
for pid in $pids; do
|
||||||
|
if [ -n "$pid" ] && [ "$pid" != "0" ]; then
|
||||||
|
ssh "$SSH_HOST" "taskkill /F /PID $pid" 2>/dev/null || true
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
ssh "$SSH_HOST" "schtasks /Delete /TN ${TASK_NAME} /F" 2>/dev/null || true
|
||||||
|
sleep 3
|
||||||
|
}
|
||||||
|
|
||||||
|
start_bench_server() {
|
||||||
|
local args="$1"
|
||||||
|
# Write a batch file, then run it via schtasks
|
||||||
|
ssh "$SSH_HOST" "echo ${LLAMA_BIN} ${args} > ${BAT_PATH}" 2>/dev/null
|
||||||
|
ssh "$SSH_HOST" "schtasks /Create /TN ${TASK_NAME} /TR ${BAT_PATH} /SC ONCE /ST 00:00 /F /RL HIGHEST" 2>/dev/null
|
||||||
|
ssh "$SSH_HOST" "schtasks /Run /TN ${TASK_NAME}" 2>/dev/null
|
||||||
|
}
|
||||||
|
|
||||||
|
poll_health() {
|
||||||
|
local elapsed=0
|
||||||
|
while [ $elapsed -lt $HEALTH_TIMEOUT ]; do
|
||||||
|
if curl -sf "${ENDPOINT}/health" >/dev/null 2>&1; then
|
||||||
|
echo " health OK (${elapsed}s)"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
sleep 3
|
||||||
|
elapsed=$((elapsed + 3))
|
||||||
|
if [ $((elapsed % 15)) -eq 0 ]; then
|
||||||
|
echo " waiting... (${elapsed}s)"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
echo " HEALTH TIMEOUT after ${HEALTH_TIMEOUT}s"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
send_request() {
|
||||||
|
local prompt_file="$1"
|
||||||
|
local output_file="$2"
|
||||||
|
local body
|
||||||
|
body=$(python3 -c "
|
||||||
|
import json
|
||||||
|
prompt = open('${prompt_file}').read()
|
||||||
|
print(json.dumps({
|
||||||
|
'messages': [{'role': 'user', 'content': prompt}],
|
||||||
|
'max_tokens': ${MAX_TOKENS},
|
||||||
|
'temperature': 0,
|
||||||
|
'seed': 42,
|
||||||
|
'stream': False
|
||||||
|
}))
|
||||||
|
")
|
||||||
|
local http_code
|
||||||
|
http_code=$(curl -s -w '%{http_code}' -o "$output_file" \
|
||||||
|
--max-time 300 \
|
||||||
|
-X POST "${ENDPOINT}/v1/chat/completions" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d "$body" 2>/dev/null)
|
||||||
|
if [ "$http_code" != "200" ]; then
|
||||||
|
echo "HTTP ${http_code}"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
print_metrics() {
|
||||||
|
python3 -c "
|
||||||
|
import json
|
||||||
|
d = json.load(open('${1}'))
|
||||||
|
t = d.get('timings', {})
|
||||||
|
ptps = t.get('prompt_per_second', 0)
|
||||||
|
etps = t.get('predicted_per_second', 0)
|
||||||
|
dn = t.get('draft_n', '')
|
||||||
|
da = t.get('draft_n_accepted', '')
|
||||||
|
draft = ''
|
||||||
|
if dn != '':
|
||||||
|
draft = f' draft={da}/{dn}'
|
||||||
|
print(f'prompt={ptps:.1f} eval={etps:.1f} tok/s{draft}')
|
||||||
|
" 2>/dev/null || echo "(parse error)"
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── Main ──────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
total=${#CONFIGS[@]}
|
||||||
|
echo "================================================================"
|
||||||
|
echo " MTP ON/OFF BENCHMARK SWEEP"
|
||||||
|
echo " ${total} configs x 3 prompts x 3 runs"
|
||||||
|
echo " Endpoint: ${ENDPOINT}"
|
||||||
|
echo "================================================================"
|
||||||
|
|
||||||
|
t_start=$(date +%s)
|
||||||
|
config_idx=0
|
||||||
|
|
||||||
|
for config_entry in "${CONFIGS[@]}"; do
|
||||||
|
config_idx=$((config_idx + 1))
|
||||||
|
IFS='|' read -r stem mtp_state args <<< "$config_entry"
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "================================================================"
|
||||||
|
echo " [${config_idx}/${total}] ${stem} MTP=${mtp_state}"
|
||||||
|
echo "================================================================"
|
||||||
|
|
||||||
|
kill_bench_server
|
||||||
|
echo " Starting llama-server..."
|
||||||
|
start_bench_server "$args"
|
||||||
|
|
||||||
|
if ! poll_health; then
|
||||||
|
echo " SKIPPING"
|
||||||
|
kill_bench_server
|
||||||
|
continue
|
||||||
|
fi
|
||||||
|
|
||||||
|
for len in "${PROMPT_LENS[@]}"; do
|
||||||
|
prompt_file="${PROMPTS_DIR}/p${len}.txt"
|
||||||
|
[ -f "$prompt_file" ] || { echo " Missing p${len}.txt"; continue; }
|
||||||
|
echo " -- p${len} --"
|
||||||
|
for run in 1 2 3; do
|
||||||
|
outfile="${RESULTS_DIR}/${stem}__mtp-${mtp_state}__len${len}__run${run}.json"
|
||||||
|
printf " run %d: " "$run"
|
||||||
|
if send_request "$prompt_file" "$outfile"; then
|
||||||
|
print_metrics "$outfile"
|
||||||
|
fi
|
||||||
|
sleep 1
|
||||||
|
done
|
||||||
|
done
|
||||||
|
|
||||||
|
echo " Killing..."
|
||||||
|
kill_bench_server
|
||||||
|
done
|
||||||
|
|
||||||
|
t_end=$(date +%s)
|
||||||
|
elapsed=$(( t_end - t_start ))
|
||||||
|
echo ""
|
||||||
|
echo "================================================================"
|
||||||
|
echo " SWEEP COMPLETE in $(( elapsed / 60 ))m $(( elapsed % 60 ))s"
|
||||||
|
echo " Run: python3 $(dirname "$0")/analyze.py"
|
||||||
|
echo "================================================================"
|
||||||
67
bench/prompts/p1024.txt
Normal file
67
bench/prompts/p1024.txt
Normal file
@@ -0,0 +1,67 @@
|
|||||||
|
You will rejoice to hear that no disaster has accompanied the
|
||||||
|
commencement of an enterprise which you have regarded with such evil
|
||||||
|
forebodings. I arrived here yesterday, and my first task is to assure
|
||||||
|
my dear sister of my welfare and increasing confidence in the success
|
||||||
|
of my undertaking.
|
||||||
|
|
||||||
|
I am already far north of London, and as I walk in the streets of
|
||||||
|
Petersburgh, I feel a cold northern breeze play upon my cheeks, which
|
||||||
|
braces my nerves and fills me with delight. Do you understand this
|
||||||
|
feeling? This breeze, which has travelled from the regions towards
|
||||||
|
which I am advancing, gives me a foretaste of those icy climes.
|
||||||
|
Inspirited by this wind of promise, my daydreams become more fervent
|
||||||
|
and vivid. I try in vain to be persuaded that the pole is the seat of
|
||||||
|
frost and desolation; it ever presents itself to my imagination as the
|
||||||
|
region of beauty and delight. There, Margaret, the sun is for ever
|
||||||
|
visible, its broad disk just skirting the horizon and diffusing a
|
||||||
|
perpetual splendour. There—for with your leave, my sister, I will put
|
||||||
|
some trust in preceding navigators—there snow and frost are banished;
|
||||||
|
and, sailing over a calm sea, we may be wafted to a land surpassing in
|
||||||
|
wonders and in beauty every region hitherto discovered on the habitable
|
||||||
|
globe. Its productions and features may be without example, as the
|
||||||
|
phenomena of the heavenly bodies undoubtedly are in those undiscovered
|
||||||
|
solitudes. What may not be expected in a country of eternal light? I
|
||||||
|
may there discover the wondrous power which attracts the needle and may
|
||||||
|
regulate a thousand celestial observations that require only this
|
||||||
|
voyage to render their seeming eccentricities consistent for ever. I
|
||||||
|
shall satiate my ardent curiosity with the sight of a part of the world
|
||||||
|
never before visited, and may tread a land never before imprinted by
|
||||||
|
the foot of man. These are my enticements, and they are sufficient to
|
||||||
|
conquer all fear of danger or death and to induce me to commence this
|
||||||
|
laborious voyage with the joy a child feels when he embarks in a little
|
||||||
|
boat, with his holiday mates, on an expedition of discovery up his
|
||||||
|
native river. But supposing all these conjectures to be false, you
|
||||||
|
cannot contest the inestimable benefit which I shall confer on all
|
||||||
|
mankind, to the last generation, by discovering a passage near the pole
|
||||||
|
to those countries, to reach which at present so many months are
|
||||||
|
requisite; or by ascertaining the secret of the magnet, which, if at
|
||||||
|
all possible, can only be effected by an undertaking such as mine.
|
||||||
|
|
||||||
|
These reflections have dispelled the agitation with which I began my
|
||||||
|
letter, and I feel my heart glow with an enthusiasm which elevates me
|
||||||
|
to heaven, for nothing contributes so much to tranquillise the mind as
|
||||||
|
a steady purpose—a point on which the soul may fix its intellectual
|
||||||
|
eye. This expedition has been the favourite dream of my early years. I
|
||||||
|
have read with ardour the accounts of the various voyages which have
|
||||||
|
been made in the prospect of arriving at the North Pacific Ocean
|
||||||
|
through the seas which surround the pole. You may remember that a
|
||||||
|
history of all the voyages made for purposes of discovery composed the
|
||||||
|
whole of our good Uncle Thomas’ library. My education was neglected,
|
||||||
|
yet I was passionately fond of reading. These volumes were my study
|
||||||
|
day and night, and my familiarity with them increased that regret which
|
||||||
|
I had felt, as a child, on learning that my father’s dying injunction
|
||||||
|
had forbidden my uncle to allow me to embark in a seafaring life.
|
||||||
|
|
||||||
|
These visions faded when I perused, for the first time, those poets
|
||||||
|
whose effusions entranced my soul and lifted it to heaven. I also
|
||||||
|
became a poet and for one year lived in a paradise of my own creation;
|
||||||
|
I imagined that I also might obtain a niche in the temple where the
|
||||||
|
names of Homer and Shakespeare are consecrated. You are well
|
||||||
|
acquainted with my failure and how heavily I bore the disappointment.
|
||||||
|
But just at that time I inherited the fortune of my cousin, and my
|
||||||
|
thoughts were turned into the channel of their earlier bent.
|
||||||
|
|
||||||
|
Six years have passed since I resolved on my present undertaking. I
|
||||||
|
can, even now, remember the hour from which I dedicated myself to this
|
||||||
|
great enterprise. I commenced by inuring my body to hardship.
|
||||||
|
Continue this passage in exactly 200 tokens of prose.
|
||||||
18
bench/prompts/p256.txt
Normal file
18
bench/prompts/p256.txt
Normal file
@@ -0,0 +1,18 @@
|
|||||||
|
You will rejoice to hear that no disaster has accompanied the
|
||||||
|
commencement of an enterprise which you have regarded with such evil
|
||||||
|
forebodings. I arrived here yesterday, and my first task is to assure
|
||||||
|
my dear sister of my welfare and increasing confidence in the success
|
||||||
|
of my undertaking.
|
||||||
|
|
||||||
|
I am already far north of London, and as I walk in the streets of
|
||||||
|
Petersburgh, I feel a cold northern breeze play upon my cheeks, which
|
||||||
|
braces my nerves and fills me with delight. Do you understand this
|
||||||
|
feeling? This breeze, which has travelled from the regions towards
|
||||||
|
which I am advancing, gives me a foretaste of those icy climes.
|
||||||
|
Inspirited by this wind of promise, my daydreams become more fervent
|
||||||
|
and vivid. I try in vain to be persuaded that the pole is the seat of
|
||||||
|
frost and desolation; it ever presents itself to my imagination as the
|
||||||
|
region of beauty and delight. There, Margaret, the sun is for ever
|
||||||
|
visible, its broad disk just skirting the horizon and diffusing a
|
||||||
|
perpetual splendour.
|
||||||
|
Continue this passage in exactly 200 tokens of prose.
|
||||||
319
bench/prompts/p4096.txt
Normal file
319
bench/prompts/p4096.txt
Normal file
@@ -0,0 +1,319 @@
|
|||||||
|
You will rejoice to hear that no disaster has accompanied the
|
||||||
|
commencement of an enterprise which you have regarded with such evil
|
||||||
|
forebodings. I arrived here yesterday, and my first task is to assure
|
||||||
|
my dear sister of my welfare and increasing confidence in the success
|
||||||
|
of my undertaking.
|
||||||
|
|
||||||
|
I am already far north of London, and as I walk in the streets of
|
||||||
|
Petersburgh, I feel a cold northern breeze play upon my cheeks, which
|
||||||
|
braces my nerves and fills me with delight. Do you understand this
|
||||||
|
feeling? This breeze, which has travelled from the regions towards
|
||||||
|
which I am advancing, gives me a foretaste of those icy climes.
|
||||||
|
Inspirited by this wind of promise, my daydreams become more fervent
|
||||||
|
and vivid. I try in vain to be persuaded that the pole is the seat of
|
||||||
|
frost and desolation; it ever presents itself to my imagination as the
|
||||||
|
region of beauty and delight. There, Margaret, the sun is for ever
|
||||||
|
visible, its broad disk just skirting the horizon and diffusing a
|
||||||
|
perpetual splendour. There—for with your leave, my sister, I will put
|
||||||
|
some trust in preceding navigators—there snow and frost are banished;
|
||||||
|
and, sailing over a calm sea, we may be wafted to a land surpassing in
|
||||||
|
wonders and in beauty every region hitherto discovered on the habitable
|
||||||
|
globe. Its productions and features may be without example, as the
|
||||||
|
phenomena of the heavenly bodies undoubtedly are in those undiscovered
|
||||||
|
solitudes. What may not be expected in a country of eternal light? I
|
||||||
|
may there discover the wondrous power which attracts the needle and may
|
||||||
|
regulate a thousand celestial observations that require only this
|
||||||
|
voyage to render their seeming eccentricities consistent for ever. I
|
||||||
|
shall satiate my ardent curiosity with the sight of a part of the world
|
||||||
|
never before visited, and may tread a land never before imprinted by
|
||||||
|
the foot of man. These are my enticements, and they are sufficient to
|
||||||
|
conquer all fear of danger or death and to induce me to commence this
|
||||||
|
laborious voyage with the joy a child feels when he embarks in a little
|
||||||
|
boat, with his holiday mates, on an expedition of discovery up his
|
||||||
|
native river. But supposing all these conjectures to be false, you
|
||||||
|
cannot contest the inestimable benefit which I shall confer on all
|
||||||
|
mankind, to the last generation, by discovering a passage near the pole
|
||||||
|
to those countries, to reach which at present so many months are
|
||||||
|
requisite; or by ascertaining the secret of the magnet, which, if at
|
||||||
|
all possible, can only be effected by an undertaking such as mine.
|
||||||
|
|
||||||
|
These reflections have dispelled the agitation with which I began my
|
||||||
|
letter, and I feel my heart glow with an enthusiasm which elevates me
|
||||||
|
to heaven, for nothing contributes so much to tranquillise the mind as
|
||||||
|
a steady purpose—a point on which the soul may fix its intellectual
|
||||||
|
eye. This expedition has been the favourite dream of my early years. I
|
||||||
|
have read with ardour the accounts of the various voyages which have
|
||||||
|
been made in the prospect of arriving at the North Pacific Ocean
|
||||||
|
through the seas which surround the pole. You may remember that a
|
||||||
|
history of all the voyages made for purposes of discovery composed the
|
||||||
|
whole of our good Uncle Thomas’ library. My education was neglected,
|
||||||
|
yet I was passionately fond of reading. These volumes were my study
|
||||||
|
day and night, and my familiarity with them increased that regret which
|
||||||
|
I had felt, as a child, on learning that my father’s dying injunction
|
||||||
|
had forbidden my uncle to allow me to embark in a seafaring life.
|
||||||
|
|
||||||
|
These visions faded when I perused, for the first time, those poets
|
||||||
|
whose effusions entranced my soul and lifted it to heaven. I also
|
||||||
|
became a poet and for one year lived in a paradise of my own creation;
|
||||||
|
I imagined that I also might obtain a niche in the temple where the
|
||||||
|
names of Homer and Shakespeare are consecrated. You are well
|
||||||
|
acquainted with my failure and how heavily I bore the disappointment.
|
||||||
|
But just at that time I inherited the fortune of my cousin, and my
|
||||||
|
thoughts were turned into the channel of their earlier bent.
|
||||||
|
|
||||||
|
Six years have passed since I resolved on my present undertaking. I
|
||||||
|
can, even now, remember the hour from which I dedicated myself to this
|
||||||
|
great enterprise. I commenced by inuring my body to hardship. I
|
||||||
|
accompanied the whale-fishers on several expeditions to the North Sea;
|
||||||
|
I voluntarily endured cold, famine, thirst, and want of sleep; I often
|
||||||
|
worked harder than the common sailors during the day and devoted my
|
||||||
|
nights to the study of mathematics, the theory of medicine, and those
|
||||||
|
branches of physical science from which a naval adventurer might derive
|
||||||
|
the greatest practical advantage. Twice I actually hired myself as an
|
||||||
|
under-mate in a Greenland whaler, and acquitted myself to admiration. I
|
||||||
|
must own I felt a little proud when my captain offered me the second
|
||||||
|
dignity in the vessel and entreated me to remain with the greatest
|
||||||
|
earnestness, so valuable did he consider my services.
|
||||||
|
|
||||||
|
And now, dear Margaret, do I not deserve to accomplish some great purpose?
|
||||||
|
My life might have been passed in ease and luxury, but I preferred glory to
|
||||||
|
every enticement that wealth placed in my path. Oh, that some encouraging
|
||||||
|
voice would answer in the affirmative! My courage and my resolution is
|
||||||
|
firm; but my hopes fluctuate, and my spirits are often depressed. I am
|
||||||
|
about to proceed on a long and difficult voyage, the emergencies of which
|
||||||
|
will demand all my fortitude: I am required not only to raise the spirits
|
||||||
|
of others, but sometimes to sustain my own, when theirs are failing.
|
||||||
|
|
||||||
|
This is the most favourable period for travelling in Russia. They fly
|
||||||
|
quickly over the snow in their sledges; the motion is pleasant, and, in
|
||||||
|
my opinion, far more agreeable than that of an English stagecoach. The
|
||||||
|
cold is not excessive, if you are wrapped in furs—a dress which I have
|
||||||
|
already adopted, for there is a great difference between walking the
|
||||||
|
deck and remaining seated motionless for hours, when no exercise
|
||||||
|
prevents the blood from actually freezing in your veins. I have no
|
||||||
|
ambition to lose my life on the post-road between St. Petersburgh and
|
||||||
|
Archangel.
|
||||||
|
|
||||||
|
I shall depart for the latter town in a fortnight or three weeks; and my
|
||||||
|
intention is to hire a ship there, which can easily be done by paying the
|
||||||
|
insurance for the owner, and to engage as many sailors as I think necessary
|
||||||
|
among those who are accustomed to the whale-fishing. I do not intend to
|
||||||
|
sail until the month of June; and when shall I return? Ah, dear sister, how
|
||||||
|
can I answer this question? If I succeed, many, many months, perhaps years,
|
||||||
|
will pass before you and I may meet. If I fail, you will see me again soon,
|
||||||
|
or never.
|
||||||
|
|
||||||
|
Farewell, my dear, excellent Margaret. Heaven shower down blessings on you,
|
||||||
|
and save me, that I may again and again testify my gratitude for all your
|
||||||
|
love and kindness.
|
||||||
|
|
||||||
|
Your affectionate brother,
|
||||||
|
|
||||||
|
R. Walton
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Letter 2
|
||||||
|
|
||||||
|
_To Mrs. Saville, England._
|
||||||
|
|
||||||
|
Archangel, 28th March, 17—.
|
||||||
|
|
||||||
|
|
||||||
|
How slowly the time passes here, encompassed as I am by frost and snow!
|
||||||
|
Yet a second step is taken towards my enterprise. I have hired a
|
||||||
|
vessel and am occupied in collecting my sailors; those whom I have
|
||||||
|
already engaged appear to be men on whom I can depend and are certainly
|
||||||
|
possessed of dauntless courage.
|
||||||
|
|
||||||
|
But I have one want which I have never yet been able to satisfy, and the
|
||||||
|
absence of the object of which I now feel as a most severe evil, I have no
|
||||||
|
friend, Margaret: when I am glowing with the enthusiasm of success, there
|
||||||
|
will be none to participate my joy; if I am assailed by disappointment, no
|
||||||
|
one will endeavour to sustain me in dejection. I shall commit my thoughts
|
||||||
|
to paper, it is true; but that is a poor medium for the communication of
|
||||||
|
feeling. I desire the company of a man who could sympathise with me, whose
|
||||||
|
eyes would reply to mine. You may deem me romantic, my dear sister, but I
|
||||||
|
bitterly feel the want of a friend. I have no one near me, gentle yet
|
||||||
|
courageous, possessed of a cultivated as well as of a capacious mind, whose
|
||||||
|
tastes are like my own, to approve or amend my plans. How would such a
|
||||||
|
friend repair the faults of your poor brother! I am too ardent in execution
|
||||||
|
and too impatient of difficulties. But it is a still greater evil to me
|
||||||
|
that I am self-educated: for the first fourteen years of my life I ran wild
|
||||||
|
on a common and read nothing but our Uncle Thomas’ books of voyages.
|
||||||
|
At that age I became acquainted with the celebrated poets of our own
|
||||||
|
country; but it was only when it had ceased to be in my power to derive its
|
||||||
|
most important benefits from such a conviction that I perceived the
|
||||||
|
necessity of becoming acquainted with more languages than that of my native
|
||||||
|
country. Now I am twenty-eight and am in reality more illiterate than many
|
||||||
|
schoolboys of fifteen. It is true that I have thought more and that my
|
||||||
|
daydreams are more extended and magnificent, but they want (as the painters
|
||||||
|
call it) _keeping;_ and I greatly need a friend who would have sense
|
||||||
|
enough not to despise me as romantic, and affection enough for me to
|
||||||
|
endeavour to regulate my mind.
|
||||||
|
|
||||||
|
Well, these are useless complaints; I shall certainly find no friend on the
|
||||||
|
wide ocean, nor even here in Archangel, among merchants and seamen. Yet
|
||||||
|
some feelings, unallied to the dross of human nature, beat even in these
|
||||||
|
rugged bosoms. My lieutenant, for instance, is a man of wonderful courage
|
||||||
|
and enterprise; he is madly desirous of glory, or rather, to word my phrase
|
||||||
|
more characteristically, of advancement in his profession. He is an
|
||||||
|
Englishman, and in the midst of national and professional prejudices,
|
||||||
|
unsoftened by cultivation, retains some of the noblest endowments of
|
||||||
|
humanity. I first became acquainted with him on board a whale vessel;
|
||||||
|
finding that he was unemployed in this city, I easily engaged him to assist
|
||||||
|
in my enterprise.
|
||||||
|
|
||||||
|
The master is a person of an excellent disposition and is remarkable in the
|
||||||
|
ship for his gentleness and the mildness of his discipline. This
|
||||||
|
circumstance, added to his well-known integrity and dauntless courage, made
|
||||||
|
me very desirous to engage him. A youth passed in solitude, my best years
|
||||||
|
spent under your gentle and feminine fosterage, has so refined the
|
||||||
|
groundwork of my character that I cannot overcome an intense distaste to
|
||||||
|
the usual brutality exercised on board ship: I have never believed it to be
|
||||||
|
necessary, and when I heard of a mariner equally noted for his kindliness
|
||||||
|
of heart and the respect and obedience paid to him by his crew, I felt
|
||||||
|
myself peculiarly fortunate in being able to secure his services. I heard
|
||||||
|
of him first in rather a romantic manner, from a lady who owes to him the
|
||||||
|
happiness of her life. This, briefly, is his story. Some years ago he loved
|
||||||
|
a young Russian lady of moderate fortune, and having amassed a considerable
|
||||||
|
sum in prize-money, the father of the girl consented to the match. He saw
|
||||||
|
his mistress once before the destined ceremony; but she was bathed in
|
||||||
|
tears, and throwing herself at his feet, entreated him to spare her,
|
||||||
|
confessing at the same time that she loved another, but that he was poor,
|
||||||
|
and that her father would never consent to the union. My generous friend
|
||||||
|
reassured the suppliant, and on being informed of the name of her lover,
|
||||||
|
instantly abandoned his pursuit. He had already bought a farm with his
|
||||||
|
money, on which he had designed to pass the remainder of his life; but he
|
||||||
|
bestowed the whole on his rival, together with the remains of his
|
||||||
|
prize-money to purchase stock, and then himself solicited the young
|
||||||
|
woman’s father to consent to her marriage with her lover. But the old
|
||||||
|
man decidedly refused, thinking himself bound in honour to my friend, who,
|
||||||
|
when he found the father inexorable, quitted his country, nor returned
|
||||||
|
until he heard that his former mistress was married according to her
|
||||||
|
inclinations. “What a noble fellow!” you will exclaim. He is
|
||||||
|
so; but then he is wholly uneducated: he is as silent as a Turk, and a kind
|
||||||
|
of ignorant carelessness attends him, which, while it renders his conduct
|
||||||
|
the more astonishing, detracts from the interest and sympathy which
|
||||||
|
otherwise he would command.
|
||||||
|
|
||||||
|
Yet do not suppose, because I complain a little or because I can
|
||||||
|
conceive a consolation for my toils which I may never know, that I am
|
||||||
|
wavering in my resolutions. Those are as fixed as fate, and my voyage
|
||||||
|
is only now delayed until the weather shall permit my embarkation. The
|
||||||
|
winter has been dreadfully severe, but the spring promises well, and it
|
||||||
|
is considered as a remarkably early season, so that perhaps I may sail
|
||||||
|
sooner than I expected. I shall do nothing rashly: you know me
|
||||||
|
sufficiently to confide in my prudence and considerateness whenever the
|
||||||
|
safety of others is committed to my care.
|
||||||
|
|
||||||
|
I cannot describe to you my sensations on the near prospect of my
|
||||||
|
undertaking. It is impossible to communicate to you a conception of
|
||||||
|
the trembling sensation, half pleasurable and half fearful, with which
|
||||||
|
I am preparing to depart. I am going to unexplored regions, to “the
|
||||||
|
land of mist and snow,” but I shall kill no albatross; therefore do not
|
||||||
|
be alarmed for my safety or if I should come back to you as worn and
|
||||||
|
woeful as the “Ancient Mariner.” You will smile at my allusion, but I
|
||||||
|
will disclose a secret. I have often attributed my attachment to, my
|
||||||
|
passionate enthusiasm for, the dangerous mysteries of ocean to that
|
||||||
|
production of the most imaginative of modern poets. There is something
|
||||||
|
at work in my soul which I do not understand. I am practically
|
||||||
|
industrious—painstaking, a workman to execute with perseverance and
|
||||||
|
labour—but besides this there is a love for the marvellous, a belief
|
||||||
|
in the marvellous, intertwined in all my projects, which hurries me out
|
||||||
|
of the common pathways of men, even to the wild sea and unvisited
|
||||||
|
regions I am about to explore.
|
||||||
|
|
||||||
|
But to return to dearer considerations. Shall I meet you again, after
|
||||||
|
having traversed immense seas, and returned by the most southern cape of
|
||||||
|
Africa or America? I dare not expect such success, yet I cannot bear to
|
||||||
|
look on the reverse of the picture. Continue for the present to write to
|
||||||
|
me by every opportunity: I may receive your letters on some occasions when
|
||||||
|
I need them most to support my spirits. I love you very tenderly.
|
||||||
|
Remember me with affection, should you never hear from me again.
|
||||||
|
|
||||||
|
Your affectionate brother,
|
||||||
|
Robert Walton
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Letter 3
|
||||||
|
|
||||||
|
_To Mrs. Saville, England._
|
||||||
|
|
||||||
|
July 7th, 17—.
|
||||||
|
|
||||||
|
|
||||||
|
My dear Sister,
|
||||||
|
|
||||||
|
I write a few lines in haste to say that I am safe—and well advanced
|
||||||
|
on my voyage. This letter will reach England by a merchantman now on
|
||||||
|
its homeward voyage from Archangel; more fortunate than I, who may not
|
||||||
|
see my native land, perhaps, for many years. I am, however, in good
|
||||||
|
spirits: my men are bold and apparently firm of purpose, nor do the
|
||||||
|
floating sheets of ice that continually pass us, indicating the dangers
|
||||||
|
of the region towards which we are advancing, appear to dismay them. We
|
||||||
|
have already reached a very high latitude; but it is the height of
|
||||||
|
summer, and although not so warm as in England, the southern gales,
|
||||||
|
which blow us speedily towards those shores which I so ardently desire
|
||||||
|
to attain, breathe a degree of renovating warmth which I had not
|
||||||
|
expected.
|
||||||
|
|
||||||
|
No incidents have hitherto befallen us that would make a figure in a
|
||||||
|
letter. One or two stiff gales and the springing of a leak are
|
||||||
|
accidents which experienced navigators scarcely remember to record, and
|
||||||
|
I shall be well content if nothing worse happen to us during our voyage.
|
||||||
|
|
||||||
|
Adieu, my dear Margaret. Be assured that for my own sake, as well as
|
||||||
|
yours, I will not rashly encounter danger. I will be cool,
|
||||||
|
persevering, and prudent.
|
||||||
|
|
||||||
|
But success _shall_ crown my endeavours. Wherefore not? Thus far I
|
||||||
|
have gone, tracing a secure way over the pathless seas, the very stars
|
||||||
|
themselves being witnesses and testimonies of my triumph. Why not
|
||||||
|
still proceed over the untamed yet obedient element? What can stop the
|
||||||
|
determined heart and resolved will of man?
|
||||||
|
|
||||||
|
My swelling heart involuntarily pours itself out thus. But I must
|
||||||
|
finish. Heaven bless my beloved sister!
|
||||||
|
|
||||||
|
R.W.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Letter 4
|
||||||
|
|
||||||
|
|
||||||
|
_To Mrs. Saville, England._
|
||||||
|
|
||||||
|
August 5th, 17—.
|
||||||
|
|
||||||
|
So strange an accident has happened to us that I cannot forbear
|
||||||
|
recording it, although it is very probable that you will see me before
|
||||||
|
these papers can come into your possession.
|
||||||
|
|
||||||
|
Last Monday (July 31st) we were nearly surrounded by ice, which closed
|
||||||
|
in the ship on all sides, scarcely leaving her the sea-room in which
|
||||||
|
she floated. Our situation was somewhat dangerous, especially as we
|
||||||
|
were compassed round by a very thick fog. We accordingly lay to,
|
||||||
|
hoping that some change would take place in the atmosphere and weather.
|
||||||
|
|
||||||
|
About two o’clock the mist cleared away, and we beheld, stretched out
|
||||||
|
in every direction, vast and irregular plains of ice, which seemed to
|
||||||
|
have no end. Some of my comrades groaned, and my own mind began to
|
||||||
|
grow watchful with anxious thoughts, when a strange sight suddenly
|
||||||
|
attracted our attention and diverted our solicitude from our own
|
||||||
|
situation. We perceived a low carriage, fixed on a sledge and drawn by
|
||||||
|
dogs, pass on towards the north, at the distance of half a mile; a
|
||||||
|
being which had the shape of a man, but apparently of gigantic stature,
|
||||||
|
sat in the sledge and guided the dogs. We watched the rapid progress
|
||||||
|
of the traveller with our telescopes until he was lost among the
|
||||||
|
distant inequalities of the ice.
|
||||||
|
|
||||||
|
This appearance excited our unqualified wonder. We were, as we believed,
|
||||||
|
many hundred miles from any land; but this apparition seemed to denote that
|
||||||
|
it was not, in reality, so distant as we had supposed.
|
||||||
|
Continue this passage in exactly 200 tokens of prose.
|
||||||
109
benchmarks/3d/analyze.py
Normal file
109
benchmarks/3d/analyze.py
Normal file
@@ -0,0 +1,109 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Analyze MTP n_max sweep results and produce summary.md."""
|
||||||
|
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
RESULTS_PATH = Path(__file__).parent / "results.json"
|
||||||
|
SUMMARY_PATH = Path(__file__).parent / "summary.md"
|
||||||
|
|
||||||
|
|
||||||
|
def load_results() -> list[dict]:
|
||||||
|
data = json.loads(RESULTS_PATH.read_text())
|
||||||
|
return [r for r in data if r.get("eval_tok_s") is not None and r.get("error") is None]
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
rows = load_results()
|
||||||
|
if not rows:
|
||||||
|
print("No valid results found.")
|
||||||
|
return
|
||||||
|
|
||||||
|
models = sorted(set(r["model"] for r in rows))
|
||||||
|
lines = ["# MTP n_max Sweep Results\n"]
|
||||||
|
lines.append(f"**{len(rows)} valid measurements across {len(models)} models.**\n")
|
||||||
|
|
||||||
|
recommendations = []
|
||||||
|
|
||||||
|
for model in models:
|
||||||
|
model_rows = [r for r in rows if r["model"] == model]
|
||||||
|
n_max_values = sorted(set(r["n_max"] for r in model_rows))
|
||||||
|
prompt_names = sorted(set(r["prompt"] for r in model_rows))
|
||||||
|
|
||||||
|
lines.append(f"\n## {model}\n")
|
||||||
|
|
||||||
|
header = "| n_max | " + " | ".join(f"{p} tok/s" for p in prompt_names) + " | avg tok/s | vs n_max=0 |"
|
||||||
|
sep = "|-------|" + "|".join("-" * (len(p) + 7) for p in prompt_names) + "|-----------|------------|"
|
||||||
|
lines.append(header)
|
||||||
|
lines.append(sep)
|
||||||
|
|
||||||
|
baseline_avg = None
|
||||||
|
best_avg = 0
|
||||||
|
best_n = 0
|
||||||
|
|
||||||
|
for n in n_max_values:
|
||||||
|
cells = []
|
||||||
|
vals = []
|
||||||
|
for p in prompt_names:
|
||||||
|
matching = [r for r in model_rows if r["n_max"] == n and r["prompt"] == p]
|
||||||
|
if matching:
|
||||||
|
v = matching[0]["eval_tok_s"]
|
||||||
|
cells.append(f"{v:.1f}")
|
||||||
|
vals.append(v)
|
||||||
|
else:
|
||||||
|
cells.append("—")
|
||||||
|
|
||||||
|
avg = sum(vals) / len(vals) if vals else 0
|
||||||
|
if n == 0:
|
||||||
|
baseline_avg = avg
|
||||||
|
delta = "baseline"
|
||||||
|
elif baseline_avg and baseline_avg > 0:
|
||||||
|
pct = ((avg - baseline_avg) / baseline_avg) * 100
|
||||||
|
delta = f"{pct:+.1f}%"
|
||||||
|
else:
|
||||||
|
delta = "—"
|
||||||
|
|
||||||
|
if avg > best_avg:
|
||||||
|
best_avg = avg
|
||||||
|
best_n = n
|
||||||
|
|
||||||
|
draft_info = ""
|
||||||
|
draft_rows = [r for r in model_rows if r["n_max"] == n and r.get("draft_n")]
|
||||||
|
if draft_rows:
|
||||||
|
total_draft = sum(r.get("draft_n", 0) for r in draft_rows)
|
||||||
|
total_accepted = sum(r.get("draft_n_accepted", 0) for r in draft_rows)
|
||||||
|
if total_draft > 0:
|
||||||
|
accept_pct = (total_accepted / total_draft) * 100
|
||||||
|
draft_info = f" (accept {accept_pct:.0f}%)"
|
||||||
|
|
||||||
|
row_str = f"| {n} | " + " | ".join(cells) + f" | {avg:.1f} | {delta}{draft_info} |"
|
||||||
|
lines.append(row_str)
|
||||||
|
|
||||||
|
if baseline_avg and baseline_avg > 0 and best_avg > 0:
|
||||||
|
improvement = ((best_avg - baseline_avg) / baseline_avg) * 100
|
||||||
|
lines.append(f"\n**Optimal n_max: {best_n}** (avg {best_avg:.1f} tok/s, {improvement:+.1f}% vs baseline)\n")
|
||||||
|
recommendations.append((model, best_n, best_avg, improvement))
|
||||||
|
else:
|
||||||
|
lines.append(f"\n**Optimal n_max: {best_n}** (avg {best_avg:.1f} tok/s)\n")
|
||||||
|
|
||||||
|
# Recommendations section
|
||||||
|
lines.append("\n---\n")
|
||||||
|
lines.append("## Recommended `llama_extra_args` per model\n")
|
||||||
|
lines.append("| Model | n_max | avg tok/s | vs baseline | suggested flags |")
|
||||||
|
lines.append("|-------|-------|-----------|-------------|-----------------|")
|
||||||
|
for model, n, avg, imp in recommendations:
|
||||||
|
if n > 0:
|
||||||
|
flags = f'`["--spec-type", "draft-mtp", "--spec-draft-n-max", "{n}"]`'
|
||||||
|
else:
|
||||||
|
flags = "_(none — MTP not beneficial)_"
|
||||||
|
lines.append(f"| {model} | {n} | {avg:.1f} | {imp:+.1f}% | {flags} |")
|
||||||
|
|
||||||
|
lines.append("")
|
||||||
|
summary = "\n".join(lines)
|
||||||
|
SUMMARY_PATH.write_text(summary)
|
||||||
|
print(summary)
|
||||||
|
print(f"\nWritten to: {SUMMARY_PATH}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
248
benchmarks/3d/run_sweep.py
Normal file
248
benchmarks/3d/run_sweep.py
Normal file
@@ -0,0 +1,248 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""MTP n_max sweep across MTP-capable models via llama-sidecar.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 run_sweep.py # full sweep
|
||||||
|
python3 run_sweep.py --dry-run # print matrix, no API calls
|
||||||
|
python3 run_sweep.py --limit 1 # run first combo only (smoke)
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from urllib.request import Request, urlopen
|
||||||
|
from urllib.error import URLError, HTTPError
|
||||||
|
|
||||||
|
SIDECAR_URL = os.environ.get("SIDECAR_URL", "http://100.101.41.16:8402")
|
||||||
|
RESULTS_PATH = Path(__file__).parent / "results.json"
|
||||||
|
|
||||||
|
MATRIX = [
|
||||||
|
("qwen3.6-35b-a3b-mxfp4", [0, 1, 2, 3]),
|
||||||
|
("qwen3.6-27b-mtp", [0, 1, 2, 3, 4]),
|
||||||
|
("qwopus3.6-27b-v2-mtp", [0, 2]),
|
||||||
|
("qwopus3.5-9b-coder-mtp", [0, 2]),
|
||||||
|
]
|
||||||
|
|
||||||
|
PROMPTS = {
|
||||||
|
"short": {
|
||||||
|
"content": "Reply with exactly five words: a haiku-like greeting.",
|
||||||
|
"max_tokens": 100,
|
||||||
|
},
|
||||||
|
"medium": {
|
||||||
|
"content": (
|
||||||
|
"Explain how multi-token prediction speculative decoding works in transformer "
|
||||||
|
"inference. Cover: 1) the draft model role, 2) the verification mechanism, "
|
||||||
|
"3) acceptance rate dynamics, 4) why MoE models gain less than dense models. "
|
||||||
|
"Aim for 400-500 words."
|
||||||
|
),
|
||||||
|
"max_tokens": 700,
|
||||||
|
},
|
||||||
|
"long": {
|
||||||
|
"content": (
|
||||||
|
"Write a complete Python implementation of a simple HTTP server that "
|
||||||
|
"accepts POST requests on /v1/chat/completions, validates JSON bodies "
|
||||||
|
"against a basic OpenAI schema, logs each request to stdout in JSON "
|
||||||
|
"format, and returns a hardcoded streaming response. Include error "
|
||||||
|
"handling for malformed JSON, missing required fields, and unsupported "
|
||||||
|
"methods. Add docstrings and type hints throughout. Show full file."
|
||||||
|
),
|
||||||
|
"max_tokens": 2500,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def build_flags(n_max: int) -> str:
|
||||||
|
if n_max > 0:
|
||||||
|
return f"--spec-type draft-mtp --spec-draft-n-max {n_max} --repeat-penalty 1.0"
|
||||||
|
return "--repeat-penalty 1.0"
|
||||||
|
|
||||||
|
|
||||||
|
def sidecar_request(method: str, path: str, body: dict | None = None,
|
||||||
|
headers: dict | None = None, timeout: int = 180) -> dict | None:
|
||||||
|
url = f"{SIDECAR_URL}{path}"
|
||||||
|
data = json.dumps(body).encode() if body else None
|
||||||
|
hdrs = {"Content-Type": "application/json"}
|
||||||
|
if headers:
|
||||||
|
hdrs.update(headers)
|
||||||
|
req = Request(url, data=data, headers=hdrs, method=method)
|
||||||
|
try:
|
||||||
|
with urlopen(req, timeout=timeout) as resp:
|
||||||
|
return json.loads(resp.read())
|
||||||
|
except HTTPError as e:
|
||||||
|
body_text = e.read().decode(errors="replace")
|
||||||
|
try:
|
||||||
|
return json.loads(body_text)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
return {"error": f"HTTP {e.code}", "body": body_text[:500]}
|
||||||
|
except URLError as e:
|
||||||
|
return {"error": str(e)}
|
||||||
|
|
||||||
|
|
||||||
|
def send_completion(model: str, flags: str, prompt: str, max_tokens: int) -> dict:
|
||||||
|
body = {
|
||||||
|
"model": model,
|
||||||
|
"messages": [{"role": "user", "content": prompt}],
|
||||||
|
"max_tokens": max_tokens,
|
||||||
|
"stream": False,
|
||||||
|
}
|
||||||
|
headers = {
|
||||||
|
"X-Agent-Flags": flags,
|
||||||
|
"X-Model-Id": model,
|
||||||
|
}
|
||||||
|
t0 = time.perf_counter()
|
||||||
|
resp = sidecar_request("POST", "/v1/chat/completions", body=body, headers=headers)
|
||||||
|
wall_ms = (time.perf_counter() - t0) * 1000
|
||||||
|
if resp is None:
|
||||||
|
return {"error": "no response", "wall_clock_ms": wall_ms}
|
||||||
|
resp["wall_clock_ms"] = wall_ms
|
||||||
|
return resp
|
||||||
|
|
||||||
|
|
||||||
|
def extract_metrics(resp: dict, model: str, n_max: int, prompt_name: str) -> dict:
|
||||||
|
timings = resp.get("timings", {})
|
||||||
|
usage = resp.get("usage", {})
|
||||||
|
sidecars = sidecar_request("GET", "/sidecars") or []
|
||||||
|
sidecar_hash = ""
|
||||||
|
sidecar_port = 0
|
||||||
|
if isinstance(sidecars, list):
|
||||||
|
for s in sidecars:
|
||||||
|
if s.get("model_id") == model:
|
||||||
|
sidecar_hash = s.get("hash", "")
|
||||||
|
sidecar_port = s.get("port", 0)
|
||||||
|
break
|
||||||
|
|
||||||
|
return {
|
||||||
|
"model": model,
|
||||||
|
"n_max": n_max,
|
||||||
|
"prompt": prompt_name,
|
||||||
|
"timestamp_utc": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"completion_tokens": usage.get("completion_tokens"),
|
||||||
|
"prompt_tokens": usage.get("prompt_tokens"),
|
||||||
|
"eval_tok_s": timings.get("predicted_per_second"),
|
||||||
|
"prompt_tok_s": timings.get("prompt_per_second"),
|
||||||
|
"eval_ms": timings.get("predicted_ms"),
|
||||||
|
"prompt_ms": timings.get("prompt_ms"),
|
||||||
|
"draft_n": timings.get("draft_n"),
|
||||||
|
"draft_n_accepted": timings.get("draft_n_accepted"),
|
||||||
|
"wall_clock_ms": resp.get("wall_clock_ms"),
|
||||||
|
"sidecar_hash": sidecar_hash,
|
||||||
|
"sidecar_port": sidecar_port,
|
||||||
|
"error": resp.get("error"),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def append_result(row: dict) -> None:
|
||||||
|
results = []
|
||||||
|
if RESULTS_PATH.exists():
|
||||||
|
try:
|
||||||
|
results = json.loads(RESULTS_PATH.read_text())
|
||||||
|
except (json.JSONDecodeError, OSError):
|
||||||
|
pass
|
||||||
|
results.append(row)
|
||||||
|
RESULTS_PATH.write_text(json.dumps(results, indent=2) + "\n")
|
||||||
|
|
||||||
|
|
||||||
|
def evict_all_sidecars() -> None:
|
||||||
|
sidecars = sidecar_request("GET", "/sidecars")
|
||||||
|
if not isinstance(sidecars, list):
|
||||||
|
return
|
||||||
|
for s in sidecars:
|
||||||
|
h = s.get("hash", "")
|
||||||
|
if h:
|
||||||
|
sidecar_request("DELETE", f"/sidecars/{h}")
|
||||||
|
|
||||||
|
|
||||||
|
def run_combo(model: str, n_max: int, combo_idx: int, total_combos: int,
|
||||||
|
prompt_names: list[str]) -> None:
|
||||||
|
flags = build_flags(n_max)
|
||||||
|
label = f"[{combo_idx}/{total_combos}] {model} n_max={n_max}"
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"{label}")
|
||||||
|
print(f" flags: {flags}")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
|
||||||
|
for pname in prompt_names:
|
||||||
|
p = PROMPTS[pname]
|
||||||
|
# Warmup
|
||||||
|
print(f" {pname}: warmup...", end="", flush=True)
|
||||||
|
send_completion(model, flags, p["content"], p["max_tokens"])
|
||||||
|
print(" done.", flush=True)
|
||||||
|
time.sleep(2)
|
||||||
|
|
||||||
|
# Record
|
||||||
|
print(f" {pname}: recording...", end="", flush=True)
|
||||||
|
resp = send_completion(model, flags, p["content"], p["max_tokens"])
|
||||||
|
row = extract_metrics(resp, model, n_max, pname)
|
||||||
|
append_result(row)
|
||||||
|
|
||||||
|
tok_s = row.get("eval_tok_s")
|
||||||
|
draft = row.get("draft_n")
|
||||||
|
err = row.get("error")
|
||||||
|
if err:
|
||||||
|
print(f" ERROR: {err}")
|
||||||
|
elif tok_s:
|
||||||
|
draft_str = f" draft_n={draft}" if draft else ""
|
||||||
|
print(f" {tok_s:.1f} tok/s{draft_str}")
|
||||||
|
else:
|
||||||
|
print(" (no timings in response)")
|
||||||
|
|
||||||
|
# Evict this sidecar to free VRAM
|
||||||
|
evict_all_sidecars()
|
||||||
|
print(f" evicted sidecars, sleeping 5s for VRAM release...")
|
||||||
|
time.sleep(5)
|
||||||
|
|
||||||
|
|
||||||
|
def dry_run() -> None:
|
||||||
|
combos = [(model, n) for model, ns in MATRIX for n in ns]
|
||||||
|
print(f"Dry run: {len(combos)} combos × 3 prompts × 2 calls = {len(combos)*6} API calls")
|
||||||
|
print(f"Estimated runtime: 60-90 minutes\n")
|
||||||
|
for i, (model, n_max) in enumerate(combos, 1):
|
||||||
|
flags = build_flags(n_max)
|
||||||
|
print(f" [{i}/{len(combos)}] {model} n_max={n_max}")
|
||||||
|
print(f" flags: {flags}")
|
||||||
|
for pname in PROMPTS:
|
||||||
|
p = PROMPTS[pname]
|
||||||
|
print(f" {pname}: max_tokens={p['max_tokens']}")
|
||||||
|
print(f"\nResults would be written to: {RESULTS_PATH}")
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
parser = argparse.ArgumentParser(description="MTP n_max sweep benchmark")
|
||||||
|
parser.add_argument("--dry-run", action="store_true", help="Print matrix without running")
|
||||||
|
parser.add_argument("--limit", type=int, default=0, help="Run only first N combos")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if args.dry_run:
|
||||||
|
dry_run()
|
||||||
|
return
|
||||||
|
|
||||||
|
# Check sidecar health
|
||||||
|
health = sidecar_request("GET", "/health")
|
||||||
|
if not health or health.get("status") != "ok":
|
||||||
|
print(f"Sidecar unhealthy: {health}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
print(f"Sidecar healthy: {health}")
|
||||||
|
|
||||||
|
# Clear existing sidecars
|
||||||
|
evict_all_sidecars()
|
||||||
|
|
||||||
|
combos = [(model, n) for model, ns in MATRIX for n in ns]
|
||||||
|
if args.limit > 0:
|
||||||
|
combos = combos[:args.limit]
|
||||||
|
prompt_names = list(PROMPTS.keys())
|
||||||
|
|
||||||
|
t_start = time.perf_counter()
|
||||||
|
for i, (model, n_max) in enumerate(combos, 1):
|
||||||
|
run_combo(model, n_max, i, len(combos), prompt_names)
|
||||||
|
|
||||||
|
elapsed = time.perf_counter() - t_start
|
||||||
|
print(f"\nSweep complete. {len(combos)} combos in {elapsed/60:.1f} minutes.")
|
||||||
|
print(f"Results: {RESULTS_PATH}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
74
cmd/llama-sidecar/main.go
Normal file
74
cmd/llama-sidecar/main.go
Normal file
@@ -0,0 +1,74 @@
|
|||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"context"
|
||||||
|
"fmt"
|
||||||
|
"log/slog"
|
||||||
|
"net/http"
|
||||||
|
"os"
|
||||||
|
"time"
|
||||||
|
|
||||||
|
"github.com/indifferentketchup/llama-sidecar/internal/config"
|
||||||
|
"github.com/indifferentketchup/llama-sidecar/internal/pool"
|
||||||
|
"github.com/indifferentketchup/llama-sidecar/internal/server"
|
||||||
|
"github.com/indifferentketchup/llama-sidecar/internal/winsvc"
|
||||||
|
)
|
||||||
|
|
||||||
|
func main() {
|
||||||
|
cfg, err := config.Load()
|
||||||
|
if err != nil {
|
||||||
|
fmt.Fprintf(os.Stderr, "config error: %v\n", err)
|
||||||
|
os.Exit(1)
|
||||||
|
}
|
||||||
|
|
||||||
|
initLogger(cfg.LogLevel)
|
||||||
|
slog.Info("starting llama-sidecar",
|
||||||
|
"bind", cfg.Bind,
|
||||||
|
"max_sidecars", cfg.MaxSidecars,
|
||||||
|
"port_range", fmt.Sprintf("%d-%d", cfg.PortRangeLo, cfg.PortRangeHi),
|
||||||
|
"models", len(cfg.ModelDirMap),
|
||||||
|
"base_args", cfg.BaseArgs,
|
||||||
|
)
|
||||||
|
|
||||||
|
startedAt := time.Now()
|
||||||
|
spawner := &pool.RealSpawner{}
|
||||||
|
p := pool.New(cfg, spawner)
|
||||||
|
srv := server.New(cfg, p, startedAt)
|
||||||
|
|
||||||
|
go func() {
|
||||||
|
slog.Info("listening", "addr", cfg.Bind)
|
||||||
|
if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
|
||||||
|
slog.Error("server error", "err", err)
|
||||||
|
os.Exit(1)
|
||||||
|
}
|
||||||
|
}()
|
||||||
|
|
||||||
|
winsvc.RegisterShutdownHandler(context.Background(), func(ctx context.Context) error {
|
||||||
|
slog.Info("draining HTTP server")
|
||||||
|
drainCtx, drainCancel := context.WithTimeout(ctx, 10*time.Second)
|
||||||
|
defer drainCancel()
|
||||||
|
if err := srv.Shutdown(drainCtx); err != nil {
|
||||||
|
slog.Error("HTTP drain failed", "err", err)
|
||||||
|
}
|
||||||
|
slog.Info("shutting down sidecar pool")
|
||||||
|
poolCtx, poolCancel := context.WithTimeout(ctx, 30*time.Second)
|
||||||
|
defer poolCancel()
|
||||||
|
return p.Shutdown(poolCtx)
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
func initLogger(level string) {
|
||||||
|
var lvl slog.Level
|
||||||
|
switch level {
|
||||||
|
case "debug":
|
||||||
|
lvl = slog.LevelDebug
|
||||||
|
case "warn":
|
||||||
|
lvl = slog.LevelWarn
|
||||||
|
case "error":
|
||||||
|
lvl = slog.LevelError
|
||||||
|
default:
|
||||||
|
lvl = slog.LevelInfo
|
||||||
|
}
|
||||||
|
handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{Level: lvl})
|
||||||
|
slog.SetDefault(slog.New(handler))
|
||||||
|
}
|
||||||
72
eval/ab/prompts.json
Normal file
72
eval/ab/prompts.json
Normal file
@@ -0,0 +1,72 @@
|
|||||||
|
[
|
||||||
|
{
|
||||||
|
"id": "review-1",
|
||||||
|
"agent": "Code Reviewer",
|
||||||
|
"prompt": "Review the `buildHeadPayload` function in `apps/server/src/services/compaction.ts`. It was recently patched in v1.13.6 to embed `reasoning_parts` as a `<reasoning>...</reasoning>` prose prefix on the assistant content for tool-bearing turns. Check: does the current implementation handle the case where `reasoning_parts` is an empty array? Does it handle turns that have both reasoning_parts AND non-empty text content (not just tool calls)? Cite file:line for any issues."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "review-2",
|
||||||
|
"agent": "Code Reviewer",
|
||||||
|
"prompt": "Review the path guard layer in `apps/coder/services/path_guard.ts`. It enforces per-project scoping with a blanket `/opt:rw` mount and policy at the tool layer. Check for: symlink traversal (does it resolve symlinks before checking?), double-encoding attacks on path components, race conditions between check and use (TOCTOU), and whether `extraRoots` from `request_read_access` grants could be abused to escape the project scope. Cite file:line."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "debug-1",
|
||||||
|
"agent": "Debugger",
|
||||||
|
"prompt": "Bug report: after a long BooCode chat session (~40 messages), the compaction trigger fires but the resulting summary is empty — the assistant message with `summary=true` has blank content. The `ctx_max` is correctly fetched from `/upstream/<model>/props` (verified in logs). The `needs_compaction` flag is being set. But the summary inference returns an empty string. This started happening after the v1.13.7 compaction trigger change that lowered the threshold to `floor(0.85 * ctx_max)`. Diagnose: what code path could produce an empty summary, and what would you check first?"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "debug-2",
|
||||||
|
"agent": "Debugger",
|
||||||
|
"prompt": "Bug report: BooTerm terminal pane shows garbled output past column 66 on initial open, but corrects itself after manually resizing the browser window. The `stty size` inside the terminal reports `82 66` even though the pane is visually ~132 columns wide. tmux `list-windows` confirms the session was created at 66 columns. This only happens when opening a terminal pane via the split-pane button, not when opening it as the sole pane. Diagnose the root cause in `apps/web/src/components/panes/TerminalPane.tsx`."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "refactor-1",
|
||||||
|
"agent": "Refactorer",
|
||||||
|
"prompt": "The `streamCompletion` function in `apps/server/src/services/provider.ts` has grown to handle: AI SDK v6 streaming, XML fallback parsing for qwen3.6 tool-call emissions, abort signal handling (the explicit `if (signal?.aborted) throw` patch), reasoning-delta counting, and usage extraction. It's now ~200 lines. Propose a refactor that separates concerns without breaking the streaming contract. The function must remain a single entry point for callers."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "refactor-2",
|
||||||
|
"agent": "Refactorer",
|
||||||
|
"prompt": "The WebSocket frame publishing in BooCode went through two batches (v1.13.12 + v1.13.13) that converted ~80 publish sites to typed `publishFrame`/`publishUserFrame` wrappers with Zod validation. The schemas are duplicated byte-identical between `apps/server/src/types/ws-frames.ts` and `apps/web/src/api/ws-frames.ts` with a parity test. Propose a refactor to share the schema definition from a single source instead of maintaining the duplication + parity test."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "architect-1",
|
||||||
|
"agent": "Architect",
|
||||||
|
"prompt": "Design the system-prompt prefix cache for BooCode. Context: `buildSystemPromptWithFingerprint` already computes a SHA-256 of the assembled prefix and logs drift. The prefix is rebuilt on every inference turn from: project settings, agent instructions (AGENTS.md), skills, session-level overrides, and web_search_enabled flag. Most of these don't change between turns in the same session. Design a cache that avoids rebuilding+rehashing on every turn. Consider: process-memory vs DB-backed, invalidation strategy, cache key shape, and whether the fingerprint can serve as the cache key itself."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "architect-2",
|
||||||
|
"agent": "Architect",
|
||||||
|
"prompt": "Design the v2.5 task model integration with BooCoder's ACP dispatch. Context: v2.5.0-task-model just shipped a `tasks` table and lightweight task model services. BooCoder dispatches external agents (opencode, goose, claude) via ACP or PTY. Design how a task created in BooChat should flow through to a BooCoder dispatch: task creation → agent selection → ACP session → status updates back to the task row → completion. Consider: which fields from the task row map to ACP session params, how task status syncs with the agent's exit code, and how the UI surfaces progress."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "security-1",
|
||||||
|
"agent": "Security Auditor",
|
||||||
|
"prompt": "Audit the `web_fetch` tool implementation in BooCode. It fetches arbitrary URLs on behalf of the LLM agent. Check for: SSRF against internal Tailscale IPs (100.x.x.x), DNS rebinding, redirect following to internal hosts, response size limits, content-type validation, and whether the `url_guard.ts` layer covers all cases. The tool is gated by `session.web_search_enabled` but once enabled, the URL is user-agent-controlled (the LLM decides what to fetch)."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "security-2",
|
||||||
|
"agent": "Security Auditor",
|
||||||
|
"prompt": "Audit the `request_read_access` tool and `allowed_read_paths` grant mechanism (v1.13.17). When an agent needs to read files outside its project scope, it calls `request_read_access(path)` which triggers an `ask_user_input` elicitation for approval. On approval, the path is added to `allowed_read_paths` for that session, and `pathGuard` is extended with `extraRoots`. Check: can the agent request a path like `/etc/shadow` or `/opt/boocode/.env`? Is the grant scoped to the session or persistent? Can the path be a symlink that resolves to a sensitive location after the grant?"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "prompt-1",
|
||||||
|
"agent": "Prompt Builder",
|
||||||
|
"prompt": "Write a Claude Code dispatch prompt for: adding a new BooCode agent called 'Documenter' to AGENTS.md. The agent should read source files and produce inline JSDoc/TSDoc comments. It should use the read-only tool set. Temperature 0.4, steps 10. The prompt should include pre-flight checks, the exact file to modify, backup instructions, and verification steps."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "prompt-2",
|
||||||
|
"agent": "Prompt Builder",
|
||||||
|
"prompt": "Write an OpenCode dispatch prompt for: fixing the codecontext sidecar to handle projects with more than 10,000 files without OOMing. The fork is at /opt/forks/codecontext/. The agent should investigate the memory profile of the graph analysis pass, identify the allocation hotspot, and propose a streaming or chunked alternative. Include #careful hashtag, backup rules, and stop conditions."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "recon-1",
|
||||||
|
"agent": "Recon",
|
||||||
|
"prompt": "Map the BooCode monorepo at /opt/boocode/. I need: top-level directory structure, the three apps and their roles, how they share the database, the Docker container topology, and the key service files in apps/server/src/services/. Identify the data flow from a user message in BooChat through to the LLM inference call and back."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "recon-2",
|
||||||
|
"agent": "Recon",
|
||||||
|
"prompt": "Map the codecontext fork at /opt/forks/codecontext/. I need: the MCP tool surface (what tools are exposed), the parser architecture (how tree-sitter grammars are registered), the graph analysis pipeline (how dependencies and call graphs are built), and the codesight-merge additions (blast radius, hot files, routes, middleware). Identify the main entry points and the caching layer."
|
||||||
|
}
|
||||||
|
]
|
||||||
242
eval/ab/run.sh
Executable file
242
eval/ab/run.sh
Executable file
@@ -0,0 +1,242 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||||
|
ENDPOINT="http://100.101.41.16:8401/v1"
|
||||||
|
PROMPTS_FILE="${SCRIPT_DIR}/prompts.json"
|
||||||
|
RESULTS_DIR="${SCRIPT_DIR}/results"
|
||||||
|
COMPARE_FILE="${SCRIPT_DIR}/COMPARE.md"
|
||||||
|
TIMING_FILE="${SCRIPT_DIR}/timing.csv"
|
||||||
|
|
||||||
|
MODELS=(
|
||||||
|
qwen3.6-35b-a3b-mxfp4
|
||||||
|
qwen3-coder-30b-apex
|
||||||
|
qwen3.6-27b-mtp
|
||||||
|
qwopus3.5-4b-mtp
|
||||||
|
qwen3.5-9b-deepseek-v4-mtp
|
||||||
|
qwopus3.6-35b-a3b-v1
|
||||||
|
qwopus3.6-27b-v2-mtp
|
||||||
|
qwopus3.5-9b-coder-mtp
|
||||||
|
)
|
||||||
|
|
||||||
|
mkdir -p "$RESULTS_DIR"
|
||||||
|
|
||||||
|
# ── Parse prompts ─────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
PROMPT_COUNT=$(python3 -c "import json; print(len(json.load(open('${PROMPTS_FILE}'))))")
|
||||||
|
TOTAL=$((PROMPT_COUNT * ${#MODELS[@]}))
|
||||||
|
EST_MIN=$(( TOTAL * 30 / 60 ))
|
||||||
|
|
||||||
|
echo "================================================================"
|
||||||
|
echo " A/B MODEL COMPARISON"
|
||||||
|
echo " ${PROMPT_COUNT} prompts × ${#MODELS[@]} models = ${TOTAL} requests"
|
||||||
|
echo " Estimated runtime: ~${EST_MIN} minutes"
|
||||||
|
echo " Endpoint: ${ENDPOINT}"
|
||||||
|
echo "================================================================"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# ── Main loop: models (outer) × prompts (inner) ──────────────────────
|
||||||
|
# One model load per model, all prompts answered, then swap.
|
||||||
|
|
||||||
|
t_start=$(date +%s)
|
||||||
|
done_count=0
|
||||||
|
|
||||||
|
for model in "${MODELS[@]}"; do
|
||||||
|
echo ""
|
||||||
|
echo "================================================================"
|
||||||
|
echo " MODEL: ${model}"
|
||||||
|
echo "================================================================"
|
||||||
|
|
||||||
|
# Warmup: load the model with a trivial request
|
||||||
|
all_cached=true
|
||||||
|
for pidx in $(seq 0 $((PROMPT_COUNT - 1))); do
|
||||||
|
PID=$(python3 -c "import json; print(json.load(open('${PROMPTS_FILE}'))[${pidx}]['id'])")
|
||||||
|
if [ ! -f "${RESULTS_DIR}/${PID}/${model}.json" ] || [ ! -s "${RESULTS_DIR}/${PID}/${model}.json" ]; then
|
||||||
|
all_cached=false
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if [ "$all_cached" = "true" ]; then
|
||||||
|
echo " All ${PROMPT_COUNT} prompts cached, skipping model"
|
||||||
|
for pidx in $(seq 0 $((PROMPT_COUNT - 1))); do
|
||||||
|
done_count=$((done_count + 1))
|
||||||
|
done
|
||||||
|
continue
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo " Warming up..."
|
||||||
|
curl -s -X POST "${ENDPOINT}/chat/completions" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d "{\"model\":\"${model}\",\"messages\":[{\"role\":\"user\",\"content\":\"Say OK.\"}],\"max_tokens\":10,\"temperature\":0}" \
|
||||||
|
--max-time 300 > /dev/null 2>&1
|
||||||
|
echo " Warm."
|
||||||
|
|
||||||
|
for pidx in $(seq 0 $((PROMPT_COUNT - 1))); do
|
||||||
|
PROMPT_ID=$(python3 -c "import json; print(json.load(open('${PROMPTS_FILE}'))[${pidx}]['id'])")
|
||||||
|
AGENT=$(python3 -c "import json; print(json.load(open('${PROMPTS_FILE}'))[${pidx}]['agent'])")
|
||||||
|
|
||||||
|
mkdir -p "${RESULTS_DIR}/${PROMPT_ID}"
|
||||||
|
OUT_JSON="${RESULTS_DIR}/${PROMPT_ID}/${model}.json"
|
||||||
|
OUT_MD="${RESULTS_DIR}/${PROMPT_ID}/${model}.md"
|
||||||
|
|
||||||
|
# Resume: skip if already done
|
||||||
|
if [ -f "$OUT_JSON" ] && [ -s "$OUT_JSON" ]; then
|
||||||
|
done_count=$((done_count + 1))
|
||||||
|
echo " [${PROMPT_ID}] cached (${done_count}/${TOTAL})"
|
||||||
|
continue
|
||||||
|
fi
|
||||||
|
|
||||||
|
BODY=$(python3 -c "
|
||||||
|
import json
|
||||||
|
p = json.load(open('${PROMPTS_FILE}'))[${pidx}]
|
||||||
|
print(json.dumps({
|
||||||
|
'model': '${model}',
|
||||||
|
'messages': [{'role': 'user', 'content': p['prompt']}],
|
||||||
|
'temperature': 0.6,
|
||||||
|
'max_tokens': 2048,
|
||||||
|
'seed': 42,
|
||||||
|
'stream': False
|
||||||
|
}))
|
||||||
|
")
|
||||||
|
|
||||||
|
SUCCESS=0
|
||||||
|
for attempt in 1 2; do
|
||||||
|
HTTP_CODE=$(curl -s -w '%{http_code}' -o "$OUT_JSON" \
|
||||||
|
--max-time 300 \
|
||||||
|
-X POST "${ENDPOINT}/chat/completions" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d "$BODY" 2>/dev/null)
|
||||||
|
|
||||||
|
if [ "$HTTP_CODE" = "200" ]; then
|
||||||
|
SUCCESS=1
|
||||||
|
break
|
||||||
|
else
|
||||||
|
if [ "$attempt" = "1" ]; then
|
||||||
|
echo " [${PROMPT_ID}] HTTP ${HTTP_CODE}, retrying in 10s..."
|
||||||
|
sleep 10
|
||||||
|
else
|
||||||
|
echo "ERROR: HTTP ${HTTP_CODE}" > "$OUT_MD"
|
||||||
|
echo " [${PROMPT_ID}] FAILED (HTTP ${HTTP_CODE})"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if [ "$SUCCESS" = "1" ]; then
|
||||||
|
python3 -c "
|
||||||
|
import json
|
||||||
|
d = json.load(open('${OUT_JSON}'))
|
||||||
|
msg = d.get('choices', [{}])[0].get('message', {})
|
||||||
|
content = msg.get('content', '') or ''
|
||||||
|
reasoning = msg.get('reasoning_content', '') or ''
|
||||||
|
out = ''
|
||||||
|
if reasoning:
|
||||||
|
out += '<think>\n' + reasoning + '\n</think>\n\n'
|
||||||
|
out += content
|
||||||
|
open('${OUT_MD}', 'w').write(out)
|
||||||
|
" 2>/dev/null
|
||||||
|
done_count=$((done_count + 1))
|
||||||
|
METRICS=$(python3 -c "
|
||||||
|
import json
|
||||||
|
d = json.load(open('${OUT_JSON}'))
|
||||||
|
t = d.get('timings', {})
|
||||||
|
tps = t.get('predicted_per_second', 0)
|
||||||
|
tok = d.get('usage', {}).get('completion_tokens', 0)
|
||||||
|
print(f'{tps:.1f}tok/s {tok}tok')
|
||||||
|
" 2>/dev/null || echo "?")
|
||||||
|
echo " [${PROMPT_ID}] done (${METRICS}) [${done_count}/${TOTAL}]"
|
||||||
|
fi
|
||||||
|
|
||||||
|
sleep 2
|
||||||
|
done
|
||||||
|
done
|
||||||
|
|
||||||
|
# ── Generate COMPARE.md ──────────────────────────────────────────────
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "Generating COMPARE.md..."
|
||||||
|
|
||||||
|
MODELS_JSON=$(printf '%s\n' "${MODELS[@]}" | python3 -c "import json,sys; print(json.dumps([l.strip() for l in sys.stdin if l.strip()]))")
|
||||||
|
|
||||||
|
python3 -c "
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
prompts = json.load(open('${PROMPTS_FILE}'))
|
||||||
|
results_dir = Path('${RESULTS_DIR}')
|
||||||
|
models = json.loads('${MODELS_JSON}')
|
||||||
|
|
||||||
|
lines = ['# A/B Model Comparison\n']
|
||||||
|
|
||||||
|
timing_rows = []
|
||||||
|
|
||||||
|
for p in prompts:
|
||||||
|
pid = p['id']
|
||||||
|
agent = p['agent']
|
||||||
|
short = p['prompt'][:80]
|
||||||
|
lines.append(f'## [{pid}] {agent}\n')
|
||||||
|
lines.append(f'> {short}...\n')
|
||||||
|
|
||||||
|
for model in models:
|
||||||
|
md_path = results_dir / pid / f'{model}.md'
|
||||||
|
json_path = results_dir / pid / f'{model}.json'
|
||||||
|
lines.append(f'### {model}\n')
|
||||||
|
if md_path.exists():
|
||||||
|
content = md_path.read_text().strip()
|
||||||
|
lines.append(f'{content}\n')
|
||||||
|
else:
|
||||||
|
lines.append('*(no response)*\n')
|
||||||
|
|
||||||
|
if json_path.exists():
|
||||||
|
try:
|
||||||
|
d = json.loads(json_path.read_text())
|
||||||
|
t = d.get('timings', {})
|
||||||
|
u = d.get('usage', {})
|
||||||
|
timing_rows.append({
|
||||||
|
'prompt_id': pid,
|
||||||
|
'model_id': model,
|
||||||
|
'prompt_tps': t.get('prompt_per_second', 0),
|
||||||
|
'predicted_tps': t.get('predicted_per_second', 0),
|
||||||
|
'total_tokens': u.get('total_tokens', 0),
|
||||||
|
'latency_ms': round((t.get('prompt_ms', 0) or 0) + (t.get('predicted_ms', 0) or 0), 1),
|
||||||
|
})
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
lines.append('---\n')
|
||||||
|
|
||||||
|
# Timing table
|
||||||
|
lines.append('## Timing Summary\n')
|
||||||
|
pids = list(dict.fromkeys(r['prompt_id'] for r in timing_rows))
|
||||||
|
lines.append('| prompt | ' + ' | '.join(models) + ' |')
|
||||||
|
lines.append('|--------' + '|------' * len(models) + '|')
|
||||||
|
for pid in pids:
|
||||||
|
cells = []
|
||||||
|
for model in models:
|
||||||
|
match = [r for r in timing_rows if r['prompt_id'] == pid and r['model_id'] == model]
|
||||||
|
if match:
|
||||||
|
cells.append(f\"{match[0]['predicted_tps']:.0f}\")
|
||||||
|
else:
|
||||||
|
cells.append('—')
|
||||||
|
lines.append(f'| {pid} | ' + ' | '.join(cells) + ' |')
|
||||||
|
|
||||||
|
Path('${COMPARE_FILE}').write_text('\n'.join(lines) + '\n')
|
||||||
|
print(f'Wrote ${COMPARE_FILE}')
|
||||||
|
|
||||||
|
# timing.csv
|
||||||
|
import csv
|
||||||
|
with open('${TIMING_FILE}', 'w', newline='') as f:
|
||||||
|
w = csv.DictWriter(f, fieldnames=['prompt_id', 'model_id', 'prompt_tps', 'predicted_tps', 'total_tokens', 'latency_ms'])
|
||||||
|
w.writeheader()
|
||||||
|
w.writerows(timing_rows)
|
||||||
|
print(f'Wrote ${TIMING_FILE}')
|
||||||
|
"
|
||||||
|
|
||||||
|
t_end=$(date +%s)
|
||||||
|
elapsed=$(( t_end - t_start ))
|
||||||
|
echo ""
|
||||||
|
echo "================================================================"
|
||||||
|
echo " COMPLETE in $(( elapsed / 60 ))m $(( elapsed % 60 ))s"
|
||||||
|
echo " Results: ${RESULTS_DIR}/"
|
||||||
|
echo " Compare: ${COMPARE_FILE}"
|
||||||
|
echo " Timing: ${TIMING_FILE}"
|
||||||
|
echo "================================================================"
|
||||||
125
eval/analyze.py
Normal file
125
eval/analyze.py
Normal file
@@ -0,0 +1,125 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Generate SUMMARY.md from scores.csv."""
|
||||||
|
|
||||||
|
import csv
|
||||||
|
from collections import defaultdict
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
CSV_PATH = Path(__file__).parent / "scores.csv"
|
||||||
|
SUMMARY_PATH = Path(__file__).parent / "SUMMARY.md"
|
||||||
|
|
||||||
|
|
||||||
|
def load_scores() -> list[dict]:
|
||||||
|
rows = []
|
||||||
|
with open(CSV_PATH) as f:
|
||||||
|
for row in csv.DictReader(f):
|
||||||
|
row["correct"] = row["correct"].lower() in ("true", "1", "yes")
|
||||||
|
row["latency_ms"] = float(row.get("latency_ms", 0) or 0)
|
||||||
|
rows.append(row)
|
||||||
|
return rows
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
rows = load_scores()
|
||||||
|
if not rows:
|
||||||
|
print("No data in scores.csv")
|
||||||
|
return
|
||||||
|
|
||||||
|
models = sorted(set(r["model"] for r in rows))
|
||||||
|
benchmarks = ["mmlu", "gsm8k", "humaneval"]
|
||||||
|
|
||||||
|
# Compute scores
|
||||||
|
scores = {} # (model, bench) -> (correct, total)
|
||||||
|
for r in rows:
|
||||||
|
key = (r["model"], r["benchmark"])
|
||||||
|
if key not in scores:
|
||||||
|
scores[key] = [0, 0]
|
||||||
|
scores[key][1] += 1
|
||||||
|
if r["correct"]:
|
||||||
|
scores[key][0] += 1
|
||||||
|
|
||||||
|
# MMLU per-category
|
||||||
|
cat_scores = defaultdict(lambda: [0, 0])
|
||||||
|
for r in rows:
|
||||||
|
if r["benchmark"] == "mmlu" and r.get("category"):
|
||||||
|
key = (r["model"], r["category"])
|
||||||
|
cat_scores[key][1] += 1
|
||||||
|
if r["correct"]:
|
||||||
|
cat_scores[key][0] += 1
|
||||||
|
|
||||||
|
categories = sorted(set(r.get("category", "") for r in rows if r.get("category")))
|
||||||
|
|
||||||
|
lines = ["# Eval Results\n"]
|
||||||
|
|
||||||
|
# Main table
|
||||||
|
lines.append("## Overall Scores\n")
|
||||||
|
header = "| Model | MMLU (%) | GSM8K (%) | HumanEval (%) | Avg (%) |"
|
||||||
|
sep = "|-------|---------|---------|--------------|---------|"
|
||||||
|
lines.append(header)
|
||||||
|
lines.append(sep)
|
||||||
|
|
||||||
|
model_avgs = []
|
||||||
|
for model in models:
|
||||||
|
cells = []
|
||||||
|
pcts = []
|
||||||
|
for bench in benchmarks:
|
||||||
|
key = (model, bench)
|
||||||
|
if key in scores:
|
||||||
|
c, t = scores[key]
|
||||||
|
pct = c / t * 100 if t > 0 else 0
|
||||||
|
cells.append(f"{pct:.1f}")
|
||||||
|
pcts.append(pct)
|
||||||
|
else:
|
||||||
|
cells.append("—")
|
||||||
|
avg = sum(pcts) / len(pcts) if pcts else 0
|
||||||
|
model_avgs.append((model, avg))
|
||||||
|
cells.append(f"{avg:.1f}")
|
||||||
|
lines.append(f"| {model} | " + " | ".join(cells) + " |")
|
||||||
|
|
||||||
|
# Sort summary
|
||||||
|
model_avgs.sort(key=lambda x: -x[1])
|
||||||
|
lines.append(f"\n**Best overall: {model_avgs[0][0]}** ({model_avgs[0][1]:.1f}% avg)\n")
|
||||||
|
|
||||||
|
# MMLU category breakdown
|
||||||
|
if categories:
|
||||||
|
lines.append("\n## MMLU Per-Category Breakdown\n")
|
||||||
|
header = "| Model | " + " | ".join(c.replace("_", " ").title() for c in categories) + " |"
|
||||||
|
sep = "|-------" + "|-------" * len(categories) + "|"
|
||||||
|
lines.append(header)
|
||||||
|
lines.append(sep)
|
||||||
|
for model in models:
|
||||||
|
cells = []
|
||||||
|
for cat in categories:
|
||||||
|
key = (model, cat)
|
||||||
|
if key in cat_scores:
|
||||||
|
c, t = cat_scores[key]
|
||||||
|
cells.append(f"{c}/{t}")
|
||||||
|
else:
|
||||||
|
cells.append("—")
|
||||||
|
lines.append(f"| {model} | " + " | ".join(cells) + " |")
|
||||||
|
|
||||||
|
# Latency summary
|
||||||
|
lines.append("\n## Median Latency (ms)\n")
|
||||||
|
lines.append("| Model | MMLU | GSM8K | HumanEval |")
|
||||||
|
lines.append("|-------|------|-------|-----------|")
|
||||||
|
for model in models:
|
||||||
|
cells = []
|
||||||
|
for bench in benchmarks:
|
||||||
|
lats = sorted([r["latency_ms"] for r in rows
|
||||||
|
if r["model"] == model and r["benchmark"] == bench
|
||||||
|
and r["latency_ms"] > 0])
|
||||||
|
if lats:
|
||||||
|
med = lats[len(lats)//2]
|
||||||
|
cells.append(f"{med:.0f}")
|
||||||
|
else:
|
||||||
|
cells.append("—")
|
||||||
|
lines.append(f"| {model} | " + " | ".join(cells) + " |")
|
||||||
|
|
||||||
|
summary = "\n".join(lines) + "\n"
|
||||||
|
SUMMARY_PATH.write_text(summary)
|
||||||
|
print(summary)
|
||||||
|
print(f"\nWritten to: {SUMMARY_PATH}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
164
eval/gsm8k.py
Normal file
164
eval/gsm8k.py
Normal file
@@ -0,0 +1,164 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""GSM8K 50-question subset benchmark (seed=42)."""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import random
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from datasets import load_dataset
|
||||||
|
from openai import OpenAI
|
||||||
|
from tqdm import tqdm
|
||||||
|
|
||||||
|
ENDPOINT = os.environ.get("LLAMA_SWAP_URL", "http://100.101.41.16:8401/v1")
|
||||||
|
RESULTS_DIR = Path(__file__).parent / "results"
|
||||||
|
MAX_TOKENS = 512
|
||||||
|
SEED = 42
|
||||||
|
TEMPERATURE = 0
|
||||||
|
N_QUESTIONS = 50
|
||||||
|
|
||||||
|
|
||||||
|
def load_questions() -> list[dict]:
|
||||||
|
rng = random.Random(SEED)
|
||||||
|
ds = load_dataset("openai/gsm8k", "main", split="test", trust_remote_code=True)
|
||||||
|
indices = list(range(len(ds)))
|
||||||
|
rng.shuffle(indices)
|
||||||
|
questions = []
|
||||||
|
for idx in indices[:N_QUESTIONS]:
|
||||||
|
row = ds[idx]
|
||||||
|
answer_text = row["answer"]
|
||||||
|
# GSM8K answer format: "#### <number>" at end
|
||||||
|
match = re.search(r"####\s*([0-9,.-]+)", answer_text)
|
||||||
|
expected = int(match.group(1).replace(",", "")) if match else 0
|
||||||
|
questions.append({
|
||||||
|
"id": f"gsm8k_{idx}",
|
||||||
|
"question": row["question"],
|
||||||
|
"expected": expected,
|
||||||
|
})
|
||||||
|
return questions
|
||||||
|
|
||||||
|
|
||||||
|
def format_prompt(q: dict) -> str:
|
||||||
|
return (
|
||||||
|
"Solve this problem step by step, then on the final line write "
|
||||||
|
"'ANSWER: <number>'.\n\n" + q["question"]
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def parse_answer(text: str) -> int | None:
|
||||||
|
matches = re.findall(r"ANSWER:\s*([0-9,.-]+)", text, re.IGNORECASE)
|
||||||
|
if matches:
|
||||||
|
try:
|
||||||
|
return int(matches[-1].replace(",", ""))
|
||||||
|
except ValueError:
|
||||||
|
return None
|
||||||
|
# Fallback: last number in the response
|
||||||
|
nums = re.findall(r"-?\d[\d,]*", text)
|
||||||
|
if nums:
|
||||||
|
try:
|
||||||
|
return int(nums[-1].replace(",", ""))
|
||||||
|
except ValueError:
|
||||||
|
return None
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def run_gsm8k(model: str, client: OpenAI, questions: list[dict]) -> list[dict]:
|
||||||
|
model_dir = RESULTS_DIR / model / "gsm8k"
|
||||||
|
model_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
results = []
|
||||||
|
correct = 0
|
||||||
|
total = 0
|
||||||
|
|
||||||
|
skipped = 0
|
||||||
|
for i, q in enumerate(tqdm(questions, desc=f" GSM8K {model}", file=sys.stderr)):
|
||||||
|
expected = q["expected"]
|
||||||
|
out_path = model_dir / f"{q['id']}.json"
|
||||||
|
|
||||||
|
if out_path.exists():
|
||||||
|
try:
|
||||||
|
cached = json.loads(out_path.read_text())
|
||||||
|
raw = ""
|
||||||
|
if "choices" in cached:
|
||||||
|
msg = cached["choices"][0].get("message", {})
|
||||||
|
raw = msg.get("content", "") or msg.get("reasoning_content", "") or ""
|
||||||
|
parsed = parse_answer(raw)
|
||||||
|
is_correct = parsed is not None and parsed == expected
|
||||||
|
if is_correct:
|
||||||
|
correct += 1
|
||||||
|
total += 1
|
||||||
|
results.append({
|
||||||
|
"model": model, "benchmark": "gsm8k", "question_id": q["id"],
|
||||||
|
"correct": is_correct, "raw_answer": raw[:200],
|
||||||
|
"parsed_answer": str(parsed) if parsed is not None else "",
|
||||||
|
"expected": str(expected), "latency_ms": 0,
|
||||||
|
})
|
||||||
|
skipped += 1
|
||||||
|
continue
|
||||||
|
except (json.JSONDecodeError, KeyError):
|
||||||
|
pass
|
||||||
|
|
||||||
|
prompt = format_prompt(q)
|
||||||
|
t0 = time.time()
|
||||||
|
resp_json = None
|
||||||
|
for attempt in range(2):
|
||||||
|
try:
|
||||||
|
resp = client.chat.completions.create(
|
||||||
|
model=model,
|
||||||
|
messages=[{"role": "user", "content": prompt}],
|
||||||
|
max_tokens=MAX_TOKENS,
|
||||||
|
temperature=TEMPERATURE,
|
||||||
|
seed=SEED,
|
||||||
|
)
|
||||||
|
resp_json = resp.model_dump()
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
if attempt == 0:
|
||||||
|
time.sleep(5)
|
||||||
|
else:
|
||||||
|
resp_json = {"error": str(e)}
|
||||||
|
latency = (time.time() - t0) * 1000
|
||||||
|
|
||||||
|
raw = ""
|
||||||
|
if resp_json and "choices" in resp_json:
|
||||||
|
msg = resp_json["choices"][0].get("message", {})
|
||||||
|
raw = msg.get("content", "") or msg.get("reasoning_content", "") or ""
|
||||||
|
|
||||||
|
parsed = parse_answer(raw)
|
||||||
|
is_correct = parsed is not None and parsed == expected
|
||||||
|
if is_correct:
|
||||||
|
correct += 1
|
||||||
|
total += 1
|
||||||
|
|
||||||
|
out_path.write_text(json.dumps(resp_json, indent=2, default=str))
|
||||||
|
|
||||||
|
results.append({
|
||||||
|
"model": model,
|
||||||
|
"benchmark": "gsm8k",
|
||||||
|
"question_id": q["id"],
|
||||||
|
"correct": is_correct,
|
||||||
|
"raw_answer": raw[:200],
|
||||||
|
"parsed_answer": str(parsed) if parsed is not None else "",
|
||||||
|
"expected": str(expected),
|
||||||
|
"latency_ms": round(latency, 1),
|
||||||
|
})
|
||||||
|
|
||||||
|
if (i + 1) % 10 == 0:
|
||||||
|
print(f" [{model}] GSM8K {i+1}/{len(questions)} — {correct}/{total} ({correct/total*100:.0f}%)", file=sys.stderr)
|
||||||
|
|
||||||
|
if skipped:
|
||||||
|
print(f" [{model}] GSM8K resumed: {skipped} cached, {total-skipped} new", file=sys.stderr)
|
||||||
|
print(f" [{model}] GSM8K FINAL: {correct}/{total} ({correct/total*100:.1f}%)", file=sys.stderr)
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
model = sys.argv[1] if len(sys.argv) > 1 else "qwen3.6-35b-a3b-mxfp4"
|
||||||
|
client = OpenAI(base_url=ENDPOINT, api_key="dummy")
|
||||||
|
questions = load_questions()
|
||||||
|
results = run_gsm8k(model, client, questions)
|
||||||
|
for r in results:
|
||||||
|
print(json.dumps(r))
|
||||||
201
eval/humaneval.py
Normal file
201
eval/humaneval.py
Normal file
@@ -0,0 +1,201 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""HumanEval benchmark — 164 problems with sandboxed execution."""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import tempfile
|
||||||
|
import textwrap
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from datasets import load_dataset
|
||||||
|
from openai import OpenAI
|
||||||
|
from tqdm import tqdm
|
||||||
|
|
||||||
|
ENDPOINT = os.environ.get("LLAMA_SWAP_URL", "http://100.101.41.16:8401/v1")
|
||||||
|
RESULTS_DIR = Path(__file__).parent / "results"
|
||||||
|
MAX_TOKENS = 1024
|
||||||
|
SEED = 42
|
||||||
|
TEMPERATURE = 0
|
||||||
|
EXEC_TIMEOUT = 30
|
||||||
|
|
||||||
|
|
||||||
|
def load_problems() -> list[dict]:
|
||||||
|
ds = load_dataset("openai/openai_humaneval", split="test", trust_remote_code=True)
|
||||||
|
problems = []
|
||||||
|
for row in ds:
|
||||||
|
problems.append({
|
||||||
|
"id": row["task_id"],
|
||||||
|
"prompt": row["prompt"],
|
||||||
|
"canonical": row["canonical_solution"],
|
||||||
|
"test": row["test"],
|
||||||
|
"entry_point": row["entry_point"],
|
||||||
|
})
|
||||||
|
return problems
|
||||||
|
|
||||||
|
|
||||||
|
def extract_code(response: str, prompt: str) -> str:
|
||||||
|
# Try to find a code block
|
||||||
|
blocks = re.findall(r"```(?:python)?\n(.*?)```", response, re.DOTALL)
|
||||||
|
if blocks:
|
||||||
|
code = blocks[0]
|
||||||
|
# If the code block contains the function signature, use it directly
|
||||||
|
if "def " in code:
|
||||||
|
return code
|
||||||
|
# Otherwise prepend the prompt (function signature)
|
||||||
|
return prompt + code
|
||||||
|
|
||||||
|
# No code block — try to extract everything from the first def onwards
|
||||||
|
lines = response.split("\n")
|
||||||
|
in_code = False
|
||||||
|
code_lines = []
|
||||||
|
for line in lines:
|
||||||
|
if line.strip().startswith("def ") or in_code:
|
||||||
|
in_code = True
|
||||||
|
code_lines.append(line)
|
||||||
|
elif in_code and line.strip() == "":
|
||||||
|
code_lines.append(line)
|
||||||
|
|
||||||
|
if code_lines:
|
||||||
|
return "\n".join(code_lines)
|
||||||
|
|
||||||
|
# Last resort: prepend prompt to raw response
|
||||||
|
return prompt + response
|
||||||
|
|
||||||
|
|
||||||
|
def run_test(code: str, test_code: str, entry_point: str) -> tuple[bool, str]:
|
||||||
|
full = code + "\n\n" + test_code + f"\n\ncheck({entry_point})\n"
|
||||||
|
|
||||||
|
with tempfile.NamedTemporaryFile(
|
||||||
|
mode="w", suffix=".py", dir="/tmp", delete=False
|
||||||
|
) as f:
|
||||||
|
f.write(full)
|
||||||
|
f.flush()
|
||||||
|
fpath = f.name
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Sandboxed execution: restrict to /tmp, limited PATH
|
||||||
|
env = {"PATH": "/usr/bin:/usr/local/bin", "HOME": "/tmp"}
|
||||||
|
result = subprocess.run(
|
||||||
|
[sys.executable, fpath],
|
||||||
|
capture_output=True, text=True,
|
||||||
|
timeout=EXEC_TIMEOUT,
|
||||||
|
cwd="/tmp",
|
||||||
|
env=env,
|
||||||
|
)
|
||||||
|
passed = result.returncode == 0
|
||||||
|
output = result.stderr[:500] if result.stderr else result.stdout[:500]
|
||||||
|
return passed, output
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
return False, "TIMEOUT"
|
||||||
|
except Exception as e:
|
||||||
|
return False, str(e)[:500]
|
||||||
|
finally:
|
||||||
|
try:
|
||||||
|
os.unlink(fpath)
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
def run_humaneval(model: str, client: OpenAI, problems: list[dict]) -> list[dict]:
|
||||||
|
model_dir = RESULTS_DIR / model / "humaneval"
|
||||||
|
model_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
results = []
|
||||||
|
correct = 0
|
||||||
|
total = 0
|
||||||
|
|
||||||
|
skipped = 0
|
||||||
|
for i, p in enumerate(tqdm(problems, desc=f" HumanEval {model}", file=sys.stderr)):
|
||||||
|
out_path = model_dir / f"{p['id'].replace('/', '_')}.json"
|
||||||
|
|
||||||
|
if out_path.exists():
|
||||||
|
try:
|
||||||
|
cached = json.loads(out_path.read_text())
|
||||||
|
passed = cached.get("passed", False)
|
||||||
|
if passed:
|
||||||
|
correct += 1
|
||||||
|
total += 1
|
||||||
|
results.append({
|
||||||
|
"model": model, "benchmark": "humaneval",
|
||||||
|
"question_id": p["id"], "correct": passed,
|
||||||
|
"raw_answer": "", "parsed_answer": "pass" if passed else "fail",
|
||||||
|
"expected": "pass", "latency_ms": 0,
|
||||||
|
})
|
||||||
|
skipped += 1
|
||||||
|
continue
|
||||||
|
except (json.JSONDecodeError, KeyError):
|
||||||
|
pass
|
||||||
|
|
||||||
|
t0 = time.time()
|
||||||
|
resp_json = None
|
||||||
|
for attempt in range(2):
|
||||||
|
try:
|
||||||
|
resp = client.chat.completions.create(
|
||||||
|
model=model,
|
||||||
|
messages=[{"role": "user", "content": (
|
||||||
|
"Complete the following Python function. "
|
||||||
|
"Return ONLY the complete function implementation.\n\n"
|
||||||
|
+ p["prompt"]
|
||||||
|
)}],
|
||||||
|
max_tokens=MAX_TOKENS,
|
||||||
|
temperature=TEMPERATURE,
|
||||||
|
seed=SEED,
|
||||||
|
)
|
||||||
|
resp_json = resp.model_dump()
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
if attempt == 0:
|
||||||
|
time.sleep(5)
|
||||||
|
else:
|
||||||
|
resp_json = {"error": str(e)}
|
||||||
|
latency = (time.time() - t0) * 1000
|
||||||
|
|
||||||
|
raw = ""
|
||||||
|
if resp_json and "choices" in resp_json:
|
||||||
|
msg = resp_json["choices"][0].get("message", {})
|
||||||
|
raw = msg.get("content", "") or msg.get("reasoning_content", "") or ""
|
||||||
|
|
||||||
|
code = extract_code(raw, p["prompt"])
|
||||||
|
passed, exec_output = run_test(code, p["test"], p["entry_point"])
|
||||||
|
if passed:
|
||||||
|
correct += 1
|
||||||
|
total += 1
|
||||||
|
|
||||||
|
out_path.write_text(json.dumps({
|
||||||
|
"response": resp_json,
|
||||||
|
"extracted_code": code[:2000],
|
||||||
|
"passed": passed,
|
||||||
|
"exec_output": exec_output,
|
||||||
|
}, indent=2, default=str))
|
||||||
|
|
||||||
|
results.append({
|
||||||
|
"model": model,
|
||||||
|
"benchmark": "humaneval",
|
||||||
|
"question_id": p["id"],
|
||||||
|
"correct": passed,
|
||||||
|
"raw_answer": raw[:200],
|
||||||
|
"parsed_answer": "pass" if passed else "fail",
|
||||||
|
"expected": "pass",
|
||||||
|
"latency_ms": round(latency, 1),
|
||||||
|
})
|
||||||
|
|
||||||
|
if (i + 1) % 10 == 0:
|
||||||
|
print(f" [{model}] HumanEval {i+1}/{len(problems)} — {correct}/{total} ({correct/total*100:.0f}%)", file=sys.stderr)
|
||||||
|
|
||||||
|
if skipped:
|
||||||
|
print(f" [{model}] HumanEval resumed: {skipped} cached, {total-skipped} new", file=sys.stderr)
|
||||||
|
print(f" [{model}] HumanEval FINAL: {correct}/{total} ({correct/total*100:.1f}%)", file=sys.stderr)
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
model = sys.argv[1] if len(sys.argv) > 1 else "qwen3.6-35b-a3b-mxfp4"
|
||||||
|
client = OpenAI(base_url=ENDPOINT, api_key="dummy")
|
||||||
|
problems = load_problems()
|
||||||
|
results = run_humaneval(model, client, problems)
|
||||||
|
for r in results:
|
||||||
|
print(json.dumps(r))
|
||||||
166
eval/mmlu.py
Normal file
166
eval/mmlu.py
Normal file
@@ -0,0 +1,166 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""MMLU 100-question subset benchmark (20 per category, seed=42)."""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import random
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from datasets import load_dataset
|
||||||
|
from openai import OpenAI
|
||||||
|
from tqdm import tqdm
|
||||||
|
|
||||||
|
ENDPOINT = os.environ.get("LLAMA_SWAP_URL", "http://100.101.41.16:8401/v1")
|
||||||
|
RESULTS_DIR = Path(__file__).parent / "results"
|
||||||
|
MAX_TOKENS = 512
|
||||||
|
SEED = 42
|
||||||
|
TEMPERATURE = 0
|
||||||
|
|
||||||
|
CATEGORIES = [
|
||||||
|
"high_school_mathematics",
|
||||||
|
"college_computer_science",
|
||||||
|
"professional_medicine",
|
||||||
|
"formal_logic",
|
||||||
|
"miscellaneous",
|
||||||
|
]
|
||||||
|
PER_CATEGORY = 20
|
||||||
|
|
||||||
|
CHOICES = ["A", "B", "C", "D"]
|
||||||
|
|
||||||
|
|
||||||
|
def load_questions() -> list[dict]:
|
||||||
|
rng = random.Random(SEED)
|
||||||
|
questions = []
|
||||||
|
for cat in CATEGORIES:
|
||||||
|
ds = load_dataset("cais/mmlu", cat, split="test", trust_remote_code=True)
|
||||||
|
indices = list(range(len(ds)))
|
||||||
|
rng.shuffle(indices)
|
||||||
|
for idx in indices[:PER_CATEGORY]:
|
||||||
|
row = ds[idx]
|
||||||
|
questions.append({
|
||||||
|
"id": f"{cat}_{idx}",
|
||||||
|
"category": cat,
|
||||||
|
"question": row["question"],
|
||||||
|
"choices": row["choices"],
|
||||||
|
"answer_idx": row["answer"],
|
||||||
|
})
|
||||||
|
return questions
|
||||||
|
|
||||||
|
|
||||||
|
def format_prompt(q: dict) -> str:
|
||||||
|
lines = [f"Question: {q['question']}"]
|
||||||
|
for i, choice in enumerate(q["choices"]):
|
||||||
|
lines.append(f"{CHOICES[i]}) {choice}")
|
||||||
|
lines.append("Answer with a single letter: ")
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
def parse_answer(text: str) -> str | None:
|
||||||
|
for ch in text.strip():
|
||||||
|
if ch.upper() in CHOICES:
|
||||||
|
return ch.upper()
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def run_mmlu(model: str, client: OpenAI, questions: list[dict]) -> list[dict]:
|
||||||
|
model_dir = RESULTS_DIR / model / "mmlu"
|
||||||
|
model_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
results = []
|
||||||
|
correct = 0
|
||||||
|
total = 0
|
||||||
|
|
||||||
|
skipped = 0
|
||||||
|
for i, q in enumerate(tqdm(questions, desc=f" MMLU {model}", file=sys.stderr)):
|
||||||
|
expected = CHOICES[q["answer_idx"]]
|
||||||
|
out_path = model_dir / f"{q['id']}.json"
|
||||||
|
|
||||||
|
# Resume: skip if result file exists
|
||||||
|
if out_path.exists():
|
||||||
|
try:
|
||||||
|
cached = json.loads(out_path.read_text())
|
||||||
|
raw = ""
|
||||||
|
if "choices" in cached:
|
||||||
|
msg = cached["choices"][0].get("message", {})
|
||||||
|
raw = msg.get("content", "") or msg.get("reasoning_content", "") or ""
|
||||||
|
parsed = parse_answer(raw)
|
||||||
|
is_correct = parsed == expected
|
||||||
|
if is_correct:
|
||||||
|
correct += 1
|
||||||
|
total += 1
|
||||||
|
results.append({
|
||||||
|
"model": model, "benchmark": "mmlu", "question_id": q["id"],
|
||||||
|
"category": q["category"], "correct": is_correct,
|
||||||
|
"raw_answer": raw[:200], "parsed_answer": parsed or "",
|
||||||
|
"expected": expected, "latency_ms": 0,
|
||||||
|
})
|
||||||
|
skipped += 1
|
||||||
|
continue
|
||||||
|
except (json.JSONDecodeError, KeyError):
|
||||||
|
pass
|
||||||
|
|
||||||
|
prompt = format_prompt(q)
|
||||||
|
t0 = time.time()
|
||||||
|
resp_json = None
|
||||||
|
for attempt in range(2):
|
||||||
|
try:
|
||||||
|
resp = client.chat.completions.create(
|
||||||
|
model=model,
|
||||||
|
messages=[{"role": "user", "content": prompt}],
|
||||||
|
max_tokens=MAX_TOKENS,
|
||||||
|
temperature=TEMPERATURE,
|
||||||
|
seed=SEED,
|
||||||
|
)
|
||||||
|
resp_json = resp.model_dump()
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
if attempt == 0:
|
||||||
|
time.sleep(5)
|
||||||
|
else:
|
||||||
|
resp_json = {"error": str(e)}
|
||||||
|
latency = (time.time() - t0) * 1000
|
||||||
|
|
||||||
|
raw = ""
|
||||||
|
if resp_json and "choices" in resp_json:
|
||||||
|
msg = resp_json["choices"][0].get("message", {})
|
||||||
|
raw = msg.get("content", "") or msg.get("reasoning_content", "") or ""
|
||||||
|
|
||||||
|
parsed = parse_answer(raw)
|
||||||
|
is_correct = parsed == expected
|
||||||
|
if is_correct:
|
||||||
|
correct += 1
|
||||||
|
total += 1
|
||||||
|
|
||||||
|
out_path.write_text(json.dumps(resp_json, indent=2, default=str))
|
||||||
|
|
||||||
|
results.append({
|
||||||
|
"model": model,
|
||||||
|
"benchmark": "mmlu",
|
||||||
|
"question_id": q["id"],
|
||||||
|
"category": q["category"],
|
||||||
|
"correct": is_correct,
|
||||||
|
"raw_answer": raw[:200],
|
||||||
|
"parsed_answer": parsed or "",
|
||||||
|
"expected": expected,
|
||||||
|
"latency_ms": round(latency, 1),
|
||||||
|
})
|
||||||
|
|
||||||
|
if (i + 1) % 10 == 0:
|
||||||
|
print(f" [{model}] MMLU {i+1}/{len(questions)} — {correct}/{total} ({correct/total*100:.0f}%)", file=sys.stderr)
|
||||||
|
|
||||||
|
if skipped:
|
||||||
|
print(f" [{model}] MMLU resumed: {skipped} cached, {total-skipped} new", file=sys.stderr)
|
||||||
|
print(f" [{model}] MMLU FINAL: {correct}/{total} ({correct/total*100:.1f}%)", file=sys.stderr)
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
model = sys.argv[1] if len(sys.argv) > 1 else "qwen3.6-35b-a3b-mxfp4"
|
||||||
|
client = OpenAI(base_url=ENDPOINT, api_key="dummy")
|
||||||
|
questions = load_questions()
|
||||||
|
results = run_mmlu(model, client, questions)
|
||||||
|
for r in results:
|
||||||
|
print(json.dumps(r))
|
||||||
117
eval/run_all.py
Normal file
117
eval/run_all.py
Normal file
@@ -0,0 +1,117 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Orchestrate MMLU, GSM8K, HumanEval across all models."""
|
||||||
|
|
||||||
|
import csv
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from openai import OpenAI
|
||||||
|
|
||||||
|
ENDPOINT = os.environ.get("LLAMA_SWAP_URL", "http://100.101.41.16:8401/v1")
|
||||||
|
RESULTS_DIR = Path(__file__).parent / "results"
|
||||||
|
CSV_PATH = Path(__file__).parent / "scores.csv"
|
||||||
|
|
||||||
|
MODELS = [
|
||||||
|
"qwen3.6-35b-a3b-mxfp4",
|
||||||
|
"qwen3-coder-30b-apex",
|
||||||
|
"qwen3.6-27b-mtp",
|
||||||
|
"qwopus3.5-4b-mtp",
|
||||||
|
"qwen3.5-9b-deepseek-v4-mtp",
|
||||||
|
"qwopus3.6-35b-a3b-v1",
|
||||||
|
"qwopus3.6-27b-v2-mtp",
|
||||||
|
"qwopus3.5-9b-coder-mtp",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def warmup_model(client: OpenAI, model: str) -> bool:
|
||||||
|
print(f"\n{'='*60}", file=sys.stderr)
|
||||||
|
print(f" Loading model: {model}", file=sys.stderr)
|
||||||
|
print(f"{'='*60}", file=sys.stderr)
|
||||||
|
for attempt in range(3):
|
||||||
|
try:
|
||||||
|
resp = client.chat.completions.create(
|
||||||
|
model=model,
|
||||||
|
messages=[{"role": "user", "content": "Say OK."}],
|
||||||
|
max_tokens=10,
|
||||||
|
temperature=0,
|
||||||
|
)
|
||||||
|
print(f" Warmup OK", file=sys.stderr)
|
||||||
|
return True
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Warmup attempt {attempt+1} failed: {e}", file=sys.stderr)
|
||||||
|
time.sleep(10)
|
||||||
|
print(f" WARNING: warmup failed for {model}, continuing anyway", file=sys.stderr)
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def run_benchmark(module_name: str, model: str, client: OpenAI) -> list[dict]:
|
||||||
|
if module_name == "mmlu":
|
||||||
|
from mmlu import load_questions, run_mmlu
|
||||||
|
questions = load_questions()
|
||||||
|
return run_mmlu(model, client, questions)
|
||||||
|
elif module_name == "gsm8k":
|
||||||
|
from gsm8k import load_questions, run_gsm8k
|
||||||
|
questions = load_questions()
|
||||||
|
return run_gsm8k(model, client, questions)
|
||||||
|
elif module_name == "humaneval":
|
||||||
|
from humaneval import load_problems, run_humaneval
|
||||||
|
problems = load_problems()
|
||||||
|
return run_humaneval(model, client, problems)
|
||||||
|
else:
|
||||||
|
raise ValueError(f"Unknown benchmark: {module_name}")
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
client = OpenAI(base_url=ENDPOINT, api_key="dummy")
|
||||||
|
|
||||||
|
# Check connectivity
|
||||||
|
try:
|
||||||
|
client.models.list()
|
||||||
|
print("Connected to llama-swap", file=sys.stderr)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Cannot connect to {ENDPOINT}: {e}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
all_results: list[dict] = []
|
||||||
|
benchmarks = ["mmlu", "gsm8k", "humaneval"]
|
||||||
|
|
||||||
|
t_start = time.time()
|
||||||
|
|
||||||
|
for model in MODELS:
|
||||||
|
warmup_model(client, model)
|
||||||
|
|
||||||
|
for bench in benchmarks:
|
||||||
|
print(f"\n --- {model} / {bench} ---", file=sys.stderr)
|
||||||
|
try:
|
||||||
|
results = run_benchmark(bench, model, client)
|
||||||
|
all_results.extend(results)
|
||||||
|
write_csv(all_results)
|
||||||
|
except Exception as e:
|
||||||
|
print(f" ERROR in {model}/{bench}: {e}", file=sys.stderr)
|
||||||
|
|
||||||
|
elapsed = time.time() - t_start
|
||||||
|
print(f"\nAll benchmarks complete in {elapsed/60:.0f} minutes", file=sys.stderr)
|
||||||
|
print(f"Results: {CSV_PATH}", file=sys.stderr)
|
||||||
|
|
||||||
|
|
||||||
|
def write_csv(results: list[dict]) -> None:
|
||||||
|
if not results:
|
||||||
|
return
|
||||||
|
fields = ["model", "benchmark", "question_id", "correct", "raw_answer",
|
||||||
|
"parsed_answer", "expected", "latency_ms"]
|
||||||
|
# Also include category if present (MMLU)
|
||||||
|
if any("category" in r for r in results):
|
||||||
|
fields.insert(3, "category")
|
||||||
|
|
||||||
|
with open(CSV_PATH, "w", newline="") as f:
|
||||||
|
w = csv.DictWriter(f, fieldnames=fields, extrasaction="ignore")
|
||||||
|
w.writeheader()
|
||||||
|
w.writerows(results)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
20
eval/run_all.sh
Executable file
20
eval/run_all.sh
Executable file
@@ -0,0 +1,20 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
EVAL_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||||
|
VENV="${EVAL_DIR}/.venv/bin/python3"
|
||||||
|
|
||||||
|
cd "$EVAL_DIR"
|
||||||
|
|
||||||
|
echo "Starting eval sweep at $(date)"
|
||||||
|
echo "Using venv: ${VENV}"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
$VENV run_all.py 2>&1 | tee eval.log
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "Generating summary..."
|
||||||
|
$VENV analyze.py
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "Done at $(date)"
|
||||||
3
go.mod
Normal file
3
go.mod
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
module github.com/indifferentketchup/llama-sidecar
|
||||||
|
|
||||||
|
go 1.26.3
|
||||||
139
internal/config/config.go
Normal file
139
internal/config/config.go
Normal file
@@ -0,0 +1,139 @@
|
|||||||
|
package config
|
||||||
|
|
||||||
|
import (
|
||||||
|
"bytes"
|
||||||
|
"encoding/json"
|
||||||
|
"fmt"
|
||||||
|
"os"
|
||||||
|
"strconv"
|
||||||
|
"strings"
|
||||||
|
)
|
||||||
|
|
||||||
|
var utf8BOM = []byte{0xEF, 0xBB, 0xBF}
|
||||||
|
|
||||||
|
type Config struct {
|
||||||
|
Bind string
|
||||||
|
LlamaServerBin string
|
||||||
|
ModelDirMap map[string]string
|
||||||
|
PortRangeLo int
|
||||||
|
PortRangeHi int
|
||||||
|
MaxSidecars int
|
||||||
|
LogLevel string
|
||||||
|
BaseArgs []string
|
||||||
|
HealthTimeoutSeconds int
|
||||||
|
HealthIntervalSeconds int
|
||||||
|
}
|
||||||
|
|
||||||
|
func Load() (*Config, error) {
|
||||||
|
bin := os.Getenv("LLAMA_SERVER_BIN")
|
||||||
|
if bin == "" {
|
||||||
|
return nil, fmt.Errorf("LLAMA_SERVER_BIN is required")
|
||||||
|
}
|
||||||
|
if _, err := os.Stat(bin); err != nil {
|
||||||
|
return nil, fmt.Errorf("LLAMA_SERVER_BIN %q: %w", bin, err)
|
||||||
|
}
|
||||||
|
|
||||||
|
mapFile := os.Getenv("MODEL_DIR_MAP_FILE")
|
||||||
|
if mapFile == "" {
|
||||||
|
return nil, fmt.Errorf("MODEL_DIR_MAP_FILE is required")
|
||||||
|
}
|
||||||
|
modelMap, err := loadModelMap(mapFile)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("MODEL_DIR_MAP_FILE: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
bind := envOr("LLAMA_SIDECAR_BIND", "127.0.0.1:8402")
|
||||||
|
logLevel := envOr("LOG_LEVEL", "info")
|
||||||
|
maxSidecars := envIntOr("MAX_SIDECARS", 2)
|
||||||
|
healthTimeout := envIntOr("HEALTH_TIMEOUT_SECONDS", 60)
|
||||||
|
healthInterval := envIntOr("HEALTH_INTERVAL_SECONDS", 30)
|
||||||
|
|
||||||
|
lo, hi, err := parsePortRange(envOr("PORT_RANGE", "8500-8599"))
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("PORT_RANGE: %w", err)
|
||||||
|
}
|
||||||
|
if hi-lo+1 < maxSidecars {
|
||||||
|
return nil, fmt.Errorf("PORT_RANGE %d-%d has %d ports but MAX_SIDECARS is %d", lo, hi, hi-lo+1, maxSidecars)
|
||||||
|
}
|
||||||
|
|
||||||
|
baseArgs := defaultBaseArgs()
|
||||||
|
if env := os.Getenv("BASE_ARGS"); env != "" {
|
||||||
|
var parsed []string
|
||||||
|
envBytes := bytes.TrimPrefix([]byte(env), utf8BOM)
|
||||||
|
if err := json.Unmarshal(envBytes, &parsed); err != nil {
|
||||||
|
return nil, fmt.Errorf("BASE_ARGS: invalid JSON array: %w", err)
|
||||||
|
}
|
||||||
|
baseArgs = parsed
|
||||||
|
}
|
||||||
|
|
||||||
|
return &Config{
|
||||||
|
Bind: bind,
|
||||||
|
LlamaServerBin: bin,
|
||||||
|
ModelDirMap: modelMap,
|
||||||
|
PortRangeLo: lo,
|
||||||
|
PortRangeHi: hi,
|
||||||
|
MaxSidecars: maxSidecars,
|
||||||
|
LogLevel: logLevel,
|
||||||
|
BaseArgs: baseArgs,
|
||||||
|
HealthTimeoutSeconds: healthTimeout,
|
||||||
|
HealthIntervalSeconds: healthInterval,
|
||||||
|
}, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func defaultBaseArgs() []string {
|
||||||
|
return []string{"-ngl", "999", "-c", "32768", "--flash-attn", "on", "--no-mmap"}
|
||||||
|
}
|
||||||
|
|
||||||
|
func loadModelMap(path string) (map[string]string, error) {
|
||||||
|
data, err := os.ReadFile(path)
|
||||||
|
if err != nil {
|
||||||
|
return nil, err
|
||||||
|
}
|
||||||
|
data = bytes.TrimPrefix(data, utf8BOM)
|
||||||
|
var m map[string]string
|
||||||
|
if err := json.Unmarshal(data, &m); err != nil {
|
||||||
|
return nil, fmt.Errorf("invalid JSON: %w", err)
|
||||||
|
}
|
||||||
|
if len(m) == 0 {
|
||||||
|
return nil, fmt.Errorf("model map is empty")
|
||||||
|
}
|
||||||
|
return m, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func parsePortRange(s string) (int, int, error) {
|
||||||
|
parts := strings.SplitN(s, "-", 2)
|
||||||
|
if len(parts) != 2 {
|
||||||
|
return 0, 0, fmt.Errorf("expected lo-hi format, got %q", s)
|
||||||
|
}
|
||||||
|
lo, err := strconv.Atoi(strings.TrimSpace(parts[0]))
|
||||||
|
if err != nil {
|
||||||
|
return 0, 0, fmt.Errorf("invalid lo port: %w", err)
|
||||||
|
}
|
||||||
|
hi, err := strconv.Atoi(strings.TrimSpace(parts[1]))
|
||||||
|
if err != nil {
|
||||||
|
return 0, 0, fmt.Errorf("invalid hi port: %w", err)
|
||||||
|
}
|
||||||
|
if hi <= lo {
|
||||||
|
return 0, 0, fmt.Errorf("hi (%d) must be > lo (%d)", hi, lo)
|
||||||
|
}
|
||||||
|
return lo, hi, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func envOr(key, fallback string) string {
|
||||||
|
if v := os.Getenv(key); v != "" {
|
||||||
|
return v
|
||||||
|
}
|
||||||
|
return fallback
|
||||||
|
}
|
||||||
|
|
||||||
|
func envIntOr(key string, fallback int) int {
|
||||||
|
v := os.Getenv(key)
|
||||||
|
if v == "" {
|
||||||
|
return fallback
|
||||||
|
}
|
||||||
|
n, err := strconv.Atoi(v)
|
||||||
|
if err != nil {
|
||||||
|
return fallback
|
||||||
|
}
|
||||||
|
return n
|
||||||
|
}
|
||||||
79
internal/config/config_test.go
Normal file
79
internal/config/config_test.go
Normal file
@@ -0,0 +1,79 @@
|
|||||||
|
package config
|
||||||
|
|
||||||
|
import (
|
||||||
|
"os"
|
||||||
|
"path/filepath"
|
||||||
|
"testing"
|
||||||
|
)
|
||||||
|
|
||||||
|
func TestLoad_MissingRequired(t *testing.T) {
|
||||||
|
os.Unsetenv("LLAMA_SERVER_BIN")
|
||||||
|
os.Unsetenv("MODEL_DIR_MAP_FILE")
|
||||||
|
_, err := Load()
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected error for missing LLAMA_SERVER_BIN")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParsePortRange(t *testing.T) {
|
||||||
|
lo, hi, err := parsePortRange("8500-8599")
|
||||||
|
if err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
if lo != 8500 || hi != 8599 {
|
||||||
|
t.Fatalf("got %d-%d", lo, hi)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParsePortRange_Bad(t *testing.T) {
|
||||||
|
_, _, err := parsePortRange("abc")
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected error")
|
||||||
|
}
|
||||||
|
_, _, err = parsePortRange("100-50")
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected error for hi <= lo")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestLoadModelMap_BOM(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
path := filepath.Join(dir, "model_map.json")
|
||||||
|
content := append([]byte{0xEF, 0xBB, 0xBF}, []byte(`{"test-model": "/fake/path.gguf"}`)...)
|
||||||
|
if err := os.WriteFile(path, content, 0644); err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
m, err := loadModelMap(path)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("BOM-prefixed JSON should parse: %v", err)
|
||||||
|
}
|
||||||
|
if m["test-model"] != "/fake/path.gguf" {
|
||||||
|
t.Fatalf("unexpected map: %v", m)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestDefaultBaseArgs_FlashAttn(t *testing.T) {
|
||||||
|
args := defaultBaseArgs()
|
||||||
|
for i, a := range args {
|
||||||
|
if a == "--flash-attn" && i+1 < len(args) && args[i+1] == "on" {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
}
|
||||||
|
t.Fatal("expected --flash-attn on in default args")
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestDefaultBaseArgs(t *testing.T) {
|
||||||
|
args := defaultBaseArgs()
|
||||||
|
if len(args) == 0 {
|
||||||
|
t.Fatal("expected non-empty default args")
|
||||||
|
}
|
||||||
|
found := false
|
||||||
|
for _, a := range args {
|
||||||
|
if a == "--no-mmap" {
|
||||||
|
found = true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if !found {
|
||||||
|
t.Fatal("expected --no-mmap in default args")
|
||||||
|
}
|
||||||
|
}
|
||||||
53
internal/pool/hash.go
Normal file
53
internal/pool/hash.go
Normal file
@@ -0,0 +1,53 @@
|
|||||||
|
package pool
|
||||||
|
|
||||||
|
import (
|
||||||
|
"crypto/sha256"
|
||||||
|
"fmt"
|
||||||
|
"sort"
|
||||||
|
"strings"
|
||||||
|
|
||||||
|
"github.com/indifferentketchup/llama-sidecar/internal/validator"
|
||||||
|
)
|
||||||
|
|
||||||
|
// Hash computes a deterministic hash for a (modelID, flags) pair.
|
||||||
|
// Flag order does not affect the result.
|
||||||
|
func Hash(modelID string, flags []string) string {
|
||||||
|
type pair struct {
|
||||||
|
key, val string
|
||||||
|
}
|
||||||
|
|
||||||
|
var pairs []pair
|
||||||
|
i := 0
|
||||||
|
for i < len(flags) {
|
||||||
|
tok := flags[i]
|
||||||
|
key := validator.FlagName(tok)
|
||||||
|
if key == "" {
|
||||||
|
i++
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
if idx := strings.IndexByte(tok, '='); idx >= 0 {
|
||||||
|
pairs = append(pairs, pair{key: tok[:idx], val: tok[idx+1:]})
|
||||||
|
i++
|
||||||
|
} else if i+1 < len(flags) && validator.FlagName(flags[i+1]) == "" {
|
||||||
|
pairs = append(pairs, pair{key: key, val: flags[i+1]})
|
||||||
|
i += 2
|
||||||
|
} else {
|
||||||
|
pairs = append(pairs, pair{key: key, val: ""})
|
||||||
|
i++
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
sort.Slice(pairs, func(a, b int) bool {
|
||||||
|
return pairs[a].key < pairs[b].key
|
||||||
|
})
|
||||||
|
|
||||||
|
var parts []string
|
||||||
|
for _, p := range pairs {
|
||||||
|
parts = append(parts, p.key+"\x1f"+p.val)
|
||||||
|
}
|
||||||
|
serialized := strings.Join(parts, "\x1e")
|
||||||
|
input := modelID + "\x1d" + serialized
|
||||||
|
|
||||||
|
sum := sha256.Sum256([]byte(input))
|
||||||
|
return fmt.Sprintf("%x", sum[:8])
|
||||||
|
}
|
||||||
53
internal/pool/hash_test.go
Normal file
53
internal/pool/hash_test.go
Normal file
@@ -0,0 +1,53 @@
|
|||||||
|
package pool
|
||||||
|
|
||||||
|
import (
|
||||||
|
"math/rand"
|
||||||
|
"testing"
|
||||||
|
)
|
||||||
|
|
||||||
|
func TestHash_OrderIndependence(t *testing.T) {
|
||||||
|
flags1 := []string{"--a", "1", "--b", "2", "--c", "3"}
|
||||||
|
h1 := Hash("foo", flags1)
|
||||||
|
|
||||||
|
for i := 0; i < 5; i++ {
|
||||||
|
shuffled := make([]string, len(flags1))
|
||||||
|
copy(shuffled, flags1)
|
||||||
|
// Shuffle pairs (each pair is 2 tokens)
|
||||||
|
pairs := make([][2]string, 0)
|
||||||
|
for j := 0; j < len(shuffled); j += 2 {
|
||||||
|
pairs = append(pairs, [2]string{shuffled[j], shuffled[j+1]})
|
||||||
|
}
|
||||||
|
rand.Shuffle(len(pairs), func(a, b int) { pairs[a], pairs[b] = pairs[b], pairs[a] })
|
||||||
|
var flat []string
|
||||||
|
for _, p := range pairs {
|
||||||
|
flat = append(flat, p[0], p[1])
|
||||||
|
}
|
||||||
|
h := Hash("foo", flat)
|
||||||
|
if h != h1 {
|
||||||
|
t.Errorf("iteration %d: hash %s != %s for order %v", i, h, h1, flat)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestHash_SeparatorCollision(t *testing.T) {
|
||||||
|
h1 := Hash("foo", []string{"--a\x1eb", "1"})
|
||||||
|
h2 := Hash("foo", []string{"--ab", "1"})
|
||||||
|
if h1 == h2 {
|
||||||
|
t.Error("separator collision: hashes should differ")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestHash_Length(t *testing.T) {
|
||||||
|
h := Hash("model", []string{"--top-k", "20"})
|
||||||
|
if len(h) != 16 {
|
||||||
|
t.Errorf("expected 16 hex chars, got %d: %s", len(h), h)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestHash_DifferentModels(t *testing.T) {
|
||||||
|
h1 := Hash("model-a", []string{"--top-k", "20"})
|
||||||
|
h2 := Hash("model-b", []string{"--top-k", "20"})
|
||||||
|
if h1 == h2 {
|
||||||
|
t.Error("different models should produce different hashes")
|
||||||
|
}
|
||||||
|
}
|
||||||
188
internal/pool/pool.go
Normal file
188
internal/pool/pool.go
Normal file
@@ -0,0 +1,188 @@
|
|||||||
|
package pool
|
||||||
|
|
||||||
|
import (
|
||||||
|
"container/list"
|
||||||
|
"context"
|
||||||
|
"fmt"
|
||||||
|
"log/slog"
|
||||||
|
"sync"
|
||||||
|
"time"
|
||||||
|
|
||||||
|
"github.com/indifferentketchup/llama-sidecar/internal/config"
|
||||||
|
"github.com/indifferentketchup/llama-sidecar/internal/validator"
|
||||||
|
)
|
||||||
|
|
||||||
|
type SidecarInfo struct {
|
||||||
|
Hash string `json:"hash"`
|
||||||
|
ModelID string `json:"model_id"`
|
||||||
|
Flags []string `json:"flags"`
|
||||||
|
Port int `json:"port"`
|
||||||
|
Pid int `json:"pid"`
|
||||||
|
StartedAt time.Time `json:"started_at"`
|
||||||
|
LastUsed time.Time `json:"last_used"`
|
||||||
|
Healthy bool `json:"healthy"`
|
||||||
|
}
|
||||||
|
|
||||||
|
type Pool struct {
|
||||||
|
mu sync.Mutex
|
||||||
|
cfg *config.Config
|
||||||
|
sidecars map[string]*Sidecar
|
||||||
|
lru *list.List
|
||||||
|
lruIdx map[string]*list.Element
|
||||||
|
ports *PortAllocator
|
||||||
|
spawner Spawner
|
||||||
|
}
|
||||||
|
|
||||||
|
func New(cfg *config.Config, spawner Spawner) *Pool {
|
||||||
|
return &Pool{
|
||||||
|
cfg: cfg,
|
||||||
|
sidecars: make(map[string]*Sidecar),
|
||||||
|
lru: list.New(),
|
||||||
|
lruIdx: make(map[string]*list.Element),
|
||||||
|
ports: NewPortAllocator(cfg.PortRangeLo, cfg.PortRangeHi),
|
||||||
|
spawner: spawner,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func (p *Pool) Acquire(ctx context.Context, modelID string, flags []string) (*Sidecar, error) {
|
||||||
|
if _, err := validator.ValidateExtraArgs(flags); err != nil {
|
||||||
|
return nil, fmt.Errorf("validation: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
modelPath, ok := p.cfg.ModelDirMap[modelID]
|
||||||
|
if !ok {
|
||||||
|
return nil, fmt.Errorf("unknown model: %s", modelID)
|
||||||
|
}
|
||||||
|
|
||||||
|
hash := Hash(modelID, flags)
|
||||||
|
|
||||||
|
p.mu.Lock()
|
||||||
|
defer p.mu.Unlock()
|
||||||
|
|
||||||
|
if s, ok := p.sidecars[hash]; ok {
|
||||||
|
if s.Healthy() {
|
||||||
|
if el, ok := p.lruIdx[hash]; ok {
|
||||||
|
p.lru.MoveToFront(el)
|
||||||
|
}
|
||||||
|
s.TouchLastUsed()
|
||||||
|
return s, nil
|
||||||
|
}
|
||||||
|
p.removeLocked(hash)
|
||||||
|
}
|
||||||
|
|
||||||
|
if len(p.sidecars) >= p.cfg.MaxSidecars {
|
||||||
|
if err := p.evictLRULocked(); err != nil {
|
||||||
|
return nil, fmt.Errorf("eviction failed: %w", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
port, err := p.ports.Allocate()
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("port allocation: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
p.mu.Unlock()
|
||||||
|
s, err := p.spawner.Spawn(ctx, p.cfg, modelID, modelPath, flags, port, hash)
|
||||||
|
p.mu.Lock()
|
||||||
|
|
||||||
|
if err != nil {
|
||||||
|
p.ports.Release(port)
|
||||||
|
return nil, fmt.Errorf("spawn: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
p.sidecars[hash] = s
|
||||||
|
el := p.lru.PushFront(hash)
|
||||||
|
p.lruIdx[hash] = el
|
||||||
|
return s, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func (p *Pool) List() []SidecarInfo {
|
||||||
|
p.mu.Lock()
|
||||||
|
defer p.mu.Unlock()
|
||||||
|
out := make([]SidecarInfo, 0, len(p.sidecars))
|
||||||
|
for _, s := range p.sidecars {
|
||||||
|
out = append(out, SidecarInfo{
|
||||||
|
Hash: s.Hash,
|
||||||
|
ModelID: s.ModelID,
|
||||||
|
Flags: s.Flags,
|
||||||
|
Port: s.Port,
|
||||||
|
Pid: s.Pid,
|
||||||
|
StartedAt: s.StartedAt,
|
||||||
|
LastUsed: time.Unix(0, s.LastUsed.Load()),
|
||||||
|
Healthy: s.Healthy(),
|
||||||
|
})
|
||||||
|
}
|
||||||
|
return out
|
||||||
|
}
|
||||||
|
|
||||||
|
func (p *Pool) Remove(hash string) error {
|
||||||
|
p.mu.Lock()
|
||||||
|
defer p.mu.Unlock()
|
||||||
|
if _, ok := p.sidecars[hash]; !ok {
|
||||||
|
return fmt.Errorf("sidecar %s not found", hash)
|
||||||
|
}
|
||||||
|
return p.removeLocked(hash)
|
||||||
|
}
|
||||||
|
|
||||||
|
func (p *Pool) Shutdown(ctx context.Context) error {
|
||||||
|
p.mu.Lock()
|
||||||
|
hashes := make([]string, 0, len(p.sidecars))
|
||||||
|
for h := range p.sidecars {
|
||||||
|
hashes = append(hashes, h)
|
||||||
|
}
|
||||||
|
p.mu.Unlock()
|
||||||
|
|
||||||
|
var wg sync.WaitGroup
|
||||||
|
for _, h := range hashes {
|
||||||
|
wg.Add(1)
|
||||||
|
go func(hash string) {
|
||||||
|
defer wg.Done()
|
||||||
|
p.mu.Lock()
|
||||||
|
s, ok := p.sidecars[hash]
|
||||||
|
p.mu.Unlock()
|
||||||
|
if !ok {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
if err := p.spawner.Kill(s); err != nil {
|
||||||
|
slog.Error("shutdown kill failed", "hash", hash, "err", err)
|
||||||
|
}
|
||||||
|
}(h)
|
||||||
|
}
|
||||||
|
|
||||||
|
done := make(chan struct{})
|
||||||
|
go func() { wg.Wait(); close(done) }()
|
||||||
|
select {
|
||||||
|
case <-done:
|
||||||
|
case <-ctx.Done():
|
||||||
|
return ctx.Err()
|
||||||
|
}
|
||||||
|
slog.Info("pool shutdown complete", "count", len(hashes))
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func (p *Pool) removeLocked(hash string) error {
|
||||||
|
s, ok := p.sidecars[hash]
|
||||||
|
if !ok {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
delete(p.sidecars, hash)
|
||||||
|
if el, ok := p.lruIdx[hash]; ok {
|
||||||
|
p.lru.Remove(el)
|
||||||
|
delete(p.lruIdx, hash)
|
||||||
|
}
|
||||||
|
if err := p.spawner.Kill(s); err != nil {
|
||||||
|
slog.Error("kill failed during remove", "hash", hash, "err", err)
|
||||||
|
}
|
||||||
|
p.ports.Release(s.Port)
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func (p *Pool) evictLRULocked() error {
|
||||||
|
back := p.lru.Back()
|
||||||
|
if back == nil {
|
||||||
|
return fmt.Errorf("pool full but LRU empty")
|
||||||
|
}
|
||||||
|
hash := back.Value.(string)
|
||||||
|
slog.Info("evicting LRU sidecar", "hash", hash)
|
||||||
|
return p.removeLocked(hash)
|
||||||
|
}
|
||||||
151
internal/pool/pool_test.go
Normal file
151
internal/pool/pool_test.go
Normal file
@@ -0,0 +1,151 @@
|
|||||||
|
package pool
|
||||||
|
|
||||||
|
import (
|
||||||
|
"context"
|
||||||
|
"sync"
|
||||||
|
"sync/atomic"
|
||||||
|
"testing"
|
||||||
|
"time"
|
||||||
|
|
||||||
|
"github.com/indifferentketchup/llama-sidecar/internal/config"
|
||||||
|
)
|
||||||
|
|
||||||
|
type fakeSpawner struct {
|
||||||
|
spawnCount atomic.Int32
|
||||||
|
killCount atomic.Int32
|
||||||
|
}
|
||||||
|
|
||||||
|
func (f *fakeSpawner) Spawn(ctx context.Context, cfg *config.Config, modelID, modelPath string, flags []string, port int, hash string) (*Sidecar, error) {
|
||||||
|
f.spawnCount.Add(1)
|
||||||
|
s := &Sidecar{
|
||||||
|
Hash: hash,
|
||||||
|
ModelID: modelID,
|
||||||
|
ModelPath: modelPath,
|
||||||
|
Flags: flags,
|
||||||
|
Port: port,
|
||||||
|
Pid: 99999,
|
||||||
|
StartedAt: time.Now(),
|
||||||
|
stderr: newRingBuffer(8),
|
||||||
|
cancel: func() {},
|
||||||
|
}
|
||||||
|
s.healthy.Store(true)
|
||||||
|
s.LastUsed.Store(time.Now().UnixNano())
|
||||||
|
return s, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func (f *fakeSpawner) Kill(s *Sidecar) error {
|
||||||
|
f.killCount.Add(1)
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func testConfig() *config.Config {
|
||||||
|
return &config.Config{
|
||||||
|
Bind: "127.0.0.1:0",
|
||||||
|
LlamaServerBin: "/fake/llama-server",
|
||||||
|
ModelDirMap: map[string]string{
|
||||||
|
"model-a": "/fake/model-a.gguf",
|
||||||
|
"model-b": "/fake/model-b.gguf",
|
||||||
|
},
|
||||||
|
PortRangeLo: 8500,
|
||||||
|
PortRangeHi: 8509,
|
||||||
|
MaxSidecars: 2,
|
||||||
|
BaseArgs: []string{"-ngl", "999"},
|
||||||
|
HealthTimeoutSeconds: 60,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestPool_AcquireSameKey(t *testing.T) {
|
||||||
|
fs := &fakeSpawner{}
|
||||||
|
p := New(testConfig(), fs)
|
||||||
|
ctx := context.Background()
|
||||||
|
|
||||||
|
s1, err := p.Acquire(ctx, "model-a", []string{"--top-k", "20"})
|
||||||
|
if err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
s2, err := p.Acquire(ctx, "model-a", []string{"--top-k", "20"})
|
||||||
|
if err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
if s1.Hash != s2.Hash {
|
||||||
|
t.Fatalf("expected same sidecar, got different hashes: %s vs %s", s1.Hash, s2.Hash)
|
||||||
|
}
|
||||||
|
if fs.spawnCount.Load() != 1 {
|
||||||
|
t.Fatalf("expected 1 spawn, got %d", fs.spawnCount.Load())
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestPool_EvictLRU(t *testing.T) {
|
||||||
|
cfg := testConfig()
|
||||||
|
cfg.MaxSidecars = 1
|
||||||
|
fs := &fakeSpawner{}
|
||||||
|
p := New(cfg, fs)
|
||||||
|
ctx := context.Background()
|
||||||
|
|
||||||
|
_, err := p.Acquire(ctx, "model-a", []string{"--top-k", "20"})
|
||||||
|
if err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
_, err = p.Acquire(ctx, "model-b", []string{"--top-k", "40"})
|
||||||
|
if err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
|
||||||
|
if fs.spawnCount.Load() != 2 {
|
||||||
|
t.Fatalf("expected 2 spawns, got %d", fs.spawnCount.Load())
|
||||||
|
}
|
||||||
|
if fs.killCount.Load() != 1 {
|
||||||
|
t.Fatalf("expected 1 kill (eviction), got %d", fs.killCount.Load())
|
||||||
|
}
|
||||||
|
list := p.List()
|
||||||
|
if len(list) != 1 {
|
||||||
|
t.Fatalf("expected 1 sidecar, got %d", len(list))
|
||||||
|
}
|
||||||
|
if list[0].ModelID != "model-b" {
|
||||||
|
t.Fatalf("expected model-b, got %s", list[0].ModelID)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestPool_ValidatorReject(t *testing.T) {
|
||||||
|
fs := &fakeSpawner{}
|
||||||
|
p := New(testConfig(), fs)
|
||||||
|
_, err := p.Acquire(context.Background(), "model-a", []string{"--model", "evil.gguf"})
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected validation error")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestPool_UnknownModel(t *testing.T) {
|
||||||
|
fs := &fakeSpawner{}
|
||||||
|
p := New(testConfig(), fs)
|
||||||
|
_, err := p.Acquire(context.Background(), "nonexistent", nil)
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected unknown model error")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestPool_ConcurrentAcquire(t *testing.T) {
|
||||||
|
cfg := testConfig()
|
||||||
|
cfg.MaxSidecars = 10
|
||||||
|
cfg.PortRangeHi = 8599
|
||||||
|
fs := &fakeSpawner{}
|
||||||
|
p := New(cfg, fs)
|
||||||
|
ctx := context.Background()
|
||||||
|
|
||||||
|
var wg sync.WaitGroup
|
||||||
|
for i := 0; i < 10; i++ {
|
||||||
|
wg.Add(1)
|
||||||
|
go func() {
|
||||||
|
defer wg.Done()
|
||||||
|
for j := 0; j < 50; j++ {
|
||||||
|
_, _ = p.Acquire(ctx, "model-a", []string{"--top-k", "20"})
|
||||||
|
}
|
||||||
|
}()
|
||||||
|
}
|
||||||
|
wg.Wait()
|
||||||
|
|
||||||
|
list := p.List()
|
||||||
|
if len(list) != 1 {
|
||||||
|
t.Fatalf("expected 1 sidecar (same key), got %d", len(list))
|
||||||
|
}
|
||||||
|
}
|
||||||
28
internal/pool/ports.go
Normal file
28
internal/pool/ports.go
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
package pool
|
||||||
|
|
||||||
|
import "fmt"
|
||||||
|
|
||||||
|
type PortAllocator struct {
|
||||||
|
ports chan int
|
||||||
|
}
|
||||||
|
|
||||||
|
func NewPortAllocator(lo, hi int) *PortAllocator {
|
||||||
|
ch := make(chan int, hi-lo+1)
|
||||||
|
for p := lo; p <= hi; p++ {
|
||||||
|
ch <- p
|
||||||
|
}
|
||||||
|
return &PortAllocator{ports: ch}
|
||||||
|
}
|
||||||
|
|
||||||
|
func (pa *PortAllocator) Allocate() (int, error) {
|
||||||
|
select {
|
||||||
|
case p := <-pa.ports:
|
||||||
|
return p, nil
|
||||||
|
default:
|
||||||
|
return 0, fmt.Errorf("port allocator exhausted")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func (pa *PortAllocator) Release(port int) {
|
||||||
|
pa.ports <- port
|
||||||
|
}
|
||||||
74
internal/pool/ports_test.go
Normal file
74
internal/pool/ports_test.go
Normal file
@@ -0,0 +1,74 @@
|
|||||||
|
package pool
|
||||||
|
|
||||||
|
import (
|
||||||
|
"sync"
|
||||||
|
"testing"
|
||||||
|
)
|
||||||
|
|
||||||
|
func TestPortAllocator_AllocateRelease(t *testing.T) {
|
||||||
|
pa := NewPortAllocator(8500, 8502)
|
||||||
|
p1, err := pa.Allocate()
|
||||||
|
if err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
p2, err := pa.Allocate()
|
||||||
|
if err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
p3, err := pa.Allocate()
|
||||||
|
if err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// All three ports should be distinct
|
||||||
|
if p1 == p2 || p2 == p3 || p1 == p3 {
|
||||||
|
t.Fatalf("expected distinct ports: %d, %d, %d", p1, p2, p3)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Exhausted
|
||||||
|
_, err = pa.Allocate()
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected error when exhausted")
|
||||||
|
}
|
||||||
|
|
||||||
|
// Release and re-allocate
|
||||||
|
pa.Release(p2)
|
||||||
|
p4, err := pa.Allocate()
|
||||||
|
if err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
if p4 != p2 {
|
||||||
|
t.Fatalf("expected released port %d, got %d", p2, p4)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestPortAllocator_Concurrent(t *testing.T) {
|
||||||
|
pa := NewPortAllocator(8500, 8599)
|
||||||
|
var wg sync.WaitGroup
|
||||||
|
allocated := make(chan int, 100)
|
||||||
|
|
||||||
|
for i := 0; i < 100; i++ {
|
||||||
|
wg.Add(1)
|
||||||
|
go func() {
|
||||||
|
defer wg.Done()
|
||||||
|
p, err := pa.Allocate()
|
||||||
|
if err != nil {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
allocated <- p
|
||||||
|
}()
|
||||||
|
}
|
||||||
|
wg.Wait()
|
||||||
|
close(allocated)
|
||||||
|
|
||||||
|
seen := make(map[int]bool)
|
||||||
|
for p := range allocated {
|
||||||
|
if seen[p] {
|
||||||
|
t.Fatalf("duplicate port %d", p)
|
||||||
|
}
|
||||||
|
seen[p] = true
|
||||||
|
}
|
||||||
|
if len(seen) != 100 {
|
||||||
|
t.Fatalf("expected 100 ports, got %d", len(seen))
|
||||||
|
}
|
||||||
|
}
|
||||||
313
internal/pool/sidecar.go
Normal file
313
internal/pool/sidecar.go
Normal file
@@ -0,0 +1,313 @@
|
|||||||
|
package pool
|
||||||
|
|
||||||
|
import (
|
||||||
|
"bytes"
|
||||||
|
"context"
|
||||||
|
"fmt"
|
||||||
|
"io"
|
||||||
|
"log/slog"
|
||||||
|
"net/http"
|
||||||
|
"os"
|
||||||
|
"os/exec"
|
||||||
|
"strconv"
|
||||||
|
"strings"
|
||||||
|
"sync"
|
||||||
|
"sync/atomic"
|
||||||
|
"time"
|
||||||
|
|
||||||
|
"github.com/indifferentketchup/llama-sidecar/internal/config"
|
||||||
|
"github.com/indifferentketchup/llama-sidecar/internal/validator"
|
||||||
|
)
|
||||||
|
|
||||||
|
type Sidecar struct {
|
||||||
|
Hash string
|
||||||
|
ModelID string
|
||||||
|
ModelPath string
|
||||||
|
Flags []string
|
||||||
|
Port int
|
||||||
|
Pid int
|
||||||
|
StartedAt time.Time
|
||||||
|
LastUsed atomic.Int64
|
||||||
|
healthy atomic.Bool
|
||||||
|
cmd *exec.Cmd
|
||||||
|
cancel context.CancelFunc
|
||||||
|
done chan error
|
||||||
|
stderr *ringBuffer
|
||||||
|
stopMon context.CancelFunc
|
||||||
|
stdinFile *os.File
|
||||||
|
stdoutR *os.File
|
||||||
|
stdoutFile *os.File
|
||||||
|
}
|
||||||
|
|
||||||
|
func (s *Sidecar) Healthy() bool {
|
||||||
|
return s.healthy.Load()
|
||||||
|
}
|
||||||
|
|
||||||
|
func (s *Sidecar) TouchLastUsed() {
|
||||||
|
s.LastUsed.Store(time.Now().UnixNano())
|
||||||
|
}
|
||||||
|
|
||||||
|
func (s *Sidecar) LastStderr() string {
|
||||||
|
return s.stderr.String()
|
||||||
|
}
|
||||||
|
|
||||||
|
// Spawner abstracts sidecar creation for testing.
|
||||||
|
type Spawner interface {
|
||||||
|
Spawn(ctx context.Context, cfg *config.Config, modelID, modelPath string, flags []string, port int, hash string) (*Sidecar, error)
|
||||||
|
Kill(s *Sidecar) error
|
||||||
|
}
|
||||||
|
|
||||||
|
type RealSpawner struct{}
|
||||||
|
|
||||||
|
func (rs *RealSpawner) Spawn(ctx context.Context, cfg *config.Config, modelID, modelPath string, flags []string, port int, hash string) (*Sidecar, error) {
|
||||||
|
args := buildArgs(cfg.BaseArgs, modelPath, port, flags)
|
||||||
|
_ = ctx
|
||||||
|
childCtx, cancel := context.WithCancel(context.Background())
|
||||||
|
cmd := exec.CommandContext(childCtx, cfg.LlamaServerBin, args...)
|
||||||
|
setPlatformAttrs(cmd)
|
||||||
|
|
||||||
|
devNull, err := os.Open(os.DevNull)
|
||||||
|
if err != nil {
|
||||||
|
cancel()
|
||||||
|
return nil, fmt.Errorf("open devnull: %w", err)
|
||||||
|
}
|
||||||
|
cmd.Stdin = devNull
|
||||||
|
|
||||||
|
stderr := newRingBuffer(64)
|
||||||
|
prefix := fmt.Sprintf("[sidecar:%s:%d] ", hash[:8], port)
|
||||||
|
cmd.Stderr = io.MultiWriter(stderr, &prefixWriter{prefix: prefix})
|
||||||
|
stdoutR, stdoutW, err := os.Pipe()
|
||||||
|
if err != nil {
|
||||||
|
cancel()
|
||||||
|
devNull.Close()
|
||||||
|
return nil, fmt.Errorf("stdout pipe: %w", err)
|
||||||
|
}
|
||||||
|
go io.Copy(io.Discard, stdoutR)
|
||||||
|
cmd.Stdout = stdoutW
|
||||||
|
|
||||||
|
slog.Info("spawning sidecar", "hash", hash, "model", modelID, "port", port, "args", strings.Join(args, " "))
|
||||||
|
if err := cmd.Start(); err != nil {
|
||||||
|
cancel()
|
||||||
|
return nil, fmt.Errorf("spawn failed: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
s := &Sidecar{
|
||||||
|
Hash: hash,
|
||||||
|
ModelID: modelID,
|
||||||
|
ModelPath: modelPath,
|
||||||
|
Flags: flags,
|
||||||
|
Port: port,
|
||||||
|
Pid: cmd.Process.Pid,
|
||||||
|
StartedAt: time.Now(),
|
||||||
|
cmd: cmd,
|
||||||
|
cancel: cancel,
|
||||||
|
done: make(chan error, 1),
|
||||||
|
stderr: stderr,
|
||||||
|
stdinFile: devNull,
|
||||||
|
stdoutR: stdoutR,
|
||||||
|
stdoutFile: stdoutW,
|
||||||
|
}
|
||||||
|
s.LastUsed.Store(time.Now().UnixNano())
|
||||||
|
|
||||||
|
go func() {
|
||||||
|
err := cmd.Wait()
|
||||||
|
s.healthy.Store(false)
|
||||||
|
exitCode := -1
|
||||||
|
if cmd.ProcessState != nil {
|
||||||
|
exitCode = cmd.ProcessState.ExitCode()
|
||||||
|
}
|
||||||
|
slog.Error("sidecar child exited",
|
||||||
|
"hash", hash,
|
||||||
|
"port", port,
|
||||||
|
"pid", s.Pid,
|
||||||
|
"exit_code", exitCode,
|
||||||
|
"wait_err", fmt.Sprintf("%v", err),
|
||||||
|
"uptime", time.Since(s.StartedAt).Round(time.Millisecond),
|
||||||
|
"stderr_tail", stderr.String(),
|
||||||
|
)
|
||||||
|
s.done <- err
|
||||||
|
close(s.done)
|
||||||
|
}()
|
||||||
|
|
||||||
|
// Wait for health
|
||||||
|
healthURL := fmt.Sprintf("http://127.0.0.1:%d/health", port)
|
||||||
|
deadline := time.Now().Add(time.Duration(cfg.HealthTimeoutSeconds) * time.Second)
|
||||||
|
for time.Now().Before(deadline) {
|
||||||
|
resp, err := http.Get(healthURL)
|
||||||
|
if err == nil {
|
||||||
|
resp.Body.Close()
|
||||||
|
if resp.StatusCode == 200 {
|
||||||
|
s.healthy.Store(true)
|
||||||
|
slog.Info("sidecar healthy", "hash", hash, "port", port, "elapsed", time.Since(s.StartedAt).Round(time.Millisecond))
|
||||||
|
monCtx, monCancel := context.WithCancel(ctx)
|
||||||
|
s.stopMon = monCancel
|
||||||
|
go s.healthMonitor(monCtx, cfg.HealthIntervalSeconds)
|
||||||
|
return s, nil
|
||||||
|
}
|
||||||
|
}
|
||||||
|
select {
|
||||||
|
case <-childCtx.Done():
|
||||||
|
return nil, fmt.Errorf("sidecar process exited during health check")
|
||||||
|
case <-time.After(500 * time.Millisecond):
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
_ = rs.Kill(s)
|
||||||
|
return nil, fmt.Errorf("health check timed out after %ds, last stderr: %s", cfg.HealthTimeoutSeconds, s.stderr.LastLine())
|
||||||
|
}
|
||||||
|
|
||||||
|
func (rs *RealSpawner) Kill(s *Sidecar) error {
|
||||||
|
if s.stopMon != nil {
|
||||||
|
s.stopMon()
|
||||||
|
}
|
||||||
|
s.cancel()
|
||||||
|
select {
|
||||||
|
case <-s.done:
|
||||||
|
case <-time.After(5 * time.Second):
|
||||||
|
if s.cmd.Process != nil {
|
||||||
|
_ = s.cmd.Process.Kill()
|
||||||
|
}
|
||||||
|
<-s.done
|
||||||
|
}
|
||||||
|
if s.stdinFile != nil {
|
||||||
|
s.stdinFile.Close()
|
||||||
|
}
|
||||||
|
if s.stdoutFile != nil {
|
||||||
|
s.stdoutFile.Close()
|
||||||
|
}
|
||||||
|
if s.stdoutR != nil {
|
||||||
|
s.stdoutR.Close()
|
||||||
|
}
|
||||||
|
slog.Info("sidecar killed", "hash", s.Hash, "port", s.Port)
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func (s *Sidecar) healthMonitor(ctx context.Context, intervalSec int) {
|
||||||
|
ticker := time.NewTicker(time.Duration(intervalSec) * time.Second)
|
||||||
|
defer ticker.Stop()
|
||||||
|
failures := 0
|
||||||
|
url := fmt.Sprintf("http://127.0.0.1:%d/health", s.Port)
|
||||||
|
client := &http.Client{Timeout: 5 * time.Second}
|
||||||
|
for {
|
||||||
|
select {
|
||||||
|
case <-ctx.Done():
|
||||||
|
return
|
||||||
|
case <-ticker.C:
|
||||||
|
resp, err := client.Get(url)
|
||||||
|
if err != nil || resp.StatusCode != 200 {
|
||||||
|
if resp != nil {
|
||||||
|
resp.Body.Close()
|
||||||
|
}
|
||||||
|
failures++
|
||||||
|
if failures >= 3 {
|
||||||
|
slog.Warn("sidecar unhealthy, marking for eviction", "hash", s.Hash, "port", s.Port)
|
||||||
|
s.healthy.Store(false)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
resp.Body.Close()
|
||||||
|
failures = 0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func buildArgs(baseArgs []string, modelPath string, port int, userFlags []string) []string {
|
||||||
|
deduped := dedupFlags(baseArgs, userFlags)
|
||||||
|
args := make([]string, 0, len(deduped)+len(userFlags)+4)
|
||||||
|
args = append(args, deduped...)
|
||||||
|
args = append(args, "--model", modelPath)
|
||||||
|
args = append(args, "--port", strconv.Itoa(port))
|
||||||
|
args = append(args, userFlags...)
|
||||||
|
return args
|
||||||
|
}
|
||||||
|
|
||||||
|
// dedupFlags removes from autoArgs any flag that the user also supplied,
|
||||||
|
// so the user's value wins via llama.cpp's last-wins CLI parsing.
|
||||||
|
func dedupFlags(autoArgs, userArgs []string) []string {
|
||||||
|
userNames := make(map[string]bool)
|
||||||
|
for _, tok := range userArgs {
|
||||||
|
if name := validator.FlagName(tok); name != "" {
|
||||||
|
userNames[name] = true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
out := make([]string, 0, len(autoArgs))
|
||||||
|
i := 0
|
||||||
|
for i < len(autoArgs) {
|
||||||
|
tok := autoArgs[i]
|
||||||
|
name := validator.FlagName(tok)
|
||||||
|
if name == "" || !userNames[name] {
|
||||||
|
out = append(out, tok)
|
||||||
|
i++
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
if strings.Contains(tok, "=") {
|
||||||
|
i++
|
||||||
|
} else if i+1 < len(autoArgs) && validator.FlagName(autoArgs[i+1]) == "" {
|
||||||
|
i += 2
|
||||||
|
} else {
|
||||||
|
i++
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return out
|
||||||
|
}
|
||||||
|
|
||||||
|
// Ring buffer for last N lines of stderr
|
||||||
|
type ringBuffer struct {
|
||||||
|
mu sync.Mutex
|
||||||
|
lines []string
|
||||||
|
max int
|
||||||
|
}
|
||||||
|
|
||||||
|
func newRingBuffer(max int) *ringBuffer {
|
||||||
|
return &ringBuffer{lines: make([]string, 0, max), max: max}
|
||||||
|
}
|
||||||
|
|
||||||
|
func (rb *ringBuffer) Write(p []byte) (int, error) {
|
||||||
|
rb.mu.Lock()
|
||||||
|
defer rb.mu.Unlock()
|
||||||
|
for _, line := range strings.Split(string(p), "\n") {
|
||||||
|
line = strings.TrimRight(line, "\r\n")
|
||||||
|
if line == "" {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
if len(rb.lines) >= rb.max {
|
||||||
|
rb.lines = rb.lines[1:]
|
||||||
|
}
|
||||||
|
rb.lines = append(rb.lines, line)
|
||||||
|
}
|
||||||
|
return len(p), nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func (rb *ringBuffer) String() string {
|
||||||
|
rb.mu.Lock()
|
||||||
|
defer rb.mu.Unlock()
|
||||||
|
return strings.Join(rb.lines, "\n")
|
||||||
|
}
|
||||||
|
|
||||||
|
func (rb *ringBuffer) LastLine() string {
|
||||||
|
rb.mu.Lock()
|
||||||
|
defer rb.mu.Unlock()
|
||||||
|
if len(rb.lines) == 0 {
|
||||||
|
return ""
|
||||||
|
}
|
||||||
|
return rb.lines[len(rb.lines)-1]
|
||||||
|
}
|
||||||
|
|
||||||
|
type prefixWriter struct {
|
||||||
|
prefix string
|
||||||
|
buf bytes.Buffer
|
||||||
|
}
|
||||||
|
|
||||||
|
func (pw *prefixWriter) Write(p []byte) (int, error) {
|
||||||
|
pw.buf.Write(p)
|
||||||
|
for {
|
||||||
|
line, err := pw.buf.ReadString('\n')
|
||||||
|
if err != nil {
|
||||||
|
pw.buf.WriteString(line)
|
||||||
|
break
|
||||||
|
}
|
||||||
|
fmt.Fprint(os.Stderr, pw.prefix+line)
|
||||||
|
}
|
||||||
|
return len(p), nil
|
||||||
|
}
|
||||||
96
internal/pool/sidecar_test.go
Normal file
96
internal/pool/sidecar_test.go
Normal file
@@ -0,0 +1,96 @@
|
|||||||
|
package pool
|
||||||
|
|
||||||
|
import (
|
||||||
|
"reflect"
|
||||||
|
"testing"
|
||||||
|
)
|
||||||
|
|
||||||
|
func TestBuildArgs_PreservesNonOverlapping(t *testing.T) {
|
||||||
|
base := []string{"-ngl", "999", "-c", "32768", "--flash-attn", "on", "--no-mmap"}
|
||||||
|
user := []string{"--top-k", "20"}
|
||||||
|
got := buildArgs(base, "/model.gguf", 8500, user)
|
||||||
|
|
||||||
|
// -c 32768 must survive (user didn't supply -c)
|
||||||
|
if !containsSeq(got, "-c", "32768") {
|
||||||
|
t.Errorf("-c 32768 missing from args: %v", got)
|
||||||
|
}
|
||||||
|
// --top-k 20 must be present (user flag)
|
||||||
|
if !containsSeq(got, "--top-k", "20") {
|
||||||
|
t.Errorf("--top-k 20 missing from args: %v", got)
|
||||||
|
}
|
||||||
|
// --model and --port injected
|
||||||
|
if !containsSeq(got, "--model", "/model.gguf") {
|
||||||
|
t.Errorf("--model missing: %v", got)
|
||||||
|
}
|
||||||
|
if !containsSeq(got, "--port", "8500") {
|
||||||
|
t.Errorf("--port missing: %v", got)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestBuildArgs_UserOverridesBase(t *testing.T) {
|
||||||
|
base := []string{"-ngl", "999", "-c", "32768"}
|
||||||
|
user := []string{"-c", "131072"}
|
||||||
|
got := buildArgs(base, "/model.gguf", 8500, user)
|
||||||
|
|
||||||
|
// base -c should be dropped, user -c should be present
|
||||||
|
count := 0
|
||||||
|
for i, tok := range got {
|
||||||
|
if tok == "-c" && i+1 < len(got) {
|
||||||
|
count++
|
||||||
|
if got[i+1] == "32768" {
|
||||||
|
t.Errorf("base -c 32768 should have been deduped: %v", got)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if count != 1 {
|
||||||
|
t.Errorf("expected exactly 1 -c flag, got %d in %v", count, got)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestBuildArgs_NoUserFlags(t *testing.T) {
|
||||||
|
base := []string{"-ngl", "999", "-c", "32768", "--no-mmap"}
|
||||||
|
got := buildArgs(base, "/model.gguf", 8500, nil)
|
||||||
|
|
||||||
|
if !containsSeq(got, "-c", "32768") {
|
||||||
|
t.Errorf("-c 32768 missing when no user flags: %v", got)
|
||||||
|
}
|
||||||
|
if !containsSeq(got, "--no-mmap") {
|
||||||
|
t.Errorf("--no-mmap missing: %v", got)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestDedupFlags_Mixed(t *testing.T) {
|
||||||
|
auto := []string{"--top-k", "40", "-c", "32768", "--no-mmap"}
|
||||||
|
user := []string{"--top-k", "20"}
|
||||||
|
got := dedupFlags(auto, user)
|
||||||
|
want := []string{"-c", "32768", "--no-mmap"}
|
||||||
|
if !reflect.DeepEqual(got, want) {
|
||||||
|
t.Errorf("dedupFlags = %v, want %v", got, want)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestDedupFlags_EqualsForm(t *testing.T) {
|
||||||
|
auto := []string{"--ctx-size=4096", "--no-mmap"}
|
||||||
|
user := []string{"--ctx-size", "8192"}
|
||||||
|
got := dedupFlags(auto, user)
|
||||||
|
want := []string{"--no-mmap"}
|
||||||
|
if !reflect.DeepEqual(got, want) {
|
||||||
|
t.Errorf("dedupFlags = %v, want %v", got, want)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func containsSeq(args []string, seq ...string) bool {
|
||||||
|
for i := 0; i <= len(args)-len(seq); i++ {
|
||||||
|
match := true
|
||||||
|
for j, s := range seq {
|
||||||
|
if args[i+j] != s {
|
||||||
|
match = false
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if match {
|
||||||
|
return true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return false
|
||||||
|
}
|
||||||
7
internal/pool/sidecar_unix.go
Normal file
7
internal/pool/sidecar_unix.go
Normal file
@@ -0,0 +1,7 @@
|
|||||||
|
//go:build !windows
|
||||||
|
|
||||||
|
package pool
|
||||||
|
|
||||||
|
import "os/exec"
|
||||||
|
|
||||||
|
func setPlatformAttrs(_ *exec.Cmd) {}
|
||||||
15
internal/pool/sidecar_windows.go
Normal file
15
internal/pool/sidecar_windows.go
Normal file
@@ -0,0 +1,15 @@
|
|||||||
|
//go:build windows
|
||||||
|
|
||||||
|
package pool
|
||||||
|
|
||||||
|
import (
|
||||||
|
"os/exec"
|
||||||
|
"syscall"
|
||||||
|
)
|
||||||
|
|
||||||
|
func setPlatformAttrs(cmd *exec.Cmd) {
|
||||||
|
cmd.SysProcAttr = &syscall.SysProcAttr{
|
||||||
|
HideWindow: true,
|
||||||
|
CreationFlags: 0x00000008 | 0x00000200, // DETACHED_PROCESS | CREATE_NEW_PROCESS_GROUP
|
||||||
|
}
|
||||||
|
}
|
||||||
42
internal/server/admin.go
Normal file
42
internal/server/admin.go
Normal file
@@ -0,0 +1,42 @@
|
|||||||
|
package server
|
||||||
|
|
||||||
|
import (
|
||||||
|
"net/http"
|
||||||
|
"time"
|
||||||
|
|
||||||
|
"github.com/indifferentketchup/llama-sidecar/internal/config"
|
||||||
|
"github.com/indifferentketchup/llama-sidecar/internal/pool"
|
||||||
|
)
|
||||||
|
|
||||||
|
func healthHandler(p *pool.Pool, cfg *config.Config, startedAt time.Time) http.HandlerFunc {
|
||||||
|
return func(w http.ResponseWriter, r *http.Request) {
|
||||||
|
sidecars := p.List()
|
||||||
|
writeJSON(w, http.StatusOK, map[string]any{
|
||||||
|
"status": "ok",
|
||||||
|
"sidecars": len(sidecars),
|
||||||
|
"max": cfg.MaxSidecars,
|
||||||
|
"uptime_seconds": int(time.Since(startedAt).Seconds()),
|
||||||
|
})
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func listSidecarsHandler(p *pool.Pool) http.HandlerFunc {
|
||||||
|
return func(w http.ResponseWriter, r *http.Request) {
|
||||||
|
writeJSON(w, http.StatusOK, p.List())
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func deleteSidecarHandler(p *pool.Pool) http.HandlerFunc {
|
||||||
|
return func(w http.ResponseWriter, r *http.Request) {
|
||||||
|
hash := r.PathValue("hash")
|
||||||
|
if hash == "" {
|
||||||
|
writeJSON(w, http.StatusBadRequest, map[string]string{"error": "hash required"})
|
||||||
|
return
|
||||||
|
}
|
||||||
|
if err := p.Remove(hash); err != nil {
|
||||||
|
writeJSON(w, http.StatusNotFound, map[string]string{"error": err.Error()})
|
||||||
|
return
|
||||||
|
}
|
||||||
|
writeJSON(w, http.StatusOK, map[string]string{"status": "removed"})
|
||||||
|
}
|
||||||
|
}
|
||||||
111
internal/server/proxy.go
Normal file
111
internal/server/proxy.go
Normal file
@@ -0,0 +1,111 @@
|
|||||||
|
package server
|
||||||
|
|
||||||
|
import (
|
||||||
|
"encoding/json"
|
||||||
|
"fmt"
|
||||||
|
"io"
|
||||||
|
"log/slog"
|
||||||
|
"net/http"
|
||||||
|
"net/http/httputil"
|
||||||
|
"net/url"
|
||||||
|
"strings"
|
||||||
|
|
||||||
|
"github.com/indifferentketchup/llama-sidecar/internal/pool"
|
||||||
|
)
|
||||||
|
|
||||||
|
var shellUnsafe = strings.NewReplacer(
|
||||||
|
"`", "", "$", "", "|", "", ";", "", "&", "", "\n", "",
|
||||||
|
)
|
||||||
|
|
||||||
|
func parseFlags(raw string) ([]string, error) {
|
||||||
|
cleaned := shellUnsafe.Replace(raw)
|
||||||
|
if cleaned != raw {
|
||||||
|
return nil, fmt.Errorf("flags contain unsafe characters")
|
||||||
|
}
|
||||||
|
return splitArgs(strings.TrimSpace(raw)), nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func splitArgs(s string) []string {
|
||||||
|
if s == "" {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
return strings.Fields(s)
|
||||||
|
}
|
||||||
|
|
||||||
|
func proxyHandler(p *pool.Pool) http.HandlerFunc {
|
||||||
|
return func(w http.ResponseWriter, r *http.Request) {
|
||||||
|
flagsRaw := r.Header.Get("X-Agent-Flags")
|
||||||
|
var flags []string
|
||||||
|
if flagsRaw != "" {
|
||||||
|
var err error
|
||||||
|
flags, err = parseFlags(flagsRaw)
|
||||||
|
if err != nil {
|
||||||
|
writeJSON(w, http.StatusBadRequest, map[string]string{
|
||||||
|
"error": err.Error(),
|
||||||
|
})
|
||||||
|
return
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
modelID := r.Header.Get("X-Model-Id")
|
||||||
|
if modelID == "" {
|
||||||
|
body, err := io.ReadAll(io.LimitReader(r.Body, 1<<20))
|
||||||
|
if err != nil {
|
||||||
|
writeJSON(w, http.StatusBadRequest, map[string]string{"error": "failed to read body"})
|
||||||
|
return
|
||||||
|
}
|
||||||
|
var req struct {
|
||||||
|
Model string `json:"model"`
|
||||||
|
}
|
||||||
|
if err := json.Unmarshal(body, &req); err == nil && req.Model != "" {
|
||||||
|
modelID = req.Model
|
||||||
|
}
|
||||||
|
r.Body = io.NopCloser(strings.NewReader(string(body)))
|
||||||
|
r.ContentLength = int64(len(body))
|
||||||
|
}
|
||||||
|
if modelID == "" {
|
||||||
|
writeJSON(w, http.StatusBadRequest, map[string]string{"error": "model not specified (X-Model-Id header or body.model)"})
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
sidecar, err := p.Acquire(r.Context(), modelID, flags)
|
||||||
|
if err != nil {
|
||||||
|
errMsg := err.Error()
|
||||||
|
status := http.StatusInternalServerError
|
||||||
|
if strings.Contains(errMsg, "validation:") {
|
||||||
|
status = http.StatusBadRequest
|
||||||
|
} else if strings.Contains(errMsg, "unknown model:") {
|
||||||
|
status = http.StatusNotFound
|
||||||
|
} else if strings.Contains(errMsg, "port allocation:") {
|
||||||
|
status = http.StatusServiceUnavailable
|
||||||
|
}
|
||||||
|
writeJSON(w, status, map[string]string{"error": errMsg})
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
target := &url.URL{
|
||||||
|
Scheme: "http",
|
||||||
|
Host: fmt.Sprintf("127.0.0.1:%d", sidecar.Port),
|
||||||
|
}
|
||||||
|
proxy := httputil.NewSingleHostReverseProxy(target)
|
||||||
|
proxy.ErrorHandler = func(rw http.ResponseWriter, req *http.Request, err error) {
|
||||||
|
slog.Error("upstream error", "hash", sidecar.Hash, "port", sidecar.Port, "err", err)
|
||||||
|
writeJSON(rw, http.StatusBadGateway, map[string]any{
|
||||||
|
"error": "upstream unavailable",
|
||||||
|
"error_detail": err.Error(),
|
||||||
|
"sidecar_hash": sidecar.Hash,
|
||||||
|
"sidecar_port": sidecar.Port,
|
||||||
|
"last_stderr": sidecar.LastStderr(),
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
sidecar.TouchLastUsed()
|
||||||
|
proxy.ServeHTTP(w, r)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func writeJSON(w http.ResponseWriter, status int, v any) {
|
||||||
|
w.Header().Set("Content-Type", "application/json")
|
||||||
|
w.WriteHeader(status)
|
||||||
|
json.NewEncoder(w).Encode(v)
|
||||||
|
}
|
||||||
56
internal/server/server.go
Normal file
56
internal/server/server.go
Normal file
@@ -0,0 +1,56 @@
|
|||||||
|
package server
|
||||||
|
|
||||||
|
import (
|
||||||
|
"log/slog"
|
||||||
|
"net/http"
|
||||||
|
"time"
|
||||||
|
|
||||||
|
"github.com/indifferentketchup/llama-sidecar/internal/config"
|
||||||
|
"github.com/indifferentketchup/llama-sidecar/internal/pool"
|
||||||
|
)
|
||||||
|
|
||||||
|
func New(cfg *config.Config, p *pool.Pool, startedAt time.Time) *http.Server {
|
||||||
|
mux := http.NewServeMux()
|
||||||
|
mux.HandleFunc("GET /health", healthHandler(p, cfg, startedAt))
|
||||||
|
mux.HandleFunc("GET /sidecars", listSidecarsHandler(p))
|
||||||
|
mux.HandleFunc("DELETE /sidecars/{hash}", deleteSidecarHandler(p))
|
||||||
|
mux.HandleFunc("POST /v1/chat/completions", proxyHandler(p))
|
||||||
|
mux.HandleFunc("POST /v1/completions", proxyHandler(p))
|
||||||
|
|
||||||
|
handler := requestLogger(mux)
|
||||||
|
|
||||||
|
return &http.Server{
|
||||||
|
Addr: cfg.Bind,
|
||||||
|
Handler: handler,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func requestLogger(next http.Handler) http.Handler {
|
||||||
|
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||||
|
start := time.Now()
|
||||||
|
rw := &statusRecorder{ResponseWriter: w, status: 200}
|
||||||
|
next.ServeHTTP(rw, r)
|
||||||
|
slog.Info("request",
|
||||||
|
"method", r.Method,
|
||||||
|
"path", r.URL.Path,
|
||||||
|
"status", rw.status,
|
||||||
|
"duration_ms", time.Since(start).Milliseconds(),
|
||||||
|
)
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
type statusRecorder struct {
|
||||||
|
http.ResponseWriter
|
||||||
|
status int
|
||||||
|
}
|
||||||
|
|
||||||
|
func (sr *statusRecorder) WriteHeader(code int) {
|
||||||
|
sr.status = code
|
||||||
|
sr.ResponseWriter.WriteHeader(code)
|
||||||
|
}
|
||||||
|
|
||||||
|
func (sr *statusRecorder) Flush() {
|
||||||
|
if f, ok := sr.ResponseWriter.(http.Flusher); ok {
|
||||||
|
f.Flush()
|
||||||
|
}
|
||||||
|
}
|
||||||
156
internal/validator/validator.go
Normal file
156
internal/validator/validator.go
Normal file
@@ -0,0 +1,156 @@
|
|||||||
|
// SPDX-License-Identifier: AGPL-3.0-only
|
||||||
|
// Copyright 2026-present the Unsloth AI Inc. team. All rights reserved.
|
||||||
|
// Ported from studio/backend/core/inference/llama_server_args.py.
|
||||||
|
// Original: https://github.com/unslothai/unsloth/blob/main/studio/backend/core/inference/llama_server_args.py
|
||||||
|
|
||||||
|
package validator
|
||||||
|
|
||||||
|
import (
|
||||||
|
"fmt"
|
||||||
|
"strings"
|
||||||
|
)
|
||||||
|
|
||||||
|
var denylistGroups = [][]string{
|
||||||
|
// Model identity
|
||||||
|
{"-m", "--model"},
|
||||||
|
{"-mu", "--model-url"},
|
||||||
|
{"-dr", "--docker-repo"},
|
||||||
|
{"-hf", "-hfr", "--hf-repo"},
|
||||||
|
{"-hff", "--hf-file"},
|
||||||
|
{"-hfv", "-hfrv", "--hf-repo-v"},
|
||||||
|
{"-hffv", "--hf-file-v"},
|
||||||
|
{"-hft", "--hf-token"},
|
||||||
|
{"-mm", "--mmproj"},
|
||||||
|
{"-mmu", "--mmproj-url"},
|
||||||
|
// Networking
|
||||||
|
{"--host"},
|
||||||
|
{"--port"},
|
||||||
|
{"--path"},
|
||||||
|
{"--api-prefix"},
|
||||||
|
{"--reuse-port"},
|
||||||
|
// Auth / TLS
|
||||||
|
{"--api-key"},
|
||||||
|
{"--api-key-file"},
|
||||||
|
{"--ssl-key-file"},
|
||||||
|
{"--ssl-cert-file"},
|
||||||
|
// Server UI / multi-model
|
||||||
|
{"--webui", "--no-webui"},
|
||||||
|
{"--ui", "--no-ui"},
|
||||||
|
{"--ui-config"},
|
||||||
|
{"--ui-config-file"},
|
||||||
|
{"--ui-mcp-proxy", "--no-ui-mcp-proxy"},
|
||||||
|
{"--models-dir"},
|
||||||
|
{"--models-preset"},
|
||||||
|
{"--models-max"},
|
||||||
|
{"--models-autoload", "--no-models-autoload"},
|
||||||
|
}
|
||||||
|
|
||||||
|
var denylist map[string]bool
|
||||||
|
|
||||||
|
func init() {
|
||||||
|
denylist = make(map[string]bool)
|
||||||
|
for _, group := range denylistGroups {
|
||||||
|
for _, flag := range group {
|
||||||
|
denylist[flag] = true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// FlagName returns the flag name for a CLI token, or "" if it isn't a flag.
|
||||||
|
// Peels --key=value to the bare --key. Numeric values like -1 or -0.5
|
||||||
|
// (e.g. --seed -1) are treated as values, not flags.
|
||||||
|
func FlagName(token string) string {
|
||||||
|
if !strings.HasPrefix(token, "-") || token == "-" || token == "--" {
|
||||||
|
return ""
|
||||||
|
}
|
||||||
|
if len(token) >= 2 && (token[1] >= '0' && token[1] <= '9' || token[1] == '.') {
|
||||||
|
return ""
|
||||||
|
}
|
||||||
|
if idx := strings.IndexByte(token, '='); idx >= 0 {
|
||||||
|
return token[:idx]
|
||||||
|
}
|
||||||
|
return token
|
||||||
|
}
|
||||||
|
|
||||||
|
// ValidateExtraArgs validates user-supplied llama-server args. Returns the
|
||||||
|
// args as a flat slice. Returns an error with the offending flag if any
|
||||||
|
// token resolves to a managed flag.
|
||||||
|
func ValidateExtraArgs(args []string) ([]string, error) {
|
||||||
|
if len(args) == 0 {
|
||||||
|
return nil, nil
|
||||||
|
}
|
||||||
|
out := make([]string, 0, len(args))
|
||||||
|
for _, raw := range args {
|
||||||
|
flag := FlagName(raw)
|
||||||
|
if flag != "" && denylist[flag] {
|
||||||
|
return nil, fmt.Errorf("llama-server flag '%s' is managed and cannot be passed as an extra arg", flag)
|
||||||
|
}
|
||||||
|
out = append(out, raw)
|
||||||
|
}
|
||||||
|
return out, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// IsManagedFlag returns true if flag is a managed llama-server flag.
|
||||||
|
func IsManagedFlag(flag string) bool {
|
||||||
|
return denylist[flag]
|
||||||
|
}
|
||||||
|
|
||||||
|
var contextFlags = setOf("-c", "--ctx-size")
|
||||||
|
var cacheFlags = setOf("-ctk", "--cache-type-k", "-ctv", "--cache-type-v")
|
||||||
|
var specFlags = setOf(
|
||||||
|
"--spec-default", "--spec-type", "--spec-ngram-size-n", "--spec-ngram-size",
|
||||||
|
"--draft-min", "--draft-max",
|
||||||
|
"--spec-draft-n-max", "--spec-draft-n-min", "--spec-draft-p-min", "--spec-draft-p-split",
|
||||||
|
"--spec-ngram-mod-n-match", "--spec-ngram-mod-n-min", "--spec-ngram-mod-n-max",
|
||||||
|
)
|
||||||
|
var templateFlags = setOf(
|
||||||
|
"--chat-template", "--chat-template-file", "--chat-template-kwargs",
|
||||||
|
"--jinja", "--no-jinja",
|
||||||
|
)
|
||||||
|
var booleanShadowingFlags = setOf("--spec-default", "--jinja", "--no-jinja")
|
||||||
|
|
||||||
|
func setOf(vals ...string) map[string]bool {
|
||||||
|
m := make(map[string]bool, len(vals))
|
||||||
|
for _, v := range vals {
|
||||||
|
m[v] = true
|
||||||
|
}
|
||||||
|
return m
|
||||||
|
}
|
||||||
|
|
||||||
|
// StripShadowingFlags removes flags that shadow first-class settings from
|
||||||
|
// the arg list. By default all shadowing groups are stripped.
|
||||||
|
func StripShadowingFlags(args []string) []string {
|
||||||
|
shadowing := make(map[string]bool)
|
||||||
|
for k, v := range contextFlags {
|
||||||
|
shadowing[k] = v
|
||||||
|
}
|
||||||
|
for k, v := range cacheFlags {
|
||||||
|
shadowing[k] = v
|
||||||
|
}
|
||||||
|
for k, v := range specFlags {
|
||||||
|
shadowing[k] = v
|
||||||
|
}
|
||||||
|
for k, v := range templateFlags {
|
||||||
|
shadowing[k] = v
|
||||||
|
}
|
||||||
|
|
||||||
|
out := make([]string, 0, len(args))
|
||||||
|
i, n := 0, len(args)
|
||||||
|
for i < n {
|
||||||
|
tok := args[i]
|
||||||
|
flag := FlagName(tok)
|
||||||
|
if flag == "" || !shadowing[flag] {
|
||||||
|
out = append(out, tok)
|
||||||
|
i++
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
if booleanShadowingFlags[flag] || strings.Contains(tok, "=") {
|
||||||
|
i++
|
||||||
|
} else if i+1 < n && FlagName(args[i+1]) == "" {
|
||||||
|
i += 2
|
||||||
|
} else {
|
||||||
|
i++
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return out
|
||||||
|
}
|
||||||
150
internal/validator/validator_test.go
Normal file
150
internal/validator/validator_test.go
Normal file
@@ -0,0 +1,150 @@
|
|||||||
|
package validator
|
||||||
|
|
||||||
|
import (
|
||||||
|
"testing"
|
||||||
|
)
|
||||||
|
|
||||||
|
func TestValidateExtraArgs_DenyList(t *testing.T) {
|
||||||
|
denied := []string{
|
||||||
|
"-m", "--model",
|
||||||
|
"-mu", "--model-url",
|
||||||
|
"-dr", "--docker-repo",
|
||||||
|
"-hf", "-hfr", "--hf-repo",
|
||||||
|
"-hff", "--hf-file",
|
||||||
|
"-hfv", "-hfrv", "--hf-repo-v",
|
||||||
|
"-hffv", "--hf-file-v",
|
||||||
|
"-hft", "--hf-token",
|
||||||
|
"-mm", "--mmproj",
|
||||||
|
"-mmu", "--mmproj-url",
|
||||||
|
"--host", "--port", "--path", "--api-prefix", "--reuse-port",
|
||||||
|
"--api-key", "--api-key-file",
|
||||||
|
"--ssl-key-file", "--ssl-cert-file",
|
||||||
|
"--webui", "--no-webui", "--ui", "--no-ui",
|
||||||
|
"--ui-config", "--ui-config-file",
|
||||||
|
"--ui-mcp-proxy", "--no-ui-mcp-proxy",
|
||||||
|
"--models-dir", "--models-preset", "--models-max",
|
||||||
|
"--models-autoload", "--no-models-autoload",
|
||||||
|
}
|
||||||
|
for _, flag := range denied {
|
||||||
|
t.Run(flag, func(t *testing.T) {
|
||||||
|
_, err := ValidateExtraArgs([]string{flag})
|
||||||
|
if err == nil {
|
||||||
|
t.Fatalf("expected error for %s", flag)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestValidateExtraArgs_SafeFlags(t *testing.T) {
|
||||||
|
safe := []string{
|
||||||
|
"-c", "--ctx-size", "-ngl", "--gpu-layers",
|
||||||
|
"--top-k", "--cache-type-k", "--jinja", "--no-jinja",
|
||||||
|
"--spec-draft-n-max", "-fa", "--flash-attn",
|
||||||
|
"-t", "--threads", "-np", "--parallel", "--no-mmap",
|
||||||
|
}
|
||||||
|
for _, flag := range safe {
|
||||||
|
t.Run(flag, func(t *testing.T) {
|
||||||
|
out, err := ValidateExtraArgs([]string{flag})
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("unexpected error for %s: %v", flag, err)
|
||||||
|
}
|
||||||
|
if len(out) != 1 || out[0] != flag {
|
||||||
|
t.Fatalf("expected [%s], got %v", flag, out)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestValidateExtraArgs_FlagEqualsValue(t *testing.T) {
|
||||||
|
_, err := ValidateExtraArgs([]string{"--model=evil.gguf"})
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected error for --model=evil.gguf")
|
||||||
|
}
|
||||||
|
out, err := ValidateExtraArgs([]string{"--ctx-size=4096"})
|
||||||
|
if err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
if len(out) != 1 || out[0] != "--ctx-size=4096" {
|
||||||
|
t.Fatalf("expected [--ctx-size=4096], got %v", out)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestValidateExtraArgs_NegativeNumber(t *testing.T) {
|
||||||
|
out, err := ValidateExtraArgs([]string{"--seed", "-1"})
|
||||||
|
if err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
if len(out) != 2 {
|
||||||
|
t.Fatalf("expected 2 tokens, got %d", len(out))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestValidateExtraArgs_Empty(t *testing.T) {
|
||||||
|
out, err := ValidateExtraArgs(nil)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
if out != nil {
|
||||||
|
t.Fatalf("expected nil, got %v", out)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestIsManagedFlag(t *testing.T) {
|
||||||
|
if !IsManagedFlag("--model") {
|
||||||
|
t.Fatal("--model should be managed")
|
||||||
|
}
|
||||||
|
if !IsManagedFlag("-m") {
|
||||||
|
t.Fatal("-m should be managed")
|
||||||
|
}
|
||||||
|
if IsManagedFlag("-c") {
|
||||||
|
t.Fatal("-c should not be managed")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestFlagName(t *testing.T) {
|
||||||
|
tests := []struct {
|
||||||
|
in, want string
|
||||||
|
}{
|
||||||
|
{"--model=foo", "--model"},
|
||||||
|
{"-c", "-c"},
|
||||||
|
{"--top-k", "--top-k"},
|
||||||
|
{"-1", ""},
|
||||||
|
{"-0.5", ""},
|
||||||
|
{"-", ""},
|
||||||
|
{"--", ""},
|
||||||
|
{"hello", ""},
|
||||||
|
}
|
||||||
|
for _, tt := range tests {
|
||||||
|
got := FlagName(tt.in)
|
||||||
|
if got != tt.want {
|
||||||
|
t.Errorf("FlagName(%q) = %q, want %q", tt.in, got, tt.want)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestStripShadowingFlags(t *testing.T) {
|
||||||
|
t.Run("strips context flag with value", func(t *testing.T) {
|
||||||
|
out := StripShadowingFlags([]string{"-c", "4096", "--top-k", "40"})
|
||||||
|
if len(out) != 2 || out[0] != "--top-k" || out[1] != "40" {
|
||||||
|
t.Fatalf("got %v", out)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
t.Run("retains non-shadowing flags", func(t *testing.T) {
|
||||||
|
out := StripShadowingFlags([]string{"--top-k", "40", "--top-p", "0.95"})
|
||||||
|
if len(out) != 4 {
|
||||||
|
t.Fatalf("got %v", out)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
t.Run("strips boolean jinja flag", func(t *testing.T) {
|
||||||
|
out := StripShadowingFlags([]string{"--jinja", "--top-k", "40"})
|
||||||
|
if len(out) != 2 || out[0] != "--top-k" {
|
||||||
|
t.Fatalf("got %v", out)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
t.Run("strips equals form", func(t *testing.T) {
|
||||||
|
out := StripShadowingFlags([]string{"--ctx-size=4096"})
|
||||||
|
if len(out) != 0 {
|
||||||
|
t.Fatalf("got %v", out)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
}
|
||||||
26
internal/winsvc/winsvc_unix.go
Normal file
26
internal/winsvc/winsvc_unix.go
Normal file
@@ -0,0 +1,26 @@
|
|||||||
|
//go:build !windows
|
||||||
|
|
||||||
|
package winsvc
|
||||||
|
|
||||||
|
import (
|
||||||
|
"context"
|
||||||
|
"log/slog"
|
||||||
|
"os"
|
||||||
|
"os/signal"
|
||||||
|
"syscall"
|
||||||
|
"time"
|
||||||
|
)
|
||||||
|
|
||||||
|
func RegisterShutdownHandler(ctx context.Context, shutdownFunc func(context.Context) error) {
|
||||||
|
sigCh := make(chan os.Signal, 1)
|
||||||
|
signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)
|
||||||
|
<-sigCh
|
||||||
|
slog.Info("shutdown signal received")
|
||||||
|
shutdownCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
|
||||||
|
defer cancel()
|
||||||
|
if err := shutdownFunc(shutdownCtx); err != nil {
|
||||||
|
slog.Error("shutdown error", "err", err)
|
||||||
|
os.Exit(1)
|
||||||
|
}
|
||||||
|
os.Exit(0)
|
||||||
|
}
|
||||||
25
internal/winsvc/winsvc_windows.go
Normal file
25
internal/winsvc/winsvc_windows.go
Normal file
@@ -0,0 +1,25 @@
|
|||||||
|
//go:build windows
|
||||||
|
|
||||||
|
package winsvc
|
||||||
|
|
||||||
|
import (
|
||||||
|
"context"
|
||||||
|
"log/slog"
|
||||||
|
"os"
|
||||||
|
"os/signal"
|
||||||
|
"time"
|
||||||
|
)
|
||||||
|
|
||||||
|
func RegisterShutdownHandler(ctx context.Context, shutdownFunc func(context.Context) error) {
|
||||||
|
sigCh := make(chan os.Signal, 1)
|
||||||
|
signal.Notify(sigCh, os.Interrupt)
|
||||||
|
<-sigCh
|
||||||
|
slog.Info("shutdown signal received")
|
||||||
|
shutdownCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
|
||||||
|
defer cancel()
|
||||||
|
if err := shutdownFunc(shutdownCtx); err != nil {
|
||||||
|
slog.Error("shutdown error", "err", err)
|
||||||
|
os.Exit(1)
|
||||||
|
}
|
||||||
|
os.Exit(0)
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user