llama-sidecar v0.1.0: daemon + benchmarks + eval suite
Go daemon (cmd/llama-sidecar): per-agent llama-server process pool with LRU eviction, OpenAI-compatible proxy, flag validation (Unsloth port), deterministic hash-keyed sidecar reuse. Windows service support via schtasks/NSSM with DETACHED_PROCESS, stdout pipe drain, and request-ctx decoupled child lifetime. Bug fixes (3b.1–3b5): -c flag drop from StripShadowingFlags, UTF-8 BOM in JSON config, -fa → --flash-attn on default, child process exit after one request (stdin devnull, stdout pipe, CREATE_NO_WINDOW → DETACHED, context.Background for child lifetime, background reaper goroutine). bench/: MTP on/off throughput sweep across 8 GGUFs via SSH+schtasks automation to sam-desktop. Per-GGUF production flags from llama-swap config with --ctx-size 32768 override. eval/: accuracy benchmarks (MMLU 100q, GSM8K 50q, HumanEval 164) + A/B model comparison (14 agent-typed prompts × 8 models). All scripts resumable at individual question level. 94 Go tests, race detector clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
72
eval/ab/prompts.json
Normal file
72
eval/ab/prompts.json
Normal file
@@ -0,0 +1,72 @@
|
||||
[
|
||||
{
|
||||
"id": "review-1",
|
||||
"agent": "Code Reviewer",
|
||||
"prompt": "Review the `buildHeadPayload` function in `apps/server/src/services/compaction.ts`. It was recently patched in v1.13.6 to embed `reasoning_parts` as a `<reasoning>...</reasoning>` prose prefix on the assistant content for tool-bearing turns. Check: does the current implementation handle the case where `reasoning_parts` is an empty array? Does it handle turns that have both reasoning_parts AND non-empty text content (not just tool calls)? Cite file:line for any issues."
|
||||
},
|
||||
{
|
||||
"id": "review-2",
|
||||
"agent": "Code Reviewer",
|
||||
"prompt": "Review the path guard layer in `apps/coder/services/path_guard.ts`. It enforces per-project scoping with a blanket `/opt:rw` mount and policy at the tool layer. Check for: symlink traversal (does it resolve symlinks before checking?), double-encoding attacks on path components, race conditions between check and use (TOCTOU), and whether `extraRoots` from `request_read_access` grants could be abused to escape the project scope. Cite file:line."
|
||||
},
|
||||
{
|
||||
"id": "debug-1",
|
||||
"agent": "Debugger",
|
||||
"prompt": "Bug report: after a long BooCode chat session (~40 messages), the compaction trigger fires but the resulting summary is empty — the assistant message with `summary=true` has blank content. The `ctx_max` is correctly fetched from `/upstream/<model>/props` (verified in logs). The `needs_compaction` flag is being set. But the summary inference returns an empty string. This started happening after the v1.13.7 compaction trigger change that lowered the threshold to `floor(0.85 * ctx_max)`. Diagnose: what code path could produce an empty summary, and what would you check first?"
|
||||
},
|
||||
{
|
||||
"id": "debug-2",
|
||||
"agent": "Debugger",
|
||||
"prompt": "Bug report: BooTerm terminal pane shows garbled output past column 66 on initial open, but corrects itself after manually resizing the browser window. The `stty size` inside the terminal reports `82 66` even though the pane is visually ~132 columns wide. tmux `list-windows` confirms the session was created at 66 columns. This only happens when opening a terminal pane via the split-pane button, not when opening it as the sole pane. Diagnose the root cause in `apps/web/src/components/panes/TerminalPane.tsx`."
|
||||
},
|
||||
{
|
||||
"id": "refactor-1",
|
||||
"agent": "Refactorer",
|
||||
"prompt": "The `streamCompletion` function in `apps/server/src/services/provider.ts` has grown to handle: AI SDK v6 streaming, XML fallback parsing for qwen3.6 tool-call emissions, abort signal handling (the explicit `if (signal?.aborted) throw` patch), reasoning-delta counting, and usage extraction. It's now ~200 lines. Propose a refactor that separates concerns without breaking the streaming contract. The function must remain a single entry point for callers."
|
||||
},
|
||||
{
|
||||
"id": "refactor-2",
|
||||
"agent": "Refactorer",
|
||||
"prompt": "The WebSocket frame publishing in BooCode went through two batches (v1.13.12 + v1.13.13) that converted ~80 publish sites to typed `publishFrame`/`publishUserFrame` wrappers with Zod validation. The schemas are duplicated byte-identical between `apps/server/src/types/ws-frames.ts` and `apps/web/src/api/ws-frames.ts` with a parity test. Propose a refactor to share the schema definition from a single source instead of maintaining the duplication + parity test."
|
||||
},
|
||||
{
|
||||
"id": "architect-1",
|
||||
"agent": "Architect",
|
||||
"prompt": "Design the system-prompt prefix cache for BooCode. Context: `buildSystemPromptWithFingerprint` already computes a SHA-256 of the assembled prefix and logs drift. The prefix is rebuilt on every inference turn from: project settings, agent instructions (AGENTS.md), skills, session-level overrides, and web_search_enabled flag. Most of these don't change between turns in the same session. Design a cache that avoids rebuilding+rehashing on every turn. Consider: process-memory vs DB-backed, invalidation strategy, cache key shape, and whether the fingerprint can serve as the cache key itself."
|
||||
},
|
||||
{
|
||||
"id": "architect-2",
|
||||
"agent": "Architect",
|
||||
"prompt": "Design the v2.5 task model integration with BooCoder's ACP dispatch. Context: v2.5.0-task-model just shipped a `tasks` table and lightweight task model services. BooCoder dispatches external agents (opencode, goose, claude) via ACP or PTY. Design how a task created in BooChat should flow through to a BooCoder dispatch: task creation → agent selection → ACP session → status updates back to the task row → completion. Consider: which fields from the task row map to ACP session params, how task status syncs with the agent's exit code, and how the UI surfaces progress."
|
||||
},
|
||||
{
|
||||
"id": "security-1",
|
||||
"agent": "Security Auditor",
|
||||
"prompt": "Audit the `web_fetch` tool implementation in BooCode. It fetches arbitrary URLs on behalf of the LLM agent. Check for: SSRF against internal Tailscale IPs (100.x.x.x), DNS rebinding, redirect following to internal hosts, response size limits, content-type validation, and whether the `url_guard.ts` layer covers all cases. The tool is gated by `session.web_search_enabled` but once enabled, the URL is user-agent-controlled (the LLM decides what to fetch)."
|
||||
},
|
||||
{
|
||||
"id": "security-2",
|
||||
"agent": "Security Auditor",
|
||||
"prompt": "Audit the `request_read_access` tool and `allowed_read_paths` grant mechanism (v1.13.17). When an agent needs to read files outside its project scope, it calls `request_read_access(path)` which triggers an `ask_user_input` elicitation for approval. On approval, the path is added to `allowed_read_paths` for that session, and `pathGuard` is extended with `extraRoots`. Check: can the agent request a path like `/etc/shadow` or `/opt/boocode/.env`? Is the grant scoped to the session or persistent? Can the path be a symlink that resolves to a sensitive location after the grant?"
|
||||
},
|
||||
{
|
||||
"id": "prompt-1",
|
||||
"agent": "Prompt Builder",
|
||||
"prompt": "Write a Claude Code dispatch prompt for: adding a new BooCode agent called 'Documenter' to AGENTS.md. The agent should read source files and produce inline JSDoc/TSDoc comments. It should use the read-only tool set. Temperature 0.4, steps 10. The prompt should include pre-flight checks, the exact file to modify, backup instructions, and verification steps."
|
||||
},
|
||||
{
|
||||
"id": "prompt-2",
|
||||
"agent": "Prompt Builder",
|
||||
"prompt": "Write an OpenCode dispatch prompt for: fixing the codecontext sidecar to handle projects with more than 10,000 files without OOMing. The fork is at /opt/forks/codecontext/. The agent should investigate the memory profile of the graph analysis pass, identify the allocation hotspot, and propose a streaming or chunked alternative. Include #careful hashtag, backup rules, and stop conditions."
|
||||
},
|
||||
{
|
||||
"id": "recon-1",
|
||||
"agent": "Recon",
|
||||
"prompt": "Map the BooCode monorepo at /opt/boocode/. I need: top-level directory structure, the three apps and their roles, how they share the database, the Docker container topology, and the key service files in apps/server/src/services/. Identify the data flow from a user message in BooChat through to the LLM inference call and back."
|
||||
},
|
||||
{
|
||||
"id": "recon-2",
|
||||
"agent": "Recon",
|
||||
"prompt": "Map the codecontext fork at /opt/forks/codecontext/. I need: the MCP tool surface (what tools are exposed), the parser architecture (how tree-sitter grammars are registered), the graph analysis pipeline (how dependencies and call graphs are built), and the codesight-merge additions (blast radius, hot files, routes, middleware). Identify the main entry points and the caching layer."
|
||||
}
|
||||
]
|
||||
242
eval/ab/run.sh
Executable file
242
eval/ab/run.sh
Executable file
@@ -0,0 +1,242 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
ENDPOINT="http://100.101.41.16:8401/v1"
|
||||
PROMPTS_FILE="${SCRIPT_DIR}/prompts.json"
|
||||
RESULTS_DIR="${SCRIPT_DIR}/results"
|
||||
COMPARE_FILE="${SCRIPT_DIR}/COMPARE.md"
|
||||
TIMING_FILE="${SCRIPT_DIR}/timing.csv"
|
||||
|
||||
MODELS=(
|
||||
qwen3.6-35b-a3b-mxfp4
|
||||
qwen3-coder-30b-apex
|
||||
qwen3.6-27b-mtp
|
||||
qwopus3.5-4b-mtp
|
||||
qwen3.5-9b-deepseek-v4-mtp
|
||||
qwopus3.6-35b-a3b-v1
|
||||
qwopus3.6-27b-v2-mtp
|
||||
qwopus3.5-9b-coder-mtp
|
||||
)
|
||||
|
||||
mkdir -p "$RESULTS_DIR"
|
||||
|
||||
# ── Parse prompts ─────────────────────────────────────────────────────
|
||||
|
||||
PROMPT_COUNT=$(python3 -c "import json; print(len(json.load(open('${PROMPTS_FILE}'))))")
|
||||
TOTAL=$((PROMPT_COUNT * ${#MODELS[@]}))
|
||||
EST_MIN=$(( TOTAL * 30 / 60 ))
|
||||
|
||||
echo "================================================================"
|
||||
echo " A/B MODEL COMPARISON"
|
||||
echo " ${PROMPT_COUNT} prompts × ${#MODELS[@]} models = ${TOTAL} requests"
|
||||
echo " Estimated runtime: ~${EST_MIN} minutes"
|
||||
echo " Endpoint: ${ENDPOINT}"
|
||||
echo "================================================================"
|
||||
echo ""
|
||||
|
||||
# ── Main loop: models (outer) × prompts (inner) ──────────────────────
|
||||
# One model load per model, all prompts answered, then swap.
|
||||
|
||||
t_start=$(date +%s)
|
||||
done_count=0
|
||||
|
||||
for model in "${MODELS[@]}"; do
|
||||
echo ""
|
||||
echo "================================================================"
|
||||
echo " MODEL: ${model}"
|
||||
echo "================================================================"
|
||||
|
||||
# Warmup: load the model with a trivial request
|
||||
all_cached=true
|
||||
for pidx in $(seq 0 $((PROMPT_COUNT - 1))); do
|
||||
PID=$(python3 -c "import json; print(json.load(open('${PROMPTS_FILE}'))[${pidx}]['id'])")
|
||||
if [ ! -f "${RESULTS_DIR}/${PID}/${model}.json" ] || [ ! -s "${RESULTS_DIR}/${PID}/${model}.json" ]; then
|
||||
all_cached=false
|
||||
break
|
||||
fi
|
||||
done
|
||||
|
||||
if [ "$all_cached" = "true" ]; then
|
||||
echo " All ${PROMPT_COUNT} prompts cached, skipping model"
|
||||
for pidx in $(seq 0 $((PROMPT_COUNT - 1))); do
|
||||
done_count=$((done_count + 1))
|
||||
done
|
||||
continue
|
||||
fi
|
||||
|
||||
echo " Warming up..."
|
||||
curl -s -X POST "${ENDPOINT}/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "{\"model\":\"${model}\",\"messages\":[{\"role\":\"user\",\"content\":\"Say OK.\"}],\"max_tokens\":10,\"temperature\":0}" \
|
||||
--max-time 300 > /dev/null 2>&1
|
||||
echo " Warm."
|
||||
|
||||
for pidx in $(seq 0 $((PROMPT_COUNT - 1))); do
|
||||
PROMPT_ID=$(python3 -c "import json; print(json.load(open('${PROMPTS_FILE}'))[${pidx}]['id'])")
|
||||
AGENT=$(python3 -c "import json; print(json.load(open('${PROMPTS_FILE}'))[${pidx}]['agent'])")
|
||||
|
||||
mkdir -p "${RESULTS_DIR}/${PROMPT_ID}"
|
||||
OUT_JSON="${RESULTS_DIR}/${PROMPT_ID}/${model}.json"
|
||||
OUT_MD="${RESULTS_DIR}/${PROMPT_ID}/${model}.md"
|
||||
|
||||
# Resume: skip if already done
|
||||
if [ -f "$OUT_JSON" ] && [ -s "$OUT_JSON" ]; then
|
||||
done_count=$((done_count + 1))
|
||||
echo " [${PROMPT_ID}] cached (${done_count}/${TOTAL})"
|
||||
continue
|
||||
fi
|
||||
|
||||
BODY=$(python3 -c "
|
||||
import json
|
||||
p = json.load(open('${PROMPTS_FILE}'))[${pidx}]
|
||||
print(json.dumps({
|
||||
'model': '${model}',
|
||||
'messages': [{'role': 'user', 'content': p['prompt']}],
|
||||
'temperature': 0.6,
|
||||
'max_tokens': 2048,
|
||||
'seed': 42,
|
||||
'stream': False
|
||||
}))
|
||||
")
|
||||
|
||||
SUCCESS=0
|
||||
for attempt in 1 2; do
|
||||
HTTP_CODE=$(curl -s -w '%{http_code}' -o "$OUT_JSON" \
|
||||
--max-time 300 \
|
||||
-X POST "${ENDPOINT}/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "$BODY" 2>/dev/null)
|
||||
|
||||
if [ "$HTTP_CODE" = "200" ]; then
|
||||
SUCCESS=1
|
||||
break
|
||||
else
|
||||
if [ "$attempt" = "1" ]; then
|
||||
echo " [${PROMPT_ID}] HTTP ${HTTP_CODE}, retrying in 10s..."
|
||||
sleep 10
|
||||
else
|
||||
echo "ERROR: HTTP ${HTTP_CODE}" > "$OUT_MD"
|
||||
echo " [${PROMPT_ID}] FAILED (HTTP ${HTTP_CODE})"
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
if [ "$SUCCESS" = "1" ]; then
|
||||
python3 -c "
|
||||
import json
|
||||
d = json.load(open('${OUT_JSON}'))
|
||||
msg = d.get('choices', [{}])[0].get('message', {})
|
||||
content = msg.get('content', '') or ''
|
||||
reasoning = msg.get('reasoning_content', '') or ''
|
||||
out = ''
|
||||
if reasoning:
|
||||
out += '<think>\n' + reasoning + '\n</think>\n\n'
|
||||
out += content
|
||||
open('${OUT_MD}', 'w').write(out)
|
||||
" 2>/dev/null
|
||||
done_count=$((done_count + 1))
|
||||
METRICS=$(python3 -c "
|
||||
import json
|
||||
d = json.load(open('${OUT_JSON}'))
|
||||
t = d.get('timings', {})
|
||||
tps = t.get('predicted_per_second', 0)
|
||||
tok = d.get('usage', {}).get('completion_tokens', 0)
|
||||
print(f'{tps:.1f}tok/s {tok}tok')
|
||||
" 2>/dev/null || echo "?")
|
||||
echo " [${PROMPT_ID}] done (${METRICS}) [${done_count}/${TOTAL}]"
|
||||
fi
|
||||
|
||||
sleep 2
|
||||
done
|
||||
done
|
||||
|
||||
# ── Generate COMPARE.md ──────────────────────────────────────────────
|
||||
|
||||
echo ""
|
||||
echo "Generating COMPARE.md..."
|
||||
|
||||
MODELS_JSON=$(printf '%s\n' "${MODELS[@]}" | python3 -c "import json,sys; print(json.dumps([l.strip() for l in sys.stdin if l.strip()]))")
|
||||
|
||||
python3 -c "
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
prompts = json.load(open('${PROMPTS_FILE}'))
|
||||
results_dir = Path('${RESULTS_DIR}')
|
||||
models = json.loads('${MODELS_JSON}')
|
||||
|
||||
lines = ['# A/B Model Comparison\n']
|
||||
|
||||
timing_rows = []
|
||||
|
||||
for p in prompts:
|
||||
pid = p['id']
|
||||
agent = p['agent']
|
||||
short = p['prompt'][:80]
|
||||
lines.append(f'## [{pid}] {agent}\n')
|
||||
lines.append(f'> {short}...\n')
|
||||
|
||||
for model in models:
|
||||
md_path = results_dir / pid / f'{model}.md'
|
||||
json_path = results_dir / pid / f'{model}.json'
|
||||
lines.append(f'### {model}\n')
|
||||
if md_path.exists():
|
||||
content = md_path.read_text().strip()
|
||||
lines.append(f'{content}\n')
|
||||
else:
|
||||
lines.append('*(no response)*\n')
|
||||
|
||||
if json_path.exists():
|
||||
try:
|
||||
d = json.loads(json_path.read_text())
|
||||
t = d.get('timings', {})
|
||||
u = d.get('usage', {})
|
||||
timing_rows.append({
|
||||
'prompt_id': pid,
|
||||
'model_id': model,
|
||||
'prompt_tps': t.get('prompt_per_second', 0),
|
||||
'predicted_tps': t.get('predicted_per_second', 0),
|
||||
'total_tokens': u.get('total_tokens', 0),
|
||||
'latency_ms': round((t.get('prompt_ms', 0) or 0) + (t.get('predicted_ms', 0) or 0), 1),
|
||||
})
|
||||
except:
|
||||
pass
|
||||
lines.append('---\n')
|
||||
|
||||
# Timing table
|
||||
lines.append('## Timing Summary\n')
|
||||
pids = list(dict.fromkeys(r['prompt_id'] for r in timing_rows))
|
||||
lines.append('| prompt | ' + ' | '.join(models) + ' |')
|
||||
lines.append('|--------' + '|------' * len(models) + '|')
|
||||
for pid in pids:
|
||||
cells = []
|
||||
for model in models:
|
||||
match = [r for r in timing_rows if r['prompt_id'] == pid and r['model_id'] == model]
|
||||
if match:
|
||||
cells.append(f\"{match[0]['predicted_tps']:.0f}\")
|
||||
else:
|
||||
cells.append('—')
|
||||
lines.append(f'| {pid} | ' + ' | '.join(cells) + ' |')
|
||||
|
||||
Path('${COMPARE_FILE}').write_text('\n'.join(lines) + '\n')
|
||||
print(f'Wrote ${COMPARE_FILE}')
|
||||
|
||||
# timing.csv
|
||||
import csv
|
||||
with open('${TIMING_FILE}', 'w', newline='') as f:
|
||||
w = csv.DictWriter(f, fieldnames=['prompt_id', 'model_id', 'prompt_tps', 'predicted_tps', 'total_tokens', 'latency_ms'])
|
||||
w.writeheader()
|
||||
w.writerows(timing_rows)
|
||||
print(f'Wrote ${TIMING_FILE}')
|
||||
"
|
||||
|
||||
t_end=$(date +%s)
|
||||
elapsed=$(( t_end - t_start ))
|
||||
echo ""
|
||||
echo "================================================================"
|
||||
echo " COMPLETE in $(( elapsed / 60 ))m $(( elapsed % 60 ))s"
|
||||
echo " Results: ${RESULTS_DIR}/"
|
||||
echo " Compare: ${COMPARE_FILE}"
|
||||
echo " Timing: ${TIMING_FILE}"
|
||||
echo "================================================================"
|
||||
125
eval/analyze.py
Normal file
125
eval/analyze.py
Normal file
@@ -0,0 +1,125 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Generate SUMMARY.md from scores.csv."""
|
||||
|
||||
import csv
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
CSV_PATH = Path(__file__).parent / "scores.csv"
|
||||
SUMMARY_PATH = Path(__file__).parent / "SUMMARY.md"
|
||||
|
||||
|
||||
def load_scores() -> list[dict]:
|
||||
rows = []
|
||||
with open(CSV_PATH) as f:
|
||||
for row in csv.DictReader(f):
|
||||
row["correct"] = row["correct"].lower() in ("true", "1", "yes")
|
||||
row["latency_ms"] = float(row.get("latency_ms", 0) or 0)
|
||||
rows.append(row)
|
||||
return rows
|
||||
|
||||
|
||||
def main() -> None:
|
||||
rows = load_scores()
|
||||
if not rows:
|
||||
print("No data in scores.csv")
|
||||
return
|
||||
|
||||
models = sorted(set(r["model"] for r in rows))
|
||||
benchmarks = ["mmlu", "gsm8k", "humaneval"]
|
||||
|
||||
# Compute scores
|
||||
scores = {} # (model, bench) -> (correct, total)
|
||||
for r in rows:
|
||||
key = (r["model"], r["benchmark"])
|
||||
if key not in scores:
|
||||
scores[key] = [0, 0]
|
||||
scores[key][1] += 1
|
||||
if r["correct"]:
|
||||
scores[key][0] += 1
|
||||
|
||||
# MMLU per-category
|
||||
cat_scores = defaultdict(lambda: [0, 0])
|
||||
for r in rows:
|
||||
if r["benchmark"] == "mmlu" and r.get("category"):
|
||||
key = (r["model"], r["category"])
|
||||
cat_scores[key][1] += 1
|
||||
if r["correct"]:
|
||||
cat_scores[key][0] += 1
|
||||
|
||||
categories = sorted(set(r.get("category", "") for r in rows if r.get("category")))
|
||||
|
||||
lines = ["# Eval Results\n"]
|
||||
|
||||
# Main table
|
||||
lines.append("## Overall Scores\n")
|
||||
header = "| Model | MMLU (%) | GSM8K (%) | HumanEval (%) | Avg (%) |"
|
||||
sep = "|-------|---------|---------|--------------|---------|"
|
||||
lines.append(header)
|
||||
lines.append(sep)
|
||||
|
||||
model_avgs = []
|
||||
for model in models:
|
||||
cells = []
|
||||
pcts = []
|
||||
for bench in benchmarks:
|
||||
key = (model, bench)
|
||||
if key in scores:
|
||||
c, t = scores[key]
|
||||
pct = c / t * 100 if t > 0 else 0
|
||||
cells.append(f"{pct:.1f}")
|
||||
pcts.append(pct)
|
||||
else:
|
||||
cells.append("—")
|
||||
avg = sum(pcts) / len(pcts) if pcts else 0
|
||||
model_avgs.append((model, avg))
|
||||
cells.append(f"{avg:.1f}")
|
||||
lines.append(f"| {model} | " + " | ".join(cells) + " |")
|
||||
|
||||
# Sort summary
|
||||
model_avgs.sort(key=lambda x: -x[1])
|
||||
lines.append(f"\n**Best overall: {model_avgs[0][0]}** ({model_avgs[0][1]:.1f}% avg)\n")
|
||||
|
||||
# MMLU category breakdown
|
||||
if categories:
|
||||
lines.append("\n## MMLU Per-Category Breakdown\n")
|
||||
header = "| Model | " + " | ".join(c.replace("_", " ").title() for c in categories) + " |"
|
||||
sep = "|-------" + "|-------" * len(categories) + "|"
|
||||
lines.append(header)
|
||||
lines.append(sep)
|
||||
for model in models:
|
||||
cells = []
|
||||
for cat in categories:
|
||||
key = (model, cat)
|
||||
if key in cat_scores:
|
||||
c, t = cat_scores[key]
|
||||
cells.append(f"{c}/{t}")
|
||||
else:
|
||||
cells.append("—")
|
||||
lines.append(f"| {model} | " + " | ".join(cells) + " |")
|
||||
|
||||
# Latency summary
|
||||
lines.append("\n## Median Latency (ms)\n")
|
||||
lines.append("| Model | MMLU | GSM8K | HumanEval |")
|
||||
lines.append("|-------|------|-------|-----------|")
|
||||
for model in models:
|
||||
cells = []
|
||||
for bench in benchmarks:
|
||||
lats = sorted([r["latency_ms"] for r in rows
|
||||
if r["model"] == model and r["benchmark"] == bench
|
||||
and r["latency_ms"] > 0])
|
||||
if lats:
|
||||
med = lats[len(lats)//2]
|
||||
cells.append(f"{med:.0f}")
|
||||
else:
|
||||
cells.append("—")
|
||||
lines.append(f"| {model} | " + " | ".join(cells) + " |")
|
||||
|
||||
summary = "\n".join(lines) + "\n"
|
||||
SUMMARY_PATH.write_text(summary)
|
||||
print(summary)
|
||||
print(f"\nWritten to: {SUMMARY_PATH}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
164
eval/gsm8k.py
Normal file
164
eval/gsm8k.py
Normal file
@@ -0,0 +1,164 @@
|
||||
#!/usr/bin/env python3
|
||||
"""GSM8K 50-question subset benchmark (seed=42)."""
|
||||
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
from datasets import load_dataset
|
||||
from openai import OpenAI
|
||||
from tqdm import tqdm
|
||||
|
||||
ENDPOINT = os.environ.get("LLAMA_SWAP_URL", "http://100.101.41.16:8401/v1")
|
||||
RESULTS_DIR = Path(__file__).parent / "results"
|
||||
MAX_TOKENS = 512
|
||||
SEED = 42
|
||||
TEMPERATURE = 0
|
||||
N_QUESTIONS = 50
|
||||
|
||||
|
||||
def load_questions() -> list[dict]:
|
||||
rng = random.Random(SEED)
|
||||
ds = load_dataset("openai/gsm8k", "main", split="test", trust_remote_code=True)
|
||||
indices = list(range(len(ds)))
|
||||
rng.shuffle(indices)
|
||||
questions = []
|
||||
for idx in indices[:N_QUESTIONS]:
|
||||
row = ds[idx]
|
||||
answer_text = row["answer"]
|
||||
# GSM8K answer format: "#### <number>" at end
|
||||
match = re.search(r"####\s*([0-9,.-]+)", answer_text)
|
||||
expected = int(match.group(1).replace(",", "")) if match else 0
|
||||
questions.append({
|
||||
"id": f"gsm8k_{idx}",
|
||||
"question": row["question"],
|
||||
"expected": expected,
|
||||
})
|
||||
return questions
|
||||
|
||||
|
||||
def format_prompt(q: dict) -> str:
|
||||
return (
|
||||
"Solve this problem step by step, then on the final line write "
|
||||
"'ANSWER: <number>'.\n\n" + q["question"]
|
||||
)
|
||||
|
||||
|
||||
def parse_answer(text: str) -> int | None:
|
||||
matches = re.findall(r"ANSWER:\s*([0-9,.-]+)", text, re.IGNORECASE)
|
||||
if matches:
|
||||
try:
|
||||
return int(matches[-1].replace(",", ""))
|
||||
except ValueError:
|
||||
return None
|
||||
# Fallback: last number in the response
|
||||
nums = re.findall(r"-?\d[\d,]*", text)
|
||||
if nums:
|
||||
try:
|
||||
return int(nums[-1].replace(",", ""))
|
||||
except ValueError:
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
def run_gsm8k(model: str, client: OpenAI, questions: list[dict]) -> list[dict]:
|
||||
model_dir = RESULTS_DIR / model / "gsm8k"
|
||||
model_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
results = []
|
||||
correct = 0
|
||||
total = 0
|
||||
|
||||
skipped = 0
|
||||
for i, q in enumerate(tqdm(questions, desc=f" GSM8K {model}", file=sys.stderr)):
|
||||
expected = q["expected"]
|
||||
out_path = model_dir / f"{q['id']}.json"
|
||||
|
||||
if out_path.exists():
|
||||
try:
|
||||
cached = json.loads(out_path.read_text())
|
||||
raw = ""
|
||||
if "choices" in cached:
|
||||
msg = cached["choices"][0].get("message", {})
|
||||
raw = msg.get("content", "") or msg.get("reasoning_content", "") or ""
|
||||
parsed = parse_answer(raw)
|
||||
is_correct = parsed is not None and parsed == expected
|
||||
if is_correct:
|
||||
correct += 1
|
||||
total += 1
|
||||
results.append({
|
||||
"model": model, "benchmark": "gsm8k", "question_id": q["id"],
|
||||
"correct": is_correct, "raw_answer": raw[:200],
|
||||
"parsed_answer": str(parsed) if parsed is not None else "",
|
||||
"expected": str(expected), "latency_ms": 0,
|
||||
})
|
||||
skipped += 1
|
||||
continue
|
||||
except (json.JSONDecodeError, KeyError):
|
||||
pass
|
||||
|
||||
prompt = format_prompt(q)
|
||||
t0 = time.time()
|
||||
resp_json = None
|
||||
for attempt in range(2):
|
||||
try:
|
||||
resp = client.chat.completions.create(
|
||||
model=model,
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
max_tokens=MAX_TOKENS,
|
||||
temperature=TEMPERATURE,
|
||||
seed=SEED,
|
||||
)
|
||||
resp_json = resp.model_dump()
|
||||
break
|
||||
except Exception as e:
|
||||
if attempt == 0:
|
||||
time.sleep(5)
|
||||
else:
|
||||
resp_json = {"error": str(e)}
|
||||
latency = (time.time() - t0) * 1000
|
||||
|
||||
raw = ""
|
||||
if resp_json and "choices" in resp_json:
|
||||
msg = resp_json["choices"][0].get("message", {})
|
||||
raw = msg.get("content", "") or msg.get("reasoning_content", "") or ""
|
||||
|
||||
parsed = parse_answer(raw)
|
||||
is_correct = parsed is not None and parsed == expected
|
||||
if is_correct:
|
||||
correct += 1
|
||||
total += 1
|
||||
|
||||
out_path.write_text(json.dumps(resp_json, indent=2, default=str))
|
||||
|
||||
results.append({
|
||||
"model": model,
|
||||
"benchmark": "gsm8k",
|
||||
"question_id": q["id"],
|
||||
"correct": is_correct,
|
||||
"raw_answer": raw[:200],
|
||||
"parsed_answer": str(parsed) if parsed is not None else "",
|
||||
"expected": str(expected),
|
||||
"latency_ms": round(latency, 1),
|
||||
})
|
||||
|
||||
if (i + 1) % 10 == 0:
|
||||
print(f" [{model}] GSM8K {i+1}/{len(questions)} — {correct}/{total} ({correct/total*100:.0f}%)", file=sys.stderr)
|
||||
|
||||
if skipped:
|
||||
print(f" [{model}] GSM8K resumed: {skipped} cached, {total-skipped} new", file=sys.stderr)
|
||||
print(f" [{model}] GSM8K FINAL: {correct}/{total} ({correct/total*100:.1f}%)", file=sys.stderr)
|
||||
return results
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
model = sys.argv[1] if len(sys.argv) > 1 else "qwen3.6-35b-a3b-mxfp4"
|
||||
client = OpenAI(base_url=ENDPOINT, api_key="dummy")
|
||||
questions = load_questions()
|
||||
results = run_gsm8k(model, client, questions)
|
||||
for r in results:
|
||||
print(json.dumps(r))
|
||||
201
eval/humaneval.py
Normal file
201
eval/humaneval.py
Normal file
@@ -0,0 +1,201 @@
|
||||
#!/usr/bin/env python3
|
||||
"""HumanEval benchmark — 164 problems with sandboxed execution."""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
import textwrap
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
from datasets import load_dataset
|
||||
from openai import OpenAI
|
||||
from tqdm import tqdm
|
||||
|
||||
ENDPOINT = os.environ.get("LLAMA_SWAP_URL", "http://100.101.41.16:8401/v1")
|
||||
RESULTS_DIR = Path(__file__).parent / "results"
|
||||
MAX_TOKENS = 1024
|
||||
SEED = 42
|
||||
TEMPERATURE = 0
|
||||
EXEC_TIMEOUT = 30
|
||||
|
||||
|
||||
def load_problems() -> list[dict]:
|
||||
ds = load_dataset("openai/openai_humaneval", split="test", trust_remote_code=True)
|
||||
problems = []
|
||||
for row in ds:
|
||||
problems.append({
|
||||
"id": row["task_id"],
|
||||
"prompt": row["prompt"],
|
||||
"canonical": row["canonical_solution"],
|
||||
"test": row["test"],
|
||||
"entry_point": row["entry_point"],
|
||||
})
|
||||
return problems
|
||||
|
||||
|
||||
def extract_code(response: str, prompt: str) -> str:
|
||||
# Try to find a code block
|
||||
blocks = re.findall(r"```(?:python)?\n(.*?)```", response, re.DOTALL)
|
||||
if blocks:
|
||||
code = blocks[0]
|
||||
# If the code block contains the function signature, use it directly
|
||||
if "def " in code:
|
||||
return code
|
||||
# Otherwise prepend the prompt (function signature)
|
||||
return prompt + code
|
||||
|
||||
# No code block — try to extract everything from the first def onwards
|
||||
lines = response.split("\n")
|
||||
in_code = False
|
||||
code_lines = []
|
||||
for line in lines:
|
||||
if line.strip().startswith("def ") or in_code:
|
||||
in_code = True
|
||||
code_lines.append(line)
|
||||
elif in_code and line.strip() == "":
|
||||
code_lines.append(line)
|
||||
|
||||
if code_lines:
|
||||
return "\n".join(code_lines)
|
||||
|
||||
# Last resort: prepend prompt to raw response
|
||||
return prompt + response
|
||||
|
||||
|
||||
def run_test(code: str, test_code: str, entry_point: str) -> tuple[bool, str]:
|
||||
full = code + "\n\n" + test_code + f"\n\ncheck({entry_point})\n"
|
||||
|
||||
with tempfile.NamedTemporaryFile(
|
||||
mode="w", suffix=".py", dir="/tmp", delete=False
|
||||
) as f:
|
||||
f.write(full)
|
||||
f.flush()
|
||||
fpath = f.name
|
||||
|
||||
try:
|
||||
# Sandboxed execution: restrict to /tmp, limited PATH
|
||||
env = {"PATH": "/usr/bin:/usr/local/bin", "HOME": "/tmp"}
|
||||
result = subprocess.run(
|
||||
[sys.executable, fpath],
|
||||
capture_output=True, text=True,
|
||||
timeout=EXEC_TIMEOUT,
|
||||
cwd="/tmp",
|
||||
env=env,
|
||||
)
|
||||
passed = result.returncode == 0
|
||||
output = result.stderr[:500] if result.stderr else result.stdout[:500]
|
||||
return passed, output
|
||||
except subprocess.TimeoutExpired:
|
||||
return False, "TIMEOUT"
|
||||
except Exception as e:
|
||||
return False, str(e)[:500]
|
||||
finally:
|
||||
try:
|
||||
os.unlink(fpath)
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
|
||||
def run_humaneval(model: str, client: OpenAI, problems: list[dict]) -> list[dict]:
|
||||
model_dir = RESULTS_DIR / model / "humaneval"
|
||||
model_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
results = []
|
||||
correct = 0
|
||||
total = 0
|
||||
|
||||
skipped = 0
|
||||
for i, p in enumerate(tqdm(problems, desc=f" HumanEval {model}", file=sys.stderr)):
|
||||
out_path = model_dir / f"{p['id'].replace('/', '_')}.json"
|
||||
|
||||
if out_path.exists():
|
||||
try:
|
||||
cached = json.loads(out_path.read_text())
|
||||
passed = cached.get("passed", False)
|
||||
if passed:
|
||||
correct += 1
|
||||
total += 1
|
||||
results.append({
|
||||
"model": model, "benchmark": "humaneval",
|
||||
"question_id": p["id"], "correct": passed,
|
||||
"raw_answer": "", "parsed_answer": "pass" if passed else "fail",
|
||||
"expected": "pass", "latency_ms": 0,
|
||||
})
|
||||
skipped += 1
|
||||
continue
|
||||
except (json.JSONDecodeError, KeyError):
|
||||
pass
|
||||
|
||||
t0 = time.time()
|
||||
resp_json = None
|
||||
for attempt in range(2):
|
||||
try:
|
||||
resp = client.chat.completions.create(
|
||||
model=model,
|
||||
messages=[{"role": "user", "content": (
|
||||
"Complete the following Python function. "
|
||||
"Return ONLY the complete function implementation.\n\n"
|
||||
+ p["prompt"]
|
||||
)}],
|
||||
max_tokens=MAX_TOKENS,
|
||||
temperature=TEMPERATURE,
|
||||
seed=SEED,
|
||||
)
|
||||
resp_json = resp.model_dump()
|
||||
break
|
||||
except Exception as e:
|
||||
if attempt == 0:
|
||||
time.sleep(5)
|
||||
else:
|
||||
resp_json = {"error": str(e)}
|
||||
latency = (time.time() - t0) * 1000
|
||||
|
||||
raw = ""
|
||||
if resp_json and "choices" in resp_json:
|
||||
msg = resp_json["choices"][0].get("message", {})
|
||||
raw = msg.get("content", "") or msg.get("reasoning_content", "") or ""
|
||||
|
||||
code = extract_code(raw, p["prompt"])
|
||||
passed, exec_output = run_test(code, p["test"], p["entry_point"])
|
||||
if passed:
|
||||
correct += 1
|
||||
total += 1
|
||||
|
||||
out_path.write_text(json.dumps({
|
||||
"response": resp_json,
|
||||
"extracted_code": code[:2000],
|
||||
"passed": passed,
|
||||
"exec_output": exec_output,
|
||||
}, indent=2, default=str))
|
||||
|
||||
results.append({
|
||||
"model": model,
|
||||
"benchmark": "humaneval",
|
||||
"question_id": p["id"],
|
||||
"correct": passed,
|
||||
"raw_answer": raw[:200],
|
||||
"parsed_answer": "pass" if passed else "fail",
|
||||
"expected": "pass",
|
||||
"latency_ms": round(latency, 1),
|
||||
})
|
||||
|
||||
if (i + 1) % 10 == 0:
|
||||
print(f" [{model}] HumanEval {i+1}/{len(problems)} — {correct}/{total} ({correct/total*100:.0f}%)", file=sys.stderr)
|
||||
|
||||
if skipped:
|
||||
print(f" [{model}] HumanEval resumed: {skipped} cached, {total-skipped} new", file=sys.stderr)
|
||||
print(f" [{model}] HumanEval FINAL: {correct}/{total} ({correct/total*100:.1f}%)", file=sys.stderr)
|
||||
return results
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
model = sys.argv[1] if len(sys.argv) > 1 else "qwen3.6-35b-a3b-mxfp4"
|
||||
client = OpenAI(base_url=ENDPOINT, api_key="dummy")
|
||||
problems = load_problems()
|
||||
results = run_humaneval(model, client, problems)
|
||||
for r in results:
|
||||
print(json.dumps(r))
|
||||
166
eval/mmlu.py
Normal file
166
eval/mmlu.py
Normal file
@@ -0,0 +1,166 @@
|
||||
#!/usr/bin/env python3
|
||||
"""MMLU 100-question subset benchmark (20 per category, seed=42)."""
|
||||
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
from datasets import load_dataset
|
||||
from openai import OpenAI
|
||||
from tqdm import tqdm
|
||||
|
||||
ENDPOINT = os.environ.get("LLAMA_SWAP_URL", "http://100.101.41.16:8401/v1")
|
||||
RESULTS_DIR = Path(__file__).parent / "results"
|
||||
MAX_TOKENS = 512
|
||||
SEED = 42
|
||||
TEMPERATURE = 0
|
||||
|
||||
CATEGORIES = [
|
||||
"high_school_mathematics",
|
||||
"college_computer_science",
|
||||
"professional_medicine",
|
||||
"formal_logic",
|
||||
"miscellaneous",
|
||||
]
|
||||
PER_CATEGORY = 20
|
||||
|
||||
CHOICES = ["A", "B", "C", "D"]
|
||||
|
||||
|
||||
def load_questions() -> list[dict]:
|
||||
rng = random.Random(SEED)
|
||||
questions = []
|
||||
for cat in CATEGORIES:
|
||||
ds = load_dataset("cais/mmlu", cat, split="test", trust_remote_code=True)
|
||||
indices = list(range(len(ds)))
|
||||
rng.shuffle(indices)
|
||||
for idx in indices[:PER_CATEGORY]:
|
||||
row = ds[idx]
|
||||
questions.append({
|
||||
"id": f"{cat}_{idx}",
|
||||
"category": cat,
|
||||
"question": row["question"],
|
||||
"choices": row["choices"],
|
||||
"answer_idx": row["answer"],
|
||||
})
|
||||
return questions
|
||||
|
||||
|
||||
def format_prompt(q: dict) -> str:
|
||||
lines = [f"Question: {q['question']}"]
|
||||
for i, choice in enumerate(q["choices"]):
|
||||
lines.append(f"{CHOICES[i]}) {choice}")
|
||||
lines.append("Answer with a single letter: ")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def parse_answer(text: str) -> str | None:
|
||||
for ch in text.strip():
|
||||
if ch.upper() in CHOICES:
|
||||
return ch.upper()
|
||||
return None
|
||||
|
||||
|
||||
def run_mmlu(model: str, client: OpenAI, questions: list[dict]) -> list[dict]:
|
||||
model_dir = RESULTS_DIR / model / "mmlu"
|
||||
model_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
results = []
|
||||
correct = 0
|
||||
total = 0
|
||||
|
||||
skipped = 0
|
||||
for i, q in enumerate(tqdm(questions, desc=f" MMLU {model}", file=sys.stderr)):
|
||||
expected = CHOICES[q["answer_idx"]]
|
||||
out_path = model_dir / f"{q['id']}.json"
|
||||
|
||||
# Resume: skip if result file exists
|
||||
if out_path.exists():
|
||||
try:
|
||||
cached = json.loads(out_path.read_text())
|
||||
raw = ""
|
||||
if "choices" in cached:
|
||||
msg = cached["choices"][0].get("message", {})
|
||||
raw = msg.get("content", "") or msg.get("reasoning_content", "") or ""
|
||||
parsed = parse_answer(raw)
|
||||
is_correct = parsed == expected
|
||||
if is_correct:
|
||||
correct += 1
|
||||
total += 1
|
||||
results.append({
|
||||
"model": model, "benchmark": "mmlu", "question_id": q["id"],
|
||||
"category": q["category"], "correct": is_correct,
|
||||
"raw_answer": raw[:200], "parsed_answer": parsed or "",
|
||||
"expected": expected, "latency_ms": 0,
|
||||
})
|
||||
skipped += 1
|
||||
continue
|
||||
except (json.JSONDecodeError, KeyError):
|
||||
pass
|
||||
|
||||
prompt = format_prompt(q)
|
||||
t0 = time.time()
|
||||
resp_json = None
|
||||
for attempt in range(2):
|
||||
try:
|
||||
resp = client.chat.completions.create(
|
||||
model=model,
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
max_tokens=MAX_TOKENS,
|
||||
temperature=TEMPERATURE,
|
||||
seed=SEED,
|
||||
)
|
||||
resp_json = resp.model_dump()
|
||||
break
|
||||
except Exception as e:
|
||||
if attempt == 0:
|
||||
time.sleep(5)
|
||||
else:
|
||||
resp_json = {"error": str(e)}
|
||||
latency = (time.time() - t0) * 1000
|
||||
|
||||
raw = ""
|
||||
if resp_json and "choices" in resp_json:
|
||||
msg = resp_json["choices"][0].get("message", {})
|
||||
raw = msg.get("content", "") or msg.get("reasoning_content", "") or ""
|
||||
|
||||
parsed = parse_answer(raw)
|
||||
is_correct = parsed == expected
|
||||
if is_correct:
|
||||
correct += 1
|
||||
total += 1
|
||||
|
||||
out_path.write_text(json.dumps(resp_json, indent=2, default=str))
|
||||
|
||||
results.append({
|
||||
"model": model,
|
||||
"benchmark": "mmlu",
|
||||
"question_id": q["id"],
|
||||
"category": q["category"],
|
||||
"correct": is_correct,
|
||||
"raw_answer": raw[:200],
|
||||
"parsed_answer": parsed or "",
|
||||
"expected": expected,
|
||||
"latency_ms": round(latency, 1),
|
||||
})
|
||||
|
||||
if (i + 1) % 10 == 0:
|
||||
print(f" [{model}] MMLU {i+1}/{len(questions)} — {correct}/{total} ({correct/total*100:.0f}%)", file=sys.stderr)
|
||||
|
||||
if skipped:
|
||||
print(f" [{model}] MMLU resumed: {skipped} cached, {total-skipped} new", file=sys.stderr)
|
||||
print(f" [{model}] MMLU FINAL: {correct}/{total} ({correct/total*100:.1f}%)", file=sys.stderr)
|
||||
return results
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
model = sys.argv[1] if len(sys.argv) > 1 else "qwen3.6-35b-a3b-mxfp4"
|
||||
client = OpenAI(base_url=ENDPOINT, api_key="dummy")
|
||||
questions = load_questions()
|
||||
results = run_mmlu(model, client, questions)
|
||||
for r in results:
|
||||
print(json.dumps(r))
|
||||
117
eval/run_all.py
Normal file
117
eval/run_all.py
Normal file
@@ -0,0 +1,117 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Orchestrate MMLU, GSM8K, HumanEval across all models."""
|
||||
|
||||
import csv
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
from openai import OpenAI
|
||||
|
||||
ENDPOINT = os.environ.get("LLAMA_SWAP_URL", "http://100.101.41.16:8401/v1")
|
||||
RESULTS_DIR = Path(__file__).parent / "results"
|
||||
CSV_PATH = Path(__file__).parent / "scores.csv"
|
||||
|
||||
MODELS = [
|
||||
"qwen3.6-35b-a3b-mxfp4",
|
||||
"qwen3-coder-30b-apex",
|
||||
"qwen3.6-27b-mtp",
|
||||
"qwopus3.5-4b-mtp",
|
||||
"qwen3.5-9b-deepseek-v4-mtp",
|
||||
"qwopus3.6-35b-a3b-v1",
|
||||
"qwopus3.6-27b-v2-mtp",
|
||||
"qwopus3.5-9b-coder-mtp",
|
||||
]
|
||||
|
||||
|
||||
def warmup_model(client: OpenAI, model: str) -> bool:
|
||||
print(f"\n{'='*60}", file=sys.stderr)
|
||||
print(f" Loading model: {model}", file=sys.stderr)
|
||||
print(f"{'='*60}", file=sys.stderr)
|
||||
for attempt in range(3):
|
||||
try:
|
||||
resp = client.chat.completions.create(
|
||||
model=model,
|
||||
messages=[{"role": "user", "content": "Say OK."}],
|
||||
max_tokens=10,
|
||||
temperature=0,
|
||||
)
|
||||
print(f" Warmup OK", file=sys.stderr)
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f" Warmup attempt {attempt+1} failed: {e}", file=sys.stderr)
|
||||
time.sleep(10)
|
||||
print(f" WARNING: warmup failed for {model}, continuing anyway", file=sys.stderr)
|
||||
return False
|
||||
|
||||
|
||||
def run_benchmark(module_name: str, model: str, client: OpenAI) -> list[dict]:
|
||||
if module_name == "mmlu":
|
||||
from mmlu import load_questions, run_mmlu
|
||||
questions = load_questions()
|
||||
return run_mmlu(model, client, questions)
|
||||
elif module_name == "gsm8k":
|
||||
from gsm8k import load_questions, run_gsm8k
|
||||
questions = load_questions()
|
||||
return run_gsm8k(model, client, questions)
|
||||
elif module_name == "humaneval":
|
||||
from humaneval import load_problems, run_humaneval
|
||||
problems = load_problems()
|
||||
return run_humaneval(model, client, problems)
|
||||
else:
|
||||
raise ValueError(f"Unknown benchmark: {module_name}")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
client = OpenAI(base_url=ENDPOINT, api_key="dummy")
|
||||
|
||||
# Check connectivity
|
||||
try:
|
||||
client.models.list()
|
||||
print("Connected to llama-swap", file=sys.stderr)
|
||||
except Exception as e:
|
||||
print(f"Cannot connect to {ENDPOINT}: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
all_results: list[dict] = []
|
||||
benchmarks = ["mmlu", "gsm8k", "humaneval"]
|
||||
|
||||
t_start = time.time()
|
||||
|
||||
for model in MODELS:
|
||||
warmup_model(client, model)
|
||||
|
||||
for bench in benchmarks:
|
||||
print(f"\n --- {model} / {bench} ---", file=sys.stderr)
|
||||
try:
|
||||
results = run_benchmark(bench, model, client)
|
||||
all_results.extend(results)
|
||||
write_csv(all_results)
|
||||
except Exception as e:
|
||||
print(f" ERROR in {model}/{bench}: {e}", file=sys.stderr)
|
||||
|
||||
elapsed = time.time() - t_start
|
||||
print(f"\nAll benchmarks complete in {elapsed/60:.0f} minutes", file=sys.stderr)
|
||||
print(f"Results: {CSV_PATH}", file=sys.stderr)
|
||||
|
||||
|
||||
def write_csv(results: list[dict]) -> None:
|
||||
if not results:
|
||||
return
|
||||
fields = ["model", "benchmark", "question_id", "correct", "raw_answer",
|
||||
"parsed_answer", "expected", "latency_ms"]
|
||||
# Also include category if present (MMLU)
|
||||
if any("category" in r for r in results):
|
||||
fields.insert(3, "category")
|
||||
|
||||
with open(CSV_PATH, "w", newline="") as f:
|
||||
w = csv.DictWriter(f, fieldnames=fields, extrasaction="ignore")
|
||||
w.writeheader()
|
||||
w.writerows(results)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
20
eval/run_all.sh
Executable file
20
eval/run_all.sh
Executable file
@@ -0,0 +1,20 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
EVAL_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
VENV="${EVAL_DIR}/.venv/bin/python3"
|
||||
|
||||
cd "$EVAL_DIR"
|
||||
|
||||
echo "Starting eval sweep at $(date)"
|
||||
echo "Using venv: ${VENV}"
|
||||
echo ""
|
||||
|
||||
$VENV run_all.py 2>&1 | tee eval.log
|
||||
|
||||
echo ""
|
||||
echo "Generating summary..."
|
||||
$VENV analyze.py
|
||||
|
||||
echo ""
|
||||
echo "Done at $(date)"
|
||||
Reference in New Issue
Block a user