llama-sidecar v0.1.0: daemon + benchmarks + eval suite

Go daemon (cmd/llama-sidecar): per-agent llama-server process pool with LRU eviction, OpenAI-compatible proxy, flag validation (Unsloth port), deterministic hash-keyed sidecar reuse. Windows service support via schtasks/NSSM with DETACHED_PROCESS, stdout pipe drain, and request-ctx decoupled child lifetime. Bug fixes (3b.1–3b5): -c flag drop from StripShadowingFlags, UTF-8 BOM in JSON config, -fa → --flash-attn on default, child process exit after one request (stdin devnull, stdout pipe, CREATE_NO_WINDOW → DETACHED, context.Background for child lifetime, background reaper goroutine). bench/: MTP on/off throughput sweep across 8 GGUFs via SSH+schtasks automation to sam-desktop. Per-GGUF production flags from llama-swap config with --ctx-size 32768 override. eval/: accuracy benchmarks (MMLU 100q, GSM8K 50q, HumanEval 164) + A/B model comparison (14 agent-typed prompts × 8 models). All scripts resumable at individual question level. 94 Go tests, race detector clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-28 01:55:13 +00:00
parent babbb4f39b
commit fe7f36ae98
39 changed files with 4228 additions and 0 deletions
--- a/eval/ab/prompts.json
+++ b/eval/ab/prompts.json
@@ -0,0 +1,72 @@
+[
+  {
+    "id": "review-1",
+    "agent": "Code Reviewer",
+    "prompt": "Review the `buildHeadPayload` function in `apps/server/src/services/compaction.ts`. It was recently patched in v1.13.6 to embed `reasoning_parts` as a `<reasoning>...</reasoning>` prose prefix on the assistant content for tool-bearing turns. Check: does the current implementation handle the case where `reasoning_parts` is an empty array? Does it handle turns that have both reasoning_parts AND non-empty text content (not just tool calls)? Cite file:line for any issues."
+  },
+  {
+    "id": "review-2",
+    "agent": "Code Reviewer",
+    "prompt": "Review the path guard layer in `apps/coder/services/path_guard.ts`. It enforces per-project scoping with a blanket `/opt:rw` mount and policy at the tool layer. Check for: symlink traversal (does it resolve symlinks before checking?), double-encoding attacks on path components, race conditions between check and use (TOCTOU), and whether `extraRoots` from `request_read_access` grants could be abused to escape the project scope. Cite file:line."
+  },
+  {
+    "id": "debug-1",
+    "agent": "Debugger",
+    "prompt": "Bug report: after a long BooCode chat session (~40 messages), the compaction trigger fires but the resulting summary is empty — the assistant message with `summary=true` has blank content. The `ctx_max` is correctly fetched from `/upstream/<model>/props` (verified in logs). The `needs_compaction` flag is being set. But the summary inference returns an empty string. This started happening after the v1.13.7 compaction trigger change that lowered the threshold to `floor(0.85 * ctx_max)`. Diagnose: what code path could produce an empty summary, and what would you check first?"
+  },
+  {
+    "id": "debug-2",
+    "agent": "Debugger",
+    "prompt": "Bug report: BooTerm terminal pane shows garbled output past column 66 on initial open, but corrects itself after manually resizing the browser window. The `stty size` inside the terminal reports `82 66` even though the pane is visually ~132 columns wide. tmux `list-windows` confirms the session was created at 66 columns. This only happens when opening a terminal pane via the split-pane button, not when opening it as the sole pane. Diagnose the root cause in `apps/web/src/components/panes/TerminalPane.tsx`."
+  },
+  {
+    "id": "refactor-1",
+    "agent": "Refactorer",
+    "prompt": "The `streamCompletion` function in `apps/server/src/services/provider.ts` has grown to handle: AI SDK v6 streaming, XML fallback parsing for qwen3.6 tool-call emissions, abort signal handling (the explicit `if (signal?.aborted) throw` patch), reasoning-delta counting, and usage extraction. It's now ~200 lines. Propose a refactor that separates concerns without breaking the streaming contract. The function must remain a single entry point for callers."
+  },
+  {
+    "id": "refactor-2",
+    "agent": "Refactorer",
+    "prompt": "The WebSocket frame publishing in BooCode went through two batches (v1.13.12 + v1.13.13) that converted ~80 publish sites to typed `publishFrame`/`publishUserFrame` wrappers with Zod validation. The schemas are duplicated byte-identical between `apps/server/src/types/ws-frames.ts` and `apps/web/src/api/ws-frames.ts` with a parity test. Propose a refactor to share the schema definition from a single source instead of maintaining the duplication + parity test."
+  },
+  {
+    "id": "architect-1",
+    "agent": "Architect",
+    "prompt": "Design the system-prompt prefix cache for BooCode. Context: `buildSystemPromptWithFingerprint` already computes a SHA-256 of the assembled prefix and logs drift. The prefix is rebuilt on every inference turn from: project settings, agent instructions (AGENTS.md), skills, session-level overrides, and web_search_enabled flag. Most of these don't change between turns in the same session. Design a cache that avoids rebuilding+rehashing on every turn. Consider: process-memory vs DB-backed, invalidation strategy, cache key shape, and whether the fingerprint can serve as the cache key itself."
+  },
+  {
+    "id": "architect-2",
+    "agent": "Architect",
+    "prompt": "Design the v2.5 task model integration with BooCoder's ACP dispatch. Context: v2.5.0-task-model just shipped a `tasks` table and lightweight task model services. BooCoder dispatches external agents (opencode, goose, claude) via ACP or PTY. Design how a task created in BooChat should flow through to a BooCoder dispatch: task creation → agent selection → ACP session → status updates back to the task row → completion. Consider: which fields from the task row map to ACP session params, how task status syncs with the agent's exit code, and how the UI surfaces progress."
+  },
+  {
+    "id": "security-1",
+    "agent": "Security Auditor",
+    "prompt": "Audit the `web_fetch` tool implementation in BooCode. It fetches arbitrary URLs on behalf of the LLM agent. Check for: SSRF against internal Tailscale IPs (100.x.x.x), DNS rebinding, redirect following to internal hosts, response size limits, content-type validation, and whether the `url_guard.ts` layer covers all cases. The tool is gated by `session.web_search_enabled` but once enabled, the URL is user-agent-controlled (the LLM decides what to fetch)."
+  },
+  {
+    "id": "security-2",
+    "agent": "Security Auditor",
+    "prompt": "Audit the `request_read_access` tool and `allowed_read_paths` grant mechanism (v1.13.17). When an agent needs to read files outside its project scope, it calls `request_read_access(path)` which triggers an `ask_user_input` elicitation for approval. On approval, the path is added to `allowed_read_paths` for that session, and `pathGuard` is extended with `extraRoots`. Check: can the agent request a path like `/etc/shadow` or `/opt/boocode/.env`? Is the grant scoped to the session or persistent? Can the path be a symlink that resolves to a sensitive location after the grant?"
+  },
+  {
+    "id": "prompt-1",
+    "agent": "Prompt Builder",
+    "prompt": "Write a Claude Code dispatch prompt for: adding a new BooCode agent called 'Documenter' to AGENTS.md. The agent should read source files and produce inline JSDoc/TSDoc comments. It should use the read-only tool set. Temperature 0.4, steps 10. The prompt should include pre-flight checks, the exact file to modify, backup instructions, and verification steps."
+  },
+  {
+    "id": "prompt-2",
+    "agent": "Prompt Builder",
+    "prompt": "Write an OpenCode dispatch prompt for: fixing the codecontext sidecar to handle projects with more than 10,000 files without OOMing. The fork is at /opt/forks/codecontext/. The agent should investigate the memory profile of the graph analysis pass, identify the allocation hotspot, and propose a streaming or chunked alternative. Include #careful hashtag, backup rules, and stop conditions."
+  },
+  {
+    "id": "recon-1",
+    "agent": "Recon",
+    "prompt": "Map the BooCode monorepo at /opt/boocode/. I need: top-level directory structure, the three apps and their roles, how they share the database, the Docker container topology, and the key service files in apps/server/src/services/. Identify the data flow from a user message in BooChat through to the LLM inference call and back."
+  },
+  {
+    "id": "recon-2",
+    "agent": "Recon",
+    "prompt": "Map the codecontext fork at /opt/forks/codecontext/. I need: the MCP tool surface (what tools are exposed), the parser architecture (how tree-sitter grammars are registered), the graph analysis pipeline (how dependencies and call graphs are built), and the codesight-merge additions (blast radius, hot files, routes, middleware). Identify the main entry points and the caching layer."
+  }
+]
--- a/eval/ab/run.sh
+++ b/eval/ab/run.sh
@@ -0,0 +1,242 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+ENDPOINT="http://100.101.41.16:8401/v1"
+PROMPTS_FILE="${SCRIPT_DIR}/prompts.json"
+RESULTS_DIR="${SCRIPT_DIR}/results"
+COMPARE_FILE="${SCRIPT_DIR}/COMPARE.md"
+TIMING_FILE="${SCRIPT_DIR}/timing.csv"
+
+MODELS=(
+  qwen3.6-35b-a3b-mxfp4
+  qwen3-coder-30b-apex
+  qwen3.6-27b-mtp
+  qwopus3.5-4b-mtp
+  qwen3.5-9b-deepseek-v4-mtp
+  qwopus3.6-35b-a3b-v1
+  qwopus3.6-27b-v2-mtp
+  qwopus3.5-9b-coder-mtp
+)
+
+mkdir -p "$RESULTS_DIR"
+
+# ── Parse prompts ─────────────────────────────────────────────────────
+
+PROMPT_COUNT=$(python3 -c "import json; print(len(json.load(open('${PROMPTS_FILE}'))))")
+TOTAL=$((PROMPT_COUNT * ${#MODELS[@]}))
+EST_MIN=$(( TOTAL * 30 / 60 ))
+
+echo "================================================================"
+echo "  A/B MODEL COMPARISON"
+echo "  ${PROMPT_COUNT} prompts × ${#MODELS[@]} models = ${TOTAL} requests"
+echo "  Estimated runtime: ~${EST_MIN} minutes"
+echo "  Endpoint: ${ENDPOINT}"
+echo "================================================================"
+echo ""
+
+# ── Main loop: models (outer) × prompts (inner) ──────────────────────
+# One model load per model, all prompts answered, then swap.
+
+t_start=$(date +%s)
+done_count=0
+
+for model in "${MODELS[@]}"; do
+  echo ""
+  echo "================================================================"
+  echo "  MODEL: ${model}"
+  echo "================================================================"
+
+  # Warmup: load the model with a trivial request
+  all_cached=true
+  for pidx in $(seq 0 $((PROMPT_COUNT - 1))); do
+    PID=$(python3 -c "import json; print(json.load(open('${PROMPTS_FILE}'))[${pidx}]['id'])")
+    if [ ! -f "${RESULTS_DIR}/${PID}/${model}.json" ] || [ ! -s "${RESULTS_DIR}/${PID}/${model}.json" ]; then
+      all_cached=false
+      break
+    fi
+  done
+
+  if [ "$all_cached" = "true" ]; then
+    echo "  All ${PROMPT_COUNT} prompts cached, skipping model"
+    for pidx in $(seq 0 $((PROMPT_COUNT - 1))); do
+      done_count=$((done_count + 1))
+    done
+    continue
+  fi
+
+  echo "  Warming up..."
+  curl -s -X POST "${ENDPOINT}/chat/completions" \
+    -H "Content-Type: application/json" \
+    -d "{\"model\":\"${model}\",\"messages\":[{\"role\":\"user\",\"content\":\"Say OK.\"}],\"max_tokens\":10,\"temperature\":0}" \
+    --max-time 300 > /dev/null 2>&1
+  echo "  Warm."
+
+  for pidx in $(seq 0 $((PROMPT_COUNT - 1))); do
+    PROMPT_ID=$(python3 -c "import json; print(json.load(open('${PROMPTS_FILE}'))[${pidx}]['id'])")
+    AGENT=$(python3 -c "import json; print(json.load(open('${PROMPTS_FILE}'))[${pidx}]['agent'])")
+
+    mkdir -p "${RESULTS_DIR}/${PROMPT_ID}"
+    OUT_JSON="${RESULTS_DIR}/${PROMPT_ID}/${model}.json"
+    OUT_MD="${RESULTS_DIR}/${PROMPT_ID}/${model}.md"
+
+    # Resume: skip if already done
+    if [ -f "$OUT_JSON" ] && [ -s "$OUT_JSON" ]; then
+      done_count=$((done_count + 1))
+      echo "  [${PROMPT_ID}] cached (${done_count}/${TOTAL})"
+      continue
+    fi
+
+    BODY=$(python3 -c "
+import json
+p = json.load(open('${PROMPTS_FILE}'))[${pidx}]
+print(json.dumps({
+    'model': '${model}',
+    'messages': [{'role': 'user', 'content': p['prompt']}],
+    'temperature': 0.6,
+    'max_tokens': 2048,
+    'seed': 42,
+    'stream': False
+}))
+")
+
+    SUCCESS=0
+    for attempt in 1 2; do
+      HTTP_CODE=$(curl -s -w '%{http_code}' -o "$OUT_JSON" \
+        --max-time 300 \
+        -X POST "${ENDPOINT}/chat/completions" \
+        -H "Content-Type: application/json" \
+        -d "$BODY" 2>/dev/null)
+
+      if [ "$HTTP_CODE" = "200" ]; then
+        SUCCESS=1
+        break
+      else
+        if [ "$attempt" = "1" ]; then
+          echo "  [${PROMPT_ID}] HTTP ${HTTP_CODE}, retrying in 10s..."
+          sleep 10
+        else
+          echo "ERROR: HTTP ${HTTP_CODE}" > "$OUT_MD"
+          echo "  [${PROMPT_ID}] FAILED (HTTP ${HTTP_CODE})"
+        fi
+      fi
+    done
+
+    if [ "$SUCCESS" = "1" ]; then
+      python3 -c "
+import json
+d = json.load(open('${OUT_JSON}'))
+msg = d.get('choices', [{}])[0].get('message', {})
+content = msg.get('content', '') or ''
+reasoning = msg.get('reasoning_content', '') or ''
+out = ''
+if reasoning:
+    out += '<think>\n' + reasoning + '\n</think>\n\n'
+out += content
+open('${OUT_MD}', 'w').write(out)
+" 2>/dev/null
+      done_count=$((done_count + 1))
+      METRICS=$(python3 -c "
+import json
+d = json.load(open('${OUT_JSON}'))
+t = d.get('timings', {})
+tps = t.get('predicted_per_second', 0)
+tok = d.get('usage', {}).get('completion_tokens', 0)
+print(f'{tps:.1f}tok/s {tok}tok')
+" 2>/dev/null || echo "?")
+      echo "  [${PROMPT_ID}] done (${METRICS}) [${done_count}/${TOTAL}]"
+    fi
+
+    sleep 2
+  done
+done
+
+# ── Generate COMPARE.md ──────────────────────────────────────────────
+
+echo ""
+echo "Generating COMPARE.md..."
+
+MODELS_JSON=$(printf '%s\n' "${MODELS[@]}" | python3 -c "import json,sys; print(json.dumps([l.strip() for l in sys.stdin if l.strip()]))")
+
+python3 -c "
+import json
+from pathlib import Path
+
+prompts = json.load(open('${PROMPTS_FILE}'))
+results_dir = Path('${RESULTS_DIR}')
+models = json.loads('${MODELS_JSON}')
+
+lines = ['# A/B Model Comparison\n']
+
+timing_rows = []
+
+for p in prompts:
+    pid = p['id']
+    agent = p['agent']
+    short = p['prompt'][:80]
+    lines.append(f'## [{pid}] {agent}\n')
+    lines.append(f'> {short}...\n')
+
+    for model in models:
+        md_path = results_dir / pid / f'{model}.md'
+        json_path = results_dir / pid / f'{model}.json'
+        lines.append(f'### {model}\n')
+        if md_path.exists():
+            content = md_path.read_text().strip()
+            lines.append(f'{content}\n')
+        else:
+            lines.append('*(no response)*\n')
+
+        if json_path.exists():
+            try:
+                d = json.loads(json_path.read_text())
+                t = d.get('timings', {})
+                u = d.get('usage', {})
+                timing_rows.append({
+                    'prompt_id': pid,
+                    'model_id': model,
+                    'prompt_tps': t.get('prompt_per_second', 0),
+                    'predicted_tps': t.get('predicted_per_second', 0),
+                    'total_tokens': u.get('total_tokens', 0),
+                    'latency_ms': round((t.get('prompt_ms', 0) or 0) + (t.get('predicted_ms', 0) or 0), 1),
+                })
+            except:
+                pass
+    lines.append('---\n')
+
+# Timing table
+lines.append('## Timing Summary\n')
+pids = list(dict.fromkeys(r['prompt_id'] for r in timing_rows))
+lines.append('| prompt | ' + ' | '.join(models) + ' |')
+lines.append('|--------' + '|------' * len(models) + '|')
+for pid in pids:
+    cells = []
+    for model in models:
+        match = [r for r in timing_rows if r['prompt_id'] == pid and r['model_id'] == model]
+        if match:
+            cells.append(f\"{match[0]['predicted_tps']:.0f}\")
+        else:
+            cells.append('—')
+    lines.append(f'| {pid} | ' + ' | '.join(cells) + ' |')
+
+Path('${COMPARE_FILE}').write_text('\n'.join(lines) + '\n')
+print(f'Wrote ${COMPARE_FILE}')
+
+# timing.csv
+import csv
+with open('${TIMING_FILE}', 'w', newline='') as f:
+    w = csv.DictWriter(f, fieldnames=['prompt_id', 'model_id', 'prompt_tps', 'predicted_tps', 'total_tokens', 'latency_ms'])
+    w.writeheader()
+    w.writerows(timing_rows)
+print(f'Wrote ${TIMING_FILE}')
+"
+
+t_end=$(date +%s)
+elapsed=$(( t_end - t_start ))
+echo ""
+echo "================================================================"
+echo "  COMPLETE in $(( elapsed / 60 ))m $(( elapsed % 60 ))s"
+echo "  Results: ${RESULTS_DIR}/"
+echo "  Compare: ${COMPARE_FILE}"
+echo "  Timing:  ${TIMING_FILE}"
+echo "================================================================"