feat(booterm): structured pty_exited WS notifications. Plan-validated, impl-validated, code-reviewed green (contracts build clean, contracts test 29/29, booterm + web typecheck clean). wip: in-progress inference/provider refactor (agents.ts, provider.ts, new llama-providers.ts, removed llama-args-validator), plus arena, dispatcher, compaction, schema changes. openspec: pty-exit-notifications complete; x-agent-flags planned (not yet implemented).
1.4 KiB
Why
Per-agent llama-server tuning today is limited to the sampler fields that flow through providerOptions.openaiCompatible in the request body (top_k, min_p, dry_*, etc.). Flags that affect server startup configuration -- KV cache quantization (--cache-type-k), context size (-c), flash attention (--flash-attn), GPU layer count (-ngl) -- cannot be overridden per-agent without spawning a separate sidecar process with different BASE_ARGS.
The llama-sidecar already parses an X-Agent-Flags: --top-k 20 --cache-type-k q8_0 header and applies those flags when routing to a sidecar process. BooCode just needs to emit this header from agent config.
What Changes
- Add a
llama_flagsfield to the Agent type (raw llama CLI args string) - Parse
llama_flagsfrom AGENTS.md frontmatter - Build and emit
X-Agent-Flagsheader on inference requests routed to the sidecar - The sidecar handles deny/shadow flag validation sidecar-side
Scope
apps/server only. The sidecar (/opt/forks/llama-sidecar) already supports X-Agent-Flags -- no out-of-repo changes needed.
Non-goals
- No new typed fields for individual llama-server flags (use
llama_flagsfor raw args) - No changes to the sampler body path (top_k, min_p, etc. continue via providerOptions.openaiCompatible)
- No changes to compaction or task-model direct-fetch paths (they don't need per-agent flags)