chore: snapshot working tree - pty_exited notifications + in-flight inference WIP

feat(booterm): structured pty_exited WS notifications. Plan-validated, impl-validated, code-reviewed green (contracts build clean, contracts test 29/29, booterm + web typecheck clean). wip: in-progress inference/provider refactor (agents.ts, provider.ts, new llama-providers.ts, removed llama-args-validator), plus arena, dispatcher, compaction, schema changes. openspec: pty-exit-notifications complete; x-agent-flags planned (not yet implemented).
2026-06-14 12:48:47 +00:00
parent 0ed506f1da
commit b18de2a331
204 changed files with 25344 additions and 867 deletions
--- a/openspec/changes/x-agent-flags/design.md
+++ b/openspec/changes/x-agent-flags/design.md
@@ -0,0 +1,127 @@
+## Overview
+
+Add a `llama_flags` string field to the Agent type. On each inference request, if the agent has `llama_flags` set, emit an `X-Agent-Flags` HTTP header with the raw CLI args. The llama-sidecar parses this header and applies the flags when routing to a sidecar process.
+
+## Header injection point
+
+AI SDK v6 `streamText()` accepts a `headers` option (`Record<string, string | undefined>`) via `CallSettings`. The `@ai-sdk/openai-compatible` provider merges these with static headers via `combineHeaders()` at request time. This is the cleanest injection point -- no modification to the cached provider or fetch wrapper needed.
+
+File: `apps/server/src/services/inference/stream-phase-adapter.ts`
+
+```typescript
+// In streamCompletion(), add headers to the streamText() call:
+const agentFlagsHeader = buildAgentFlagsHeader(agent);
+const result = streamText({
+  model: upstreamModel(ctx.config, model, agent ?? null, 'boochat'),
+  messages: aiMessages,
+  // ...existing options...
+  headers: agentFlagsHeader
+    ? { 'X-Agent-Flags': agentFlagsHeader }
+    : undefined,
+});
+```
+
+## Builder function
+
+New pure helper `buildAgentFlagsHeader(agent: Agent | null): string | undefined` in `stream-phase-adapter.ts`:
+
+```typescript
+export function buildAgentFlagsHeader(agent: Agent | null): string | undefined {
+  if (!agent?.llama_flags) return undefined;
+  const trimmed = agent.llama_flags.trim();
+  return trimmed.length > 0 ? trimmed : undefined;
+}
+```
+
+The function is trivial because the sidecar does all validation (denylist, shadow flags). BooCode just passes the raw string through.
+
+## Agent type change
+
+File: `apps/server/src/types/api.ts`
+
+Add to the `Agent` interface:
+
+```typescript
+llama_flags: string | null;  // raw llama CLI args sent as X-Agent-Flags header
+```
+
+`null` means no header emitted (default).
+
+## Frontmatter parsing (V1 fix)
+
+File: `apps/server/src/services/agents.ts`
+
+The `parseFrontmatter()` function has an explicit if/else-if chain for known keys. Unknown keys are silently ignored (line 258: `// Unknown keys silently ignored`). An explicit branch MUST be added:
+
+```typescript
+} else if (key === 'llama_flags') {
+  data.llama_flags = stripQuotes(valueRaw);
+}
+```
+
+Add to `ParsedFrontmatter`:
+
+```typescript
+llama_flags?: string;
+```
+
+## Agent return-object wiring (V2 fix)
+
+File: `apps/server/src/services/agents.ts`
+
+`parseAgentSection()` explicitly constructs every field of the returned agent object. An explicit line must be added:
+
+```typescript
+llama_flags: typeof fm.llama_flags === 'string' ? fm.llama_flags : null,
+```
+
+## Sentinel summaries (V3 fix)
+
+File: `apps/server/src/services/inference/sentinel-summaries.ts`
+
+`runWrapUpSummary()` calls `streamCompletion()` at lines 96-113 but omits the 8th `agent` parameter. Two options:
+
+**Option A (recommended):** Add `agent` to the call so sentinel summaries also get agent flags. This is consistent -- the summary uses the same model as the conversation.
+
+**Option B:** Document that sentinel summaries intentionally don't use agent flags (e.g., "summaries use FAST_MODEL, a separate slot"). This requires verifying that compaction/summaries actually use FAST_MODEL.
+
+The plan recommends Option A for consistency. Add `, agent` after `signal` in the `streamCompletion` call.
+
+## Provider scope (JD-003 note)
+
+The `streamText({ headers })` approach sends the header to ALL providers (DeepSeek, gateway, llama-swap). This is acceptable because:
+- DeepSeek API ignores unknown headers (standard HTTP behavior)
+- The gateway re-forwards headers to the chosen backend
+- Only the sidecar parses `X-Agent-Flags`
+
+If this becomes an issue, provider-aware filtering can be added later by checking `isDeepSeekModel(model)` before emitting the header.
+
+## Why not extend the fetch wrapper
+
+The existing `getSwapProvider()` fetch wrapper (`provider.ts:23-33`) is cached per baseURL. Agent flags are per-agent, not per-provider. Extending the wrapper would either:
+- Create N cached providers per baseURL (one per unique flags combination) -- wasteful
+- Use a mutable closure variable -- not thread-safe
+
+The `streamText({ headers })` approach is the AI-SDK's intended per-request header mechanism and avoids both problems.
+
+## Why not forward existing sampler fields as X-Agent-Fields
+
+The existing sampler fields (top_k, min_p, etc.) already flow through `providerOptions.openaiCompatible` in the request body. The llama-server processes these dynamically. X-Agent-Flags are for startup args that can't be changed per-request (context size, cache quantization, GPU layers). Forwarding sampler fields as X-Agent-Flags would be redundant and create process-spawn overhead for no benefit.
+
+## Compaction scope
+
+Compaction (`compaction.ts`) uses `resolveModelEndpoint()` for direct `fetch()` calls and does not go through `streamCompletion()`. It does not need agent flags because:
+1. Compaction uses `FAST_MODEL` (a cheaper model per CLAUDE.md), which is a separate model slot with its own startup flags
+2. Compaction is a background maintenance task, not a user-facing agent interaction
+
+## Data flow
+
+```
+Agent.llama_flags (from AGENTS.md)
+  -> buildAgentFlagsHeader(agent)
+  -> streamText({ headers: { 'X-Agent-Flags': '...' } })
+  -> @ai-sdk/openai-compatible combineHeaders()
+  -> fetch() request to llama-swap/sidecar
+  -> sidecar parseFlags() + ValidateExtraArgs()
+  -> sidecar routes to process with matching (model, flags) hash
+```
--- a/openspec/changes/x-agent-flags/proposal.md
+++ b/openspec/changes/x-agent-flags/proposal.md
@@ -0,0 +1,22 @@
+## Why
+
+Per-agent llama-server tuning today is limited to the sampler fields that flow through `providerOptions.openaiCompatible` in the request body (top_k, min_p, dry_*, etc.). Flags that affect server startup configuration -- KV cache quantization (`--cache-type-k`), context size (`-c`), flash attention (`--flash-attn`), GPU layer count (`-ngl`) -- cannot be overridden per-agent without spawning a separate sidecar process with different BASE_ARGS.
+
+The llama-sidecar already parses an `X-Agent-Flags: --top-k 20 --cache-type-k q8_0` header and applies those flags when routing to a sidecar process. BooCode just needs to emit this header from agent config.
+
+## What Changes
+
+- Add a `llama_flags` field to the Agent type (raw llama CLI args string)
+- Parse `llama_flags` from AGENTS.md frontmatter
+- Build and emit `X-Agent-Flags` header on inference requests routed to the sidecar
+- The sidecar handles deny/shadow flag validation sidecar-side
+
+## Scope
+
+apps/server only. The sidecar (`/opt/forks/llama-sidecar`) already supports `X-Agent-Flags` -- no out-of-repo changes needed.
+
+## Non-goals
+
+- No new typed fields for individual llama-server flags (use `llama_flags` for raw args)
+- No changes to the sampler body path (top_k, min_p, etc. continue via providerOptions.openaiCompatible)
+- No changes to compaction or task-model direct-fetch paths (they don't need per-agent flags)
--- a/openspec/changes/x-agent-flags/specs/agent-flags-header/spec.md
+++ b/openspec/changes/x-agent-flags/specs/agent-flags-header/spec.md
@@ -0,0 +1,46 @@
+## ADDED Requirements
+
+### Requirement: Agent llama_flags frontmatter field
+The system SHALL parse a `llama_flags` string field from agent AGENTS.md frontmatter.
+
+#### Scenario: Agent with llama_flags set
+- **GIVEN** an agent with `llama_flags: "--cache-type-k q8_0 -c 16384"`
+- **WHEN** the agent is parsed from AGENTS.md
+- **THEN** `agent.llama_flags` equals `"--cache-type-k q8_0 -c 16384"`
+
+#### Scenario: Agent without llama_flags
+- **GIVEN** an agent with no `llama_flags` field in frontmatter
+- **WHEN** the agent is parsed from AGENTS.md
+- **THEN** `agent.llama_flags` equals `null`
+
+### Requirement: X-Agent-Flags header emission
+The inference pipeline SHALL emit an `X-Agent-Flags` HTTP header when the agent has `llama_flags` set.
+
+#### Scenario: Header emitted for agent with flags
+- **GIVEN** an agent with `llama_flags: "--cache-type-k q8_0"`
+- **WHEN** `streamCompletion()` is called with that agent
+- **THEN** the `streamText()` call receives `headers: { 'X-Agent-Flags': '--cache-type-k q8_0' }`
+
+#### Scenario: No header when agent has no flags
+- **GIVEN** an agent with `llama_flags: null`
+- **WHEN** `streamCompletion()` is called with that agent
+- **THEN** no `X-Agent-Flags` header is included in the request
+
+#### Scenario: No header when agent is null
+- **GIVEN** no agent (raw chat session)
+- **WHEN** `streamCompletion()` is called
+- **THEN** no `X-Agent-Flags` header is included in the request
+
+#### Scenario: Whitespace-only flags produce no header
+- **GIVEN** an agent with `llama_flags: "   "`
+- **WHEN** `streamCompletion()` is called with that agent
+- **THEN** no `X-Agent-Flags` header is included in the request
+
+### Requirement: Existing sampler fields unchanged
+The existing sampler fields (top_k, min_p, etc.) SHALL continue to flow through `providerOptions.openaiCompatible` in the request body, independent of the `X-Agent-Flags` header channel.
+
+#### Scenario: Dual-channel sampling
+- **GIVEN** an agent with `top_k: 20` and `llama_flags: "--cache-type-k q8_0"`
+- **WHEN** an inference request is made
+- **THEN** the request body contains `top_k: 20` via providerOptions
+- **AND** the request header contains `X-Agent-Flags: --cache-type-k q8_0`
--- a/openspec/changes/x-agent-flags/tasks.md
+++ b/openspec/changes/x-agent-flags/tasks.md
@@ -0,0 +1,35 @@
+## 1. Add llama_flags to Agent type
+
+- [ ] 1.1 Add `llama_flags: string | null` to `Agent` interface in `apps/server/src/types/api.ts`
+- [ ] 1.2 Verify no downstream type errors (tsc --noEmit)
+
+## 2. Parse llama_flags from AGENTS.md frontmatter
+
+- [ ] 2.1 Add `llama_flags?: string` to `ParsedFrontmatter` in `apps/server/src/services/agents.ts`
+- [ ] 2.2 Add explicit `else if (key === 'llama_flags')` branch in `parseFrontmatter()` before the "Unknown keys silently ignored" fallthrough (agents.ts ~line 258)
+- [ ] 2.3 Add `llama_flags: typeof fm.llama_flags === 'string' ? fm.llama_flags : null` to the return object in `parseAgentSection()` (agents.ts ~line 364)
+
+## 3. Build X-Agent-Flags header
+
+- [ ] 3.1 Add `buildAgentFlagsHeader(agent: Agent | null): string | undefined` to `apps/server/src/services/inference/stream-phase-adapter.ts`
+- [ ] 3.2 Export the function for testability
+
+## 4. Emit header on inference requests
+
+- [ ] 4.1 In `streamCompletion()`, compute `agentFlagsHeader` from the agent parameter
+- [ ] 4.2 Pass `headers: { 'X-Agent-Flags': agentFlagsHeader }` to `streamText()` when non-empty
+- [ ] 4.3 Verify the header is NOT emitted when agent is null or llama_flags is null/empty
+
+## 5. Fix sentinel summaries (V3)
+
+- [ ] 5.1 In `sentinel-summaries.ts`, add `agent` as the 8th argument to the `streamCompletion()` call in `runWrapUpSummary()` (after `signal`)
+
+## 6. Write tests
+
+- [ ] 6.1 Add unit test for `buildAgentFlagsHeader` in `stream-phase-adapter.test.ts` (null agent, null llama_flags, empty string, whitespace-only, valid flags)
+- [ ] 6.2 Add test verifying `streamText` receives `headers: { 'X-Agent-Flags': '...' }` when agent has llama_flags
+
+## 7. Verify end-to-end
+
+- [ ] 7.1 Run `pnpm -C apps/server build` to confirm typecheck passes
+- [ ] 7.2 Run `pnpm -C apps/server test` to confirm no regressions