v1.13.9: compaction overflow trigger — 0.85 × ctx_max early trigger

Opencode pattern (session/overflow.ts): fire compaction at 85% of ctx_max, replacing the v1.11.0-era `ctx_max - 20_000` formula. Old formula: usable = ctx_max - 20_000 - ctx=262144 → trigger at 242144 (92.4%) — only 7.6% headroom - ctx=100000 → trigger at 80000 (80.0%) - ctx= 32000 → trigger at 12000 (37.5%) — over-eager - ctx<=20000 → trigger at 0 — never fires New formula: usable = floor(0.85 * ctx_max) - ctx=262144 → trigger at 222822 (85.0%) — 15% headroom for summarizer - ctx=100000 → trigger at 85000 (85.0%) - ctx= 32000 → trigger at 27200 (85.0%) - ctx= 8192 → trigger at 6963 (85.0%) Ratio gives consistent headroom at any context scale. The qwen3.6 daily driver gets ~19k tokens more breathing room before overflow; small-ctx models no longer degenerate to never-triggering. usable() is the only consumer of COMPACTION_BUFFER → constant deleted. New EARLY_TRIGGER_RATIO constant takes its place. isOverflow() and the maybeFlagForCompaction() call site at payload.ts:184 are unchanged — formula swap is internal to compaction.ts. payload.ts comment touched only to drop the stale COMPACTION_BUFFER reference (PRUNE_TRIGGER_TOKENS stays at 20k as the prune-freed threshold; independent of the overflow formula). Tests: 4 new usable() corner cases (262k/100k/8k/zero+negative), plus 5 isOverflow() numbers shifted to match the 85k budget at ctx=100k. 195/195 server tests pass (was 194). Smoke: ratio math verified by unit tests at all four corners. Live cap-hit verification deferred — requires accumulating >222k tokens in a session under qwen3.6-35b-a3b-mxfp4 (was >242k pre-fix); will surface organically in extended use.
2026-05-22 13:59:14 +00:00
parent a0c8d212cb
commit b06a4a8e55
3 changed files with 53 additions and 34 deletions
--- a/apps/server/src/services/inference/payload.ts
+++ b/apps/server/src/services/inference/payload.ts
@@ -199,10 +199,13 @@ export async function maybeFlagForCompaction(
  );
  if (!overflow) return;

-  // v1.13.4: try the cheap prune first. If it freed at least the buffer
-  // worth of tokens (PRUNE_TRIGGER_TOKENS, identical to COMPACTION_BUFFER),
-  // we're below the threshold again — skip flagging summarize for the next
-  // turn. The next turn's overflow check will re-evaluate from scratch.
+  // v1.13.4: try the cheap prune first. If it freed at least
+  // PRUNE_TRIGGER_TOKENS (20k) worth of context, we're below the threshold
+  // again — skip flagging summarize for the next turn. The next turn's
+  // overflow check will re-evaluate from scratch.
+  // v1.13.9: the overflow trigger above is now 85% of ctx_max (was
+  // ctx_max - 20k). PRUNE_TRIGGER_TOKENS stays at 20k as the prune-freed
+  // threshold — independent of the overflow formula.
  // Prune failures (DB errors etc.) propagate so the surrounding inference
  // path sees them; the catch in finalizeCompletion / executeToolPhase
  // doesn't shield this — by design, we want to know if prune is broken.