Merge phase3-lifecycle: v2.6.10 lifecycle hardening (completes v2.6 persistent agent sessions)

This commit is contained in:
2026-06-01 01:10:16 +00:00
18 changed files with 1803 additions and 46 deletions

View File

@@ -2,6 +2,10 @@
All notable changes per release tag. Most recent on top, ordered by tag creation date (which matches the git history). Tag names follow `vMAJOR.MINOR.PATCH-slug` — the slug describes what shipped, so the tag name alone is enough to recall the batch. All notable changes per release tag. Most recent on top, ordered by tag creation date (which matches the git history). Tag names follow `vMAJOR.MINOR.PATCH-slug` — the slug describes what shipped, so the tag name alone is enough to recall the batch.
## v2.6.10-lifecycle-hardening — 2026-06-01
v2.6 Phase 3 (the last phase) — lifecycle hardening of the warm-process backends. **Idle eviction + LRU cap:** the agent pool runs a 60s sweep that evicts backends/sessions idle past `AGENT_POOL_IDLE_TTL_MS` (30 min default) and any beyond `AGENT_POOL_MAX_LIVE` (10, LRU) — **never a busy one** (in-flight turn, double-checked via a new `isBusy()` backend hook); the worktree persists (DB-backed) and the next turn re-spawns + reattaches. The eviction/LRU/restart decisions are factored into a pure `lifecycle-decisions.ts` (modeled on the inference `selectPruneTargets` pattern). **Crash recovery:** lifts openchamber's health-monitor + busy-aware-restart + consecutive-failure + stale-busy-grace state machine into `opencode-server.ts` (with port reclaim) and `warm-acp.ts` — an opencode server crash settles in-flight turns as failed, marks the rows `crashed`, and recreates fresh sessions (a fresh server can't hold the old in-memory id), while a warm-ACP child crash re-`session/new`s next turn; the F.1 turn-guard and U.6 usage are preserved (their tests still pass). **Worktree reaper:** a periodic reaper removes orphan on-disk worktrees (no live `worktrees` row, 1h grace) behind a superset-style preflight that skips dirty/unpushed/unmerged work, with Paseo-style soft-delete (`status='archived'`). Plus close hooks (`/api/chats/:id/close`, `/api/sessions/:id/close`, awaiting the apps/server caller) and diff re-baseline after `apply_pending`. Built test-first — 35 new tests (`lifecycle-decisions` 22, `agent-pool` 13) + a DB-opt-in reconnect integration test; 215 coder tests pass, tsc + build clean. **This completes v2.6** (Phase 03 + F.1 + Phase 1-UX). Remaining follow-ups (out of v2.6 scope): the apps/server close-hook caller, the 3.7 DiffPanel staging-boundary hint (frontend), and live Smoke 2/2b/3.
## v2.6.9-warm-acp — 2026-05-31 ## v2.6.9-warm-acp — 2026-05-31
v2.6 Phase 2: goose and qwen now run as **warm ACP backends** instead of one-shot-per-task. A new `WarmAcpBackend` (`backends/warm-acp.ts`, implementing the same `AgentBackend` interface as the opencode warm server) holds one persistent `goose acp` / `qwen --acp` child + `ClientSideConnection` + ACP session per `(chat, agent)`, running `initialize` + `session/new` once and reusing the connection across turns; per-turn abort cancels the in-flight prompt (`session/cancel`) without killing the child, and a child exit marks `agent_sessions.status='crashed'` for re-spawn on the next turn. The dispatcher routes `goose`/`qwen` chat-tab tasks to the pooled warm backend via a pure `shouldUseWarmBackend(task)` predicate (warm only when both `session_id` and `chat_id` are set), keeping the one-shot `runExternalAgent` path as the fallback for session-less creators (arena, MCP, `new_task`); broker frames + `persistExternalAgentTurn` + the latest-wins `pending_changes` diff are identical to the opencode path. The `acp-dispatch.ts` `handleSessionUpdate` switch was extracted into a pure shared `acp-event-map.ts` mapper used by both the one-shot and warm paths (one-shot behavior byte-identical, all existing acp tests green). The design's `unstable_resumeSession` concern is resolved — the installed `@agentclientprotocol/sdk@^0.22.1` exposes stable `resumeSession`/`loadSession`, but resume is moot in the hot path (warm reuse needs none); cross-restart resume + idle eviction are deferred to Phase 3. Built test-first (15 new tests: `warm-acp-routing`, `acp-event-map`); 180 coder tests pass, tsc + build clean. **Smoke 2/2b (live two-message warm reuse + the opencode→boocode→opencode switch round-trip) to be run post-deploy.** Phase 3 (lifecycle hardening) is the last v2.6 phase. v2.6 Phase 2: goose and qwen now run as **warm ACP backends** instead of one-shot-per-task. A new `WarmAcpBackend` (`backends/warm-acp.ts`, implementing the same `AgentBackend` interface as the opencode warm server) holds one persistent `goose acp` / `qwen --acp` child + `ClientSideConnection` + ACP session per `(chat, agent)`, running `initialize` + `session/new` once and reusing the connection across turns; per-turn abort cancels the in-flight prompt (`session/cancel`) without killing the child, and a child exit marks `agent_sessions.status='crashed'` for re-spawn on the next turn. The dispatcher routes `goose`/`qwen` chat-tab tasks to the pooled warm backend via a pure `shouldUseWarmBackend(task)` predicate (warm only when both `session_id` and `chat_id` are set), keeping the one-shot `runExternalAgent` path as the fallback for session-less creators (arena, MCP, `new_task`); broker frames + `persistExternalAgentTurn` + the latest-wins `pending_changes` diff are identical to the opencode path. The `acp-dispatch.ts` `handleSessionUpdate` switch was extracted into a pure shared `acp-event-map.ts` mapper used by both the one-shot and warm paths (one-shot behavior byte-identical, all existing acp tests green). The design's `unstable_resumeSession` concern is resolved — the installed `@agentclientprotocol/sdk@^0.22.1` exposes stable `resumeSession`/`loadSession`, but resume is moot in the hot path (warm reuse needs none); cross-restart resume + idle eviction are deferred to Phase 3. Built test-first (15 new tests: `warm-acp-routing`, `acp-event-map`); 180 coder tests pass, tsc + build clean. **Smoke 2/2b (live two-message warm reuse + the opencode→boocode→opencode switch round-trip) to be run post-deploy.** Phase 3 (lifecycle hardening) is the last v2.6 phase.

View File

@@ -35,6 +35,21 @@ const ConfigSchema = z.object({
// SSH access to the host for external agent dispatch (Phase 5) // SSH access to the host for external agent dispatch (Phase 5)
BOOCODER_SSH_HOST: z.string().default('100.114.205.53'), BOOCODER_SSH_HOST: z.string().default('100.114.205.53'),
BOOCODER_SSH_USER: z.string().default('samkintop'), BOOCODER_SSH_USER: z.string().default('samkintop'),
// v2.6 Phase 3 (lifecycle hardening). Idle TTL: evict a non-busy warm backend
// (opencode server / warm-ACP child) after this long with no turn — its worktree
// + agent_sessions row persist, so the next turn re-spawns + reattaches. 30 min
// default (design §6).
AGENT_POOL_IDLE_TTL_MS: z.coerce.number().int().positive().default(1_800_000),
// LRU cap: max live warm backends before the least-recently-used (non-busy) ones
// are evicted. Bounds the long-lived-daemon's per-(chat,agent) Map growth.
AGENT_POOL_MAX_LIVE: z.coerce.number().int().positive().default(10),
// Periodic sweep cadence (idle/LRU pool eviction + orphan-worktree reap). 60s
// mirrors the apps/server truncation/stale-streaming sweeper.
LIFECYCLE_SWEEP_INTERVAL_MS: z.coerce.number().int().positive().default(60_000),
// Orphan-worktree grace: an on-disk worktree dir with no live `worktrees` row is
// only reaped after it's been untouched this long (avoids sweeping a dir mid
// ensureSessionWorktree create). 1h default.
ORPHAN_WORKTREE_GRACE_MS: z.coerce.number().int().positive().default(3_600_000),
}); });
export type Config = z.infer<typeof ConfigSchema>; export type Config = z.infer<typeof ConfigSchema>;

View File

@@ -32,10 +32,12 @@ import { registerStatsRoutes } from './routes/stats.js';
import { registerArenaRoutes } from './routes/arena.js'; import { registerArenaRoutes } from './routes/arena.js';
import { registerProviderRoutes } from './routes/providers.js'; import { registerProviderRoutes } from './routes/providers.js';
import { registerWorktreeSafetyRoutes } from './routes/worktree-safety.js'; import { registerWorktreeSafetyRoutes } from './routes/worktree-safety.js';
import { registerLifecycleRoutes } from './routes/lifecycle.js';
import { registerWebSocket } from './routes/ws.js'; import { registerWebSocket } from './routes/ws.js';
// Phase 4: dispatcher + agent probe // Phase 4: dispatcher + agent probe
import { createDispatcher } from './services/dispatcher.js'; import { createDispatcher } from './services/dispatcher.js';
import { agentPool } from './services/agent-pool.js'; import { agentPool } from './services/agent-pool.js';
import { createOrphanWorktreeReaper } from './services/orphan-worktree-reaper.js';
import { probeAgents } from './services/agent-probe.js'; import { probeAgents } from './services/agent-probe.js';
import { getProviderSnapshot, persistProbedModels } from './services/provider-snapshot.js'; import { getProviderSnapshot, persistProbedModels } from './services/provider-snapshot.js';
import { setPermissionHooks } from './services/permission-waiter.js'; import { setPermissionHooks } from './services/permission-waiter.js';
@@ -181,10 +183,30 @@ async function main() {
// Phase 4: dispatcher — polls tasks table and runs inference // Phase 4: dispatcher — polls tasks table and runs inference
const dispatcher = createDispatcher({ sql, inference: inferenceApi, broker, log: app.log, config }); const dispatcher = createDispatcher({ sql, inference: inferenceApi, broker, log: app.log, config });
dispatcher.start(); dispatcher.start();
// v2.6 Phase 3: configure + start the agent-pool lifecycle sweep (idle-TTL +
// LRU-cap eviction of warm backends, plus each backend's proactive health probe)
// and the orphan-worktree reaper. Both run on the same periodic timer.
agentPool.configure({
idleTtlMs: config.AGENT_POOL_IDLE_TTL_MS,
maxLive: config.AGENT_POOL_MAX_LIVE,
sweepIntervalMs: config.LIFECYCLE_SWEEP_INTERVAL_MS,
log: app.log,
});
agentPool.startReaper(app.log);
const orphanReaper = createOrphanWorktreeReaper({
sql,
log: app.log,
intervalMs: config.LIFECYCLE_SWEEP_INTERVAL_MS,
graceMs: config.ORPHAN_WORKTREE_GRACE_MS,
});
orphanReaper.start();
app.addHook('onClose', async () => { app.addHook('onClose', async () => {
// stop() first so in-flight dispatcher turns settle, then drain the pool. // stop() first so in-flight dispatcher turns settle, then stop the reapers and
// Pool is empty in Phase 0 (nothing spawns yet) — dispose() is inert. // drain the pool (kills opencode server + warm ACP children).
await dispatcher.stop(); await dispatcher.stop();
orphanReaper.stop();
await agentPool.dispose(); await agentPool.dispose();
}); });
@@ -199,6 +221,7 @@ async function main() {
registerArenaRoutes(app, sql); registerArenaRoutes(app, sql);
registerProviderRoutes(app, sql, config); registerProviderRoutes(app, sql, config);
registerWorktreeSafetyRoutes(app, sql); registerWorktreeSafetyRoutes(app, sql);
registerLifecycleRoutes(app, sql);
registerWebSocket(app, sql, broker); registerWebSocket(app, sql, broker);
// Serve static frontend (built web app). In production, the dist/ is // Serve static frontend (built web app). In production, the dist/ is

View File

@@ -0,0 +1,122 @@
/**
* v2.6 Phase 3 (3.3) — chat/session close-or-archive cleanup hook (coder side).
*
* Chat/session close + archive + delete all live in apps/server (Docker), which
* cannot see the host worktree dirs (/tmp/booworktrees), run git on them, or reach
* the warm agent processes the dispatcher pooled in THIS (host systemd) process. So
* — exactly like the `worktree-risk` guard — the server signals the coder when a
* chat/session closes, and the coder does the real teardown:
* 1. dispose the chat's warm-ACP backends (`agentPool.closeChat`) — kills the
* goose/qwen child processes for that chat,
* 2. close the chat's opencode session on the shared server (`closeSession`),
* 3. mark every `agent_sessions` row for the chat 'closed' + (when the session's
* last open chat closes) remove the shared session worktree, preflighting
* work-at-risk so uncommitted/unmerged work is never silently dropped
* (`closeChatBackendState`).
*
* Idempotent: closing an already-closed chat is a no-op (0 rows, no backend).
*
* SERVER WIRING (not done here — apps/server, out of this batch's scope): the
* server's `POST /api/chats/:id/archive`, `DELETE /api/chats/:id`, and the
* session archive/delete routes should fire-and-forget
* fetch(`${BOOCODER_URL}/api/chats/${id}/close`, { method: 'POST' })
* after publishing their WS frame (best-effort; the orphan-worktree reaper +
* idle-pool eviction are the backstop if the call is missed).
*/
import type { FastifyInstance } from 'fastify';
import type { Sql } from '../db.js';
import { agentPool, OPENCODE_POOL_KEY } from '../services/agent-pool.js';
import { closeChatBackendState } from '../services/worktrees.js';
import type { AgentSessionHandle } from '../services/agent-backend.js';
export function registerLifecycleRoutes(app: FastifyInstance, sql: Sql): void {
// POST /api/chats/:chatId/close — tear down all warm state for a chat tab.
app.post<{ Params: { chatId: string }; Querystring: { force?: string } }>(
'/api/chats/:chatId/close',
async (req) => {
const chatId = req.params.chatId;
const force = req.query.force === 'true' || req.query.force === '1';
// 1. Close the chat's opencode session on the SHARED server (the server is
// not chat-keyed, so agentPool.closeChat won't touch it). Resolve the
// stored opencode session id and ask the backend to drop it.
const ocRows = await sql<{ agent: string; agent_session_id: string | null; worktree_id: string | null; session_id: string | null }[]>`
SELECT agent, agent_session_id, worktree_id, session_id
FROM agent_sessions
WHERE chat_id = ${chatId} AND backend = 'opencode_server'
`;
const ocBackend = agentPool.peek(OPENCODE_POOL_KEY, 'opencode');
if (ocBackend) {
for (const row of ocRows) {
if (!row.agent_session_id) continue;
const handle: AgentSessionHandle = {
sessionId: row.session_id ?? '',
agent: row.agent,
backend: 'opencode_server',
chatId,
worktreeId: row.worktree_id ?? '',
agentSessionId: row.agent_session_id,
serverPort: null,
};
await ocBackend.closeSession(handle).catch((err) => {
app.log.warn({ err: err instanceof Error ? err.message : String(err), chatId }, 'lifecycle: opencode closeSession threw');
});
}
}
// 2. Dispose any warm-ACP backends pooled under this chat (kills the
// goose/qwen child + marks its agent row closed via the backend).
const disposed = await agentPool.closeChat(chatId);
// 3. DB + worktree truth: mark agent rows closed; remove the shared session
// worktree iff this was the session's last open chat (preflight at-risk).
const result = await closeChatBackendState(sql, chatId, { force });
app.log.info({ chatId, disposed, ...result }, 'lifecycle: chat closed');
return { ok: true, disposed, ...result };
},
);
// POST /api/sessions/:sessionId/close — close every open chat in a session
// (session archive/delete). Loops the chat-close path so the same preflight +
// teardown applies per chat; the worktree is removed on the last one.
app.post<{ Params: { sessionId: string }; Querystring: { force?: string } }>(
'/api/sessions/:sessionId/close',
async (req) => {
const sessionId = req.params.sessionId;
const force = req.query.force === 'true' || req.query.force === '1';
const chats = await sql<{ id: string }[]>`
SELECT id FROM chats WHERE session_id = ${sessionId}
`;
const results: { chatId: string; disposed: string[]; worktreeRemoved: boolean; worktreeAtRisk: boolean }[] = [];
for (const c of chats) {
const ocBackend = agentPool.peek(OPENCODE_POOL_KEY, 'opencode');
if (ocBackend) {
const ocRows = await sql<{ agent: string; agent_session_id: string | null; worktree_id: string | null; session_id: string | null }[]>`
SELECT agent, agent_session_id, worktree_id, session_id
FROM agent_sessions WHERE chat_id = ${c.id} AND backend = 'opencode_server'
`;
for (const row of ocRows) {
if (!row.agent_session_id) continue;
await ocBackend.closeSession({
sessionId: row.session_id ?? '',
agent: row.agent,
backend: 'opencode_server',
chatId: c.id,
worktreeId: row.worktree_id ?? '',
agentSessionId: row.agent_session_id,
serverPort: null,
}).catch(() => {});
}
}
const disposed = await agentPool.closeChat(c.id);
const r = await closeChatBackendState(sql, c.id, { force });
results.push({ chatId: c.id, disposed, worktreeRemoved: r.worktreeRemoved, worktreeAtRisk: r.worktreeAtRisk });
}
app.log.info({ sessionId, chats: results.length }, 'lifecycle: session closed');
return { ok: true, results };
},
);
}

View File

@@ -10,6 +10,7 @@ import {
queueCreate, queueCreate,
} from '../services/pending_changes.js'; } from '../services/pending_changes.js';
import { WriteGuardError } from '../services/write_guard.js'; import { WriteGuardError } from '../services/write_guard.js';
import { rebaselineWorktreeAfterApply } from '../services/worktrees.js';
const CreateBody = z.object({ const CreateBody = z.object({
file_path: z.string().min(1), file_path: z.string().min(1),
@@ -117,6 +118,15 @@ export function registerPendingRoutes(app: FastifyInstance, sql: Sql): void {
} }
const results = await applyAll(sql, sessionId, projectRoot); const results = await applyAll(sql, sessionId, projectRoot);
// v2.6 Phase 3 (3.5): re-baseline the session worktree's diff to the applied
// state, so the next external-agent turn diffs against applied-not-original
// and doesn't re-surface the just-applied changes. Best-effort: a worktree
// session may not exist (native-only chat), and a re-baseline hiccup must not
// fail the apply the user just requested.
if (results.some((r) => r.success)) {
await rebaselineWorktreeAfterApply(sql, sessionId).catch(() => {});
}
return { results }; return { results };
}, },
); );
@@ -136,6 +146,15 @@ export function registerPendingRoutes(app: FastifyInstance, sql: Sql): void {
const result = await applyOne(sql, changeId, projectRoot); const result = await applyOne(sql, changeId, projectRoot);
if (!result.success) { if (!result.success) {
reply.code(422); reply.code(422);
} else {
// v2.6 Phase 3 (3.5): re-baseline the session worktree after a successful
// apply so the next external-agent turn diffs against applied-not-original.
// Resolve the change's session; best-effort, never fails the apply.
const sessRows = await sql<{ session_id: string }[]>`
SELECT session_id FROM pending_changes WHERE id = ${changeId}
`;
const sessionId = sessRows[0]?.session_id;
if (sessionId) await rebaselineWorktreeAfterApply(sql, sessionId).catch(() => {});
} }
return result; return result;
}, },

View File

@@ -0,0 +1,233 @@
import { describe, it, expect, vi } from 'vitest';
import { AgentPool, OPENCODE_POOL_KEY } from '../agent-pool.js';
import type {
AgentBackend,
AgentSessionHandle,
EnsureSessionOpts,
PromptCtx,
TurnResult,
} from '../agent-backend.js';
/**
* v2.6 Phase 3 — AgentPool lifecycle unit test (T.1). No DB / no child process:
* a fake AgentBackend records dispose + reports busy/health, so we exercise
* get-or-create, idle eviction, the LRU cap, the busy-never-evict rule, closeChat,
* and dispose-drains directly. The pure decisions are covered separately in
* backends/__tests__/lifecycle-decisions.test.ts; this verifies the wiring.
*/
class FakeBackend implements AgentBackend {
disposed = 0;
closedSessions = 0;
private busyFlag = false;
tickHealthCalls = 0;
constructor(public readonly name = 'fake') {}
setBusy(b: boolean): void {
this.busyFlag = b;
}
// — AgentBackend —
async ensureSession(sessionId: string, opts: EnsureSessionOpts): Promise<AgentSessionHandle> {
return {
sessionId,
agent: opts.agent,
backend: 'acp_warm',
chatId: opts.chatId,
worktreeId: opts.worktreeId,
agentSessionId: 'fake-session',
serverPort: null,
};
}
async prompt(_h: AgentSessionHandle, _input: string, _ctx: PromptCtx): Promise<TurnResult> {
return { ok: true };
}
async closeSession(): Promise<void> {
this.closedSessions++;
}
async dispose(): Promise<void> {
this.disposed++;
}
health(): 'up' | 'down' {
return 'up';
}
isBusy(): boolean {
return this.busyFlag;
}
async tickHealth(): Promise<void> {
this.tickHealthCalls++;
}
}
describe('AgentPool — get/register/touch (3.1)', () => {
it('register then get returns the same backend', () => {
const pool = new AgentPool();
const b = new FakeBackend();
pool.register('chat-1', 'goose', b);
expect(pool.get('chat-1', 'goose')).toBe(b);
expect(pool.get('chat-1', 'qwen')).toBeUndefined();
});
it('peek does NOT exist for a missing key', () => {
const pool = new AgentPool();
expect(pool.peek('nope', 'goose')).toBeUndefined();
});
it('health reports size + busy count', () => {
const pool = new AgentPool();
const a = new FakeBackend();
const b = new FakeBackend();
b.setBusy(true);
pool.register('c1', 'goose', a);
pool.register('c2', 'qwen', b);
expect(pool.health()).toEqual({ size: 2, busy: 1 });
});
});
describe('AgentPool.sweep — idle TTL eviction (3.1)', () => {
it('evicts an idle backend past the TTL and disposes it', async () => {
const pool = new AgentPool({ idleTtlMs: 1_000, maxLive: 100 });
const b = new FakeBackend();
pool.register('c1', 'goose', b);
// Sweep with now far past the registration → idle → evicted.
const { evicted } = await pool.sweep(Date.now() + 10_000);
expect(evicted).toEqual(['c1:goose']);
expect(b.disposed).toBe(1);
expect(pool.get('c1', 'goose')).toBeUndefined();
});
it('never evicts a busy backend even past the TTL', async () => {
const pool = new AgentPool({ idleTtlMs: 1_000, maxLive: 100 });
const b = new FakeBackend();
b.setBusy(true);
pool.register('c1', 'goose', b);
const { evicted } = await pool.sweep(Date.now() + 10_000);
expect(evicted).toEqual([]);
expect(b.disposed).toBe(0);
expect(pool.get('c1', 'goose')).toBe(b);
});
it('touch keeps a backend warm so the TTL measures from the last turn', async () => {
const pool = new AgentPool({ idleTtlMs: 5_000, maxLive: 100 });
const b = new FakeBackend();
pool.register('c1', 'goose', b);
const base = Date.now();
// 4s later, touch — resets activity. A sweep at +6s from base is only +2s from
// the touch → still within TTL → not evicted.
vi.spyOn(Date, 'now').mockReturnValue(base + 4_000);
pool.touch('c1', 'goose');
vi.restoreAllMocks();
const { evicted } = await pool.sweep(base + 6_000);
expect(evicted).toEqual([]);
});
});
describe('AgentPool.sweep — LRU cap (3.4)', () => {
it('evicts the least-recently-used beyond the cap', async () => {
const pool = new AgentPool({ idleTtlMs: 1_000_000, maxLive: 2 });
const base = 1_000_000;
const mk = (key: string, regAt: number) => {
vi.spyOn(Date, 'now').mockReturnValue(regAt);
const b = new FakeBackend(key);
const [chat, agent] = key.split(':');
pool.register(chat!, agent!, b);
vi.restoreAllMocks();
return b;
};
const a = mk('c1:goose', base + 100);
const b = mk('c2:goose', base + 300);
const c = mk('c3:goose', base + 200);
// 3 entries, cap 2, all within idle TTL → LRU (oldest = a@+100) evicted.
const { evicted } = await pool.sweep(base + 1_000);
expect(evicted).toEqual(['c1:goose']);
expect(a.disposed).toBe(1);
expect(b.disposed).toBe(0);
expect(c.disposed).toBe(0);
});
});
describe('AgentPool.sweep — proactive health probe (3.2)', () => {
it('drives each backend tickHealth before eviction', async () => {
const pool = new AgentPool({ idleTtlMs: 1_000_000, maxLive: 100 });
const b = new FakeBackend();
pool.register('c1', 'opencode', b);
await pool.sweep(Date.now());
expect(b.tickHealthCalls).toBe(1);
});
});
describe('AgentPool.closeChat — chat-close teardown (3.3)', () => {
it('disposes only the matching chat keys, leaving others + the shared server', async () => {
const pool = new AgentPool();
const goose = new FakeBackend('goose');
const qwen = new FakeBackend('qwen');
const other = new FakeBackend('other-chat');
const ocServer = new FakeBackend('opencode-server');
pool.register('chat-1', 'goose', goose);
pool.register('chat-1', 'qwen', qwen);
pool.register('chat-2', 'goose', other);
pool.register(OPENCODE_POOL_KEY, 'opencode', ocServer);
const removed = await pool.closeChat('chat-1');
expect(removed.sort()).toEqual(['chat-1:goose', 'chat-1:qwen']);
expect(goose.disposed).toBe(1);
expect(qwen.disposed).toBe(1);
// other chat + shared opencode server untouched.
expect(other.disposed).toBe(0);
expect(ocServer.disposed).toBe(0);
expect(pool.peek('chat-2', 'goose')).toBe(other);
expect(pool.peek(OPENCODE_POOL_KEY, 'opencode')).toBe(ocServer);
});
it('does not dispose a busy backend on closeChat', async () => {
const pool = new AgentPool();
const b = new FakeBackend();
b.setBusy(true);
pool.register('chat-1', 'goose', b);
const removed = await pool.closeChat('chat-1');
expect(removed).toEqual([]);
expect(b.disposed).toBe(0);
});
it('does not match a chat id that is a prefix of another', async () => {
// 'chat-1' must not match 'chat-10' — keys are `${chatId}:${agent}` so the
// colon delimiter prevents the prefix collision.
const pool = new AgentPool();
const a = new FakeBackend();
const b = new FakeBackend();
pool.register('chat-1', 'goose', a);
pool.register('chat-10', 'goose', b);
await pool.closeChat('chat-1');
expect(a.disposed).toBe(1);
expect(b.disposed).toBe(0);
expect(pool.peek('chat-10', 'goose')).toBe(b);
});
});
describe('AgentPool.dispose — drain all (T.1)', () => {
it('disposes every backend and clears the map', async () => {
const pool = new AgentPool();
const a = new FakeBackend();
const b = new FakeBackend();
pool.register('c1', 'goose', a);
pool.register('c2', 'qwen', b);
await pool.dispose();
expect(a.disposed).toBe(1);
expect(b.disposed).toBe(1);
expect(pool.health()).toEqual({ size: 0, busy: 0 });
});
it('tolerates a backend whose dispose throws', async () => {
const pool = new AgentPool();
const good = new FakeBackend();
const bad = new FakeBackend();
bad.dispose = async () => {
throw new Error('boom');
};
pool.register('c1', 'goose', bad);
pool.register('c2', 'qwen', good);
await expect(pool.dispose()).resolves.toBeUndefined();
expect(good.disposed).toBe(1);
});
});

View File

@@ -0,0 +1,170 @@
import { describe, it, expect, beforeAll, afterAll } from 'vitest';
import { readFileSync, existsSync } from 'node:fs';
import { rm, mkdir } from 'node:fs/promises';
import { resolve } from 'node:path';
import postgres from 'postgres';
import {
ensureSessionWorktree,
closeChatBackendState,
rebaselineWorktreeAfterApply,
} from '../worktrees.js';
import { reapOrphanWorktrees } from '../orphan-worktree-reaper.js';
import { hostExec } from '../host-exec.js';
/**
* v2.6 Phase 3 (3.6) — reconnect-after-restart integration test.
*
* Proves the DB-truth side of crash/restart recovery: a BooCoder restart wipes the
* in-memory pool, but the persistent `worktrees` + `agent_sessions` rows survive,
* so the "next turn" re-resolves the SAME worktree (reattach, no new dir) and the
* agent-session row is still there to resume from. Also exercises the chat-close
* hook (3.3), the apply re-baseline (3.5), and the orphan reaper (3.4) end-to-end
* against a real git repo + postgres.
*
* Requires DATABASE_URL (DB-opt-in; skips cleanly otherwise) AND git on PATH. Runs:
* DATABASE_URL='postgres://boocode:devpass@localhost:5500/boochat' pnpm -C apps/coder test
*/
describe.runIf(!!process.env.DATABASE_URL)('reconnect after restart (Phase 3)', () => {
let sql: ReturnType<typeof postgres>;
const stamp = Date.now();
const projectDir = `/tmp/boocode-reconnect-proj-${stamp}`;
let projectId: string;
let sessionId: string;
let chatId: string;
beforeAll(async () => {
sql = postgres(process.env.DATABASE_URL!, { max: 3 });
// Both schemas land in the one boochat DB: server owns sessions/chats/projects,
// coder owns worktrees/agent_sessions (FK targets must pre-exist → server first).
const serverSchema = resolve(__dirname, '../../../../server/src/schema.sql');
const coderSchema = resolve(__dirname, '../../schema.sql');
await sql.unsafe(readFileSync(serverSchema, 'utf8'));
await sql.unsafe(readFileSync(coderSchema, 'utf8'));
// A real git repo with one commit so worktree add / diff / rev-parse work.
await mkdir(projectDir, { recursive: true });
await hostExec(
`cd ${projectDir} && git init -q && git config user.email t@t && git config user.name t ` +
`&& echo hello > README.md && git add -A && git commit -qm init`,
{ timeoutMs: 20_000 },
);
const [project] = await sql<{ id: string }[]>`
INSERT INTO projects (name, path, status) VALUES ('reconnect-test', ${projectDir}, 'open') RETURNING id
`;
projectId = project!.id;
const [session] = await sql<{ id: string }[]>`
INSERT INTO sessions (project_id, name, model, status)
VALUES (${projectId}, 'recon', 'm', 'open') RETURNING id
`;
sessionId = session!.id;
const [chat] = await sql<{ id: string }[]>`
INSERT INTO chats (session_id, name, status) VALUES (${sessionId}, 'tab', 'open') RETURNING id
`;
chatId = chat!.id;
});
afterAll(async () => {
if (sql) {
// Best-effort worktree cleanup before dropping rows.
const rows = await sql<{ path: string }[]>`SELECT path FROM worktrees WHERE session_id = ${sessionId}`.catch(() => []);
for (const r of rows) {
await hostExec(`git -C ${projectDir} worktree remove ${r.path} --force`, { timeoutMs: 10_000 }).catch(() => {});
}
await sql`DELETE FROM agent_sessions WHERE chat_id = ${chatId}`.catch(() => {});
await sql`DELETE FROM worktrees WHERE session_id = ${sessionId}`.catch(() => {});
await sql`DELETE FROM chats WHERE id = ${chatId}`.catch(() => {});
await sql`DELETE FROM sessions WHERE id = ${sessionId}`.catch(() => {});
await sql`DELETE FROM projects WHERE id = ${projectId}`.catch(() => {});
await sql.end({ timeout: 5 });
}
await rm(projectDir, { recursive: true, force: true });
});
it('reattaches the SAME worktree across a simulated restart (no new dir)', async () => {
// "Turn 1" — first ensureSessionWorktree creates the worktree + row.
const first = await ensureSessionWorktree(sql, projectDir, sessionId);
expect(existsSync(first.worktreePath)).toBe(true);
expect(first.baseCommit).toBeTruthy();
// Simulate an agent_sessions row written by turn 1 (opencode).
await sql`
INSERT INTO agent_sessions (chat_id, session_id, worktree_id, agent, backend, agent_session_id, status, last_active_at)
VALUES (${chatId}, ${sessionId}, ${first.worktreeId}, 'opencode', 'opencode_server', 'oc-sess-1', 'active', clock_timestamp())
ON CONFLICT (chat_id, agent) DO NOTHING
`;
// "Restart" = brand-new resolution with NO in-memory state. ensureSessionWorktree
// must return the EXISTING row (same id + path), proving reattach not re-create.
const second = await ensureSessionWorktree(sql, projectDir, sessionId);
expect(second.worktreeId).toBe(first.worktreeId);
expect(second.worktreePath).toBe(first.worktreePath);
expect(second.baseCommit).toBe(first.baseCommit);
// The agent_sessions row survived the "restart" with its resume handle intact.
const [row] = await sql<{ agent_session_id: string; status: string }[]>`
SELECT agent_session_id, status FROM agent_sessions WHERE chat_id = ${chatId} AND agent = 'opencode'
`;
expect(row!.agent_session_id).toBe('oc-sess-1');
});
it('re-baselines the worktree diff after apply (3.5)', async () => {
const wt = await ensureSessionWorktree(sql, projectDir, sessionId);
const baseBefore = wt.baseCommit;
// Make a change in the worktree (as an external agent would).
await hostExec(`cd ${wt.worktreePath} && echo change >> README.md`, { timeoutMs: 10_000 });
const r = await rebaselineWorktreeAfterApply(sql, sessionId);
expect(r.rebaselined).toBe(true);
expect(r.newBaseCommit).toBeTruthy();
expect(r.newBaseCommit).not.toBe(baseBefore);
const [row] = await sql<{ base_commit: string }[]>`
SELECT base_commit FROM worktrees WHERE session_id = ${sessionId} AND status = 'active'
`;
expect(row!.base_commit).toBe(r.newBaseCommit);
// Idempotent: a second re-baseline with no new edits is a no-op.
const r2 = await rebaselineWorktreeAfterApply(sql, sessionId);
expect(r2.rebaselined).toBe(false);
});
it('chat-close hook closes agent rows + removes the worktree on the last chat (3.3)', async () => {
// Sanity: an active worktree + agent row exist from the prior tests.
const beforeWt = await sql<{ id: string }[]>`SELECT id FROM worktrees WHERE session_id = ${sessionId} AND status = 'active'`;
expect(beforeWt.length).toBe(1);
const result = await closeChatBackendState(sql, chatId);
expect(result.agentRowsClosed).toBeGreaterThanOrEqual(1);
// chatId is the session's only chat → worktree removed (it was clean after the
// re-baseline commit), not at-risk.
expect(result.worktreeAtRisk).toBe(false);
expect(result.worktreeRemoved).toBe(true);
const [agentRow] = await sql<{ status: string }[]>`
SELECT status FROM agent_sessions WHERE chat_id = ${chatId} AND agent = 'opencode'
`;
expect(agentRow!.status).toBe('closed');
const activeWt = await sql<{ id: string }[]>`SELECT id FROM worktrees WHERE session_id = ${sessionId} AND status = 'active'`;
expect(activeWt.length).toBe(0); // archived, no longer active
});
it('orphan reaper leaves a live worktree alone and reaps a row-less dir (3.4)', async () => {
// Recreate a live worktree for this session (the close test archived the old one).
const live = await ensureSessionWorktree(sql, projectDir, sessionId);
expect(existsSync(live.worktreePath)).toBe(true);
// A live worktree (active row) with grace 0 must NOT be reaped.
const r1 = await reapOrphanWorktrees(sql, console as never, 0, Date.now());
expect(r1.reaped).not.toContain(live.worktreePath);
// Now archive its row (simulating a leaked dir) and reap again — it becomes an
// orphan and is reclaimed (it's clean → not at-risk).
await sql`UPDATE worktrees SET status = 'archived' WHERE id = ${live.worktreeId}`;
const r2 = await reapOrphanWorktrees(sql, console as never, 0, Date.now());
expect(r2.reaped).toContain(live.worktreePath);
expect(existsSync(live.worktreePath)).toBe(false);
});
});

View File

@@ -99,4 +99,21 @@ export interface AgentBackend {
dispose(): Promise<void>; dispose(): Promise<void>;
/** Liveness for health endpoint + dispatcher fallback decision. §2 */ /** Liveness for health endpoint + dispatcher fallback decision. §2 */
health(): 'up' | 'down'; health(): 'up' | 'down';
/**
* v2.6 Phase 3: true iff a turn is in flight on this backend. The pool's idle
* eviction + LRU cap NEVER evict a busy backend (design §6 busy rule); the
* health-monitor defers a restart while busy (stale-grace). Optional so the
* Phase-0 scaffold and any test double stay compatible — absent ⇒ treated as
* not busy. opencode-server (multi-session) is busy iff ANY session has an
* active turn; warm-acp (single session) iff its one slot is active.
*/
isBusy?(): boolean;
/**
* v2.6 Phase 3: optional proactive health probe + busy-aware self-restart, run
* by the pool's periodic sweep. The opencode-server backend implements it
* (detects a hung-but-not-exited server and restarts when non-busy). Backends
* with no long-lived shared process (warm-ACP recovers lazily on its own child
* exit) can omit it. Must never throw — the sweep ignores rejections.
*/
tickHealth?(now?: number): Promise<void>;
} }

View File

@@ -1,44 +1,246 @@
/** /**
* v2.6 — AgentPool (Phase 0 scaffold). * v2.6 — AgentPool.
* *
* Lazy get-or-create registry of `AgentBackend` instances keyed by * Lazy get-or-create registry of `AgentBackend` instances keyed by
* `${sessionId}:${agent}`. Phase 0 ships the skeleton only: an in-memory Map, * `${primary}:${agent}` (primary = chatId for warm-ACP, a fixed sentinel for the
* lookup / register / health, and clean disposal wired to the server's onClose. * single shared opencode server). Phase 0 shipped the skeleton (Map + health +
* Spawning lands in Phase 1/2; nothing populates the map yet. * dispose). Phase 3 adds the LIFECYCLE: per-entry idle tracking, a periodic
* idle-TTL + LRU-cap sweep (the pure decisions live in
* `backends/lifecycle-decisions.ts`), and a `closeChat` helper for the chat-close
* hook. Reattach after eviction is implicit — the next turn's `ensureSession`
* rebuilds the backend from `agent_sessions` / `worktrees` (DB is the source of
* truth; the in-memory pool is a warm cache).
* *
* Spec: openspec/changes/v2-6-persistent-agent-sessions/design.md §2. * The hard rule (design §6): NEVER evict a busy backend (one with an in-flight
* turn). `selectIdleEvictionTargets` / `selectLruEvictionTargets` enforce it via
* `backend.isBusy()`; a long turn that outlives the TTL is left alone.
*
* Spec: openspec/changes/v2-6-persistent-agent-sessions/design.md §2 / §6.
*/ */
import type { FastifyBaseLogger } from 'fastify';
import type { AgentBackend } from './agent-backend.js'; import type { AgentBackend } from './agent-backend.js';
import {
selectIdleEvictionTargets,
selectLruEvictionTargets,
DEFAULT_IDLE_TTL_MS,
DEFAULT_MAX_LIVE_BACKENDS,
} from './backends/lifecycle-decisions.js';
interface PoolEntry {
primary: string;
agent: string;
backend: AgentBackend;
/** Epoch ms of the last turn boundary (register or touch). Drives idle/LRU. */
lastActiveAt: number;
}
export interface AgentPoolOpts {
/** Idle TTL before a non-busy backend is evicted. Default 30 min. */
idleTtlMs?: number;
/** Max live backends before the LRU cap evicts the least-recently-used. */
maxLive?: number;
/** Sweep cadence. Default 60s (mirrors the server's periodic sweeper). */
sweepIntervalMs?: number;
log?: FastifyBaseLogger;
}
const DEFAULT_SWEEP_INTERVAL_MS = 60_000;
export class AgentPool { export class AgentPool {
private readonly backends = new Map<string, AgentBackend>(); private readonly backends = new Map<string, PoolEntry>();
private idleTtlMs: number;
private maxLive: number;
private sweepIntervalMs: number;
private log: FastifyBaseLogger | undefined;
private sweepTimer: ReturnType<typeof setInterval> | null = null;
/** Serializes sweep runs so a slow eviction can't overlap the next tick. */
private sweeping = false;
private key(sessionId: string, agent: string): string { constructor(opts: AgentPoolOpts = {}) {
return `${sessionId}:${agent}`; this.idleTtlMs = opts.idleTtlMs ?? DEFAULT_IDLE_TTL_MS;
this.maxLive = opts.maxLive ?? DEFAULT_MAX_LIVE_BACKENDS;
this.sweepIntervalMs = opts.sweepIntervalMs ?? DEFAULT_SWEEP_INTERVAL_MS;
this.log = opts.log;
} }
/** Map lookup only. Spawning is Phase 1/2 — never creates here. */ /** Apply env-derived knobs to the module singleton at bootstrap (before
get(sessionId: string, agent: string): AgentBackend | undefined { * startReaper). Only overrides explicitly-provided fields. */
return this.backends.get(this.key(sessionId, agent)); configure(opts: AgentPoolOpts): void {
if (opts.idleTtlMs != null) this.idleTtlMs = opts.idleTtlMs;
if (opts.maxLive != null) this.maxLive = opts.maxLive;
if (opts.sweepIntervalMs != null) this.sweepIntervalMs = opts.sweepIntervalMs;
if (opts.log) this.log = opts.log;
} }
/** Store a backend instance for this (session, agent). */ private key(primary: string, agent: string): string {
register(sessionId: string, agent: string, backend: AgentBackend): void { return `${primary}:${agent}`;
this.backends.set(this.key(sessionId, agent), backend); }
/** Map lookup only. Spawning happens in the dispatcher (Phase 1/2). A hit also
* marks the entry recently-active so a resolve-without-prompt doesn't get it
* evicted out from under an imminent turn. */
get(primary: string, agent: string): AgentBackend | undefined {
const entry = this.backends.get(this.key(primary, agent));
if (entry) entry.lastActiveAt = Date.now();
return entry?.backend;
}
/** Store a backend instance for this (primary, agent). */
register(primary: string, agent: string, backend: AgentBackend): void {
this.backends.set(this.key(primary, agent), { primary, agent, backend, lastActiveAt: Date.now() });
}
/** Mark a backend recently-active (call at turn start AND settle so a long turn
* keeps its slot warm). No-op if the key isn't pooled. */
touch(primary: string, agent: string): void {
const entry = this.backends.get(this.key(primary, agent));
if (entry) entry.lastActiveAt = Date.now();
}
/** Snapshot for the decision helpers (busy is read live from the backend). */
private snapshots(): { key: string; lastActiveAt: number; busy: boolean }[] {
const out: { key: string; lastActiveAt: number; busy: boolean }[] = [];
for (const [key, e] of this.backends) {
out.push({ key, lastActiveAt: e.lastActiveAt, busy: e.backend.isBusy?.() ?? false });
}
return out;
} }
/** Summary for the health endpoint. */ /** Summary for the health endpoint. */
health(): { size: number } { health(): { size: number; busy: number } {
return { size: this.backends.size }; let busy = 0;
for (const e of this.backends.values()) if (e.backend.isBusy?.()) busy++;
return { size: this.backends.size, busy };
}
// ─── Phase 3: idle-TTL + LRU eviction sweep ──────────────────────────────────
/** Start the periodic idle + LRU sweep. Idempotent; unref'd so it never holds
* the process open on its own. */
startReaper(log?: FastifyBaseLogger): void {
if (log) this.log = log;
if (this.sweepTimer) return;
this.sweepTimer = setInterval(() => {
void this.sweep().catch((err) => {
this.log?.warn({ err: errMsg(err) }, 'agent-pool: sweep error');
});
}, this.sweepIntervalMs);
this.sweepTimer.unref?.();
}
stopReaper(): void {
if (this.sweepTimer) {
clearInterval(this.sweepTimer);
this.sweepTimer = null;
}
}
/**
* One sweep pass: evict idle-past-TTL backends, then enforce the LRU cap.
* Deduped (a key can't appear in both lists for one pass). Busy backends are
* excluded by the decision helpers — a live turn is never torn down.
*/
async sweep(now: number = Date.now()): Promise<{ evicted: string[] }> {
if (this.sweeping) return { evicted: [] };
this.sweeping = true;
try {
// Phase 3: drive each backend's optional proactive health probe first (the
// opencode server's busy-aware hung-detect + self-restart). Best-effort —
// a probe must never fail the sweep.
for (const e of this.backends.values()) {
if (e.backend.tickHealth) {
await e.backend.tickHealth(now).catch((err) => {
this.log?.warn({ key: this.key(e.primary, e.agent), err: errMsg(err) }, 'agent-pool: tickHealth threw');
});
}
}
const snaps = this.snapshots();
const idle = selectIdleEvictionTargets(snaps, now, this.idleTtlMs);
// LRU runs on what remains after idle eviction, so the two never double-evict.
const idleSet = new Set(idle);
const remaining = snaps.filter((s) => !idleSet.has(s.key));
const lru = selectLruEvictionTargets(remaining, this.maxLive);
const targets = [...idle, ...lru];
if (targets.length === 0) return { evicted: [] };
const evicted: string[] = [];
for (const key of targets) {
const entry = this.backends.get(key);
if (!entry) continue;
// Re-check busy right before teardown — a turn may have started since the
// snapshot. Defensive; the decision already excluded busy at snapshot time.
if (entry.backend.isBusy?.()) continue;
this.backends.delete(key);
try {
await entry.backend.dispose();
} catch (err) {
this.log?.warn({ key, err: errMsg(err) }, 'agent-pool: backend dispose threw during eviction');
}
evicted.push(key);
}
if (evicted.length > 0) {
this.log?.info({ evicted, size: this.backends.size }, 'agent-pool: evicted idle/over-cap backends');
}
return { evicted };
} finally {
this.sweeping = false;
}
}
// ─── Phase 3: chat-close cleanup (3.3) ───────────────────────────────────────
/**
* Tear down every pooled backend whose key is for this chat. Used by the
* chat-close hook. The opencode server is shared (keyed on a sentinel, not the
* chat), so it is NOT disposed here — only its session is closed via
* `closeSession`, which the hook calls directly with the per-(chat,agent)
* handle. Returns the keys it removed. Skips busy entries (a close mid-turn is
* rare but must not kill a live stream — the idle sweep reaps it shortly after).
*/
async closeChat(chatId: string): Promise<string[]> {
const removed: string[] = [];
const prefix = `${chatId}:`;
for (const [key, entry] of [...this.backends]) {
if (!key.startsWith(prefix)) continue;
if (entry.backend.isBusy?.()) continue;
this.backends.delete(key);
try {
await entry.backend.dispose();
} catch (err) {
this.log?.warn({ key, err: errMsg(err) }, 'agent-pool: dispose threw during closeChat');
}
removed.push(key);
}
return removed;
}
/** Look up a backend by exact key without bumping its activity (for closeSession). */
peek(primary: string, agent: string): AgentBackend | undefined {
return this.backends.get(this.key(primary, agent))?.backend;
} }
/** Dispose every backend and clear the map. Tolerates throwing backends. */ /** Dispose every backend and clear the map. Tolerates throwing backends. */
async dispose(): Promise<void> { async dispose(): Promise<void> {
this.stopReaper();
const entries = [...this.backends.values()]; const entries = [...this.backends.values()];
this.backends.clear(); this.backends.clear();
await Promise.allSettled(entries.map((b) => b.dispose())); await Promise.allSettled(entries.map((e) => e.backend.dispose()));
} }
} }
/** Single shared instance — referenced only by the server's onClose hook in Phase 0. */ function errMsg(e: unknown): string {
return e instanceof Error ? e.message : String(e);
}
/**
* The shared opencode server is pooled under a FIXED sentinel (one server per
* BooCoder process, multiplexing all opencode sessions internally) rather than a
* chat id — so it is NOT torn down by `closeChat(chatId)` (only its per-chat
* session is closed). Exported so the dispatcher + the lifecycle close-hook agree
* on the key without drift.
*/
export const OPENCODE_POOL_KEY = '__opencode_server__';
/** Single shared instance — registered by the dispatcher, swept + drained by the
* server's onClose hook. */
export const agentPool = new AgentPool(); export const agentPool = new AgentPool();

View File

@@ -0,0 +1,176 @@
import { describe, it, expect } from 'vitest';
import {
selectIdleEvictionTargets,
selectLruEvictionTargets,
decideRestart,
selectOrphanWorktreeTargets,
DEFAULT_IDLE_TTL_MS,
DEFAULT_MAX_LIVE_BACKENDS,
type PoolEntrySnapshot,
} from '../lifecycle-decisions.js';
/**
* v2.6 Phase 3 — pure lifecycle decisions. No DB, no children, no timers; `now`
* is injected. Models prune.ts:selectPruneTargets — the caller acts on the keys.
*/
const NOW = 1_000_000_000_000;
function entry(key: string, ageMs: number, busy = false): PoolEntrySnapshot {
return { key, lastActiveAt: NOW - ageMs, busy };
}
describe('selectIdleEvictionTargets (3.1)', () => {
it('evicts entries idle past the TTL', () => {
const entries = [
entry('a:opencode', DEFAULT_IDLE_TTL_MS + 1),
entry('b:goose', DEFAULT_IDLE_TTL_MS - 1),
];
expect(selectIdleEvictionTargets(entries, NOW)).toEqual(['a:opencode']);
});
it('never evicts a busy entry even when idle past the TTL', () => {
const entries = [entry('a:opencode', DEFAULT_IDLE_TTL_MS * 10, /* busy */ true)];
expect(selectIdleEvictionTargets(entries, NOW)).toEqual([]);
});
it('respects a custom TTL', () => {
const entries = [entry('a:goose', 5_000), entry('b:qwen', 500)];
expect(selectIdleEvictionTargets(entries, NOW, 1_000)).toEqual(['a:goose']);
});
it('treats exactly-at-TTL as evictable (>=)', () => {
expect(selectIdleEvictionTargets([entry('a:x', 1_000)], NOW, 1_000)).toEqual(['a:x']);
});
it('returns empty for an empty pool', () => {
expect(selectIdleEvictionTargets([], NOW)).toEqual([]);
});
});
describe('selectLruEvictionTargets (3.4)', () => {
it('returns nothing when at or under the cap', () => {
const entries = [entry('a:x', 10), entry('b:y', 20)];
expect(selectLruEvictionTargets(entries, 2)).toEqual([]);
expect(selectLruEvictionTargets(entries, 5)).toEqual([]);
});
it('evicts the least-recently-used beyond the cap', () => {
// oldest first: c (300ms ago) is LRU, then a (100ms), then b (10ms).
const entries = [entry('a:x', 100), entry('b:y', 10), entry('c:z', 300)];
expect(selectLruEvictionTargets(entries, 2)).toEqual(['c:z']);
});
it('evicts multiple LRU entries to reach the cap', () => {
const entries = [
entry('a:x', 100),
entry('b:y', 10),
entry('c:z', 300),
entry('d:w', 200),
];
// cap 1: must remove 3, oldest-first c(300), d(200), a(100).
expect(selectLruEvictionTargets(entries, 1)).toEqual(['c:z', 'd:w', 'a:x']);
});
it('never evicts a busy entry even if it is the LRU', () => {
// c is LRU but busy → it cannot be evicted; fall to the next-oldest (a).
const entries = [entry('a:x', 100), entry('b:y', 10), entry('c:z', 300, true)];
expect(selectLruEvictionTargets(entries, 2)).toEqual(['a:x']);
});
it('can transiently exceed the cap when too many are busy', () => {
// cap 1, but both old entries busy → only the single idle one is evictable.
const entries = [entry('a:x', 100, true), entry('c:z', 300, true), entry('b:y', 10)];
expect(selectLruEvictionTargets(entries, 1)).toEqual(['b:y']);
});
it('uses the default cap when omitted', () => {
const entries = Array.from({ length: DEFAULT_MAX_LIVE_BACKENDS + 1 }, (_, i) =>
entry(`k${String(i).padStart(2, '0')}:a`, (i + 1) * 1000),
);
const evicted = selectLruEvictionTargets(entries);
// exactly one over the default cap → evict the single LRU (largest age).
expect(evicted).toHaveLength(1);
expect(evicted[0]).toBe(`k${String(DEFAULT_MAX_LIVE_BACKENDS).padStart(2, '0')}:a`);
});
});
describe('decideRestart (3.2, busy-aware)', () => {
const base = {
consecutiveFailures: 0,
busy: false,
unhealthyBusySince: 0,
now: NOW,
failureThreshold: 3,
staleBusyGraceMs: 120_000,
};
it('does nothing when healthy', () => {
expect(decideRestart({ ...base, processExited: false, healthy: true }))
.toEqual({ action: 'none', reason: 'healthy' });
});
it('restarts immediately when the process exited', () => {
expect(decideRestart({ ...base, processExited: true, busy: true }))
.toEqual({ action: 'restart', reason: 'process-exited' });
});
it('waits below the failure threshold', () => {
expect(decideRestart({ ...base, processExited: false, consecutiveFailures: 2 }))
.toEqual({ action: 'wait', reason: 'below-threshold' });
});
it('restarts at the threshold when idle', () => {
expect(decideRestart({ ...base, processExited: false, consecutiveFailures: 3 }))
.toEqual({ action: 'restart', reason: 'threshold' });
});
it('defers a restart while busy within the grace window', () => {
expect(decideRestart({
...base, processExited: false, consecutiveFailures: 5, busy: true,
unhealthyBusySince: NOW - 1_000,
})).toEqual({ action: 'wait', reason: 'busy-grace' });
});
it('force-restarts a busy backend after the stale-busy grace', () => {
expect(decideRestart({
...base, processExited: false, consecutiveFailures: 5, busy: true,
unhealthyBusySince: NOW - 120_001,
})).toEqual({ action: 'restart', reason: 'stale-busy-grace' });
});
it('waits (busy-grace) when busy + threshold but the window just started', () => {
// unhealthyBusySince === 0 means the caller is about to stamp it this cycle.
expect(decideRestart({
...base, processExited: false, consecutiveFailures: 5, busy: true,
unhealthyBusySince: 0,
})).toEqual({ action: 'wait', reason: 'busy-grace' });
});
});
describe('selectOrphanWorktreeTargets (3.4)', () => {
it('skips dirs tracked by a live worktrees row', () => {
const onDisk = [{ path: '/wt/sess-a', mtimeMs: NOW - 10_000_000 }];
expect(selectOrphanWorktreeTargets(onDisk, new Set(['/wt/sess-a']), NOW, 1000)).toEqual([]);
});
it('reaps an untracked dir older than the grace', () => {
const onDisk = [{ path: '/wt/sess-orphan', mtimeMs: NOW - 5000 }];
expect(selectOrphanWorktreeTargets(onDisk, new Set(), NOW, 1000)).toEqual(['/wt/sess-orphan']);
});
it('never reaps a dir younger than the grace (mid-create race)', () => {
const onDisk = [{ path: '/wt/sess-fresh', mtimeMs: NOW - 500 }];
expect(selectOrphanWorktreeTargets(onDisk, new Set(), NOW, 1000)).toEqual([]);
});
it('mixes tracked, fresh, and orphaned correctly', () => {
const onDisk = [
{ path: '/wt/sess-live', mtimeMs: NOW - 10_000 },
{ path: '/wt/sess-fresh', mtimeMs: NOW - 100 },
{ path: '/wt/sess-orphan', mtimeMs: NOW - 10_000 },
];
expect(selectOrphanWorktreeTargets(onDisk, new Set(['/wt/sess-live']), NOW, 1000))
.toEqual(['/wt/sess-orphan']);
});
});

View File

@@ -0,0 +1,197 @@
/**
* v2.6 Phase 3 — pure lifecycle decision helpers.
*
* The eviction / LRU-cap / busy-aware-restart / reaper-target logic, factored out
* of AgentPool + the backends + the periodic sweeper so it's unit-testable with no
* DB, no child processes, no timers (modeled on
* apps/server/src/services/inference/prune.ts:selectPruneTargets — a pure decision
* core the caller acts on).
*
* Three decisions live here:
* 1. selectIdleEvictionTargets — which warm backends to evict for being idle.
* 2. selectLruEvictionTargets — which warm backends to evict to honour a max-live
* cap (least-recently-used beyond the cap), NEVER a busy one.
* 3. shouldRestartCrashedBackend (busy-aware) — openchamber's skip-while-busy +
* stale-grace state machine, re-implemented for BooCode's per-(chat,agent) pool.
*
* "Busy" = the backend has an in-flight turn. The hard rule (design §6, decisions):
* never evict or force-restart a busy backend; defer with a stale-grace.
*/
// ─── Idle TTL eviction (3.1) ─────────────────────────────────────────────────
/** Default idle TTL before a warm backend/session is evicted (design §6 ~30 min). */
export const DEFAULT_IDLE_TTL_MS = 30 * 60 * 1000;
/** A pool entry as the decision helpers see it (no backend internals). */
export interface PoolEntrySnapshot {
/** Pool key `${primary}:${agent}` — opaque to the decision, used for selection. */
key: string;
/** Epoch ms of the last turn activity (start or settle) on this backend. */
lastActiveAt: number;
/** True iff a turn is in flight right now. Busy entries are never evicted. */
busy: boolean;
}
/**
* Idle eviction: an entry is evictable when it has been idle (no turn) for longer
* than `ttlMs` AND is not currently busy. Returns the keys to evict.
*
* Pure: `now` is injected so tests don't depend on wall-clock. Busy entries are
* categorically excluded — a long-running turn that exceeds the TTL must NOT be
* torn down mid-stream (the §6 / openchamber busy rule).
*/
export function selectIdleEvictionTargets(
entries: ReadonlyArray<PoolEntrySnapshot>,
now: number,
ttlMs: number = DEFAULT_IDLE_TTL_MS,
): string[] {
const out: string[] = [];
for (const e of entries) {
if (e.busy) continue;
if (now - e.lastActiveAt >= ttlMs) out.push(e.key);
}
return out;
}
// ─── LRU cap (3.4) ───────────────────────────────────────────────────────────
/** Default max live warm backends/worktrees before the LRU cap evicts (env-overridable). */
export const DEFAULT_MAX_LIVE_BACKENDS = 10;
/**
* LRU cap: when more than `cap` non-busy entries are live, evict the
* least-recently-used ones (oldest `lastActiveAt` first) until at most `cap`
* remain. Busy entries are never evicted AND are not counted toward the cap's
* "kept" budget being freed — i.e. we only ever evict idle entries, so a burst of
* concurrent busy turns can transiently exceed the cap rather than kill live work.
*
* Returns the keys to evict, least-recently-used first. Pure / deterministic:
* ties broken by key for stable test output.
*/
export function selectLruEvictionTargets(
entries: ReadonlyArray<PoolEntrySnapshot>,
cap: number = DEFAULT_MAX_LIVE_BACKENDS,
): string[] {
if (cap < 0) cap = 0;
if (entries.length <= cap) return [];
// Only idle entries are eligible to be evicted.
const evictable = entries
.filter((e) => !e.busy)
.sort((a, b) => a.lastActiveAt - b.lastActiveAt || (a.key < b.key ? -1 : a.key > b.key ? 1 : 0));
// We must shrink total live count down to `cap`. Busy entries can't be evicted,
// so the number we CAN remove is bounded by the evictable pool; evict the oldest
// (total - cap) of them, never more than exist.
const overBy = entries.length - cap;
const toEvict = evictable.slice(0, Math.max(0, overBy));
return toEvict.map((e) => e.key);
}
// ─── Busy-aware crash restart (3.2) — openchamber lift ───────────────────────
/**
* Default grace after which a backend that has stayed unhealthy WHILE busy is
* force-restarted anyway (openchamber's STALE_BUSY_GRACE_MS = 2 min). Guards
* against a permanently-stuck "busy" turn wedging recovery forever.
*/
export const DEFAULT_STALE_BUSY_GRACE_MS = 2 * 60 * 1000;
/** Default consecutive health-check failures before a restart is attempted. */
export const DEFAULT_HEALTH_FAILURE_THRESHOLD = 3;
export interface RestartDecisionInput {
/** True iff the process is actually dead (exited). A dead process restarts
* immediately regardless of busy/threshold — there's nothing to protect. */
processExited: boolean;
/** Consecutive failed health probes so far (including the current one). */
consecutiveFailures: number;
/** Whether the backend currently has an in-flight turn. */
busy: boolean;
/** Epoch ms when the unhealthy-while-busy window started, or 0 if not in one. */
unhealthyBusySince: number;
/** Injected clock. */
now: number;
failureThreshold?: number;
staleBusyGraceMs?: number;
}
export type RestartDecision =
| { action: 'restart'; reason: 'process-exited' | 'threshold' | 'stale-busy-grace' }
| { action: 'wait'; reason: 'below-threshold' | 'busy-grace' }
| { action: 'none'; reason: 'healthy' };
/**
* Decide whether to restart a backend after a health probe. Mirrors
* openchamber's `runHealthCheckCycle` + `shouldSkipRestartForBusySessions`,
* re-implemented as a pure function over injected state (the caller owns the
* mutable counters + the actual restart side-effect).
*
* Order (matches openchamber):
* - process exited → restart now (nothing live to protect).
* - below failure threshold → wait (transient blip; the next probe re-checks).
* - threshold reached + idle → restart now.
* - threshold reached + busy → skip UNLESS the unhealthy-busy window exceeded
* the stale grace, then force restart.
*
* `healthy: true` callers don't reach here; included for completeness so the
* caller can pass through and reset counters on a single code path.
*/
export function decideRestart(input: RestartDecisionInput & { healthy?: boolean }): RestartDecision {
if (input.healthy) return { action: 'none', reason: 'healthy' };
if (input.processExited) return { action: 'restart', reason: 'process-exited' };
const threshold = input.failureThreshold ?? DEFAULT_HEALTH_FAILURE_THRESHOLD;
if (input.consecutiveFailures < threshold) {
return { action: 'wait', reason: 'below-threshold' };
}
if (!input.busy) {
return { action: 'restart', reason: 'threshold' };
}
// Busy + unhealthy at/over threshold: defer, but not forever.
const grace = input.staleBusyGraceMs ?? DEFAULT_STALE_BUSY_GRACE_MS;
if (input.unhealthyBusySince > 0 && input.now - input.unhealthyBusySince >= grace) {
return { action: 'restart', reason: 'stale-busy-grace' };
}
return { action: 'wait', reason: 'busy-grace' };
}
// ─── Orphan worktree reaper target selection (3.4) ───────────────────────────
/** Default TTL: an on-disk worktree dir with no live `worktrees` row is reaped
* only after it's been orphaned at least this long (mtime-based grace so a
* just-created dir mid-`ensureSessionWorktree` race is never swept). */
export const DEFAULT_ORPHAN_WORKTREE_GRACE_MS = 60 * 60 * 1000; // 1h
export interface OnDiskWorktree {
/** Absolute path of the worktree dir on disk. */
path: string;
/** Last-modified epoch ms of the dir (newest of dir + contents, caller's choice). */
mtimeMs: number;
}
/**
* Reaper target selection: which on-disk worktree dirs are orphans safe to
* inspect-and-reap. An orphan is a dir under the worktree base that has NO live
* `worktrees` row (path not in `liveWorktreePaths`) AND whose mtime is older than
* the grace window (so an in-flight create isn't swept).
*
* Pure — the caller (the sweeper) then runs the at-risk preflight (dirty/unpushed)
* on each returned path and only physically removes the SAFE ones. This helper
* never decides to remove work-at-risk; it only narrows the candidate set.
*/
export function selectOrphanWorktreeTargets(
onDisk: ReadonlyArray<OnDiskWorktree>,
liveWorktreePaths: ReadonlySet<string>,
now: number,
graceMs: number = DEFAULT_ORPHAN_WORKTREE_GRACE_MS,
): string[] {
const out: string[] = [];
for (const w of onDisk) {
if (liveWorktreePaths.has(w.path)) continue; // tracked → not an orphan
if (now - w.mtimeMs < graceMs) continue; // too fresh → could be mid-create
out.push(w.path);
}
return out;
}

View File

@@ -21,9 +21,9 @@
* - promptAsync is fire-and-forget (204); the turn completes via a * - promptAsync is fire-and-forget (204); the turn completes via a
* 'session.idle' event for that opencode session id. * 'session.idle' event for that opencode session id.
*/ */
import { spawn, type ChildProcess } from 'node:child_process'; import { spawn, spawnSync, type ChildProcess } from 'node:child_process';
import { createHash } from 'node:crypto'; import { createHash } from 'node:crypto';
import { createServer } from 'node:net'; import { createServer, connect as netConnect } from 'node:net';
import type { FastifyBaseLogger } from 'fastify'; import type { FastifyBaseLogger } from 'fastify';
import { import {
createOpencodeClient, createOpencodeClient,
@@ -39,6 +39,7 @@ import type { Sql } from '../../db.js';
import type { AcpToolSnapshot } from '../acp-tool-snapshot.js'; import type { AcpToolSnapshot } from '../acp-tool-snapshot.js';
import { armAbortGuard, noteTurnActivity, consumeTerminal } from './turn-guard.js'; import { armAbortGuard, noteTurnActivity, consumeTerminal } from './turn-guard.js';
import { stepEndedToUsage, type StepUsage } from './opencode-usage.js'; import { stepEndedToUsage, type StepUsage } from './opencode-usage.js';
import { decideRestart, DEFAULT_HEALTH_FAILURE_THRESHOLD } from './lifecycle-decisions.js';
import type { import type {
AgentBackend, AgentBackend,
AgentEvent, AgentEvent,
@@ -104,6 +105,11 @@ export class OpenCodeServerBackend implements AgentBackend {
private port: number | null = null; private port: number | null = null;
private up = false; private up = false;
private serverStarting: Promise<void> | null = null; private serverStarting: Promise<void> | null = null;
// Phase 3 busy-aware health monitor (openchamber lift): consecutive failed
// probes + the start of an unhealthy-while-busy window feed `decideRestart`.
private consecutiveHealthFailures = 0;
private unhealthyBusySince = 0;
private restarting: Promise<void> | null = null;
/** opencode session id → demux state. Maintained by ensureSession; read by the SSE loop. */ /** opencode session id → demux state. Maintained by ensureSession; read by the SSE loop. */
private readonly byOpencodeId = new Map<string, SessionState>(); private readonly byOpencodeId = new Map<string, SessionState>();
@@ -119,11 +125,30 @@ export class OpenCodeServerBackend implements AgentBackend {
return this.up ? 'up' : 'down'; return this.up ? 'up' : 'down';
} }
// ─── Server lifecycle (1.2: spawn once + client + ready) ───────────────────── /** Phase 3: busy iff ANY pooled opencode session has an in-flight turn. The
* pool reads this to skip idle/LRU eviction and the health-monitor to defer a
* restart (never tear down a session mid-stream). */
isBusy(): boolean {
for (const st of this.byOpencodeId.values()) {
if (st.activeTurn) return true;
}
return false;
}
/** Lazy: start the single server on first use. Idempotent — one server per backend. */ // ─── Server lifecycle (1.2: spawn once + client + ready; Phase 3 crash-restart) ──
/**
* Lazy: start the single server on first use; re-spawn after a crash. Idempotent
* within one live server — `serverStarting` caches the in-flight start, and is
* reset to null by the crash handler so the NEXT ensureServer re-spawns a fresh
* server (Phase 3 crash recovery). A dead-but-not-yet-reaped child (exit handler
* raced) is also treated as needing a restart.
*/
private ensureServer(): Promise<void> { private ensureServer(): Promise<void> {
if (!this.serverStarting) this.serverStarting = this.startServer(); const childDead = this.child != null && (this.child.exitCode !== null || this.child.signalCode !== null);
if (!this.serverStarting || (!this.up && childDead)) {
this.serverStarting = this.startServer();
}
return this.serverStarting; return this.serverStarting;
} }
@@ -143,11 +168,15 @@ export class OpenCodeServerBackend implements AgentBackend {
this.port = port; this.port = port;
// Child lifetime is the backend's (the pool's), NOT a request's. We never tie // Child lifetime is the backend's (the pool's), NOT a request's. We never tie
// it to a per-turn abort signal. On unexpected exit we mark down + log; crash // it to a per-turn abort signal. Phase 3: on unexpected exit we recover —
// recovery is Phase 3. // settle any in-flight turns as failed, mark their agent_sessions rows crashed,
// and reset `serverStarting` so the next ensureServer re-spawns. opencode keeps
// sessions on disk, but a fresh server's in-memory state is gone, so the next
// turn's ensureSession (rows now 'crashed') creates fresh opencode sessions.
child.on('exit', (code, signal) => { child.on('exit', (code, signal) => {
this.up = false; // Only react to THIS child's exit (a restart may have swapped in a new one).
this.log.warn({ code, signal, port }, 'opencode-server: child exited (recovery is Phase 3)'); if (this.child !== child) return;
this.handleServerCrash(code, signal, port);
}); });
await waitForReady(child, READY_TIMEOUT_MS); await waitForReady(child, READY_TIMEOUT_MS);
@@ -157,6 +186,136 @@ export class OpenCodeServerBackend implements AgentBackend {
this.log.info({ port }, 'opencode-server: ready'); this.log.info({ port }, 'opencode-server: ready');
} }
/**
* Crash handler (Phase 3, lift of openchamber's restart-on-exit path). The
* server died with N live opencode sessions; we can't restart it here (the next
* turn does, lazily — avoids a restart storm if the binary is broken). We:
* 1. fail every in-flight turn so its dispatcher unblocks + publishes an error,
* 2. mark each session's agent_sessions row 'crashed' so ensureSession won't
* resume a now-dead native session id (it creates fresh),
* 3. tear down the SSE loops + demux state (stale against the dead server),
* 4. reclaim the port + reset state so the next ensureServer re-spawns.
*/
private handleServerCrash(code: number | null, signal: NodeJS.Signals | null, port: number): void {
this.up = false;
const states = [...this.byOpencodeId.values()];
this.log.warn(
{ code, signal, port, liveSessions: states.length },
'opencode-server: child exited — recovering (fail in-flight, mark crashed, re-spawn next turn)',
);
const crashedIds: string[] = [];
for (const st of states) {
st.sseAbort?.abort();
if (st.activeTurn) {
st.activeTurn.settle({ ok: false, error: 'opencode server crashed mid-turn' });
st.activeTurn = null;
}
if (st.watchdog) {
clearTimeout(st.watchdog);
st.watchdog = null;
}
crashedIds.push(st.agentSessionId);
}
// Drop the demux map: every session id is stale against a fresh server.
this.byOpencodeId.clear();
this.client = null;
this.serverStarting = null; // force a re-spawn on the next ensureServer
if (crashedIds.length > 0) {
this.sql`
UPDATE agent_sessions SET status = 'crashed'
WHERE agent_session_id = ANY(${crashedIds}) AND status <> 'closed'
`.catch((err) => {
this.log.warn({ err: errMsg(err) }, 'opencode-server: failed to mark crashed sessions (non-fatal)');
});
}
// Reclaim the port so a re-spawn on a fixed/leaked port isn't blocked. Best
// effort; the next start uses a fresh ephemeral port anyway.
reclaimPort(port);
}
/**
* Phase 3 proactive health monitor (openchamber `runHealthCheckCycle` lift,
* busy-aware). Probes the server's /global/health; on a sustained failure of a
* NON-busy server, force a restart so the next turn isn't blocked by a wedged
* (hung-but-not-exited) process. Busy servers are deferred via the stale-grace in
* `decideRestart` — never tear down live work. Driven by the pool's periodic
* sweep (best-effort; a crash-exit is already handled by `handleServerCrash` +
* lazy `ensureServer` re-spawn, so this only catches the hung case). No-op when
* the server was never started or a restart is already in flight.
*/
async tickHealth(now: number = Date.now()): Promise<void> {
if (!this.child || this.restarting) return;
const childExited = this.child.exitCode !== null || this.child.signalCode !== null;
// An exited child is recovered lazily by ensureServer; don't double-restart it.
if (childExited) return;
const healthy = await this.probeHealth();
if (healthy) {
this.consecutiveHealthFailures = 0;
this.unhealthyBusySince = 0;
return;
}
this.consecutiveHealthFailures += 1;
const busy = this.isBusy();
const decision = decideRestart({
processExited: false,
consecutiveFailures: this.consecutiveHealthFailures,
busy,
unhealthyBusySince: this.unhealthyBusySince,
now,
failureThreshold: DEFAULT_HEALTH_FAILURE_THRESHOLD,
});
// Stamp the start of an unhealthy-while-busy window so the stale-grace can fire.
if (busy && this.unhealthyBusySince === 0) this.unhealthyBusySince = now;
if (decision.action === 'restart') {
this.log.warn(
{ failures: this.consecutiveHealthFailures, busy, reason: decision.reason },
'opencode-server: health monitor forcing restart',
);
this.consecutiveHealthFailures = 0;
this.unhealthyBusySince = 0;
await this.restartServer();
}
}
private async probeHealth(): Promise<boolean> {
if (!this.client) return false;
try {
const res = await this.client.global.health();
return !res.error;
} catch {
return false;
}
}
/** Force-kill the current server + reclaim its port; the next ensureServer
* re-spawns (lazy). Mirrors handleServerCrash's state reset but is initiated by
* the health monitor rather than the OS. */
private async restartServer(): Promise<void> {
if (this.restarting) return this.restarting;
this.restarting = (async () => {
const child = this.child;
const port = this.port;
this.up = false;
// Fail in-flight turns + mark sessions crashed via the same path as a crash.
if (child) {
this.handleServerCrash(null, null, port ?? 0);
if (!child.killed) child.kill('SIGTERM');
}
if (port) {
reclaimPort(port);
await waitForPortRelease(port, 3_000);
}
this.child = null;
})().finally(() => {
this.restarting = null;
});
return this.restarting;
}
// ─── SSE read loop + demux + translate (1.3) + dedup (1.4) ─────────────────── // ─── SSE read loop + demux + translate (1.3) + dedup (1.4) ───────────────────
/** Per-session SSE subscription, scoped to the session's worktree directory. /** Per-session SSE subscription, scoped to the session's worktree directory.
@@ -756,6 +915,67 @@ function mapToolStatus(s: ToolState['status'] | undefined): ToolCallStatus | nul
} }
} }
/**
* Reclaim a loopback port a dead opencode child may still hold (lift of
* openchamber `killProcessOnPort`). Best-effort, POSIX-only (`lsof`/`kill`); a
* failure is harmless because the next spawn allocates a fresh ephemeral port.
* Never kills this process. Synchronous + short-timeout so the crash handler
* doesn't block.
*/
function reclaimPort(port: number | null): void {
if (!port || process.platform === 'win32') return;
try {
const res = spawnSync('lsof', ['-ti', `:${port}`], { encoding: 'utf8', timeout: 3_000, windowsHide: true });
const out = res.stdout || '';
const myPid = process.pid;
for (const pidStr of out.split(/\s+/)) {
const pid = parseInt(pidStr.trim(), 10);
if (pid && pid !== myPid) {
try {
spawnSync('kill', ['-9', String(pid)], { stdio: 'ignore', timeout: 2_000 });
} catch {
// ignore — best effort
}
}
}
} catch {
// lsof absent or failed — the fresh-ephemeral-port spawn doesn't need this.
}
}
/**
* Resolve true once nothing is listening on `port` (lift of openchamber
* `waitForPortRelease`). Used before re-spawning on a fixed port; with ephemeral
* ports it's a fast no-op. Probes 127.0.0.1; resolves false at the deadline.
*/
function waitForPortRelease(port: number, timeoutMs: number): Promise<boolean> {
const deadline = Date.now() + timeoutMs;
return new Promise((resolve) => {
const attempt = () => {
const socket = netConnect({ port, host: '127.0.0.1' });
let settled = false;
const finish = (released: boolean) => {
if (settled) return;
settled = true;
socket.removeAllListeners();
socket.destroy();
if (released || Date.now() >= deadline) {
resolve(released);
return;
}
setTimeout(attempt, 150);
};
socket.once('connect', () => finish(false));
socket.once('error', (err: NodeJS.ErrnoException) => {
if (err && (err.code === 'ECONNREFUSED' || err.code === 'EHOSTUNREACH')) finish(true);
else finish(false);
});
socket.setTimeout(500, () => finish(true));
};
attempt();
});
}
/** Bind-probe an ephemeral port on loopback. */ /** Bind-probe an ephemeral port on loopback. */
function freePort(): Promise<number> { function freePort(): Promise<number> {
return new Promise((resolve, reject) => { return new Promise((resolve, reject) => {

View File

@@ -132,6 +132,12 @@ export class WarmAcpBackend implements AgentBackend {
return this.up ? 'up' : 'down'; return this.up ? 'up' : 'down';
} }
/** Phase 3: busy iff this backend's single session has an in-flight turn. The
* pool reads this to skip idle/LRU eviction (never kill the child mid-prompt). */
isBusy(): boolean {
return this.activeTurn != null;
}
// ─── warm-process lifecycle (2.1 spawn + initialize + session/new ONCE) ─────── // ─── warm-process lifecycle (2.1 spawn + initialize + session/new ONCE) ───────
/** Lazy: spawn the warm process on first use. Idempotent — one process per backend. */ /** Lazy: spawn the warm process on first use. Idempotent — one process per backend. */

View File

@@ -12,7 +12,7 @@ import { clearTaskCommands, setTaskCommands } from './agent-commands-cache.js';
import { getManifestCommands } from './provider-commands.js'; import { getManifestCommands } from './provider-commands.js';
import { persistExternalAgentTurn } from './agent-turn-persist.js'; import { persistExternalAgentTurn } from './agent-turn-persist.js';
import { snapshotToWireToolCall, type AcpToolSnapshot } from './acp-tool-snapshot.js'; import { snapshotToWireToolCall, type AcpToolSnapshot } from './acp-tool-snapshot.js';
import { agentPool } from './agent-pool.js'; import { agentPool, OPENCODE_POOL_KEY } from './agent-pool.js';
import { OpenCodeServerBackend } from './backends/opencode-server.js'; import { OpenCodeServerBackend } from './backends/opencode-server.js';
import { WarmAcpBackend } from './backends/warm-acp.js'; import { WarmAcpBackend } from './backends/warm-acp.js';
import { shouldUseWarmBackend } from './backends/warm-acp-routing.js'; import { shouldUseWarmBackend } from './backends/warm-acp-routing.js';
@@ -499,9 +499,8 @@ export function createDispatcher(deps: Deps): { start(): void; stop(): Promise<v
// OpenCode runs ONE server per BooCoder process, shared across all sessions // OpenCode runs ONE server per BooCoder process, shared across all sessions
// (the backend multiplexes sessions internally), so it's pooled under a fixed // (the backend multiplexes sessions internally), so it's pooled under a fixed
// key rather than per-session. Warm ACP backends (Phase 2) will be per-session. // key (OPENCODE_POOL_KEY, shared with the lifecycle close-hook) rather than
const OPENCODE_POOL_KEY = '__opencode_server__'; // per-session. Warm ACP backends (Phase 2) are per (chat, agent).
function getOpenCodeBackend(installPath: string | null): AgentBackend { function getOpenCodeBackend(installPath: string | null): AgentBackend {
let backend = agentPool.get(OPENCODE_POOL_KEY, 'opencode'); let backend = agentPool.get(OPENCODE_POOL_KEY, 'opencode');
if (!backend) { if (!backend) {
@@ -710,6 +709,9 @@ export function createDispatcher(deps: Deps): { start(): void; stop(): Promise<v
signal: ac.signal, signal: ac.signal,
onEvent, onEvent,
}); });
// Phase 3: keep the pooled backend's slot warm across this (possibly long)
// turn so the idle sweep measures from turn END, not start.
agentPool.touch(OPENCODE_POOL_KEY, agent);
// Flush any text held back mid-tag at stream end (complete tags stripped). // Flush any text held back mid-tag at stream end (complete tags stripped).
const dcpTail = dcp.flush(); const dcpTail = dcp.flush();
@@ -962,6 +964,8 @@ export function createDispatcher(deps: Deps): { start(): void; stop(): Promise<v
taskId, taskId,
modeId: task.mode_id ?? undefined, modeId: task.mode_id ?? undefined,
}); });
// Phase 3: keep the pooled (chat,agent) backend warm across the turn.
agentPool.touch(chatId, agent);
const assistantContent = textChunks.join('').slice(0, 50_000); const assistantContent = textChunks.join('').slice(0, 50_000);
const reasoningText = reasoningChunks.join('').slice(0, 200_000); const reasoningText = reasoningChunks.join('').slice(0, 200_000);

View File

@@ -0,0 +1,170 @@
/**
* v2.6 Phase 3 (3.4) — orphan worktree reaper.
*
* Reclaims on-disk session worktree dirs under WORKTREE_BASE that have NO live
* (`status='active'`) row in the `worktrees` table — leaks from a crash between
* `git worktree add` and the DB insert, a missed chat-close hook, or a manual rm
* of the DB row. Extends the periodic-sweeper pattern (apps/server's truncation +
* stale-streaming reaper).
*
* SAFETY (Paseo worktree-archive cascade + superset destroy-saga lift): before
* removing ANY dir, run `checkWorktreeWorkAtRisk` — a dirty / unpushed / unmerged
* worktree is SKIPPED (logged), never force-removed. The pure orphan-target
* selection (which dirs are candidates) lives in
* `backends/lifecycle-decisions.ts:selectOrphanWorktreeTargets` and is unit-tested;
* this module does the DB read + fs stat + git preflight + removal side-effects.
*
* The mtime grace (default 1h) means a dir mid-`ensureSessionWorktree` (created on
* disk, row not yet committed) is never swept — the grace window covers the gap.
*/
import { readdir, stat } from 'node:fs/promises';
import { join } from 'node:path';
import type { FastifyBaseLogger } from 'fastify';
import type { Sql } from '../db.js';
import { WORKTREE_BASE, checkWorktreeWorkAtRisk } from './worktrees.js';
import { hostExec } from './host-exec.js';
import {
selectOrphanWorktreeTargets,
DEFAULT_ORPHAN_WORKTREE_GRACE_MS,
} from './backends/lifecycle-decisions.js';
export interface OrphanWorktreeReaperDeps {
sql: Sql;
log: FastifyBaseLogger;
intervalMs: number;
graceMs?: number;
}
export interface OrphanReaperResult {
scanned: number;
candidates: number;
reaped: string[];
skippedAtRisk: string[];
}
/** Single-pass reap: select orphan candidates, preflight at-risk, remove the safe. */
export async function reapOrphanWorktrees(
sql: Sql,
log: FastifyBaseLogger,
graceMs: number = DEFAULT_ORPHAN_WORKTREE_GRACE_MS,
now: number = Date.now(),
): Promise<OrphanReaperResult> {
// Enumerate on-disk session worktree dirs (`sess-*`). Per-task worktrees
// (arena/new_task/MCP) are cleaned up inline by the one-shot path, so we only
// own the persistent session dirs the warm paths leave behind.
let dirents: string[];
try {
dirents = await readdir(WORKTREE_BASE);
} catch {
return { scanned: 0, candidates: 0, reaped: [], skippedAtRisk: [] }; // base absent → nothing to do
}
const onDisk: { path: string; mtimeMs: number }[] = [];
for (const name of dirents) {
if (!name.startsWith('sess-')) continue; // only persistent session worktrees
const path = join(WORKTREE_BASE, name);
try {
const s = await stat(path);
if (!s.isDirectory()) continue;
onDisk.push({ path, mtimeMs: s.mtimeMs });
} catch {
// vanished between readdir and stat — skip
}
}
// Live worktree paths from the DB (active rows only — archived/removed rows are
// not "live", so their leftover dirs are reapable orphans).
const liveRows = await sql<{ path: string }[]>`
SELECT path FROM worktrees WHERE status = 'active'
`;
const live = new Set(liveRows.map((r) => r.path));
const candidates = selectOrphanWorktreeTargets(onDisk, live, now, graceMs);
const reaped: string[] = [];
const skippedAtRisk: string[] = [];
for (const path of candidates) {
// Preflight: never reap work at risk. A git error forces atRisk=true (fail
// closed), so a half-broken worktree is kept, not silently destroyed.
const risk = await checkWorktreeWorkAtRisk(path);
if (risk.atRisk) {
skippedAtRisk.push(path);
log.warn({ path, dirty: risk.dirty, unmerged: risk.unmerged, error: risk.error }, 'orphan-reaper: skipping at-risk orphan worktree');
continue;
}
const removed = await removeOrphanDir(path);
if (removed) reaped.push(path);
}
if (reaped.length > 0 || skippedAtRisk.length > 0) {
log.info({ scanned: onDisk.length, candidates: candidates.length, reaped, skippedAtRisk }, 'orphan-reaper: pass complete');
}
return { scanned: onDisk.length, candidates: candidates.length, reaped, skippedAtRisk };
}
/**
* Remove a single orphan worktree dir. Resolve its main repo via the git
* common-dir, run `worktree remove --force` from there + prune, then rm the dir as
* a backstop. Best-effort: every step is independently fault-tolerant so a partial
* state (dir present, git untracked) still gets reclaimed.
*/
async function removeOrphanDir(path: string): Promise<boolean> {
// Find the owning repo (the common git dir's parent). When the dir isn't a valid
// worktree anymore, this fails and we fall back to a plain rm.
const common = await hostExec(
`git -C ${shellEscape(path)} rev-parse --path-format=absolute --git-common-dir`,
{ timeoutMs: 10_000 },
).catch(() => null);
const commonDir = common && common.exitCode === 0 ? common.stdout.trim() : '';
// The repo worktree root is the parent of the .git common dir (strip trailing /.git).
const repoRoot = commonDir.replace(/\/\.git\/?$/, '').replace(/\/\.git$/, '');
if (repoRoot && repoRoot !== commonDir) {
await hostExec(
`git -C ${shellEscape(repoRoot)} worktree remove ${shellEscape(path)} --force`,
{ timeoutMs: 15_000 },
).catch(() => {});
await hostExec(
`git -C ${shellEscape(repoRoot)} worktree prune`,
{ timeoutMs: 10_000 },
).catch(() => {});
}
// Backstop: ensure the dir is gone even if the git remove no-op'd.
const rm = await hostExec(`rm -rf ${shellEscape(path)}`, { timeoutMs: 15_000 }).catch(() => null);
return rm != null && rm.exitCode === 0;
}
/** Minimal single-quote shell escape (mirrors worktrees.ts). */
function shellEscape(s: string): string {
return "'" + s.replace(/'/g, "'\\''") + "'";
}
/** Periodic orphan-worktree reaper, started/stopped by the bootstrap. Unref'd. */
export function createOrphanWorktreeReaper(deps: OrphanWorktreeReaperDeps): { start(): void; stop(): void } {
const { sql, log, intervalMs } = deps;
const graceMs = deps.graceMs ?? DEFAULT_ORPHAN_WORKTREE_GRACE_MS;
let timer: ReturnType<typeof setInterval> | null = null;
let running = false;
return {
start() {
if (timer) return;
timer = setInterval(() => {
if (running) return; // a slow pass must not overlap the next tick
running = true;
void reapOrphanWorktrees(sql, log, graceMs)
.catch((err) => log.warn({ err: err instanceof Error ? err.message : String(err) }, 'orphan-reaper: pass error'))
.finally(() => {
running = false;
});
}, intervalMs);
timer.unref?.();
log.info({ intervalMs, graceMs }, 'orphan-reaper: started');
},
stop() {
if (timer) {
clearInterval(timer);
timer = null;
}
},
};
}

View File

@@ -9,7 +9,7 @@
import type { Sql } from '../db.js'; import type { Sql } from '../db.js';
import { hostExec } from './host-exec.js'; import { hostExec } from './host-exec.js';
const WORKTREE_BASE = '/tmp/booworktrees'; export const WORKTREE_BASE = '/tmp/booworktrees';
/** /**
* Create a git worktree for a task on the host. * Create a git worktree for a task on the host.
@@ -197,6 +197,187 @@ export async function ensureSessionWorktree(
}; };
} }
/**
* v2.6 Phase 3 (3.3 / 3.4): physically remove a session's persistent worktree —
* the git worktree dir + its branch — and archive its `worktrees` row. Used by the
* chat/session-close hook (when the last chat in a session closes) and the orphan
* reaper. Best-effort on the git side (a dir already gone is not an error); the DB
* row is flipped to 'archived' (soft-delete, Paseo's worktree-archive pattern) so
* history/attribution survives and a re-run is idempotent.
*
* SAFETY: callers MUST run `checkWorktreeWorkAtRisk` first and skip at-risk
* worktrees — this function force-removes (`--force`), so it never silently drops
* uncommitted/unmerged work unless the caller already cleared/accepted the risk.
*/
export async function removeSessionWorktree(
sql: Sql,
projectPath: string,
worktree: { id: string; path: string; branch?: string | null },
opts?: { signal?: AbortSignal },
): Promise<void> {
await hostExec(
`git -C ${shellEscape(projectPath)} worktree remove ${shellEscape(worktree.path)} --force`,
{ signal: opts?.signal, timeoutMs: 15_000 },
).catch(() => {});
const branch = worktree.branch ?? null;
if (branch) {
await hostExec(
`git -C ${shellEscape(projectPath)} branch -D ${shellEscape(branch)}`,
{ signal: opts?.signal, timeoutMs: 10_000 },
).catch(() => {});
}
// Prune any stale worktree administrative entries left behind by a partial remove.
await hostExec(
`git -C ${shellEscape(projectPath)} worktree prune`,
{ signal: opts?.signal, timeoutMs: 10_000 },
).catch(() => {});
await sql`UPDATE worktrees SET status = 'archived' WHERE id = ${worktree.id}`.catch(() => {});
}
/**
* v2.6 Phase 3 (3.3): the chat-close cleanup. Mark every `agent_sessions` row for
* the chat 'closed', then — only if this was the session's LAST open chat — remove
* the shared session worktree (a worktree is one-per-session, shared across the
* session's chat tabs, so closing one tab must not pull the rug from sibling tabs).
*
* Returns what it did so the route can report it. The actual backend (process /
* server-session) teardown is the pool's job (`agentPool.closeChat` +
* `backend.closeSession`); this owns the DB + git truth.
*
* `worktreeRemoved` is false when other open chats remain (worktree kept) OR when
* the worktree held work at risk (preflight blocked it — never silently dropped).
*/
export interface ChatCloseResult {
agentRowsClosed: number;
worktreeRemoved: boolean;
worktreeAtRisk: boolean;
}
export async function closeChatBackendState(
sql: Sql,
chatId: string,
opts?: { signal?: AbortSignal; force?: boolean },
): Promise<ChatCloseResult> {
// Resolve the chat's session (and that session's project path) before we touch
// anything — a deleted chat row leaves agent_sessions/worktrees pointing nowhere.
const [chatRow] = await sql<{ session_id: string | null }[]>`
SELECT session_id FROM chats WHERE id = ${chatId}
`;
// chat row may already be gone (delete fired first); fall back to agent_sessions'
// session_id link, which SET NULLs only on session delete, not chat delete.
let sessionId = chatRow?.session_id ?? null;
if (!sessionId) {
const [as] = await sql<{ session_id: string | null }[]>`
SELECT session_id FROM agent_sessions WHERE chat_id = ${chatId} AND session_id IS NOT NULL LIMIT 1
`;
sessionId = as?.session_id ?? null;
}
// Mark this chat's (chat,agent) backend rows closed (idempotent).
const closedRows = await sql<{ agent: string }[]>`
UPDATE agent_sessions SET status = 'closed'
WHERE chat_id = ${chatId} AND status <> 'closed'
RETURNING agent
`;
let worktreeRemoved = false;
let worktreeAtRisk = false;
if (sessionId) {
// Other open chats still sharing the session worktree? If so, keep it.
const openRows = await sql<{ open_count: number }[]>`
SELECT COUNT(*)::int AS open_count FROM chats
WHERE session_id = ${sessionId} AND status = 'open' AND id <> ${chatId}
`;
const openCount = openRows[0]?.open_count ?? 0;
if (openCount === 0) {
const [wt] = await sql<{ id: string; path: string; branch: string | null }[]>`
SELECT id, path, branch FROM worktrees
WHERE session_id = ${sessionId} AND status = 'active' LIMIT 1
`;
if (wt) {
const projRows = await sql<{ path: string | null }[]>`
SELECT p.path FROM sessions s JOIN projects p ON p.id = s.project_id WHERE s.id = ${sessionId}
`;
const projectPath = projRows[0]?.path ?? null;
// Preflight (close-hook semantics): a DELIBERATE chat/session close — the
// server's session-delete already ran the full work-at-risk gate
// (dirty/unpushed/unmerged) before calling us, and chat-close discards the
// tab's staged review intentionally. So here we only block on UNCOMMITTED
// working-tree changes (`dirty`) — work the user never even staged into the
// review diff. The session branch's own commits (the diff-staging
// mechanism) are NOT a block; treating them as "unmerged risk" would make
// the worktree un-removable on every real session (the orphan reaper keeps
// the full at-risk gate because it runs unattended). `force` skips this.
if (!opts?.force) {
const risk = await checkWorktreeWorkAtRisk(wt.path, opts);
worktreeAtRisk = risk.dirty || risk.error != null;
}
if (projectPath && (opts?.force || !worktreeAtRisk)) {
await removeSessionWorktree(sql, projectPath, wt, opts);
worktreeRemoved = true;
}
}
}
}
return { agentRowsClosed: closedRows.length, worktreeRemoved, worktreeAtRisk };
}
/**
* v2.6 Phase 3 (3.5): re-baseline a session's worktree diff after a successful
* `apply_pending`. The applied changes were written to the PROJECT ROOT; the
* worktree branch still holds the same delta against the ORIGINAL `base_commit`,
* so the next turn's `diffWorktree(base_commit...worktree-HEAD)` would re-surface
* the already-applied changes as "pending" — a confusing double-count.
*
* Fix: advance the stored `base_commit` to the worktree's CURRENT HEAD (the
* `diffWorktree` path commits the worktree's accumulated changes before diffing,
* so HEAD already encodes the applied state). The next turn then diffs against
* that, surfacing only edits made AFTER the apply. Idempotent: if the worktree has
* no new commits, the base is unchanged.
*
* Diff-baseline-correctness note (design §7): we re-baseline to the worktree's own
* HEAD, NOT to a moving project HEAD — so an out-of-band edit to the project root
* after apply doesn't corrupt the baseline. The trade-off is that a manual project
* edit isn't reflected as "already there"; acceptable, and matches the stored-base
* (not moving-target) decision in §7.
*/
export async function rebaselineWorktreeAfterApply(
sql: Sql,
sessionId: string,
opts?: { signal?: AbortSignal },
): Promise<{ rebaselined: boolean; newBaseCommit: string | null }> {
const [wt] = await sql<{ id: string; path: string; base_commit: string | null }[]>`
SELECT id, path, base_commit FROM worktrees
WHERE session_id = ${sessionId} AND status = 'active' LIMIT 1
`;
if (!wt) return { rebaselined: false, newBaseCommit: null };
// Make sure the worktree's accumulated edits are committed so HEAD encodes the
// just-applied state (the diff path normally does this, but apply may run with no
// prior diff this turn). Commit ONLY when something is staged — NO --allow-empty,
// so a re-baseline with no new edits doesn't advance HEAD and stays idempotent.
await hostExec(
`cd ${shellEscape(wt.path)} && git add -A && ` +
`git diff --cached --quiet || ` +
`git -c user.email=boocoder@local -c user.name=BooCoder commit -q -m "rebaseline after apply"`,
{ signal: opts?.signal, timeoutMs: 15_000 },
).catch(() => {});
const headRes = await hostExec(
`git -C ${shellEscape(wt.path)} rev-parse HEAD`,
{ signal: opts?.signal, timeoutMs: 10_000 },
).catch(() => null);
const newBase = headRes && headRes.exitCode === 0 ? headRes.stdout.trim() || null : null;
if (!newBase || newBase === wt.base_commit) {
return { rebaselined: false, newBaseCommit: wt.base_commit };
}
await sql`UPDATE worktrees SET base_commit = ${newBase} WHERE id = ${wt.id}`;
return { rebaselined: true, newBaseCommit: newBase };
}
// ─── Session-delete work-loss guard ───────────────────────────────────────── // ─── Session-delete work-loss guard ─────────────────────────────────────────
/** /**

View File

@@ -348,7 +348,7 @@ Per-session Docker sandbox spawned by BooCoder on first write. Only project path
----- -----
## Shipped (v2.2.2v2.6.9 — interactive ACP, provider lifecycle, persistent agent sessions, workspace UX) ## Shipped (v2.2.2v2.6.10 — interactive ACP, provider lifecycle, persistent agent sessions, workspace UX)
All tags `vMAJOR.MINOR.PATCH-slug`, monotonic per minor, assigned at ship time (planning slugs differ — see the numbering-discipline note below). `CHANGELOG.md` is the canonical per-tag record. **Note on numbering divergence:** the *planned-feature* "v2.3 — Provider lifecycle" actually shipped under the **v2.5.4v2.5.13** tags; the *planned-feature* "v2.4 — BooCoder as ACP agent" remains **unshipped** even though v2.4.0/v2.4.1 *tags* shipped unrelated content (Unsloth lifts, sidecar routing). The patch-tag thread and the conceptual-milestone thread have diverged — read tags as the ship record, the `## v2.x` feature sections below as the milestone plan. The v2.3.0v2.5.1 tags were never CHANGELOG-backfilled; summarized here from commit bodies. All tags `vMAJOR.MINOR.PATCH-slug`, monotonic per minor, assigned at ship time (planning slugs differ — see the numbering-discipline note below). `CHANGELOG.md` is the canonical per-tag record. **Note on numbering divergence:** the *planned-feature* "v2.3 — Provider lifecycle" actually shipped under the **v2.5.4v2.5.13** tags; the *planned-feature* "v2.4 — BooCoder as ACP agent" remains **unshipped** even though v2.4.0/v2.4.1 *tags* shipped unrelated content (Unsloth lifts, sidecar routing). The patch-tag thread and the conceptual-milestone thread have diverged — read tags as the ship record, the `## v2.x` feature sections below as the milestone plan. The v2.3.0v2.5.1 tags were never CHANGELOG-backfilled; summarized here from commit bodies.
@@ -384,6 +384,7 @@ All tags `vMAJOR.MINOR.PATCH-slug`, monotonic per minor, assigned at ship time (
- `v2.6.7-interrupt-guard`**F.1 fix:** post-interrupt stale-terminal bug in the opencode warm-server backend (one-click reachable since `v2.6.5`'s Stop button). opencode emits one trailing `session.idle`/`session.error` for a cancelled turn (sessionID only, no turn id) that settled the *next* turn early as success. Pure per-session guard (`backends/turn-guard.ts` — arm-on-abort / swallow-one-orphan / self-heal-on-activity) wired into `opencode-server.ts`; 3 regression tests (TDD). First item of the v2.6 openspec "remaining" plan; Phase 1-UX / 2 / 3 still open - `v2.6.7-interrupt-guard`**F.1 fix:** post-interrupt stale-terminal bug in the opencode warm-server backend (one-click reachable since `v2.6.5`'s Stop button). opencode emits one trailing `session.idle`/`session.error` for a cancelled turn (sessionID only, no turn id) that settled the *next* turn early as success. Pure per-session guard (`backends/turn-guard.ts` — arm-on-abort / swallow-one-orphan / self-heal-on-activity) wired into `opencode-server.ts`; 3 regression tests (TDD). First item of the v2.6 openspec "remaining" plan; Phase 1-UX / 2 / 3 still open
- `v2.6.8-agent-attribution`**v2.6 Phase 1-UX** (U.1U.6), built by 3 parallel subagents over disjoint files. Backend: `pending_changes.agent` stamped at every queue site + flows through `listPending`; new `GET /api/sessions/:id/agent-sessions` route; opencode warm-server consumes `session.next.step.ended` → accumulates `input_tokens`/`output_tokens`/`cost` on `agent_sessions`. Frontend: DiffPanel per-row agent badges + multi-agent note; AgentComposerBar resumed/history/new-session chip (gated on optional `sessionId`, BooChat unaffected); shared `providerIcons.tsx` + `useAgentSessions` hook. 9 new tests; web+coder tsc clean. Both surfaces deployed (boocoder restart + `boocode` Docker rebuild). Phase 2/3 remain - `v2.6.8-agent-attribution`**v2.6 Phase 1-UX** (U.1U.6), built by 3 parallel subagents over disjoint files. Backend: `pending_changes.agent` stamped at every queue site + flows through `listPending`; new `GET /api/sessions/:id/agent-sessions` route; opencode warm-server consumes `session.next.step.ended` → accumulates `input_tokens`/`output_tokens`/`cost` on `agent_sessions`. Frontend: DiffPanel per-row agent badges + multi-agent note; AgentComposerBar resumed/history/new-session chip (gated on optional `sessionId`, BooChat unaffected); shared `providerIcons.tsx` + `useAgentSessions` hook. 9 new tests; web+coder tsc clean. Both surfaces deployed (boocoder restart + `boocode` Docker rebuild). Phase 2/3 remain
- `v2.6.9-warm-acp`**v2.6 Phase 2:** goose/qwen run as **warm ACP backends** (one persistent `goose acp`/`qwen --acp` child + `ClientSideConnection` + ACP session per `(chat,agent)`, `initialize`+`session/new` once, reused across turns) instead of one-shot. New `WarmAcpBackend` (same `AgentBackend` interface as opencode); abort = `session/cancel` the prompt only (never kills the child); dispatcher routes goose/qwen chat-tab tasks via pure `shouldUseWarmBackend` (one-shot fallback kept for arena/MCP/`new_task`); `handleSessionUpdate` extracted to a shared pure `acp-event-map.ts` (one-shot path byte-identical). SDK concern resolved (`@agentclientprotocol/sdk@^0.22.1` has stable resume; moot warm, deferred to Phase 3). 15 new tests, 180 coder tests pass. Backend-only deploy (boocoder restart). **Smoke 2/2b pending live.** Phase 3 (lifecycle hardening) is the last v2.6 phase - `v2.6.9-warm-acp`**v2.6 Phase 2:** goose/qwen run as **warm ACP backends** (one persistent `goose acp`/`qwen --acp` child + `ClientSideConnection` + ACP session per `(chat,agent)`, `initialize`+`session/new` once, reused across turns) instead of one-shot. New `WarmAcpBackend` (same `AgentBackend` interface as opencode); abort = `session/cancel` the prompt only (never kills the child); dispatcher routes goose/qwen chat-tab tasks via pure `shouldUseWarmBackend` (one-shot fallback kept for arena/MCP/`new_task`); `handleSessionUpdate` extracted to a shared pure `acp-event-map.ts` (one-shot path byte-identical). SDK concern resolved (`@agentclientprotocol/sdk@^0.22.1` has stable resume; moot warm, deferred to Phase 3). 15 new tests, 180 coder tests pass. Backend-only deploy (boocoder restart). **Smoke 2/2b pending live.** Phase 3 (lifecycle hardening) is the last v2.6 phase
- `v2.6.10-lifecycle-hardening`**v2.6 Phase 3 (final phase — completes v2.6).** Idle TTL eviction (`AGENT_POOL_IDLE_TTL_MS`=30min) + LRU cap (`AGENT_POOL_MAX_LIVE`=10), busy backends never evicted; pure `lifecycle-decisions.ts`. Crash recovery via openchamber's health-monitor + busy-aware-restart + stale-grace state machine in `opencode-server.ts` (+ port reclaim) + `warm-acp.ts` (opencode → fresh sessions; ACP → re-`session/new`; F.1 guard + U.6 usage preserved). Orphan worktree reaper (1h grace, superset-style dirty/unpushed preflight, Paseo soft-delete) + close hooks + re-baseline after apply. 35 new tests + DB-opt-in reconnect test; 215 coder tests pass. Backend-only deploy. **Follow-ups (out of v2.6 scope): apps/server close-hook caller, 3.7 DiffPanel staging hint (frontend), live Smoke 2/2b/3.** With this, **v2.6 persistent agent sessions is complete** (Phase 03 + F.1 + Phase 1-UX)
----- -----

View File

@@ -54,20 +54,17 @@ ACP follows; hardening last.
resumes the SAME `agent_session_id` (memory intact), boocode saw opencode's turns as resumes the SAME `agent_session_id` (memory intact), boocode saw opencode's turns as
history, all three shared the one worktree, and no agent was locked to the chat. history, all three shared the one worktree, and no agent was locked to the chat.
## Phase 3 — Lifecycle hardening — ⬜ REMAINING ## Phase 3 — Lifecycle hardening — ✅ SHIPPED `v2.6.10-lifecycle-hardening` (3.13.6; 3.7 frontend + apps/server close-hook caller are follow-ups)
> **Lift (design §10):** hardening from **openchamber** (MIT, same warm-opencode-server architecture) — health-monitor + crash auto-restart + busy-aware restart + port reclaim (`killProcessOnPort`/`waitForPortRelease`) + stall-SSE = a concrete state machine for 3.1/3.2/3.6. Reaper (3.3/3.4): Paseo worktree-archive cascade + superset destroy-saga (preflight dirty/unpushed inspect) + LRU cap on warm-server Maps. Do crash-recovery + reaper together (shared supervision loop). > **Lift (design §10):** hardening from **openchamber** (MIT, same warm-opencode-server architecture) — health-monitor + crash auto-restart + busy-aware restart + port reclaim (`killProcessOnPort`/`waitForPortRelease`) + stall-SSE = a concrete state machine for 3.1/3.2/3.6. Reaper (3.3/3.4): Paseo worktree-archive cascade + superset destroy-saga (preflight dirty/unpushed inspect) + LRU cap on warm-server Maps. Do crash-recovery + reaper together (shared supervision loop).
- [ ] 3.1 Idle TTL eviction keyed per `(chat, agent)`; reattach-on-next-turn from `agent_sessions`. - [x] 3.1 Idle TTL eviction per `(chat, agent)` (`AGENT_POOL_IDLE_TTL_MS`=30min) + LRU cap (`AGENT_POOL_MAX_LIVE`=10), busy never evicted; reattach next turn. Pure `lifecycle-decisions.ts` (TDD).
- [ ] 3.2 Crash recovery: opencode server restart recreates sessions; ACP re-`session/new`. - [x] 3.2 Crash recovery: openchamber health-monitor + busy-aware-restart + stale-grace state machine in `opencode-server.ts` (+ port reclaim) + `warm-acp.ts`. opencode → fresh sessions; ACP re-`session/new`. F.1 guard + U.6 usage preserved.
- [ ] 3.3 Chat close/archive hook → `closeSession` for every `(chat, agent)` + remove the - [x] 3.3 Close hooks (`/api/chats/:id/close`, `/api/sessions/:id/close`) → `closeChat` evicts backends + archives the `worktrees` row + removes the worktree. *(apps/server caller is a follow-up; idle-evict + reaper backstop it.)*
chat's **`worktrees`** row + worktree (NOT `session_worktrees` — superseded P1.5-b); mark agent rows `status='closed'`. - [x] 3.4 Orphan worktree reaper (periodic, 1h grace, superset-style dirty/unpushed preflight, Paseo soft-delete) + LRU cap on the pool.
- [ ] 3.4 Orphan worktree reaper (extend periodic sweeper) + max-live-worktrees LRU cap. - [x] 3.5 Re-baseline `worktrees.base_commit` after a successful `apply_pending` (both apply routes).
- [ ] 3.5 Re-baseline worktree diff after `apply_pending`. - [x] 3.6 Reconnect integration test (DB-opt-in): restart mid-session → next turn reattaches/recreates from `agent_sessions`/`worktrees`.
- [ ] 3.6 Reconnect test: restart BooCoder mid-session → next turn reattaches/recreates cleanly. - [ ] 3.7 Staging-boundary hint in DiffPanel (§9c) — **frontend follow-up** (apps/web; deferred — Sam has uncommitted web work).
- [ ] 3.7 Staging-boundary hint in DiffPanel (§9c): muted one-liner when the selected
provider can't see another agent's unapplied worktree edits (derived from per-change
`agent` + current provider; no new state).
## Tests — ⬜ REMAINING (none of T.1T.3 exist yet) ## Tests — ⬜ REMAINING (none of T.1T.3 exist yet)