Close: guard channel delete with pendingDelete so a restart can't orphan it

The button and slash close paths deleted the channel via a bare setTimeout that never set the pendingDelete flag, so a restart in the 5s grace window orphaned the channel (closed in DB, still present in Discord) with no recovery — only the auto-close path used the flag correctly. Extract scheduleTicketChannelDelete() in services/tickets.js: a grace-delayed, queue-routed (enqueueDelete) delete that clears pendingDelete on success. All three close paths now use it. Button/slash set pendingDelete:true and keep discordThreadId populated so resumePendingDeletes() recovers the delete on the next boot. The button path previously nulled discordThreadId before the delete, which made the channel unrecoverable.
2026-06-05 03:08:28 +00:00
parent 61e8ea32e1
commit 6ae57af885
4 changed files with 116 additions and 20 deletions
--- a/.scratch/close-hardening/design.md
+++ b/.scratch/close-hardening/design.md
@@ -0,0 +1,73 @@
+# Close-hardening: stop mid-close restarts from orphaning channels
+
+## Problem (observed)
+Ticket #18 was closed (transcript saved, `status: closed`) but its Discord channel
+was never deleted — it lingers as an orphan. Root cause: the final channel delete
+is a deferred `setTimeout`, and a container restart during the delay drops it.
+
+Evidence: ticket #18 in Mongo is `status: "closed"` with
+`discordThreadId: "1512204690631430144"` still pointing at the live channel, while
+properly-closed tickets (#17, #12, #11, #10) are `status: closed` + `discordThreadId: null`.
+
+## Three underlying defects
+1. **Deferred delete is cancellable / fragile.**
+   - Button path (`handlers/buttons.js` `runFinalClose:471`):
+     `trackTimeout(setTimeout(() => channel.delete(), 5000))`. On SIGTERM,
+     `handleShutdown` (`broccolini-discord.js:276-279`) `clearTimeout`s every tracked
+     timeout → the delete never fires. A redeploy in the 5 s window orphans the channel.
+   - Slash path (`handlers/commands/close.js` `finalizeForceClose:89-93`): a *plain*
+     `setTimeout` (not tracked) — survives SIGTERM but dies on hard exit/SIGKILL, and
+     there is no reconciliation either way.
+2. **Inconsistent DB writes between the two paths.**
+   - Button path sets `{ discordThreadId: null, status: 'closed' }` (buttons.js:447-450).
+   - Slash path sets only `{ status: 'closed' }` (close.js:73-76), leaving `discordThreadId`.
+   So an orphan may have `discordThreadId` null OR still-set — no single signal.
+3. **No reconciliation for "closed but channel still exists."**
+   `reconcileDeletedTicketChannels` only handles the opposite direction (DB open,
+   channel gone). Nothing heals a closed ticket whose channel survived.
+
+## Goals
+- A restart at any moment during close must not permanently orphan a channel.
+- Both close paths leave identical, unambiguous DB state.
+- A self-healing sweep finishes any delete a restart interrupted.
+
+## Approach (IMPLEMENTED — uses the existing pendingDelete mechanism)
+Discovery during implementation: the codebase **already has** the restart-survival
+machinery — the `pendingDelete` flag (`models.js`) + `resumePendingDeletes(client)`
+called once from the `ready` handler (`broccolini-discord.js:231`). The **auto-close**
+path uses it correctly; the **button** and **slash** paths simply did not participate
+(bare `setTimeout(channel.delete())`, never setting `pendingDelete`). That omission is
+the entire bug. So the fix is to make all three paths share one guarded delete — NOT to
+add a new reconcile job.
+
+1. **Shared helper `scheduleTicketChannelDelete(channel, gmailThreadId)`** in
+   `services/tickets.js`: after a 5 s grace delay, `enqueueDelete(channel)` (queue-routed,
+   honoring Hard Rule #3 — the old bare `channel.delete()` bypassed the queue) then unset
+   `pendingDelete`. Wrapped in `trackTimeout`.
+2. **Each close path sets `pendingDelete: true` and keeps `discordThreadId` populated**
+   before scheduling, so a restart in the grace window is recovered by
+   `resumePendingDeletes()` (it re-fetches the channel by `discordThreadId` and deletes it).
+   - Button path previously set `discordThreadId: null` *before* the delete — that made the
+     channel unrecoverable on restart. Changed to `{ pendingDelete: true }`, leaving
+     `discordThreadId` set (matches the auto-close contract).
+3. The grace delay is kept (staff read the close message first); recovery now covers it.
+
+## Files (DONE)
+- `services/tickets.js` — added `scheduleTicketChannelDelete()`; auto-close else-branch now
+  calls it; exported.
+- `handlers/buttons.js` `runFinalClose` — `attemptCloseTransition(..., { pendingDelete: true })`
+  (was `{ discordThreadId: null }`); delete via `scheduleTicketChannelDelete`.
+- `handlers/commands/close.js` `finalizeForceClose` — same `{ pendingDelete: true }`; delete via
+  `scheduleTicketChannelDelete`.
+
+## Notes / residual
+- Pre-existing orphans (e.g. #14) have `pendingDelete: false`, so `resumePendingDeletes`
+  will NOT auto-heal them — they need the one-off manual cleanup (same as #18).
+- A non-restart `enqueueDelete` failure leaves `pendingDelete: true` until the next boot
+  (resume retries). Same property the auto-close path already had — accepted.
+- Closed tickets now retain `discordThreadId` (like auto-close already did); nothing queries
+  closed tickets by channel, and the deleted channel id never re-matches a live channel.
+
+## Out of scope
+- The in-memory countdown itself (a restart during the *countdown*, before finalize,
+  simply cancels the pending close — acceptable; staff can re-close).