Close: guard channel delete with pendingDelete so a restart can't orphan it

The button and slash close paths deleted the channel via a bare setTimeout that
never set the pendingDelete flag, so a restart in the 5s grace window orphaned
the channel (closed in DB, still present in Discord) with no recovery — only the
auto-close path used the flag correctly.

Extract scheduleTicketChannelDelete() in services/tickets.js: a grace-delayed,
queue-routed (enqueueDelete) delete that clears pendingDelete on success. All
three close paths now use it. Button/slash set pendingDelete:true and keep
discordThreadId populated so resumePendingDeletes() recovers the delete on the
next boot. The button path previously nulled discordThreadId before the delete,
which made the channel unrecoverable.
This commit is contained in:
2026-06-05 03:08:28 +00:00
parent 61e8ea32e1
commit 6ae57af885
4 changed files with 116 additions and 20 deletions

View File

@@ -0,0 +1,73 @@
# Close-hardening: stop mid-close restarts from orphaning channels
## Problem (observed)
Ticket #18 was closed (transcript saved, `status: closed`) but its Discord channel
was never deleted — it lingers as an orphan. Root cause: the final channel delete
is a deferred `setTimeout`, and a container restart during the delay drops it.
Evidence: ticket #18 in Mongo is `status: "closed"` with
`discordThreadId: "1512204690631430144"` still pointing at the live channel, while
properly-closed tickets (#17, #12, #11, #10) are `status: closed` + `discordThreadId: null`.
## Three underlying defects
1. **Deferred delete is cancellable / fragile.**
- Button path (`handlers/buttons.js` `runFinalClose:471`):
`trackTimeout(setTimeout(() => channel.delete(), 5000))`. On SIGTERM,
`handleShutdown` (`broccolini-discord.js:276-279`) `clearTimeout`s every tracked
timeout → the delete never fires. A redeploy in the 5 s window orphans the channel.
- Slash path (`handlers/commands/close.js` `finalizeForceClose:89-93`): a *plain*
`setTimeout` (not tracked) — survives SIGTERM but dies on hard exit/SIGKILL, and
there is no reconciliation either way.
2. **Inconsistent DB writes between the two paths.**
- Button path sets `{ discordThreadId: null, status: 'closed' }` (buttons.js:447-450).
- Slash path sets only `{ status: 'closed' }` (close.js:73-76), leaving `discordThreadId`.
So an orphan may have `discordThreadId` null OR still-set — no single signal.
3. **No reconciliation for "closed but channel still exists."**
`reconcileDeletedTicketChannels` only handles the opposite direction (DB open,
channel gone). Nothing heals a closed ticket whose channel survived.
## Goals
- A restart at any moment during close must not permanently orphan a channel.
- Both close paths leave identical, unambiguous DB state.
- A self-healing sweep finishes any delete a restart interrupted.
## Approach (IMPLEMENTED — uses the existing pendingDelete mechanism)
Discovery during implementation: the codebase **already has** the restart-survival
machinery — the `pendingDelete` flag (`models.js`) + `resumePendingDeletes(client)`
called once from the `ready` handler (`broccolini-discord.js:231`). The **auto-close**
path uses it correctly; the **button** and **slash** paths simply did not participate
(bare `setTimeout(channel.delete())`, never setting `pendingDelete`). That omission is
the entire bug. So the fix is to make all three paths share one guarded delete — NOT to
add a new reconcile job.
1. **Shared helper `scheduleTicketChannelDelete(channel, gmailThreadId)`** in
`services/tickets.js`: after a 5 s grace delay, `enqueueDelete(channel)` (queue-routed,
honoring Hard Rule #3 — the old bare `channel.delete()` bypassed the queue) then unset
`pendingDelete`. Wrapped in `trackTimeout`.
2. **Each close path sets `pendingDelete: true` and keeps `discordThreadId` populated**
before scheduling, so a restart in the grace window is recovered by
`resumePendingDeletes()` (it re-fetches the channel by `discordThreadId` and deletes it).
- Button path previously set `discordThreadId: null` *before* the delete — that made the
channel unrecoverable on restart. Changed to `{ pendingDelete: true }`, leaving
`discordThreadId` set (matches the auto-close contract).
3. The grace delay is kept (staff read the close message first); recovery now covers it.
## Files (DONE)
- `services/tickets.js` — added `scheduleTicketChannelDelete()`; auto-close else-branch now
calls it; exported.
- `handlers/buttons.js` `runFinalClose``attemptCloseTransition(..., { pendingDelete: true })`
(was `{ discordThreadId: null }`); delete via `scheduleTicketChannelDelete`.
- `handlers/commands/close.js` `finalizeForceClose` — same `{ pendingDelete: true }`; delete via
`scheduleTicketChannelDelete`.
## Notes / residual
- Pre-existing orphans (e.g. #14) have `pendingDelete: false`, so `resumePendingDeletes`
will NOT auto-heal them — they need the one-off manual cleanup (same as #18).
- A non-restart `enqueueDelete` failure leaves `pendingDelete: true` until the next boot
(resume retries). Same property the auto-close path already had — accepted.
- Closed tickets now retain `discordThreadId` (like auto-close already did); nothing queries
closed tickets by channel, and the deleted channel id never re-matches a live channel.
## Out of scope
- The in-memory countdown itself (a restart during the *countdown*, before finalize,
simply cancels the pending close — acceptable; staff can re-close).