diff --git a/.scratch/close-hardening/design.md b/.scratch/close-hardening/design.md new file mode 100644 index 0000000..dda43da --- /dev/null +++ b/.scratch/close-hardening/design.md @@ -0,0 +1,73 @@ +# Close-hardening: stop mid-close restarts from orphaning channels + +## Problem (observed) +Ticket #18 was closed (transcript saved, `status: closed`) but its Discord channel +was never deleted — it lingers as an orphan. Root cause: the final channel delete +is a deferred `setTimeout`, and a container restart during the delay drops it. + +Evidence: ticket #18 in Mongo is `status: "closed"` with +`discordThreadId: "1512204690631430144"` still pointing at the live channel, while +properly-closed tickets (#17, #12, #11, #10) are `status: closed` + `discordThreadId: null`. + +## Three underlying defects +1. **Deferred delete is cancellable / fragile.** + - Button path (`handlers/buttons.js` `runFinalClose:471`): + `trackTimeout(setTimeout(() => channel.delete(), 5000))`. On SIGTERM, + `handleShutdown` (`broccolini-discord.js:276-279`) `clearTimeout`s every tracked + timeout → the delete never fires. A redeploy in the 5 s window orphans the channel. + - Slash path (`handlers/commands/close.js` `finalizeForceClose:89-93`): a *plain* + `setTimeout` (not tracked) — survives SIGTERM but dies on hard exit/SIGKILL, and + there is no reconciliation either way. +2. **Inconsistent DB writes between the two paths.** + - Button path sets `{ discordThreadId: null, status: 'closed' }` (buttons.js:447-450). + - Slash path sets only `{ status: 'closed' }` (close.js:73-76), leaving `discordThreadId`. + So an orphan may have `discordThreadId` null OR still-set — no single signal. +3. **No reconciliation for "closed but channel still exists."** + `reconcileDeletedTicketChannels` only handles the opposite direction (DB open, + channel gone). Nothing heals a closed ticket whose channel survived. + +## Goals +- A restart at any moment during close must not permanently orphan a channel. +- Both close paths leave identical, unambiguous DB state. +- A self-healing sweep finishes any delete a restart interrupted. + +## Approach (IMPLEMENTED — uses the existing pendingDelete mechanism) +Discovery during implementation: the codebase **already has** the restart-survival +machinery — the `pendingDelete` flag (`models.js`) + `resumePendingDeletes(client)` +called once from the `ready` handler (`broccolini-discord.js:231`). The **auto-close** +path uses it correctly; the **button** and **slash** paths simply did not participate +(bare `setTimeout(channel.delete())`, never setting `pendingDelete`). That omission is +the entire bug. So the fix is to make all three paths share one guarded delete — NOT to +add a new reconcile job. + +1. **Shared helper `scheduleTicketChannelDelete(channel, gmailThreadId)`** in + `services/tickets.js`: after a 5 s grace delay, `enqueueDelete(channel)` (queue-routed, + honoring Hard Rule #3 — the old bare `channel.delete()` bypassed the queue) then unset + `pendingDelete`. Wrapped in `trackTimeout`. +2. **Each close path sets `pendingDelete: true` and keeps `discordThreadId` populated** + before scheduling, so a restart in the grace window is recovered by + `resumePendingDeletes()` (it re-fetches the channel by `discordThreadId` and deletes it). + - Button path previously set `discordThreadId: null` *before* the delete — that made the + channel unrecoverable on restart. Changed to `{ pendingDelete: true }`, leaving + `discordThreadId` set (matches the auto-close contract). +3. The grace delay is kept (staff read the close message first); recovery now covers it. + +## Files (DONE) +- `services/tickets.js` — added `scheduleTicketChannelDelete()`; auto-close else-branch now + calls it; exported. +- `handlers/buttons.js` `runFinalClose` — `attemptCloseTransition(..., { pendingDelete: true })` + (was `{ discordThreadId: null }`); delete via `scheduleTicketChannelDelete`. +- `handlers/commands/close.js` `finalizeForceClose` — same `{ pendingDelete: true }`; delete via + `scheduleTicketChannelDelete`. + +## Notes / residual +- Pre-existing orphans (e.g. #14) have `pendingDelete: false`, so `resumePendingDeletes` + will NOT auto-heal them — they need the one-off manual cleanup (same as #18). +- A non-restart `enqueueDelete` failure leaves `pendingDelete: true` until the next boot + (resume retries). Same property the auto-close path already had — accepted. +- Closed tickets now retain `discordThreadId` (like auto-close already did); nothing queries + closed tickets by channel, and the deleted channel id never re-matches a live channel. + +## Out of scope +- The in-memory countdown itself (a restart during the *countdown*, before finalize, + simply cancels the pending close — acceptable; staff can re-close). diff --git a/handlers/buttons.js b/handlers/buttons.js index 37e850f..e3d6cb6 100644 --- a/handlers/buttons.js +++ b/handlers/buttons.js @@ -22,7 +22,7 @@ const { } = require('discord.js'); const { mongoose } = require('../db-connection'); const { CONFIG } = require('../config'); -const { makeTicketName, resolveCreatorNickname, getOrCreateTicketCategory, cleanupEmptyOverflowCategory, checkTicketCreationRateLimit, toDiscordSafeName, attemptCloseTransition } = require('../services/tickets'); +const { makeTicketName, resolveCreatorNickname, getOrCreateTicketCategory, cleanupEmptyOverflowCategory, checkTicketCreationRateLimit, toDiscordSafeName, attemptCloseTransition, scheduleTicketChannelDelete } = require('../services/tickets'); const { sendTicketClosedEmail } = require('../services/gmail'); const { moveThreadToFolder } = require('../services/gmailLabels'); const { getTicketActionRow, ticketChannelOverwrites } = require('../utils/ticketComponents'); @@ -453,7 +453,9 @@ async function runFinalClose(interaction, ticket, sendEmail = true) { await sendTicketClosedEmail(ticket, closerDisplayName, interaction.user.id); } - const { transitioned, ticket: closedTicket } = await attemptCloseTransition(ticket.gmailThreadId, { discordThreadId: null }, { welcomeMessageId: '' }); + // Keep discordThreadId set and mark pendingDelete so a restart during the + // grace window before the channel delete is recovered by resumePendingDeletes(). + const { transitioned, ticket: closedTicket } = await attemptCloseTransition(ticket.gmailThreadId, { pendingDelete: true }, { welcomeMessageId: '' }); if (transitioned) { const closerType = isStaff(interaction.member) ? 'staff' : 'user'; recordAction(interaction.user.id, 'close', { @@ -482,9 +484,12 @@ async function runFinalClose(interaction, ticket, sendEmail = true) { const parentCatId = ticket.parentCategoryId; const guildRef = interaction.guild; + // Queue-routed, pendingDelete-guarded delete (shared with auto-close + slash + // close) so a mid-close restart can't orphan the channel. + scheduleTicketChannelDelete(interaction.channel, ticket.gmailThreadId); + // Lazy require — same cycle reason as in handleConfirmCloseRequest above. const { trackTimeout } = require('../broccolini-discord'); - trackTimeout(setTimeout(() => interaction.channel.delete().catch(() => {}), 5000)); trackTimeout(setTimeout(() => { if (parentCatId && guildRef) { cleanupEmptyOverflowCategory(guildRef, parentCatId, CONFIG.TICKET_CATEGORY_NAME).catch(() => {}); diff --git a/handlers/commands/close.js b/handlers/commands/close.js index 3296059..091ca7e 100644 --- a/handlers/commands/close.js +++ b/handlers/commands/close.js @@ -15,7 +15,7 @@ const { logTicketEvent, logError } = require('../../services/debugLog'); const { moveThreadToFolder } = require('../../services/gmailLabels'); const { pendingCloses } = require('../pendingCloses'); const { findTicketForChannel } = require('../sharedHelpers'); -const { attemptCloseTransition } = require('../../services/tickets'); +const { attemptCloseTransition, scheduleTicketChannelDelete } = require('../../services/tickets'); const { buildTranscriptText, formatDateForTranscript, renderTranscriptHeader } = require('../../services/transcript'); const { recordAction } = require('../../services/staffStats'); @@ -75,7 +75,9 @@ async function finalizeForceClose(channelRef, clientRef, _TicketModel, _recordAc if (!freshTicket || freshTicket.status === 'closed') return; try { - const { transitioned, ticket: closedTicket } = await attemptCloseTransition(freshTicket.gmailThreadId, {}, { welcomeMessageId: '' }, T); + // pendingDelete (with discordThreadId left set) lets resumePendingDeletes() + // recover the channel delete if a restart interrupts the grace window. + const { transitioned, ticket: closedTicket } = await attemptCloseTransition(freshTicket.gmailThreadId, { pendingDelete: true }, { welcomeMessageId: '' }, T); if (transitioned) { record(closerId ?? 'system', 'close', { ticket: closedTicket, @@ -97,11 +99,9 @@ async function finalizeForceClose(channelRef, clientRef, _TicketModel, _recordAc console.error('Transcript error (force-close):', tErr) ); - setTimeout(() => { - channelRef.delete('Ticket force-closed').catch(e => - console.error('Failed to delete channel:', e) - ); - }, 5000); + // Queue-routed, pendingDelete-guarded delete (shared with auto-close + button + // close) so a mid-close restart can't orphan the channel. + scheduleTicketChannelDelete(channelRef, freshTicket.gmailThreadId); } catch (err) { console.error('Force close error:', err); } diff --git a/services/tickets.js b/services/tickets.js index cd60e06..9d3ef09 100644 --- a/services/tickets.js +++ b/services/tickets.js @@ -293,6 +293,32 @@ async function attemptCloseTransition(gmailThreadId, extraSet = {}, extraUnset = return { transitioned, ticket }; } +/** + * Schedule the final ticket-channel delete after a short grace period (so staff + * read the close message first), routed through the channel queue. + * + * The delete is guarded by the `pendingDelete` flag: the caller MUST have already + * set `pendingDelete: true` on the ticket AND left `discordThreadId` populated, so + * that a restart during the grace window is recovered on boot by + * resumePendingDeletes() (which re-fetches the channel and deletes it). The flag + * is cleared once enqueueDelete resolves; if the doc is gone the unset is a no-op. + * + * Shared by all three close paths (auto-close, button, slash) so they behave + * identically and none can orphan a channel on a mid-close restart. + */ +function scheduleTicketChannelDelete(channel, gmailThreadId, delayMs = 5000) { + // Lazy require — broccolini-discord re-exports trackTimeout; cycle-safe at call time. + const { trackTimeout } = require('../broccolini-discord'); + trackTimeout(setTimeout(() => { + enqueueDelete(channel).then(() => { + withRetry(() => Ticket.updateOne( + { gmailThreadId }, + { $unset: { pendingDelete: '' } } + )).catch(() => {}); + }).catch(() => {}); + }, delayMs)); +} + // --- SCHEDULED CHECKS --- // These accept `client` and optionally `sendTicketClosedEmail` to avoid circular deps. @@ -340,16 +366,7 @@ async function checkAutoClose(client, sendTicketClosedEmail, _TicketModel, _reco if (_deps && _deps.scheduleDelete) { _deps.scheduleDelete(channel, ticket); } else { - // Lazy require — broccolini-discord re-exports trackTimeout; cycle-safe. - const { trackTimeout } = require('../broccolini-discord'); - trackTimeout(setTimeout(() => { - enqueueDelete(channel).then(() => { - withRetry(() => Ticket.updateOne( - { gmailThreadId: ticket.gmailThreadId }, - { $unset: { pendingDelete: '' } } - )).catch(() => {}); - }).catch(() => {}); - }, 5000)); + scheduleTicketChannelDelete(channel, ticket.gmailThreadId); } } } catch (error) { @@ -467,6 +484,7 @@ module.exports = { checkTicketCreationRateLimit, checkTicketLimits, attemptCloseTransition, + scheduleTicketChannelDelete, checkAutoClose, checkAutoUnclaim, reconcileDeletedTicketChannels,