Files
sortof/docs/specs/2026-05-01-collection-expansion.md

20 KiB
Raw Blame History

Spec B+F - Collection URL/ID expansion + live drain progress

Date: 2026-05-01 Status: Draft (awaiting review) Sibling specs: A multi-branch picker (shipped); C+D build-context + dep-add (next); E precacher (parallel); G cleanups + patch tier. Folds: Original Spec F (live drain progress) merges in here - a 50+ mod cold load is exactly when live counters matter, and both features share the polling endpoint.

Schema notes (corrections to design source text):

  • download_jobs.status enum is queued | downloading | done | failed. The design text used running; this spec uses the actual value downloading. UX label may render as "draining" for cohesion with the lifecycle vocabulary; the SQL keys off downloading.
  • The existing collections table (init/01_schema.sql) has columns collection_id PK, title, child_workshop_ids TEXT[], last_fetched_at TIMESTAMPTZ. There is no expires_at column. TTL is computed at read time as last_fetched_at + interval '6 hours'; no schema change for that.

§1 Overview

Today, sortof accepts one input shape: a blob of newline/;-delimited workshop IDs. Anything that isn't a 712 digit number is dropped by parse.parse_workshop_input. Pasting a Steam Workshop collection URL, of which there is exactly one ID embedded, currently surfaces that ID as a single mod, fails parse (process_one=no_mod_info), and lands in the non_mod bucket added by the recent unknown/non-mod feature. The user is expected to drag every child mod's ID out by hand.

This spec adds:

  1. Collection URL/ID expansion. The API recognizes Steam Workshop URLs and resolves collection IDs to their child wsids via ISteamRemoteStorage/GetCollectionDetails. Cached in the existing collections table.
  2. Async job pipeline. Any input containing a collection or any uncached wsid creates a sort_jobs row, returns a job_id, and the frontend polls GET /api/jobs/{job_id} every 2.5s until done|failed.
  3. Live counters. During expanding | queued | draining, the poll response carries fresh cached / queued / draining counts plus an incremental result_json. The status strip animates instead of going stale.

Synchronous response is preserved for the all-cached fast path (Open Q1, §10).

§2 API contract

2.1 POST /api/sort - polymorphic on input

Request body unchanged: { "input": str, "rules": str? }. Response shape branches on what's in input:

// Path A: bare wsid list, all in cache (current behavior, unchanged)
{ "status": "success", "MOD_DB": [...], "MODS_LINE": "...", ... }

// Path B: bare wsid list with ≥1 uncached, OR ≥1 collection URL
{ "status": "queued" | "expanding", "job_id": "<uuid>" }

The frontend branches on the presence of job_id. Old clients that don't poll silently get the original sync response when their input is fully warm.

2.2 GET /api/jobs/{job_id} - polling endpoint

Response (any phase):

{
  "job_id": "<uuid>",
  "phase":  "expanding" | "queued" | "draining" | "done" | "failed",
  "counts": { "cached": int, "queued": int, "draining": int },
  "wsids":  [str, ...] | null,        // null while phase=expanding; populated after
  "result": { ...SORTOF_DATA... } | null,   // partial during draining; final on done
  "failure_reason": str | null         // populated only on phase=failed
}

404 if the job_id is unknown or expired (TTL in §3).

2.3 DELETE /api/jobs/{job_id} - cancel

Marks the job failed with failure_reason="cancelled". Returns 204. Idempotent: deleting an already-terminal job is a no-op 204. Does not cancel underlying download_jobs rows (Open Q6, §10).

§3 Schema

New table:

CREATE TABLE IF NOT EXISTS sort_jobs (
    job_id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    phase            TEXT NOT NULL CHECK (phase IN ('expanding','queued','draining','done','failed')),
    phase_started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    created_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
    input_raw        TEXT NOT NULL,
    collection_ids   TEXT[] NOT NULL DEFAULT '{}',
    wsids            TEXT[],                              -- null until expansion resolves
    rules_raw        TEXT,
    result_json      JSONB,                               -- null until done (incremental partials kept here too)
    failure_reason   TEXT
);
CREATE INDEX IF NOT EXISTS sort_jobs_phase_idx ON sort_jobs (phase);
CREATE INDEX IF NOT EXISTS sort_jobs_updated_idx ON sort_jobs (updated_at);
  • TTL: rows older than updated_at + 24h AND phase ∈ (done, failed) are eligible for deletion. Cleanup script lives in Spec G; this spec only requires the schema support it.
  • updated_at trigger: mirror the existing download_jobs.touch_updated_at pattern.
  • Migration plan: init/02_sort_jobs.sql for fresh deploys + a one-shot psql -f for the live DB. No data migration; pure additive.

The existing collections table is reused as-is (4 columns, see corrections at top). No expires_at column; freshness derived from last_fetched_at.

§4 Phase state machine

                    ┌──────────────────────────────────┐
                    │ /api/sort with collections only  │
                    ▼                                   │
          ┌──────────────┐  GetCollectionDetails OK    │
          │  expanding   │ ────────────────────────────┘
          └──────┬───────┘
                 │ wsids = collections + bare ids
                 ▼
          ┌──────────────┐  ←── /api/sort with bare uncached wsids
          │   queued     │ ─────────── all wsids in mod_parsed (skip drain)
          └──────┬───────┘                              │
                 │ first download_jobs row → downloading
                 ▼                                       │
          ┌──────────────┐                               │
          │   draining   │                               │
          └──────┬───────┘                               │
                 │ all wsids resolved (mod_parsed has rows)
                 │                                       │
                 ▼                                       ▼
          ┌──────────────┐               ┌──────────────┐
          │     done     │               │     done     │
          └──────────────┘               └──────────────┘

Failure terminal at any phase: failed (with phase_at_failure stored in failure_reason prefix).

Phase transitions are monotonic: expanding → queued → draining → done. No backward transitions. A job's phase only advances; the API computes phase fresh on each GET rather than mutating it on every event (simpler, no leader needed).

Phase computation rule (executed inside GET /api/jobs/{job_id}):

if phase in (done, failed):           return as-stored
if wsids is null:                     phase = expanding
elif counts.draining > 0:             phase = draining
elif counts.queued > 0:                phase = queued
elif counts.cached >= len(wsids):     phase = done; persist result_json
else:                                 phase = queued      # transient gap between rows

§5 Steam expansion

5.1 Detection

The current parse.parse_workshop_input strips ini-style prefixes and extracts \b\d{7,12}\b. We add a sibling parse.parse_with_collections(text) -> (wsids: list, collection_ids: list):

  • Match Steam URLs https?://steamcommunity\.com/(?:sharedfiles|workshop)/filedetails/\?id=(\d{7,12}) and capture the ID.
  • Bare numeric IDs (the existing pattern) remain wsids.
  • A URL-form ID is classified as a candidate collection. We don't know syntactically whether a wsid is a collection vs a mod - so candidate collection IDs are sent to GetCollectionDetails first; if the API reports them as actual mods (not collections), they fall back to the wsids list.

5.2 Resolution

Single batched call per /api/sort with ≥1 candidate:

POST https://api.steampowered.com/ISteamRemoteStorage/GetCollectionDetails/v1/
  collectioncount=N
  publishedfileids[0..N-1]=...

Per-collection in the response: result==1 and children[] populated → expand to [c.publishedfileid for c in children]. result!=1 → mark in result warnings as {tag:"collection-partial", level:"warning", msg:"collection X could not be fetched"}; keep the job alive with whatever resolved. (Open Q3, §10.)

5.3 Caching

Hit on collections row where last_fetched_at > now() - interval '6 hours':

  • Skip the API call entirely.
  • Use cached child_workshop_ids directly.

Miss / stale → call API, UPSERT into collections, then proceed. The last_fetched_at = now() write is the cache write.

5.4 Flakiness

One internal retry with 2s backoff on HTTP error or result!=1 for a candidate. After retries exhausted, the candidate is reported as collection-partial (warning) but the job continues with whatever else resolved. (Open Q4, §10.)

§6 Counts contract

Computed live on every GET /api/jobs/{job_id} against the job's wsids[]:

-- counts.cached
SELECT COUNT(DISTINCT mp.workshop_id)
  FROM mod_parsed mp
  JOIN workshop_meta wm ON wm.workshop_id = mp.workshop_id
 WHERE mp.workshop_id = ANY($1::text[])
   AND mp.parsed_at_time_updated = wm.time_updated;

-- counts.queued
SELECT COUNT(DISTINCT workshop_id)
  FROM download_jobs
 WHERE workshop_id = ANY($1::text[]) AND status = 'queued';

-- counts.draining   (status='downloading' in DB; surfaced as 'draining' in API/UI)
SELECT COUNT(DISTINCT workshop_id)
  FROM download_jobs
 WHERE workshop_id = ANY($1::text[]) AND status = 'downloading';

Ownership precedent (Spec A §8): once a job is created, wsids[] is locked. WORKSHOP_ITEMS_LINE in the final result_json is computed from sort_jobs.wsids[], not recomputed against current mod_parsed. This means a wsid that was in the input but is currently non_mod or unknown still appears in WORKSHOP_ITEMS_LINE in the same position - matching the locked contract from Spec A.

§7 Frontend behavior

Status strip during polling:

Phase Strip text
expanding expanding collection… (animated dot, no counts visible)
queued X cached · Y queued · 0 draining (animated dots on queued)
draining X cached · Y queued · Z draining (animated dots on queued + draining)
done strip collapses, full result rendered
failed red banner with failure_reason + Retry button

Polling: setInterval at 2.5s, started on receiving job_id. Stops on phase ∈ (done, failed). On 404 (job expired/garbage-collected): show "this job expired - re-submit?" toast; offer one-click resubmit using cached input (the textarea is still populated).

Cancel button: shown during expanding | queued | draining. Issues DELETE /api/jobs/{job_id}, stops polling on success, clears the strip.

The synchronous code path (no job_id in response) renders unchanged - old picker behavior, immediate result.

Owned-fields contract (Spec A §8 precedent): WORKSHOP_ITEMS_LINE, counts.queued (the picker's internal counter), unknown[], non_mod[] are still owned by the first /api/sort (or final result_json). /api/resort ignores them. The poll's counts object is purely the live drain progress and does not feed the picker's internal queued counter.

§8 Cancellation

DELETE /api/jobs/{job_id} semantics:

  • Marks sort_jobs.phase = 'failed', failure_reason = 'cancelled'. Idempotent.
  • Does not touch download_jobs. Workshop downloads in flight continue and populate mod_parsed, benefiting subsequent users via cache. Aborting them would waste partial progress and potentially trip the drain's STALE_RECLAIM_MIN reclaim path. (Open Q6, §10.)
  • Frontend stops polling, hides the strip, shows a small "cancelled" toast. The textarea retains the input.

Re-submitting the same input after cancel creates a new job. Collection-cache hits make the second submission instant if the cache hasn't expired.

§9 Restart resilience

uvicorn boot sweep (idempotent, runs in lifespan startup):

-- Time out long-stuck expansion jobs
UPDATE sort_jobs
   SET phase = 'failed', failure_reason = 'expansion timed out',
       updated_at = now()
 WHERE phase = 'expanding'
   AND phase_started_at < now() - interval '10 minutes';

Jobs in queued / draining need no special handling - they resume polling against download_jobs on the next client GET. The phase derives live from current counts (§4 phase computation rule), so a restart in the middle of a drain is invisible to the client beyond a brief window where counts may shift.

§10 Open questions resolved

  1. Bare wsid + all-cached: synchronous or job-routed? Synchronous. The cached path is sub-100ms today; routing it through a job adds polling latency and a UI flash. Frontend branches cheaply on job_id presence.
  2. Mixed input (bare wsids + collection URLs). Treat as collection input. Job created in expanding phase immediately. Bare wsids merge into wsids[] after GetCollectionDetails resolves. No partial-sync hybrid - keeps the response shape rule clean.
  3. Partial expansion failure. Succeed with the resolvable subset. Each unresolvable collection adds a warning {tag:"collection-partial", level:"warning", msg:"collection X could not be fetched"} to result_json.WARNINGS. Job completes normally; user sees the result with one or more amber warnings.
  4. GetCollectionDetails flakiness. One internal retry with 2s backoff before reporting collection-partial. No frontend-driven retry on the GET poll - it would mask transient failures and give the user no recovery affordance. Job marked failed only if every candidate collection fails.
  5. Concurrent expansion of the same collection. Independent jobs; cache deduplicates. User A and User B paste the same collection URL near-simultaneously; both create separate sort_jobs rows. The first one's GetCollectionDetails call populates collections; the second's hits cache. Worst case (race within the cache miss window) costs one duplicate API call. In-flight cache key (e.g., collections.fetching_until) deferred to Spec G.
  6. Cancel semantics. Abandon sort_job; leave download_jobs running. Three reasons. (a) Workshop downloads benefit other users via the shared mod_parsed cache - wasting them is anti-social. (b) The drain's STALE_RECLAIM_MIN=30 reclaim path treats half-killed downloading rows as candidates for retry; introducing client-driven cancellation creates a class of races where the row is killed mid-write. (c) Worker-side cancellation requires SIGTERM-of-DD-subprocess plumbing that doesn't exist; staying out of that codepath is much cheaper.

§11 Acceptance criteria

  • POST /api/sort with all-cached bare wsids returns the synchronous shape with no job_id.
  • POST /api/sort with any uncached wsid OR any collection URL returns {status, job_id} and persists a sort_jobs row.
  • GET /api/jobs/{job_id} returns live counts and the current phase per the §4 derivation rule.
  • GET /api/jobs/{nonexistent} returns 404.
  • DELETE /api/jobs/{job_id} flips phase to failed with failure_reason="cancelled". Idempotent.
  • Collection URL https://steamcommunity.com/sharedfiles/filedetails/?id=N is detected by the parser and routed through GetCollectionDetails.
  • A collections cache hit (row younger than 6h) skips the Steam API call.
  • A collection that returns result!=1 produces a collection-partial amber warning in result_json.WARNINGS but does not fail the job (unless all collections in the input are unresolvable).
  • uvicorn restart with a job in expanding > 10min flips it to failed with failure_reason="expansion timed out".
  • uvicorn restart with a job in queued/draining is invisible to the client beyond next-poll-window jitter.
  • Frontend polls every 2.5s when phase ∈ (expanding, queued, draining); stops on terminal phase.
  • Status strip text matches the §7 table for each phase.
  • Cancel button issues DELETE, stops polling, hides strip, retains input in textarea.
  • WORKSHOP_ITEMS_LINE in result_json matches sort_jobs.wsids[] regardless of which wsids ended up in non_mod / unknown (Spec A §8 ownership preserved).

§12 Test recipes

  1. Synchronous fast path - POST /api/sort with {"input":"2169435993;2392709985;2487022075"}. Expect: response has MODS_LINE, no job_id. ~50ms.
  2. Collection URL, cold cache - clear collections row for the test ID; POST /api/sort with a known PZ collection URL. Expect: {status:"expanding", job_id:"…"} immediately. Poll: phase progresses expanding → queued → draining → done. Final result.MODS_LINE populated.
  3. Collection URL, warm cache - re-submit the same URL within 6h. Expect: phase skips expanding, goes straight to queued (or done if all children cached). One Steam API call total across both runs (verify via /var/log/... or journalctl -u sortof-api | grep GetCollectionDetails).
  4. Mixed bare + collection - POST /api/sort with "<URL>\n2169435993". Expect: job created in expanding; on resolve, wsids[] contains both the collection's children and the bare wsid; deduped.
  5. Partial collection failure - input contains two collection URLs, one valid, one to a deleted collection. Expect: job phase progresses normally; result_json.WARNINGS contains exactly one collection-partial entry; wsids[] contains only the valid collection's children.
  6. All collections fail - input contains only unresolvable collection URLs. Expect: job phase=failed, failure_reason="all input collections unresolvable".
  7. Cancel during draining - submit a 50-mod cold collection, wait until phase=draining, DELETE /api/jobs/{id}. Expect: phase=failed reason=cancelled. Verify download_jobs rows for the wsids are still in queued/downloading/done (not nuked).
  8. Restart mid-drain - submit a job, wait for phase=draining, sudo systemctl restart sortof-api. Wait 5s, GET the job. Expect: phase still derives correctly (computed from live counts), client polling resumes.
  9. Restart mid-expansion - submit a collection job, kill sortof-api mid-expansion (race window: hard to hit deliberately; can simulate by directly SET phase='expanding', phase_started_at=now()-interval '15 minutes' then restart). Expect: lifespan sweep flips it to failed with failure_reason="expansion timed out".
  10. 404 on expired job - manually DELETE FROM sort_jobs WHERE job_id=…; client poll. Expect: 404. Frontend shows the expired-toast with re-submit affordance.
  11. Counts contract - at each poll during a 50-mod cold drain, sum counts.cached + counts.queued + counts.draining and compare to len(wsids). Equal at every snapshot. (Some wsids may be non_mod post-drain; they appear in cached=0, queued=0, draining=0 because mod_parsed has no row - they're "missing from all three buckets," which is the expected steady state for non-mods.)
  12. Concurrent collection submit - open two browser tabs simultaneously and submit the same URL. Expect: two distinct job_ids, but only one GetCollectionDetails call lands at Steam (verify journal). Worst case (cache-miss race): two API calls; this is acceptable.