20 KiB
Spec B+F - Collection URL/ID expansion + live drain progress
Date: 2026-05-01 Status: Draft (awaiting review) Sibling specs: A multi-branch picker (shipped); C+D build-context + dep-add (next); E precacher (parallel); G cleanups + patch tier. Folds: Original Spec F (live drain progress) merges in here - a 50+ mod cold load is exactly when live counters matter, and both features share the polling endpoint.
Schema notes (corrections to design source text):
download_jobs.statusenum isqueued | downloading | done | failed. The design text usedrunning; this spec uses the actual valuedownloading. UX label may render as "draining" for cohesion with the lifecycle vocabulary; the SQL keys offdownloading.- The existing
collectionstable (init/01_schema.sql) has columnscollection_id PK, title, child_workshop_ids TEXT[], last_fetched_at TIMESTAMPTZ. There is noexpires_atcolumn. TTL is computed at read time aslast_fetched_at + interval '6 hours'; no schema change for that.
§1 Overview
Today, sortof accepts one input shape: a blob of newline/;-delimited workshop IDs. Anything that isn't a 7–12 digit number is dropped by parse.parse_workshop_input. Pasting a Steam Workshop collection URL, of which there is exactly one ID embedded, currently surfaces that ID as a single mod, fails parse (process_one=no_mod_info), and lands in the non_mod bucket added by the recent unknown/non-mod feature. The user is expected to drag every child mod's ID out by hand.
This spec adds:
- Collection URL/ID expansion. The API recognizes Steam Workshop URLs and resolves collection IDs to their child wsids via
ISteamRemoteStorage/GetCollectionDetails. Cached in the existingcollectionstable. - Async job pipeline. Any input containing a collection or any uncached wsid creates a
sort_jobsrow, returns ajob_id, and the frontend pollsGET /api/jobs/{job_id}every 2.5s untildone|failed. - Live counters. During
expanding | queued | draining, the poll response carries freshcached / queued / drainingcounts plus an incrementalresult_json. The status strip animates instead of going stale.
Synchronous response is preserved for the all-cached fast path (Open Q1, §10).
§2 API contract
2.1 POST /api/sort - polymorphic on input
Request body unchanged: { "input": str, "rules": str? }. Response shape branches on what's in input:
// Path A: bare wsid list, all in cache (current behavior, unchanged)
{ "status": "success", "MOD_DB": [...], "MODS_LINE": "...", ... }
// Path B: bare wsid list with ≥1 uncached, OR ≥1 collection URL
{ "status": "queued" | "expanding", "job_id": "<uuid>" }
The frontend branches on the presence of job_id. Old clients that don't poll silently get the original sync response when their input is fully warm.
2.2 GET /api/jobs/{job_id} - polling endpoint
Response (any phase):
{
"job_id": "<uuid>",
"phase": "expanding" | "queued" | "draining" | "done" | "failed",
"counts": { "cached": int, "queued": int, "draining": int },
"wsids": [str, ...] | null, // null while phase=expanding; populated after
"result": { ...SORTOF_DATA... } | null, // partial during draining; final on done
"failure_reason": str | null // populated only on phase=failed
}
404 if the job_id is unknown or expired (TTL in §3).
2.3 DELETE /api/jobs/{job_id} - cancel
Marks the job failed with failure_reason="cancelled". Returns 204. Idempotent: deleting an already-terminal job is a no-op 204. Does not cancel underlying download_jobs rows (Open Q6, §10).
§3 Schema
New table:
CREATE TABLE IF NOT EXISTS sort_jobs (
job_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
phase TEXT NOT NULL CHECK (phase IN ('expanding','queued','draining','done','failed')),
phase_started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
input_raw TEXT NOT NULL,
collection_ids TEXT[] NOT NULL DEFAULT '{}',
wsids TEXT[], -- null until expansion resolves
rules_raw TEXT,
result_json JSONB, -- null until done (incremental partials kept here too)
failure_reason TEXT
);
CREATE INDEX IF NOT EXISTS sort_jobs_phase_idx ON sort_jobs (phase);
CREATE INDEX IF NOT EXISTS sort_jobs_updated_idx ON sort_jobs (updated_at);
- TTL: rows older than
updated_at + 24hANDphase ∈ (done, failed)are eligible for deletion. Cleanup script lives in Spec G; this spec only requires the schema support it. updated_attrigger: mirror the existingdownload_jobs.touch_updated_atpattern.- Migration plan:
init/02_sort_jobs.sqlfor fresh deploys + a one-shotpsql -ffor the live DB. No data migration; pure additive.
The existing collections table is reused as-is (4 columns, see corrections at top). No expires_at column; freshness derived from last_fetched_at.
§4 Phase state machine
┌──────────────────────────────────┐
│ /api/sort with collections only │
▼ │
┌──────────────┐ GetCollectionDetails OK │
│ expanding │ ────────────────────────────┘
└──────┬───────┘
│ wsids = collections + bare ids
▼
┌──────────────┐ ←── /api/sort with bare uncached wsids
│ queued │ ─────────── all wsids in mod_parsed (skip drain)
└──────┬───────┘ │
│ first download_jobs row → downloading
▼ │
┌──────────────┐ │
│ draining │ │
└──────┬───────┘ │
│ all wsids resolved (mod_parsed has rows)
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ done │ │ done │
└──────────────┘ └──────────────┘
Failure terminal at any phase: failed (with phase_at_failure stored in failure_reason prefix).
Phase transitions are monotonic: expanding → queued → draining → done. No backward transitions. A job's phase only advances; the API computes phase fresh on each GET rather than mutating it on every event (simpler, no leader needed).
Phase computation rule (executed inside GET /api/jobs/{job_id}):
if phase in (done, failed): return as-stored
if wsids is null: phase = expanding
elif counts.draining > 0: phase = draining
elif counts.queued > 0: phase = queued
elif counts.cached >= len(wsids): phase = done; persist result_json
else: phase = queued # transient gap between rows
§5 Steam expansion
5.1 Detection
The current parse.parse_workshop_input strips ini-style prefixes and extracts \b\d{7,12}\b. We add a sibling parse.parse_with_collections(text) -> (wsids: list, collection_ids: list):
- Match Steam URLs
https?://steamcommunity\.com/(?:sharedfiles|workshop)/filedetails/\?id=(\d{7,12})and capture the ID. - Bare numeric IDs (the existing pattern) remain
wsids. - A URL-form ID is classified as a candidate collection. We don't know syntactically whether a wsid is a collection vs a mod - so candidate collection IDs are sent to
GetCollectionDetailsfirst; if the API reports them as actual mods (not collections), they fall back to the wsids list.
5.2 Resolution
Single batched call per /api/sort with ≥1 candidate:
POST https://api.steampowered.com/ISteamRemoteStorage/GetCollectionDetails/v1/
collectioncount=N
publishedfileids[0..N-1]=...
Per-collection in the response: result==1 and children[] populated → expand to [c.publishedfileid for c in children]. result!=1 → mark in result warnings as {tag:"collection-partial", level:"warning", msg:"collection X could not be fetched"}; keep the job alive with whatever resolved. (Open Q3, §10.)
5.3 Caching
Hit on collections row where last_fetched_at > now() - interval '6 hours':
- Skip the API call entirely.
- Use cached
child_workshop_idsdirectly.
Miss / stale → call API, UPSERT into collections, then proceed. The last_fetched_at = now() write is the cache write.
5.4 Flakiness
One internal retry with 2s backoff on HTTP error or result!=1 for a candidate. After retries exhausted, the candidate is reported as collection-partial (warning) but the job continues with whatever else resolved. (Open Q4, §10.)
§6 Counts contract
Computed live on every GET /api/jobs/{job_id} against the job's wsids[]:
-- counts.cached
SELECT COUNT(DISTINCT mp.workshop_id)
FROM mod_parsed mp
JOIN workshop_meta wm ON wm.workshop_id = mp.workshop_id
WHERE mp.workshop_id = ANY($1::text[])
AND mp.parsed_at_time_updated = wm.time_updated;
-- counts.queued
SELECT COUNT(DISTINCT workshop_id)
FROM download_jobs
WHERE workshop_id = ANY($1::text[]) AND status = 'queued';
-- counts.draining (status='downloading' in DB; surfaced as 'draining' in API/UI)
SELECT COUNT(DISTINCT workshop_id)
FROM download_jobs
WHERE workshop_id = ANY($1::text[]) AND status = 'downloading';
Ownership precedent (Spec A §8): once a job is created, wsids[] is locked. WORKSHOP_ITEMS_LINE in the final result_json is computed from sort_jobs.wsids[], not recomputed against current mod_parsed. This means a wsid that was in the input but is currently non_mod or unknown still appears in WORKSHOP_ITEMS_LINE in the same position - matching the locked contract from Spec A.
§7 Frontend behavior
Status strip during polling:
| Phase | Strip text |
|---|---|
expanding |
expanding collection… (animated dot, no counts visible) |
queued |
X cached · Y queued · 0 draining (animated dots on queued) |
draining |
X cached · Y queued · Z draining (animated dots on queued + draining) |
done |
strip collapses, full result rendered |
failed |
red banner with failure_reason + Retry button |
Polling: setInterval at 2.5s, started on receiving job_id. Stops on phase ∈ (done, failed). On 404 (job expired/garbage-collected): show "this job expired - re-submit?" toast; offer one-click resubmit using cached input (the textarea is still populated).
Cancel button: shown during expanding | queued | draining. Issues DELETE /api/jobs/{job_id}, stops polling on success, clears the strip.
The synchronous code path (no job_id in response) renders unchanged - old picker behavior, immediate result.
Owned-fields contract (Spec A §8 precedent): WORKSHOP_ITEMS_LINE, counts.queued (the picker's internal counter), unknown[], non_mod[] are still owned by the first /api/sort (or final result_json). /api/resort ignores them. The poll's counts object is purely the live drain progress and does not feed the picker's internal queued counter.
§8 Cancellation
DELETE /api/jobs/{job_id} semantics:
- Marks
sort_jobs.phase = 'failed',failure_reason = 'cancelled'. Idempotent. - Does not touch
download_jobs. Workshop downloads in flight continue and populatemod_parsed, benefiting subsequent users via cache. Aborting them would waste partial progress and potentially trip the drain'sSTALE_RECLAIM_MINreclaim path. (Open Q6, §10.) - Frontend stops polling, hides the strip, shows a small "cancelled" toast. The textarea retains the input.
Re-submitting the same input after cancel creates a new job. Collection-cache hits make the second submission instant if the cache hasn't expired.
§9 Restart resilience
uvicorn boot sweep (idempotent, runs in lifespan startup):
-- Time out long-stuck expansion jobs
UPDATE sort_jobs
SET phase = 'failed', failure_reason = 'expansion timed out',
updated_at = now()
WHERE phase = 'expanding'
AND phase_started_at < now() - interval '10 minutes';
Jobs in queued / draining need no special handling - they resume polling against download_jobs on the next client GET. The phase derives live from current counts (§4 phase computation rule), so a restart in the middle of a drain is invisible to the client beyond a brief window where counts may shift.
§10 Open questions resolved
- Bare wsid + all-cached: synchronous or job-routed? Synchronous. The cached path is sub-100ms today; routing it through a job adds polling latency and a UI flash. Frontend branches cheaply on
job_idpresence. - Mixed input (bare wsids + collection URLs). Treat as collection input. Job created in
expandingphase immediately. Bare wsids merge intowsids[]afterGetCollectionDetailsresolves. No partial-sync hybrid - keeps the response shape rule clean. - Partial expansion failure. Succeed with the resolvable subset. Each unresolvable collection adds a warning
{tag:"collection-partial", level:"warning", msg:"collection X could not be fetched"}toresult_json.WARNINGS. Job completes normally; user sees the result with one or more amber warnings. GetCollectionDetailsflakiness. One internal retry with 2s backoff before reporting collection-partial. No frontend-driven retry on the GET poll - it would mask transient failures and give the user no recovery affordance. Job markedfailedonly if every candidate collection fails.- Concurrent expansion of the same collection. Independent jobs; cache deduplicates. User A and User B paste the same collection URL near-simultaneously; both create separate
sort_jobsrows. The first one'sGetCollectionDetailscall populatescollections; the second's hits cache. Worst case (race within the cache miss window) costs one duplicate API call. In-flight cache key (e.g.,collections.fetching_until) deferred to Spec G. - Cancel semantics. Abandon
sort_job; leavedownload_jobsrunning. Three reasons. (a) Workshop downloads benefit other users via the sharedmod_parsedcache - wasting them is anti-social. (b) The drain'sSTALE_RECLAIM_MIN=30reclaim path treats half-killeddownloadingrows as candidates for retry; introducing client-driven cancellation creates a class of races where the row is killed mid-write. (c) Worker-side cancellation requires SIGTERM-of-DD-subprocess plumbing that doesn't exist; staying out of that codepath is much cheaper.
§11 Acceptance criteria
POST /api/sortwith all-cached bare wsids returns the synchronous shape with nojob_id.POST /api/sortwith any uncached wsid OR any collection URL returns{status, job_id}and persists asort_jobsrow.GET /api/jobs/{job_id}returns live counts and the current phase per the §4 derivation rule.GET /api/jobs/{nonexistent}returns404.DELETE /api/jobs/{job_id}flips phase tofailedwithfailure_reason="cancelled". Idempotent.- Collection URL
https://steamcommunity.com/sharedfiles/filedetails/?id=Nis detected by the parser and routed throughGetCollectionDetails. - A
collectionscache hit (row younger than 6h) skips the Steam API call. - A collection that returns
result!=1produces acollection-partialamber warning inresult_json.WARNINGSbut does not fail the job (unless all collections in the input are unresolvable). - uvicorn restart with a job in
expanding > 10minflips it tofailedwithfailure_reason="expansion timed out". - uvicorn restart with a job in
queued/drainingis invisible to the client beyond next-poll-window jitter. - Frontend polls every 2.5s when
phase ∈ (expanding, queued, draining); stops on terminal phase. - Status strip text matches the §7 table for each phase.
- Cancel button issues
DELETE, stops polling, hides strip, retains input in textarea. WORKSHOP_ITEMS_LINEinresult_jsonmatchessort_jobs.wsids[]regardless of which wsids ended up innon_mod/unknown(Spec A §8 ownership preserved).
§12 Test recipes
- Synchronous fast path -
POST /api/sortwith{"input":"2169435993;2392709985;2487022075"}. Expect: response hasMODS_LINE, nojob_id. ~50ms. - Collection URL, cold cache - clear
collectionsrow for the test ID;POST /api/sortwith a known PZ collection URL. Expect:{status:"expanding", job_id:"…"}immediately. Poll: phase progressesexpanding → queued → draining → done. Finalresult.MODS_LINEpopulated. - Collection URL, warm cache - re-submit the same URL within 6h. Expect: phase skips
expanding, goes straight toqueued(ordoneif all children cached). One Steam API call total across both runs (verify via/var/log/...orjournalctl -u sortof-api | grep GetCollectionDetails). - Mixed bare + collection -
POST /api/sortwith"<URL>\n2169435993". Expect: job created inexpanding; on resolve,wsids[]contains both the collection's children and the bare wsid; deduped. - Partial collection failure - input contains two collection URLs, one valid, one to a deleted collection. Expect: job phase progresses normally;
result_json.WARNINGScontains exactly onecollection-partialentry;wsids[]contains only the valid collection's children. - All collections fail - input contains only unresolvable collection URLs. Expect: job
phase=failed,failure_reason="all input collections unresolvable". - Cancel during draining - submit a 50-mod cold collection, wait until
phase=draining,DELETE /api/jobs/{id}. Expect: phase=failed reason=cancelled. Verifydownload_jobsrows for the wsids are still inqueued/downloading/done(not nuked). - Restart mid-drain - submit a job, wait for
phase=draining,sudo systemctl restart sortof-api. Wait 5s, GET the job. Expect: phase still derives correctly (computed from live counts), client polling resumes. - Restart mid-expansion - submit a collection job, kill
sortof-apimid-expansion (race window: hard to hit deliberately; can simulate by directly SETphase='expanding', phase_started_at=now()-interval '15 minutes'then restart). Expect: lifespan sweep flips it tofailedwithfailure_reason="expansion timed out". - 404 on expired job - manually
DELETE FROM sort_jobs WHERE job_id=…; client poll. Expect:404. Frontend shows the expired-toast with re-submit affordance. - Counts contract - at each poll during a 50-mod cold drain, sum
counts.cached + counts.queued + counts.drainingand compare tolen(wsids). Equal at every snapshot. (Some wsids may benon_modpost-drain; they appear incached=0, queued=0, draining=0becausemod_parsedhas no row - they're "missing from all three buckets," which is the expected steady state for non-mods.) - Concurrent collection submit - open two browser tabs simultaneously and submit the same URL. Expect: two distinct
job_ids, but only oneGetCollectionDetailscall lands at Steam (verify journal). Worst case (cache-miss race): two API calls; this is acceptable.