Add full sortof codebase: API, drain workers, frontend, schema, specs

2026-05-04 03:27:54 +00:00
parent acda2c90f8
commit 55d3794bfb
43 changed files with 13375 additions and 53 deletions
--- a/docs/specs/2026-05-01-collection-expansion.md
+++ b/docs/specs/2026-05-01-collection-expansion.md
@@ -0,0 +1,270 @@
+# Spec B+F - Collection URL/ID expansion + live drain progress
+
+**Date:** 2026-05-01
+**Status:** Draft (awaiting review)
+**Sibling specs:** A multi-branch picker (shipped); C+D build-context + dep-add (next); E precacher (parallel); G cleanups + patch tier.
+**Folds:** Original Spec F (live drain progress) merges in here - a 50+ mod cold load is exactly when live counters matter, and both features share the polling endpoint.
+
+**Schema notes (corrections to design source text):**
+- `download_jobs.status` enum is `queued | downloading | done | failed`. The design text used `running`; this spec uses the actual value `downloading`. UX label may render as "draining" for cohesion with the lifecycle vocabulary; the SQL keys off `downloading`.
+- The existing `collections` table (`init/01_schema.sql`) has columns `collection_id PK, title, child_workshop_ids TEXT[], last_fetched_at TIMESTAMPTZ`. There is **no `expires_at` column**. TTL is computed at read time as `last_fetched_at + interval '6 hours'`; no schema change for that.
+
+---
+
+## §1 Overview
+
+Today, sortof accepts one input shape: a blob of newline/`;`-delimited workshop IDs. Anything that isn't a 7–12 digit number is dropped by `parse.parse_workshop_input`. Pasting a Steam Workshop *collection* URL, of which there is exactly one ID embedded, currently surfaces that ID as a single mod, fails parse (`process_one=no_mod_info`), and lands in the `non_mod` bucket added by the recent unknown/non-mod feature. The user is expected to drag every child mod's ID out by hand.
+
+This spec adds:
+1. **Collection URL/ID expansion.** The API recognizes Steam Workshop URLs and resolves collection IDs to their child wsids via `ISteamRemoteStorage/GetCollectionDetails`. Cached in the existing `collections` table.
+2. **Async job pipeline.** Any input containing a collection or any uncached wsid creates a `sort_jobs` row, returns a `job_id`, and the frontend polls `GET /api/jobs/{job_id}` every 2.5s until `done|failed`.
+3. **Live counters.** During `expanding | queued | draining`, the poll response carries fresh `cached / queued / draining` counts plus an incremental `result_json`. The status strip animates instead of going stale.
+
+Synchronous response is preserved for the all-cached fast path (Open Q1, §10).
+
+## §2 API contract
+
+### 2.1 `POST /api/sort` - polymorphic on input
+
+Request body unchanged: `{ "input": str, "rules": str? }`. Response shape branches on what's in `input`:
+
+```jsonc
+// Path A: bare wsid list, all in cache (current behavior, unchanged)
+{ "status": "success", "MOD_DB": [...], "MODS_LINE": "...", ... }
+
+// Path B: bare wsid list with ≥1 uncached, OR ≥1 collection URL
+{ "status": "queued" | "expanding", "job_id": "<uuid>" }
+```
+
+The frontend branches on the presence of `job_id`. Old clients that don't poll silently get the original sync response when their input is fully warm.
+
+### 2.2 `GET /api/jobs/{job_id}` - polling endpoint
+
+Response (any phase):
+```jsonc
+{
+  "job_id": "<uuid>",
+  "phase":  "expanding" | "queued" | "draining" | "done" | "failed",
+  "counts": { "cached": int, "queued": int, "draining": int },
+  "wsids":  [str, ...] | null,        // null while phase=expanding; populated after
+  "result": { ...SORTOF_DATA... } | null,   // partial during draining; final on done
+  "failure_reason": str | null         // populated only on phase=failed
+}
+```
+
+`404` if the `job_id` is unknown or expired (TTL in §3).
+
+### 2.3 `DELETE /api/jobs/{job_id}` - cancel
+
+Marks the job `failed` with `failure_reason="cancelled"`. Returns `204`. Idempotent: deleting an already-terminal job is a no-op `204`. Does **not** cancel underlying `download_jobs` rows (Open Q6, §10).
+
+## §3 Schema
+
+New table:
+
+```sql
+CREATE TABLE IF NOT EXISTS sort_jobs (
+    job_id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
+    phase            TEXT NOT NULL CHECK (phase IN ('expanding','queued','draining','done','failed')),
+    phase_started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
+    created_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
+    updated_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
+    input_raw        TEXT NOT NULL,
+    collection_ids   TEXT[] NOT NULL DEFAULT '{}',
+    wsids            TEXT[],                              -- null until expansion resolves
+    rules_raw        TEXT,
+    result_json      JSONB,                               -- null until done (incremental partials kept here too)
+    failure_reason   TEXT
+);
+CREATE INDEX IF NOT EXISTS sort_jobs_phase_idx ON sort_jobs (phase);
+CREATE INDEX IF NOT EXISTS sort_jobs_updated_idx ON sort_jobs (updated_at);
+```
+
+- **TTL:** rows older than `updated_at + 24h` AND `phase ∈ (done, failed)` are eligible for deletion. Cleanup script lives in Spec G; this spec only requires the schema support it.
+- **`updated_at` trigger:** mirror the existing `download_jobs.touch_updated_at` pattern.
+- **Migration plan:** `init/02_sort_jobs.sql` for fresh deploys + a one-shot `psql -f` for the live DB. No data migration; pure additive.
+
+The existing `collections` table is reused as-is (4 columns, see corrections at top). No `expires_at` column; freshness derived from `last_fetched_at`.
+
+## §4 Phase state machine
+
+```
+                    ┌──────────────────────────────────┐
+                    │ /api/sort with collections only  │
+                    ▼                                   │
+          ┌──────────────┐  GetCollectionDetails OK    │
+          │  expanding   │ ────────────────────────────┘
+          └──────┬───────┘
+                 │ wsids = collections + bare ids
+                 ▼
+          ┌──────────────┐  ←── /api/sort with bare uncached wsids
+          │   queued     │ ─────────── all wsids in mod_parsed (skip drain)
+          └──────┬───────┘                              │
+                 │ first download_jobs row → downloading
+                 ▼                                       │
+          ┌──────────────┐                               │
+          │   draining   │                               │
+          └──────┬───────┘                               │
+                 │ all wsids resolved (mod_parsed has rows)
+                 │                                       │
+                 ▼                                       ▼
+          ┌──────────────┐               ┌──────────────┐
+          │     done     │               │     done     │
+          └──────────────┘               └──────────────┘
+
+Failure terminal at any phase: failed (with phase_at_failure stored in failure_reason prefix).
+```
+
+Phase transitions are **monotonic**: `expanding → queued → draining → done`. No backward transitions. A job's phase only advances; the API computes phase fresh on each `GET` rather than mutating it on every event (simpler, no leader needed).
+
+Phase computation rule (executed inside `GET /api/jobs/{job_id}`):
+
+```
+if phase in (done, failed):           return as-stored
+if wsids is null:                     phase = expanding
+elif counts.draining > 0:             phase = draining
+elif counts.queued > 0:                phase = queued
+elif counts.cached >= len(wsids):     phase = done; persist result_json
+else:                                 phase = queued      # transient gap between rows
+```
+
+## §5 Steam expansion
+
+### 5.1 Detection
+The current `parse.parse_workshop_input` strips ini-style prefixes and extracts `\b\d{7,12}\b`. We add a sibling `parse.parse_with_collections(text) -> (wsids: list, collection_ids: list)`:
+
+- Match Steam URLs `https?://steamcommunity\.com/(?:sharedfiles|workshop)/filedetails/\?id=(\d{7,12})` and capture the ID.
+- Bare numeric IDs (the existing pattern) remain `wsids`.
+- A URL-form ID is classified as a *candidate collection*. We don't know syntactically whether a wsid is a collection vs a mod - so candidate collection IDs are sent to `GetCollectionDetails` first; if the API reports them as actual mods (not collections), they fall back to the wsids list.
+
+### 5.2 Resolution
+Single batched call per `/api/sort` with ≥1 candidate:
+
+```
+POST https://api.steampowered.com/ISteamRemoteStorage/GetCollectionDetails/v1/
+  collectioncount=N
+  publishedfileids[0..N-1]=...
+```
+
+Per-collection in the response: `result==1` and `children[]` populated → expand to `[c.publishedfileid for c in children]`. `result!=1` → mark in result warnings as `{tag:"collection-partial", level:"warning", msg:"collection X could not be fetched"}`; keep the job alive with whatever resolved. (Open Q3, §10.)
+
+### 5.3 Caching
+Hit on `collections` row where `last_fetched_at > now() - interval '6 hours'`:
+- Skip the API call entirely.
+- Use cached `child_workshop_ids` directly.
+
+Miss / stale → call API, UPSERT into `collections`, then proceed. The `last_fetched_at = now()` write is the cache write.
+
+### 5.4 Flakiness
+One internal retry with 2s backoff on HTTP error or `result!=1` for a candidate. After retries exhausted, the candidate is reported as collection-partial (warning) but the job continues with whatever else resolved. (Open Q4, §10.)
+
+## §6 Counts contract
+
+Computed live on every `GET /api/jobs/{job_id}` against the job's `wsids[]`:
+
+```sql
+-- counts.cached
+SELECT COUNT(DISTINCT mp.workshop_id)
+  FROM mod_parsed mp
+  JOIN workshop_meta wm ON wm.workshop_id = mp.workshop_id
+ WHERE mp.workshop_id = ANY($1::text[])
+   AND mp.parsed_at_time_updated = wm.time_updated;
+
+-- counts.queued
+SELECT COUNT(DISTINCT workshop_id)
+  FROM download_jobs
+ WHERE workshop_id = ANY($1::text[]) AND status = 'queued';
+
+-- counts.draining   (status='downloading' in DB; surfaced as 'draining' in API/UI)
+SELECT COUNT(DISTINCT workshop_id)
+  FROM download_jobs
+ WHERE workshop_id = ANY($1::text[]) AND status = 'downloading';
+```
+
+Ownership precedent (Spec A §8): once a job is created, `wsids[]` is **locked**. `WORKSHOP_ITEMS_LINE` in the final `result_json` is computed from `sort_jobs.wsids[]`, **not** recomputed against current `mod_parsed`. This means a wsid that was in the input but is currently `non_mod` or `unknown` still appears in `WORKSHOP_ITEMS_LINE` in the same position - matching the locked contract from Spec A.
+
+## §7 Frontend behavior
+
+Status strip during polling:
+
+| Phase | Strip text |
+|---|---|
+| `expanding` | `expanding collection…` (animated dot, no counts visible) |
+| `queued` | `X cached · Y queued · 0 draining` (animated dots on queued) |
+| `draining` | `X cached · Y queued · Z draining` (animated dots on queued + draining) |
+| `done` | strip collapses, full result rendered |
+| `failed` | red banner with `failure_reason` + Retry button |
+
+Polling: `setInterval` at 2.5s, started on receiving `job_id`. Stops on `phase ∈ (done, failed)`. On `404` (job expired/garbage-collected): show "this job expired - re-submit?" toast; offer one-click resubmit using cached input (the textarea is still populated).
+
+Cancel button: shown during `expanding | queued | draining`. Issues `DELETE /api/jobs/{job_id}`, stops polling on success, clears the strip.
+
+The synchronous code path (no `job_id` in response) renders unchanged - old picker behavior, immediate result.
+
+Owned-fields contract (Spec A §8 precedent): `WORKSHOP_ITEMS_LINE`, `counts.queued` (the picker's internal counter), `unknown[]`, `non_mod[]` are still owned by the **first** `/api/sort` (or final `result_json`). `/api/resort` ignores them. The poll's `counts` object is purely the live drain progress and does not feed the picker's internal queued counter.
+
+## §8 Cancellation
+
+`DELETE /api/jobs/{job_id}` semantics:
+
+- Marks `sort_jobs.phase = 'failed'`, `failure_reason = 'cancelled'`. Idempotent.
+- **Does not** touch `download_jobs`. Workshop downloads in flight continue and populate `mod_parsed`, benefiting subsequent users via cache. Aborting them would waste partial progress and potentially trip the drain's `STALE_RECLAIM_MIN` reclaim path. (Open Q6, §10.)
+- Frontend stops polling, hides the strip, shows a small "cancelled" toast. The textarea retains the input.
+
+Re-submitting the same input after cancel creates a *new* job. Collection-cache hits make the second submission instant if the cache hasn't expired.
+
+## §9 Restart resilience
+
+uvicorn boot sweep (idempotent, runs in lifespan startup):
+
+```sql
+-- Time out long-stuck expansion jobs
+UPDATE sort_jobs
+   SET phase = 'failed', failure_reason = 'expansion timed out',
+       updated_at = now()
+ WHERE phase = 'expanding'
+   AND phase_started_at < now() - interval '10 minutes';
+```
+
+Jobs in `queued` / `draining` need no special handling - they resume polling against `download_jobs` on the next client `GET`. The phase derives live from current counts (§4 phase computation rule), so a restart in the middle of a drain is invisible to the client beyond a brief window where counts may shift.
+
+## §10 Open questions resolved
+
+1. **Bare wsid + all-cached: synchronous or job-routed?** *Synchronous.* The cached path is sub-100ms today; routing it through a job adds polling latency and a UI flash. Frontend branches cheaply on `job_id` presence.
+2. **Mixed input (bare wsids + collection URLs).** *Treat as collection input.* Job created in `expanding` phase immediately. Bare wsids merge into `wsids[]` after `GetCollectionDetails` resolves. No partial-sync hybrid - keeps the response shape rule clean.
+3. **Partial expansion failure.** *Succeed with the resolvable subset.* Each unresolvable collection adds a warning `{tag:"collection-partial", level:"warning", msg:"collection X could not be fetched"}` to `result_json.WARNINGS`. Job completes normally; user sees the result with one or more amber warnings.
+4. **`GetCollectionDetails` flakiness.** *One internal retry with 2s backoff* before reporting collection-partial. No frontend-driven retry on the GET poll - it would mask transient failures and give the user no recovery affordance. Job marked `failed` only if **every** candidate collection fails.
+5. **Concurrent expansion of the same collection.** *Independent jobs; cache deduplicates.* User A and User B paste the same collection URL near-simultaneously; both create separate `sort_jobs` rows. The first one's `GetCollectionDetails` call populates `collections`; the second's hits cache. Worst case (race within the cache miss window) costs one duplicate API call. In-flight cache key (e.g., `collections.fetching_until`) deferred to Spec G.
+6. **Cancel semantics.** *Abandon `sort_job`; leave `download_jobs` running.* Three reasons. (a) Workshop downloads benefit other users via the shared `mod_parsed` cache - wasting them is anti-social. (b) The drain's `STALE_RECLAIM_MIN=30` reclaim path treats half-killed `downloading` rows as candidates for retry; introducing client-driven cancellation creates a class of races where the row is killed mid-write. (c) Worker-side cancellation requires SIGTERM-of-DD-subprocess plumbing that doesn't exist; staying out of that codepath is much cheaper.
+
+## §11 Acceptance criteria
+
+- [ ] `POST /api/sort` with all-cached bare wsids returns the synchronous shape with no `job_id`.
+- [ ] `POST /api/sort` with any uncached wsid OR any collection URL returns `{status, job_id}` and persists a `sort_jobs` row.
+- [ ] `GET /api/jobs/{job_id}` returns live counts and the current phase per the §4 derivation rule.
+- [ ] `GET /api/jobs/{nonexistent}` returns `404`.
+- [ ] `DELETE /api/jobs/{job_id}` flips phase to `failed` with `failure_reason="cancelled"`. Idempotent.
+- [ ] Collection URL `https://steamcommunity.com/sharedfiles/filedetails/?id=N` is detected by the parser and routed through `GetCollectionDetails`.
+- [ ] A `collections` cache hit (row younger than 6h) skips the Steam API call.
+- [ ] A collection that returns `result!=1` produces a `collection-partial` amber warning in `result_json.WARNINGS` but does not fail the job (unless **all** collections in the input are unresolvable).
+- [ ] uvicorn restart with a job in `expanding > 10min` flips it to `failed` with `failure_reason="expansion timed out"`.
+- [ ] uvicorn restart with a job in `queued`/`draining` is invisible to the client beyond next-poll-window jitter.
+- [ ] Frontend polls every 2.5s when `phase ∈ (expanding, queued, draining)`; stops on terminal phase.
+- [ ] Status strip text matches the §7 table for each phase.
+- [ ] Cancel button issues `DELETE`, stops polling, hides strip, retains input in textarea.
+- [ ] `WORKSHOP_ITEMS_LINE` in `result_json` matches `sort_jobs.wsids[]` regardless of which wsids ended up in `non_mod` / `unknown` (Spec A §8 ownership preserved).
+
+## §12 Test recipes
+
+1. **Synchronous fast path** - `POST /api/sort` with `{"input":"2169435993;2392709985;2487022075"}`. Expect: response has `MODS_LINE`, no `job_id`. ~50ms.
+2. **Collection URL, cold cache** - clear `collections` row for the test ID; `POST /api/sort` with a known PZ collection URL. Expect: `{status:"expanding", job_id:"…"}` immediately. Poll: phase progresses `expanding → queued → draining → done`. Final `result.MODS_LINE` populated.
+3. **Collection URL, warm cache** - re-submit the same URL within 6h. Expect: phase skips `expanding`, goes straight to `queued` (or `done` if all children cached). One Steam API call total across both runs (verify via `/var/log/...` or `journalctl -u sortof-api | grep GetCollectionDetails`).
+4. **Mixed bare + collection** - `POST /api/sort` with `"<URL>\n2169435993"`. Expect: job created in `expanding`; on resolve, `wsids[]` contains both the collection's children and the bare wsid; deduped.
+5. **Partial collection failure** - input contains two collection URLs, one valid, one to a deleted collection. Expect: job phase progresses normally; `result_json.WARNINGS` contains exactly one `collection-partial` entry; `wsids[]` contains only the valid collection's children.
+6. **All collections fail** - input contains only unresolvable collection URLs. Expect: job `phase=failed`, `failure_reason="all input collections unresolvable"`.
+7. **Cancel during draining** - submit a 50-mod cold collection, wait until `phase=draining`, `DELETE /api/jobs/{id}`. Expect: phase=failed reason=cancelled. Verify `download_jobs` rows for the wsids are still in `queued`/`downloading`/`done` (not nuked).
+8. **Restart mid-drain** - submit a job, wait for `phase=draining`, `sudo systemctl restart sortof-api`. Wait 5s, GET the job. Expect: phase still derives correctly (computed from live counts), client polling resumes.
+9. **Restart mid-expansion** - submit a collection job, kill `sortof-api` mid-expansion (race window: hard to hit deliberately; can simulate by directly SET `phase='expanding', phase_started_at=now()-interval '15 minutes'` then restart). Expect: lifespan sweep flips it to `failed` with `failure_reason="expansion timed out"`.
+10. **404 on expired job** - manually `DELETE FROM sort_jobs WHERE job_id=…`; client poll. Expect: `404`. Frontend shows the expired-toast with re-submit affordance.
+11. **Counts contract** - at each poll during a 50-mod cold drain, sum `counts.cached + counts.queued + counts.draining` and compare to `len(wsids)`. Equal at every snapshot. (Some wsids may be `non_mod` post-drain; they appear in `cached=0, queued=0, draining=0` because `mod_parsed` has no row - they're "missing from all three buckets," which is the expected steady state for non-mods.)
+12. **Concurrent collection submit** - open two browser tabs simultaneously and submit the same URL. Expect: two distinct `job_id`s, but only one `GetCollectionDetails` call lands at Steam (verify journal). Worst case (cache-miss race): two API calls; this is acceptable.