Files
sortof/docs/specs/2026-05-01-collection-expansion.md

271 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Spec B+F - Collection URL/ID expansion + live drain progress
**Date:** 2026-05-01
**Status:** Draft (awaiting review)
**Sibling specs:** A multi-branch picker (shipped); C+D build-context + dep-add (next); E precacher (parallel); G cleanups + patch tier.
**Folds:** Original Spec F (live drain progress) merges in here - a 50+ mod cold load is exactly when live counters matter, and both features share the polling endpoint.
**Schema notes (corrections to design source text):**
- `download_jobs.status` enum is `queued | downloading | done | failed`. The design text used `running`; this spec uses the actual value `downloading`. UX label may render as "draining" for cohesion with the lifecycle vocabulary; the SQL keys off `downloading`.
- The existing `collections` table (`init/01_schema.sql`) has columns `collection_id PK, title, child_workshop_ids TEXT[], last_fetched_at TIMESTAMPTZ`. There is **no `expires_at` column**. TTL is computed at read time as `last_fetched_at + interval '6 hours'`; no schema change for that.
---
## §1 Overview
Today, sortof accepts one input shape: a blob of newline/`;`-delimited workshop IDs. Anything that isn't a 712 digit number is dropped by `parse.parse_workshop_input`. Pasting a Steam Workshop *collection* URL, of which there is exactly one ID embedded, currently surfaces that ID as a single mod, fails parse (`process_one=no_mod_info`), and lands in the `non_mod` bucket added by the recent unknown/non-mod feature. The user is expected to drag every child mod's ID out by hand.
This spec adds:
1. **Collection URL/ID expansion.** The API recognizes Steam Workshop URLs and resolves collection IDs to their child wsids via `ISteamRemoteStorage/GetCollectionDetails`. Cached in the existing `collections` table.
2. **Async job pipeline.** Any input containing a collection or any uncached wsid creates a `sort_jobs` row, returns a `job_id`, and the frontend polls `GET /api/jobs/{job_id}` every 2.5s until `done|failed`.
3. **Live counters.** During `expanding | queued | draining`, the poll response carries fresh `cached / queued / draining` counts plus an incremental `result_json`. The status strip animates instead of going stale.
Synchronous response is preserved for the all-cached fast path (Open Q1, §10).
## §2 API contract
### 2.1 `POST /api/sort` - polymorphic on input
Request body unchanged: `{ "input": str, "rules": str? }`. Response shape branches on what's in `input`:
```jsonc
// Path A: bare wsid list, all in cache (current behavior, unchanged)
{ "status": "success", "MOD_DB": [...], "MODS_LINE": "...", ... }
// Path B: bare wsid list with ≥1 uncached, OR ≥1 collection URL
{ "status": "queued" | "expanding", "job_id": "<uuid>" }
```
The frontend branches on the presence of `job_id`. Old clients that don't poll silently get the original sync response when their input is fully warm.
### 2.2 `GET /api/jobs/{job_id}` - polling endpoint
Response (any phase):
```jsonc
{
"job_id": "<uuid>",
"phase": "expanding" | "queued" | "draining" | "done" | "failed",
"counts": { "cached": int, "queued": int, "draining": int },
"wsids": [str, ...] | null, // null while phase=expanding; populated after
"result": { ...SORTOF_DATA... } | null, // partial during draining; final on done
"failure_reason": str | null // populated only on phase=failed
}
```
`404` if the `job_id` is unknown or expired (TTL in §3).
### 2.3 `DELETE /api/jobs/{job_id}` - cancel
Marks the job `failed` with `failure_reason="cancelled"`. Returns `204`. Idempotent: deleting an already-terminal job is a no-op `204`. Does **not** cancel underlying `download_jobs` rows (Open Q6, §10).
## §3 Schema
New table:
```sql
CREATE TABLE IF NOT EXISTS sort_jobs (
job_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
phase TEXT NOT NULL CHECK (phase IN ('expanding','queued','draining','done','failed')),
phase_started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
input_raw TEXT NOT NULL,
collection_ids TEXT[] NOT NULL DEFAULT '{}',
wsids TEXT[], -- null until expansion resolves
rules_raw TEXT,
result_json JSONB, -- null until done (incremental partials kept here too)
failure_reason TEXT
);
CREATE INDEX IF NOT EXISTS sort_jobs_phase_idx ON sort_jobs (phase);
CREATE INDEX IF NOT EXISTS sort_jobs_updated_idx ON sort_jobs (updated_at);
```
- **TTL:** rows older than `updated_at + 24h` AND `phase ∈ (done, failed)` are eligible for deletion. Cleanup script lives in Spec G; this spec only requires the schema support it.
- **`updated_at` trigger:** mirror the existing `download_jobs.touch_updated_at` pattern.
- **Migration plan:** `init/02_sort_jobs.sql` for fresh deploys + a one-shot `psql -f` for the live DB. No data migration; pure additive.
The existing `collections` table is reused as-is (4 columns, see corrections at top). No `expires_at` column; freshness derived from `last_fetched_at`.
## §4 Phase state machine
```
┌──────────────────────────────────┐
│ /api/sort with collections only │
▼ │
┌──────────────┐ GetCollectionDetails OK │
│ expanding │ ────────────────────────────┘
└──────┬───────┘
│ wsids = collections + bare ids
┌──────────────┐ ←── /api/sort with bare uncached wsids
│ queued │ ─────────── all wsids in mod_parsed (skip drain)
└──────┬───────┘ │
│ first download_jobs row → downloading
▼ │
┌──────────────┐ │
│ draining │ │
└──────┬───────┘ │
│ all wsids resolved (mod_parsed has rows)
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ done │ │ done │
└──────────────┘ └──────────────┘
Failure terminal at any phase: failed (with phase_at_failure stored in failure_reason prefix).
```
Phase transitions are **monotonic**: `expanding → queued → draining → done`. No backward transitions. A job's phase only advances; the API computes phase fresh on each `GET` rather than mutating it on every event (simpler, no leader needed).
Phase computation rule (executed inside `GET /api/jobs/{job_id}`):
```
if phase in (done, failed): return as-stored
if wsids is null: phase = expanding
elif counts.draining > 0: phase = draining
elif counts.queued > 0: phase = queued
elif counts.cached >= len(wsids): phase = done; persist result_json
else: phase = queued # transient gap between rows
```
## §5 Steam expansion
### 5.1 Detection
The current `parse.parse_workshop_input` strips ini-style prefixes and extracts `\b\d{7,12}\b`. We add a sibling `parse.parse_with_collections(text) -> (wsids: list, collection_ids: list)`:
- Match Steam URLs `https?://steamcommunity\.com/(?:sharedfiles|workshop)/filedetails/\?id=(\d{7,12})` and capture the ID.
- Bare numeric IDs (the existing pattern) remain `wsids`.
- A URL-form ID is classified as a *candidate collection*. We don't know syntactically whether a wsid is a collection vs a mod - so candidate collection IDs are sent to `GetCollectionDetails` first; if the API reports them as actual mods (not collections), they fall back to the wsids list.
### 5.2 Resolution
Single batched call per `/api/sort` with ≥1 candidate:
```
POST https://api.steampowered.com/ISteamRemoteStorage/GetCollectionDetails/v1/
collectioncount=N
publishedfileids[0..N-1]=...
```
Per-collection in the response: `result==1` and `children[]` populated → expand to `[c.publishedfileid for c in children]`. `result!=1` → mark in result warnings as `{tag:"collection-partial", level:"warning", msg:"collection X could not be fetched"}`; keep the job alive with whatever resolved. (Open Q3, §10.)
### 5.3 Caching
Hit on `collections` row where `last_fetched_at > now() - interval '6 hours'`:
- Skip the API call entirely.
- Use cached `child_workshop_ids` directly.
Miss / stale → call API, UPSERT into `collections`, then proceed. The `last_fetched_at = now()` write is the cache write.
### 5.4 Flakiness
One internal retry with 2s backoff on HTTP error or `result!=1` for a candidate. After retries exhausted, the candidate is reported as collection-partial (warning) but the job continues with whatever else resolved. (Open Q4, §10.)
## §6 Counts contract
Computed live on every `GET /api/jobs/{job_id}` against the job's `wsids[]`:
```sql
-- counts.cached
SELECT COUNT(DISTINCT mp.workshop_id)
FROM mod_parsed mp
JOIN workshop_meta wm ON wm.workshop_id = mp.workshop_id
WHERE mp.workshop_id = ANY($1::text[])
AND mp.parsed_at_time_updated = wm.time_updated;
-- counts.queued
SELECT COUNT(DISTINCT workshop_id)
FROM download_jobs
WHERE workshop_id = ANY($1::text[]) AND status = 'queued';
-- counts.draining (status='downloading' in DB; surfaced as 'draining' in API/UI)
SELECT COUNT(DISTINCT workshop_id)
FROM download_jobs
WHERE workshop_id = ANY($1::text[]) AND status = 'downloading';
```
Ownership precedent (Spec A §8): once a job is created, `wsids[]` is **locked**. `WORKSHOP_ITEMS_LINE` in the final `result_json` is computed from `sort_jobs.wsids[]`, **not** recomputed against current `mod_parsed`. This means a wsid that was in the input but is currently `non_mod` or `unknown` still appears in `WORKSHOP_ITEMS_LINE` in the same position - matching the locked contract from Spec A.
## §7 Frontend behavior
Status strip during polling:
| Phase | Strip text |
|---|---|
| `expanding` | `expanding collection…` (animated dot, no counts visible) |
| `queued` | `X cached · Y queued · 0 draining` (animated dots on queued) |
| `draining` | `X cached · Y queued · Z draining` (animated dots on queued + draining) |
| `done` | strip collapses, full result rendered |
| `failed` | red banner with `failure_reason` + Retry button |
Polling: `setInterval` at 2.5s, started on receiving `job_id`. Stops on `phase ∈ (done, failed)`. On `404` (job expired/garbage-collected): show "this job expired - re-submit?" toast; offer one-click resubmit using cached input (the textarea is still populated).
Cancel button: shown during `expanding | queued | draining`. Issues `DELETE /api/jobs/{job_id}`, stops polling on success, clears the strip.
The synchronous code path (no `job_id` in response) renders unchanged - old picker behavior, immediate result.
Owned-fields contract (Spec A §8 precedent): `WORKSHOP_ITEMS_LINE`, `counts.queued` (the picker's internal counter), `unknown[]`, `non_mod[]` are still owned by the **first** `/api/sort` (or final `result_json`). `/api/resort` ignores them. The poll's `counts` object is purely the live drain progress and does not feed the picker's internal queued counter.
## §8 Cancellation
`DELETE /api/jobs/{job_id}` semantics:
- Marks `sort_jobs.phase = 'failed'`, `failure_reason = 'cancelled'`. Idempotent.
- **Does not** touch `download_jobs`. Workshop downloads in flight continue and populate `mod_parsed`, benefiting subsequent users via cache. Aborting them would waste partial progress and potentially trip the drain's `STALE_RECLAIM_MIN` reclaim path. (Open Q6, §10.)
- Frontend stops polling, hides the strip, shows a small "cancelled" toast. The textarea retains the input.
Re-submitting the same input after cancel creates a *new* job. Collection-cache hits make the second submission instant if the cache hasn't expired.
## §9 Restart resilience
uvicorn boot sweep (idempotent, runs in lifespan startup):
```sql
-- Time out long-stuck expansion jobs
UPDATE sort_jobs
SET phase = 'failed', failure_reason = 'expansion timed out',
updated_at = now()
WHERE phase = 'expanding'
AND phase_started_at < now() - interval '10 minutes';
```
Jobs in `queued` / `draining` need no special handling - they resume polling against `download_jobs` on the next client `GET`. The phase derives live from current counts (§4 phase computation rule), so a restart in the middle of a drain is invisible to the client beyond a brief window where counts may shift.
## §10 Open questions resolved
1. **Bare wsid + all-cached: synchronous or job-routed?** *Synchronous.* The cached path is sub-100ms today; routing it through a job adds polling latency and a UI flash. Frontend branches cheaply on `job_id` presence.
2. **Mixed input (bare wsids + collection URLs).** *Treat as collection input.* Job created in `expanding` phase immediately. Bare wsids merge into `wsids[]` after `GetCollectionDetails` resolves. No partial-sync hybrid - keeps the response shape rule clean.
3. **Partial expansion failure.** *Succeed with the resolvable subset.* Each unresolvable collection adds a warning `{tag:"collection-partial", level:"warning", msg:"collection X could not be fetched"}` to `result_json.WARNINGS`. Job completes normally; user sees the result with one or more amber warnings.
4. **`GetCollectionDetails` flakiness.** *One internal retry with 2s backoff* before reporting collection-partial. No frontend-driven retry on the GET poll - it would mask transient failures and give the user no recovery affordance. Job marked `failed` only if **every** candidate collection fails.
5. **Concurrent expansion of the same collection.** *Independent jobs; cache deduplicates.* User A and User B paste the same collection URL near-simultaneously; both create separate `sort_jobs` rows. The first one's `GetCollectionDetails` call populates `collections`; the second's hits cache. Worst case (race within the cache miss window) costs one duplicate API call. In-flight cache key (e.g., `collections.fetching_until`) deferred to Spec G.
6. **Cancel semantics.** *Abandon `sort_job`; leave `download_jobs` running.* Three reasons. (a) Workshop downloads benefit other users via the shared `mod_parsed` cache - wasting them is anti-social. (b) The drain's `STALE_RECLAIM_MIN=30` reclaim path treats half-killed `downloading` rows as candidates for retry; introducing client-driven cancellation creates a class of races where the row is killed mid-write. (c) Worker-side cancellation requires SIGTERM-of-DD-subprocess plumbing that doesn't exist; staying out of that codepath is much cheaper.
## §11 Acceptance criteria
- [ ] `POST /api/sort` with all-cached bare wsids returns the synchronous shape with no `job_id`.
- [ ] `POST /api/sort` with any uncached wsid OR any collection URL returns `{status, job_id}` and persists a `sort_jobs` row.
- [ ] `GET /api/jobs/{job_id}` returns live counts and the current phase per the §4 derivation rule.
- [ ] `GET /api/jobs/{nonexistent}` returns `404`.
- [ ] `DELETE /api/jobs/{job_id}` flips phase to `failed` with `failure_reason="cancelled"`. Idempotent.
- [ ] Collection URL `https://steamcommunity.com/sharedfiles/filedetails/?id=N` is detected by the parser and routed through `GetCollectionDetails`.
- [ ] A `collections` cache hit (row younger than 6h) skips the Steam API call.
- [ ] A collection that returns `result!=1` produces a `collection-partial` amber warning in `result_json.WARNINGS` but does not fail the job (unless **all** collections in the input are unresolvable).
- [ ] uvicorn restart with a job in `expanding > 10min` flips it to `failed` with `failure_reason="expansion timed out"`.
- [ ] uvicorn restart with a job in `queued`/`draining` is invisible to the client beyond next-poll-window jitter.
- [ ] Frontend polls every 2.5s when `phase ∈ (expanding, queued, draining)`; stops on terminal phase.
- [ ] Status strip text matches the §7 table for each phase.
- [ ] Cancel button issues `DELETE`, stops polling, hides strip, retains input in textarea.
- [ ] `WORKSHOP_ITEMS_LINE` in `result_json` matches `sort_jobs.wsids[]` regardless of which wsids ended up in `non_mod` / `unknown` (Spec A §8 ownership preserved).
## §12 Test recipes
1. **Synchronous fast path** - `POST /api/sort` with `{"input":"2169435993;2392709985;2487022075"}`. Expect: response has `MODS_LINE`, no `job_id`. ~50ms.
2. **Collection URL, cold cache** - clear `collections` row for the test ID; `POST /api/sort` with a known PZ collection URL. Expect: `{status:"expanding", job_id:"…"}` immediately. Poll: phase progresses `expanding → queued → draining → done`. Final `result.MODS_LINE` populated.
3. **Collection URL, warm cache** - re-submit the same URL within 6h. Expect: phase skips `expanding`, goes straight to `queued` (or `done` if all children cached). One Steam API call total across both runs (verify via `/var/log/...` or `journalctl -u sortof-api | grep GetCollectionDetails`).
4. **Mixed bare + collection** - `POST /api/sort` with `"<URL>\n2169435993"`. Expect: job created in `expanding`; on resolve, `wsids[]` contains both the collection's children and the bare wsid; deduped.
5. **Partial collection failure** - input contains two collection URLs, one valid, one to a deleted collection. Expect: job phase progresses normally; `result_json.WARNINGS` contains exactly one `collection-partial` entry; `wsids[]` contains only the valid collection's children.
6. **All collections fail** - input contains only unresolvable collection URLs. Expect: job `phase=failed`, `failure_reason="all input collections unresolvable"`.
7. **Cancel during draining** - submit a 50-mod cold collection, wait until `phase=draining`, `DELETE /api/jobs/{id}`. Expect: phase=failed reason=cancelled. Verify `download_jobs` rows for the wsids are still in `queued`/`downloading`/`done` (not nuked).
8. **Restart mid-drain** - submit a job, wait for `phase=draining`, `sudo systemctl restart sortof-api`. Wait 5s, GET the job. Expect: phase still derives correctly (computed from live counts), client polling resumes.
9. **Restart mid-expansion** - submit a collection job, kill `sortof-api` mid-expansion (race window: hard to hit deliberately; can simulate by directly SET `phase='expanding', phase_started_at=now()-interval '15 minutes'` then restart). Expect: lifespan sweep flips it to `failed` with `failure_reason="expansion timed out"`.
10. **404 on expired job** - manually `DELETE FROM sort_jobs WHERE job_id=…`; client poll. Expect: `404`. Frontend shows the expired-toast with re-submit affordance.
11. **Counts contract** - at each poll during a 50-mod cold drain, sum `counts.cached + counts.queued + counts.draining` and compare to `len(wsids)`. Equal at every snapshot. (Some wsids may be `non_mod` post-drain; they appear in `cached=0, queued=0, draining=0` because `mod_parsed` has no row - they're "missing from all three buckets," which is the expected steady state for non-mods.)
12. **Concurrent collection submit** - open two browser tabs simultaneously and submit the same URL. Expect: two distinct `job_id`s, but only one `GetCollectionDetails` call lands at Steam (verify journal). Worst case (cache-miss race): two API calls; this is acceptable.