Add full sortof codebase: API, drain workers, frontend, schema, specs

This commit is contained in:
2026-05-04 03:27:54 +00:00
parent acda2c90f8
commit 55d3794bfb
43 changed files with 13375 additions and 53 deletions

View File

@@ -0,0 +1,270 @@
# Spec B+F - Collection URL/ID expansion + live drain progress
**Date:** 2026-05-01
**Status:** Draft (awaiting review)
**Sibling specs:** A multi-branch picker (shipped); C+D build-context + dep-add (next); E precacher (parallel); G cleanups + patch tier.
**Folds:** Original Spec F (live drain progress) merges in here - a 50+ mod cold load is exactly when live counters matter, and both features share the polling endpoint.
**Schema notes (corrections to design source text):**
- `download_jobs.status` enum is `queued | downloading | done | failed`. The design text used `running`; this spec uses the actual value `downloading`. UX label may render as "draining" for cohesion with the lifecycle vocabulary; the SQL keys off `downloading`.
- The existing `collections` table (`init/01_schema.sql`) has columns `collection_id PK, title, child_workshop_ids TEXT[], last_fetched_at TIMESTAMPTZ`. There is **no `expires_at` column**. TTL is computed at read time as `last_fetched_at + interval '6 hours'`; no schema change for that.
---
## §1 Overview
Today, sortof accepts one input shape: a blob of newline/`;`-delimited workshop IDs. Anything that isn't a 712 digit number is dropped by `parse.parse_workshop_input`. Pasting a Steam Workshop *collection* URL, of which there is exactly one ID embedded, currently surfaces that ID as a single mod, fails parse (`process_one=no_mod_info`), and lands in the `non_mod` bucket added by the recent unknown/non-mod feature. The user is expected to drag every child mod's ID out by hand.
This spec adds:
1. **Collection URL/ID expansion.** The API recognizes Steam Workshop URLs and resolves collection IDs to their child wsids via `ISteamRemoteStorage/GetCollectionDetails`. Cached in the existing `collections` table.
2. **Async job pipeline.** Any input containing a collection or any uncached wsid creates a `sort_jobs` row, returns a `job_id`, and the frontend polls `GET /api/jobs/{job_id}` every 2.5s until `done|failed`.
3. **Live counters.** During `expanding | queued | draining`, the poll response carries fresh `cached / queued / draining` counts plus an incremental `result_json`. The status strip animates instead of going stale.
Synchronous response is preserved for the all-cached fast path (Open Q1, §10).
## §2 API contract
### 2.1 `POST /api/sort` - polymorphic on input
Request body unchanged: `{ "input": str, "rules": str? }`. Response shape branches on what's in `input`:
```jsonc
// Path A: bare wsid list, all in cache (current behavior, unchanged)
{ "status": "success", "MOD_DB": [...], "MODS_LINE": "...", ... }
// Path B: bare wsid list with ≥1 uncached, OR ≥1 collection URL
{ "status": "queued" | "expanding", "job_id": "<uuid>" }
```
The frontend branches on the presence of `job_id`. Old clients that don't poll silently get the original sync response when their input is fully warm.
### 2.2 `GET /api/jobs/{job_id}` - polling endpoint
Response (any phase):
```jsonc
{
"job_id": "<uuid>",
"phase": "expanding" | "queued" | "draining" | "done" | "failed",
"counts": { "cached": int, "queued": int, "draining": int },
"wsids": [str, ...] | null, // null while phase=expanding; populated after
"result": { ...SORTOF_DATA... } | null, // partial during draining; final on done
"failure_reason": str | null // populated only on phase=failed
}
```
`404` if the `job_id` is unknown or expired (TTL in §3).
### 2.3 `DELETE /api/jobs/{job_id}` - cancel
Marks the job `failed` with `failure_reason="cancelled"`. Returns `204`. Idempotent: deleting an already-terminal job is a no-op `204`. Does **not** cancel underlying `download_jobs` rows (Open Q6, §10).
## §3 Schema
New table:
```sql
CREATE TABLE IF NOT EXISTS sort_jobs (
job_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
phase TEXT NOT NULL CHECK (phase IN ('expanding','queued','draining','done','failed')),
phase_started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
input_raw TEXT NOT NULL,
collection_ids TEXT[] NOT NULL DEFAULT '{}',
wsids TEXT[], -- null until expansion resolves
rules_raw TEXT,
result_json JSONB, -- null until done (incremental partials kept here too)
failure_reason TEXT
);
CREATE INDEX IF NOT EXISTS sort_jobs_phase_idx ON sort_jobs (phase);
CREATE INDEX IF NOT EXISTS sort_jobs_updated_idx ON sort_jobs (updated_at);
```
- **TTL:** rows older than `updated_at + 24h` AND `phase ∈ (done, failed)` are eligible for deletion. Cleanup script lives in Spec G; this spec only requires the schema support it.
- **`updated_at` trigger:** mirror the existing `download_jobs.touch_updated_at` pattern.
- **Migration plan:** `init/02_sort_jobs.sql` for fresh deploys + a one-shot `psql -f` for the live DB. No data migration; pure additive.
The existing `collections` table is reused as-is (4 columns, see corrections at top). No `expires_at` column; freshness derived from `last_fetched_at`.
## §4 Phase state machine
```
┌──────────────────────────────────┐
│ /api/sort with collections only │
▼ │
┌──────────────┐ GetCollectionDetails OK │
│ expanding │ ────────────────────────────┘
└──────┬───────┘
│ wsids = collections + bare ids
┌──────────────┐ ←── /api/sort with bare uncached wsids
│ queued │ ─────────── all wsids in mod_parsed (skip drain)
└──────┬───────┘ │
│ first download_jobs row → downloading
▼ │
┌──────────────┐ │
│ draining │ │
└──────┬───────┘ │
│ all wsids resolved (mod_parsed has rows)
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ done │ │ done │
└──────────────┘ └──────────────┘
Failure terminal at any phase: failed (with phase_at_failure stored in failure_reason prefix).
```
Phase transitions are **monotonic**: `expanding → queued → draining → done`. No backward transitions. A job's phase only advances; the API computes phase fresh on each `GET` rather than mutating it on every event (simpler, no leader needed).
Phase computation rule (executed inside `GET /api/jobs/{job_id}`):
```
if phase in (done, failed): return as-stored
if wsids is null: phase = expanding
elif counts.draining > 0: phase = draining
elif counts.queued > 0: phase = queued
elif counts.cached >= len(wsids): phase = done; persist result_json
else: phase = queued # transient gap between rows
```
## §5 Steam expansion
### 5.1 Detection
The current `parse.parse_workshop_input` strips ini-style prefixes and extracts `\b\d{7,12}\b`. We add a sibling `parse.parse_with_collections(text) -> (wsids: list, collection_ids: list)`:
- Match Steam URLs `https?://steamcommunity\.com/(?:sharedfiles|workshop)/filedetails/\?id=(\d{7,12})` and capture the ID.
- Bare numeric IDs (the existing pattern) remain `wsids`.
- A URL-form ID is classified as a *candidate collection*. We don't know syntactically whether a wsid is a collection vs a mod - so candidate collection IDs are sent to `GetCollectionDetails` first; if the API reports them as actual mods (not collections), they fall back to the wsids list.
### 5.2 Resolution
Single batched call per `/api/sort` with ≥1 candidate:
```
POST https://api.steampowered.com/ISteamRemoteStorage/GetCollectionDetails/v1/
collectioncount=N
publishedfileids[0..N-1]=...
```
Per-collection in the response: `result==1` and `children[]` populated → expand to `[c.publishedfileid for c in children]`. `result!=1` → mark in result warnings as `{tag:"collection-partial", level:"warning", msg:"collection X could not be fetched"}`; keep the job alive with whatever resolved. (Open Q3, §10.)
### 5.3 Caching
Hit on `collections` row where `last_fetched_at > now() - interval '6 hours'`:
- Skip the API call entirely.
- Use cached `child_workshop_ids` directly.
Miss / stale → call API, UPSERT into `collections`, then proceed. The `last_fetched_at = now()` write is the cache write.
### 5.4 Flakiness
One internal retry with 2s backoff on HTTP error or `result!=1` for a candidate. After retries exhausted, the candidate is reported as collection-partial (warning) but the job continues with whatever else resolved. (Open Q4, §10.)
## §6 Counts contract
Computed live on every `GET /api/jobs/{job_id}` against the job's `wsids[]`:
```sql
-- counts.cached
SELECT COUNT(DISTINCT mp.workshop_id)
FROM mod_parsed mp
JOIN workshop_meta wm ON wm.workshop_id = mp.workshop_id
WHERE mp.workshop_id = ANY($1::text[])
AND mp.parsed_at_time_updated = wm.time_updated;
-- counts.queued
SELECT COUNT(DISTINCT workshop_id)
FROM download_jobs
WHERE workshop_id = ANY($1::text[]) AND status = 'queued';
-- counts.draining (status='downloading' in DB; surfaced as 'draining' in API/UI)
SELECT COUNT(DISTINCT workshop_id)
FROM download_jobs
WHERE workshop_id = ANY($1::text[]) AND status = 'downloading';
```
Ownership precedent (Spec A §8): once a job is created, `wsids[]` is **locked**. `WORKSHOP_ITEMS_LINE` in the final `result_json` is computed from `sort_jobs.wsids[]`, **not** recomputed against current `mod_parsed`. This means a wsid that was in the input but is currently `non_mod` or `unknown` still appears in `WORKSHOP_ITEMS_LINE` in the same position - matching the locked contract from Spec A.
## §7 Frontend behavior
Status strip during polling:
| Phase | Strip text |
|---|---|
| `expanding` | `expanding collection…` (animated dot, no counts visible) |
| `queued` | `X cached · Y queued · 0 draining` (animated dots on queued) |
| `draining` | `X cached · Y queued · Z draining` (animated dots on queued + draining) |
| `done` | strip collapses, full result rendered |
| `failed` | red banner with `failure_reason` + Retry button |
Polling: `setInterval` at 2.5s, started on receiving `job_id`. Stops on `phase ∈ (done, failed)`. On `404` (job expired/garbage-collected): show "this job expired - re-submit?" toast; offer one-click resubmit using cached input (the textarea is still populated).
Cancel button: shown during `expanding | queued | draining`. Issues `DELETE /api/jobs/{job_id}`, stops polling on success, clears the strip.
The synchronous code path (no `job_id` in response) renders unchanged - old picker behavior, immediate result.
Owned-fields contract (Spec A §8 precedent): `WORKSHOP_ITEMS_LINE`, `counts.queued` (the picker's internal counter), `unknown[]`, `non_mod[]` are still owned by the **first** `/api/sort` (or final `result_json`). `/api/resort` ignores them. The poll's `counts` object is purely the live drain progress and does not feed the picker's internal queued counter.
## §8 Cancellation
`DELETE /api/jobs/{job_id}` semantics:
- Marks `sort_jobs.phase = 'failed'`, `failure_reason = 'cancelled'`. Idempotent.
- **Does not** touch `download_jobs`. Workshop downloads in flight continue and populate `mod_parsed`, benefiting subsequent users via cache. Aborting them would waste partial progress and potentially trip the drain's `STALE_RECLAIM_MIN` reclaim path. (Open Q6, §10.)
- Frontend stops polling, hides the strip, shows a small "cancelled" toast. The textarea retains the input.
Re-submitting the same input after cancel creates a *new* job. Collection-cache hits make the second submission instant if the cache hasn't expired.
## §9 Restart resilience
uvicorn boot sweep (idempotent, runs in lifespan startup):
```sql
-- Time out long-stuck expansion jobs
UPDATE sort_jobs
SET phase = 'failed', failure_reason = 'expansion timed out',
updated_at = now()
WHERE phase = 'expanding'
AND phase_started_at < now() - interval '10 minutes';
```
Jobs in `queued` / `draining` need no special handling - they resume polling against `download_jobs` on the next client `GET`. The phase derives live from current counts (§4 phase computation rule), so a restart in the middle of a drain is invisible to the client beyond a brief window where counts may shift.
## §10 Open questions resolved
1. **Bare wsid + all-cached: synchronous or job-routed?** *Synchronous.* The cached path is sub-100ms today; routing it through a job adds polling latency and a UI flash. Frontend branches cheaply on `job_id` presence.
2. **Mixed input (bare wsids + collection URLs).** *Treat as collection input.* Job created in `expanding` phase immediately. Bare wsids merge into `wsids[]` after `GetCollectionDetails` resolves. No partial-sync hybrid - keeps the response shape rule clean.
3. **Partial expansion failure.** *Succeed with the resolvable subset.* Each unresolvable collection adds a warning `{tag:"collection-partial", level:"warning", msg:"collection X could not be fetched"}` to `result_json.WARNINGS`. Job completes normally; user sees the result with one or more amber warnings.
4. **`GetCollectionDetails` flakiness.** *One internal retry with 2s backoff* before reporting collection-partial. No frontend-driven retry on the GET poll - it would mask transient failures and give the user no recovery affordance. Job marked `failed` only if **every** candidate collection fails.
5. **Concurrent expansion of the same collection.** *Independent jobs; cache deduplicates.* User A and User B paste the same collection URL near-simultaneously; both create separate `sort_jobs` rows. The first one's `GetCollectionDetails` call populates `collections`; the second's hits cache. Worst case (race within the cache miss window) costs one duplicate API call. In-flight cache key (e.g., `collections.fetching_until`) deferred to Spec G.
6. **Cancel semantics.** *Abandon `sort_job`; leave `download_jobs` running.* Three reasons. (a) Workshop downloads benefit other users via the shared `mod_parsed` cache - wasting them is anti-social. (b) The drain's `STALE_RECLAIM_MIN=30` reclaim path treats half-killed `downloading` rows as candidates for retry; introducing client-driven cancellation creates a class of races where the row is killed mid-write. (c) Worker-side cancellation requires SIGTERM-of-DD-subprocess plumbing that doesn't exist; staying out of that codepath is much cheaper.
## §11 Acceptance criteria
- [ ] `POST /api/sort` with all-cached bare wsids returns the synchronous shape with no `job_id`.
- [ ] `POST /api/sort` with any uncached wsid OR any collection URL returns `{status, job_id}` and persists a `sort_jobs` row.
- [ ] `GET /api/jobs/{job_id}` returns live counts and the current phase per the §4 derivation rule.
- [ ] `GET /api/jobs/{nonexistent}` returns `404`.
- [ ] `DELETE /api/jobs/{job_id}` flips phase to `failed` with `failure_reason="cancelled"`. Idempotent.
- [ ] Collection URL `https://steamcommunity.com/sharedfiles/filedetails/?id=N` is detected by the parser and routed through `GetCollectionDetails`.
- [ ] A `collections` cache hit (row younger than 6h) skips the Steam API call.
- [ ] A collection that returns `result!=1` produces a `collection-partial` amber warning in `result_json.WARNINGS` but does not fail the job (unless **all** collections in the input are unresolvable).
- [ ] uvicorn restart with a job in `expanding > 10min` flips it to `failed` with `failure_reason="expansion timed out"`.
- [ ] uvicorn restart with a job in `queued`/`draining` is invisible to the client beyond next-poll-window jitter.
- [ ] Frontend polls every 2.5s when `phase ∈ (expanding, queued, draining)`; stops on terminal phase.
- [ ] Status strip text matches the §7 table for each phase.
- [ ] Cancel button issues `DELETE`, stops polling, hides strip, retains input in textarea.
- [ ] `WORKSHOP_ITEMS_LINE` in `result_json` matches `sort_jobs.wsids[]` regardless of which wsids ended up in `non_mod` / `unknown` (Spec A §8 ownership preserved).
## §12 Test recipes
1. **Synchronous fast path** - `POST /api/sort` with `{"input":"2169435993;2392709985;2487022075"}`. Expect: response has `MODS_LINE`, no `job_id`. ~50ms.
2. **Collection URL, cold cache** - clear `collections` row for the test ID; `POST /api/sort` with a known PZ collection URL. Expect: `{status:"expanding", job_id:"…"}` immediately. Poll: phase progresses `expanding → queued → draining → done`. Final `result.MODS_LINE` populated.
3. **Collection URL, warm cache** - re-submit the same URL within 6h. Expect: phase skips `expanding`, goes straight to `queued` (or `done` if all children cached). One Steam API call total across both runs (verify via `/var/log/...` or `journalctl -u sortof-api | grep GetCollectionDetails`).
4. **Mixed bare + collection** - `POST /api/sort` with `"<URL>\n2169435993"`. Expect: job created in `expanding`; on resolve, `wsids[]` contains both the collection's children and the bare wsid; deduped.
5. **Partial collection failure** - input contains two collection URLs, one valid, one to a deleted collection. Expect: job phase progresses normally; `result_json.WARNINGS` contains exactly one `collection-partial` entry; `wsids[]` contains only the valid collection's children.
6. **All collections fail** - input contains only unresolvable collection URLs. Expect: job `phase=failed`, `failure_reason="all input collections unresolvable"`.
7. **Cancel during draining** - submit a 50-mod cold collection, wait until `phase=draining`, `DELETE /api/jobs/{id}`. Expect: phase=failed reason=cancelled. Verify `download_jobs` rows for the wsids are still in `queued`/`downloading`/`done` (not nuked).
8. **Restart mid-drain** - submit a job, wait for `phase=draining`, `sudo systemctl restart sortof-api`. Wait 5s, GET the job. Expect: phase still derives correctly (computed from live counts), client polling resumes.
9. **Restart mid-expansion** - submit a collection job, kill `sortof-api` mid-expansion (race window: hard to hit deliberately; can simulate by directly SET `phase='expanding', phase_started_at=now()-interval '15 minutes'` then restart). Expect: lifespan sweep flips it to `failed` with `failure_reason="expansion timed out"`.
10. **404 on expired job** - manually `DELETE FROM sort_jobs WHERE job_id=…`; client poll. Expect: `404`. Frontend shows the expired-toast with re-submit affordance.
11. **Counts contract** - at each poll during a 50-mod cold drain, sum `counts.cached + counts.queued + counts.draining` and compare to `len(wsids)`. Equal at every snapshot. (Some wsids may be `non_mod` post-drain; they appear in `cached=0, queued=0, draining=0` because `mod_parsed` has no row - they're "missing from all three buckets," which is the expected steady state for non-mods.)
12. **Concurrent collection submit** - open two browser tabs simultaneously and submit the same URL. Expect: two distinct `job_id`s, but only one `GetCollectionDetails` call lands at Steam (verify journal). Worst case (cache-miss race): two API calls; this is acceptable.