Add full sortof codebase: API, drain workers, frontend, schema, specs
This commit is contained in:
270
docs/specs/2026-05-01-collection-expansion.md
Normal file
270
docs/specs/2026-05-01-collection-expansion.md
Normal file
@@ -0,0 +1,270 @@
|
||||
# Spec B+F - Collection URL/ID expansion + live drain progress
|
||||
|
||||
**Date:** 2026-05-01
|
||||
**Status:** Draft (awaiting review)
|
||||
**Sibling specs:** A multi-branch picker (shipped); C+D build-context + dep-add (next); E precacher (parallel); G cleanups + patch tier.
|
||||
**Folds:** Original Spec F (live drain progress) merges in here - a 50+ mod cold load is exactly when live counters matter, and both features share the polling endpoint.
|
||||
|
||||
**Schema notes (corrections to design source text):**
|
||||
- `download_jobs.status` enum is `queued | downloading | done | failed`. The design text used `running`; this spec uses the actual value `downloading`. UX label may render as "draining" for cohesion with the lifecycle vocabulary; the SQL keys off `downloading`.
|
||||
- The existing `collections` table (`init/01_schema.sql`) has columns `collection_id PK, title, child_workshop_ids TEXT[], last_fetched_at TIMESTAMPTZ`. There is **no `expires_at` column**. TTL is computed at read time as `last_fetched_at + interval '6 hours'`; no schema change for that.
|
||||
|
||||
---
|
||||
|
||||
## §1 Overview
|
||||
|
||||
Today, sortof accepts one input shape: a blob of newline/`;`-delimited workshop IDs. Anything that isn't a 7–12 digit number is dropped by `parse.parse_workshop_input`. Pasting a Steam Workshop *collection* URL, of which there is exactly one ID embedded, currently surfaces that ID as a single mod, fails parse (`process_one=no_mod_info`), and lands in the `non_mod` bucket added by the recent unknown/non-mod feature. The user is expected to drag every child mod's ID out by hand.
|
||||
|
||||
This spec adds:
|
||||
1. **Collection URL/ID expansion.** The API recognizes Steam Workshop URLs and resolves collection IDs to their child wsids via `ISteamRemoteStorage/GetCollectionDetails`. Cached in the existing `collections` table.
|
||||
2. **Async job pipeline.** Any input containing a collection or any uncached wsid creates a `sort_jobs` row, returns a `job_id`, and the frontend polls `GET /api/jobs/{job_id}` every 2.5s until `done|failed`.
|
||||
3. **Live counters.** During `expanding | queued | draining`, the poll response carries fresh `cached / queued / draining` counts plus an incremental `result_json`. The status strip animates instead of going stale.
|
||||
|
||||
Synchronous response is preserved for the all-cached fast path (Open Q1, §10).
|
||||
|
||||
## §2 API contract
|
||||
|
||||
### 2.1 `POST /api/sort` - polymorphic on input
|
||||
|
||||
Request body unchanged: `{ "input": str, "rules": str? }`. Response shape branches on what's in `input`:
|
||||
|
||||
```jsonc
|
||||
// Path A: bare wsid list, all in cache (current behavior, unchanged)
|
||||
{ "status": "success", "MOD_DB": [...], "MODS_LINE": "...", ... }
|
||||
|
||||
// Path B: bare wsid list with ≥1 uncached, OR ≥1 collection URL
|
||||
{ "status": "queued" | "expanding", "job_id": "<uuid>" }
|
||||
```
|
||||
|
||||
The frontend branches on the presence of `job_id`. Old clients that don't poll silently get the original sync response when their input is fully warm.
|
||||
|
||||
### 2.2 `GET /api/jobs/{job_id}` - polling endpoint
|
||||
|
||||
Response (any phase):
|
||||
```jsonc
|
||||
{
|
||||
"job_id": "<uuid>",
|
||||
"phase": "expanding" | "queued" | "draining" | "done" | "failed",
|
||||
"counts": { "cached": int, "queued": int, "draining": int },
|
||||
"wsids": [str, ...] | null, // null while phase=expanding; populated after
|
||||
"result": { ...SORTOF_DATA... } | null, // partial during draining; final on done
|
||||
"failure_reason": str | null // populated only on phase=failed
|
||||
}
|
||||
```
|
||||
|
||||
`404` if the `job_id` is unknown or expired (TTL in §3).
|
||||
|
||||
### 2.3 `DELETE /api/jobs/{job_id}` - cancel
|
||||
|
||||
Marks the job `failed` with `failure_reason="cancelled"`. Returns `204`. Idempotent: deleting an already-terminal job is a no-op `204`. Does **not** cancel underlying `download_jobs` rows (Open Q6, §10).
|
||||
|
||||
## §3 Schema
|
||||
|
||||
New table:
|
||||
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS sort_jobs (
|
||||
job_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
phase TEXT NOT NULL CHECK (phase IN ('expanding','queued','draining','done','failed')),
|
||||
phase_started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
input_raw TEXT NOT NULL,
|
||||
collection_ids TEXT[] NOT NULL DEFAULT '{}',
|
||||
wsids TEXT[], -- null until expansion resolves
|
||||
rules_raw TEXT,
|
||||
result_json JSONB, -- null until done (incremental partials kept here too)
|
||||
failure_reason TEXT
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS sort_jobs_phase_idx ON sort_jobs (phase);
|
||||
CREATE INDEX IF NOT EXISTS sort_jobs_updated_idx ON sort_jobs (updated_at);
|
||||
```
|
||||
|
||||
- **TTL:** rows older than `updated_at + 24h` AND `phase ∈ (done, failed)` are eligible for deletion. Cleanup script lives in Spec G; this spec only requires the schema support it.
|
||||
- **`updated_at` trigger:** mirror the existing `download_jobs.touch_updated_at` pattern.
|
||||
- **Migration plan:** `init/02_sort_jobs.sql` for fresh deploys + a one-shot `psql -f` for the live DB. No data migration; pure additive.
|
||||
|
||||
The existing `collections` table is reused as-is (4 columns, see corrections at top). No `expires_at` column; freshness derived from `last_fetched_at`.
|
||||
|
||||
## §4 Phase state machine
|
||||
|
||||
```
|
||||
┌──────────────────────────────────┐
|
||||
│ /api/sort with collections only │
|
||||
▼ │
|
||||
┌──────────────┐ GetCollectionDetails OK │
|
||||
│ expanding │ ────────────────────────────┘
|
||||
└──────┬───────┘
|
||||
│ wsids = collections + bare ids
|
||||
▼
|
||||
┌──────────────┐ ←── /api/sort with bare uncached wsids
|
||||
│ queued │ ─────────── all wsids in mod_parsed (skip drain)
|
||||
└──────┬───────┘ │
|
||||
│ first download_jobs row → downloading
|
||||
▼ │
|
||||
┌──────────────┐ │
|
||||
│ draining │ │
|
||||
└──────┬───────┘ │
|
||||
│ all wsids resolved (mod_parsed has rows)
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌──────────────┐
|
||||
│ done │ │ done │
|
||||
└──────────────┘ └──────────────┘
|
||||
|
||||
Failure terminal at any phase: failed (with phase_at_failure stored in failure_reason prefix).
|
||||
```
|
||||
|
||||
Phase transitions are **monotonic**: `expanding → queued → draining → done`. No backward transitions. A job's phase only advances; the API computes phase fresh on each `GET` rather than mutating it on every event (simpler, no leader needed).
|
||||
|
||||
Phase computation rule (executed inside `GET /api/jobs/{job_id}`):
|
||||
|
||||
```
|
||||
if phase in (done, failed): return as-stored
|
||||
if wsids is null: phase = expanding
|
||||
elif counts.draining > 0: phase = draining
|
||||
elif counts.queued > 0: phase = queued
|
||||
elif counts.cached >= len(wsids): phase = done; persist result_json
|
||||
else: phase = queued # transient gap between rows
|
||||
```
|
||||
|
||||
## §5 Steam expansion
|
||||
|
||||
### 5.1 Detection
|
||||
The current `parse.parse_workshop_input` strips ini-style prefixes and extracts `\b\d{7,12}\b`. We add a sibling `parse.parse_with_collections(text) -> (wsids: list, collection_ids: list)`:
|
||||
|
||||
- Match Steam URLs `https?://steamcommunity\.com/(?:sharedfiles|workshop)/filedetails/\?id=(\d{7,12})` and capture the ID.
|
||||
- Bare numeric IDs (the existing pattern) remain `wsids`.
|
||||
- A URL-form ID is classified as a *candidate collection*. We don't know syntactically whether a wsid is a collection vs a mod - so candidate collection IDs are sent to `GetCollectionDetails` first; if the API reports them as actual mods (not collections), they fall back to the wsids list.
|
||||
|
||||
### 5.2 Resolution
|
||||
Single batched call per `/api/sort` with ≥1 candidate:
|
||||
|
||||
```
|
||||
POST https://api.steampowered.com/ISteamRemoteStorage/GetCollectionDetails/v1/
|
||||
collectioncount=N
|
||||
publishedfileids[0..N-1]=...
|
||||
```
|
||||
|
||||
Per-collection in the response: `result==1` and `children[]` populated → expand to `[c.publishedfileid for c in children]`. `result!=1` → mark in result warnings as `{tag:"collection-partial", level:"warning", msg:"collection X could not be fetched"}`; keep the job alive with whatever resolved. (Open Q3, §10.)
|
||||
|
||||
### 5.3 Caching
|
||||
Hit on `collections` row where `last_fetched_at > now() - interval '6 hours'`:
|
||||
- Skip the API call entirely.
|
||||
- Use cached `child_workshop_ids` directly.
|
||||
|
||||
Miss / stale → call API, UPSERT into `collections`, then proceed. The `last_fetched_at = now()` write is the cache write.
|
||||
|
||||
### 5.4 Flakiness
|
||||
One internal retry with 2s backoff on HTTP error or `result!=1` for a candidate. After retries exhausted, the candidate is reported as collection-partial (warning) but the job continues with whatever else resolved. (Open Q4, §10.)
|
||||
|
||||
## §6 Counts contract
|
||||
|
||||
Computed live on every `GET /api/jobs/{job_id}` against the job's `wsids[]`:
|
||||
|
||||
```sql
|
||||
-- counts.cached
|
||||
SELECT COUNT(DISTINCT mp.workshop_id)
|
||||
FROM mod_parsed mp
|
||||
JOIN workshop_meta wm ON wm.workshop_id = mp.workshop_id
|
||||
WHERE mp.workshop_id = ANY($1::text[])
|
||||
AND mp.parsed_at_time_updated = wm.time_updated;
|
||||
|
||||
-- counts.queued
|
||||
SELECT COUNT(DISTINCT workshop_id)
|
||||
FROM download_jobs
|
||||
WHERE workshop_id = ANY($1::text[]) AND status = 'queued';
|
||||
|
||||
-- counts.draining (status='downloading' in DB; surfaced as 'draining' in API/UI)
|
||||
SELECT COUNT(DISTINCT workshop_id)
|
||||
FROM download_jobs
|
||||
WHERE workshop_id = ANY($1::text[]) AND status = 'downloading';
|
||||
```
|
||||
|
||||
Ownership precedent (Spec A §8): once a job is created, `wsids[]` is **locked**. `WORKSHOP_ITEMS_LINE` in the final `result_json` is computed from `sort_jobs.wsids[]`, **not** recomputed against current `mod_parsed`. This means a wsid that was in the input but is currently `non_mod` or `unknown` still appears in `WORKSHOP_ITEMS_LINE` in the same position - matching the locked contract from Spec A.
|
||||
|
||||
## §7 Frontend behavior
|
||||
|
||||
Status strip during polling:
|
||||
|
||||
| Phase | Strip text |
|
||||
|---|---|
|
||||
| `expanding` | `expanding collection…` (animated dot, no counts visible) |
|
||||
| `queued` | `X cached · Y queued · 0 draining` (animated dots on queued) |
|
||||
| `draining` | `X cached · Y queued · Z draining` (animated dots on queued + draining) |
|
||||
| `done` | strip collapses, full result rendered |
|
||||
| `failed` | red banner with `failure_reason` + Retry button |
|
||||
|
||||
Polling: `setInterval` at 2.5s, started on receiving `job_id`. Stops on `phase ∈ (done, failed)`. On `404` (job expired/garbage-collected): show "this job expired - re-submit?" toast; offer one-click resubmit using cached input (the textarea is still populated).
|
||||
|
||||
Cancel button: shown during `expanding | queued | draining`. Issues `DELETE /api/jobs/{job_id}`, stops polling on success, clears the strip.
|
||||
|
||||
The synchronous code path (no `job_id` in response) renders unchanged - old picker behavior, immediate result.
|
||||
|
||||
Owned-fields contract (Spec A §8 precedent): `WORKSHOP_ITEMS_LINE`, `counts.queued` (the picker's internal counter), `unknown[]`, `non_mod[]` are still owned by the **first** `/api/sort` (or final `result_json`). `/api/resort` ignores them. The poll's `counts` object is purely the live drain progress and does not feed the picker's internal queued counter.
|
||||
|
||||
## §8 Cancellation
|
||||
|
||||
`DELETE /api/jobs/{job_id}` semantics:
|
||||
|
||||
- Marks `sort_jobs.phase = 'failed'`, `failure_reason = 'cancelled'`. Idempotent.
|
||||
- **Does not** touch `download_jobs`. Workshop downloads in flight continue and populate `mod_parsed`, benefiting subsequent users via cache. Aborting them would waste partial progress and potentially trip the drain's `STALE_RECLAIM_MIN` reclaim path. (Open Q6, §10.)
|
||||
- Frontend stops polling, hides the strip, shows a small "cancelled" toast. The textarea retains the input.
|
||||
|
||||
Re-submitting the same input after cancel creates a *new* job. Collection-cache hits make the second submission instant if the cache hasn't expired.
|
||||
|
||||
## §9 Restart resilience
|
||||
|
||||
uvicorn boot sweep (idempotent, runs in lifespan startup):
|
||||
|
||||
```sql
|
||||
-- Time out long-stuck expansion jobs
|
||||
UPDATE sort_jobs
|
||||
SET phase = 'failed', failure_reason = 'expansion timed out',
|
||||
updated_at = now()
|
||||
WHERE phase = 'expanding'
|
||||
AND phase_started_at < now() - interval '10 minutes';
|
||||
```
|
||||
|
||||
Jobs in `queued` / `draining` need no special handling - they resume polling against `download_jobs` on the next client `GET`. The phase derives live from current counts (§4 phase computation rule), so a restart in the middle of a drain is invisible to the client beyond a brief window where counts may shift.
|
||||
|
||||
## §10 Open questions resolved
|
||||
|
||||
1. **Bare wsid + all-cached: synchronous or job-routed?** *Synchronous.* The cached path is sub-100ms today; routing it through a job adds polling latency and a UI flash. Frontend branches cheaply on `job_id` presence.
|
||||
2. **Mixed input (bare wsids + collection URLs).** *Treat as collection input.* Job created in `expanding` phase immediately. Bare wsids merge into `wsids[]` after `GetCollectionDetails` resolves. No partial-sync hybrid - keeps the response shape rule clean.
|
||||
3. **Partial expansion failure.** *Succeed with the resolvable subset.* Each unresolvable collection adds a warning `{tag:"collection-partial", level:"warning", msg:"collection X could not be fetched"}` to `result_json.WARNINGS`. Job completes normally; user sees the result with one or more amber warnings.
|
||||
4. **`GetCollectionDetails` flakiness.** *One internal retry with 2s backoff* before reporting collection-partial. No frontend-driven retry on the GET poll - it would mask transient failures and give the user no recovery affordance. Job marked `failed` only if **every** candidate collection fails.
|
||||
5. **Concurrent expansion of the same collection.** *Independent jobs; cache deduplicates.* User A and User B paste the same collection URL near-simultaneously; both create separate `sort_jobs` rows. The first one's `GetCollectionDetails` call populates `collections`; the second's hits cache. Worst case (race within the cache miss window) costs one duplicate API call. In-flight cache key (e.g., `collections.fetching_until`) deferred to Spec G.
|
||||
6. **Cancel semantics.** *Abandon `sort_job`; leave `download_jobs` running.* Three reasons. (a) Workshop downloads benefit other users via the shared `mod_parsed` cache - wasting them is anti-social. (b) The drain's `STALE_RECLAIM_MIN=30` reclaim path treats half-killed `downloading` rows as candidates for retry; introducing client-driven cancellation creates a class of races where the row is killed mid-write. (c) Worker-side cancellation requires SIGTERM-of-DD-subprocess plumbing that doesn't exist; staying out of that codepath is much cheaper.
|
||||
|
||||
## §11 Acceptance criteria
|
||||
|
||||
- [ ] `POST /api/sort` with all-cached bare wsids returns the synchronous shape with no `job_id`.
|
||||
- [ ] `POST /api/sort` with any uncached wsid OR any collection URL returns `{status, job_id}` and persists a `sort_jobs` row.
|
||||
- [ ] `GET /api/jobs/{job_id}` returns live counts and the current phase per the §4 derivation rule.
|
||||
- [ ] `GET /api/jobs/{nonexistent}` returns `404`.
|
||||
- [ ] `DELETE /api/jobs/{job_id}` flips phase to `failed` with `failure_reason="cancelled"`. Idempotent.
|
||||
- [ ] Collection URL `https://steamcommunity.com/sharedfiles/filedetails/?id=N` is detected by the parser and routed through `GetCollectionDetails`.
|
||||
- [ ] A `collections` cache hit (row younger than 6h) skips the Steam API call.
|
||||
- [ ] A collection that returns `result!=1` produces a `collection-partial` amber warning in `result_json.WARNINGS` but does not fail the job (unless **all** collections in the input are unresolvable).
|
||||
- [ ] uvicorn restart with a job in `expanding > 10min` flips it to `failed` with `failure_reason="expansion timed out"`.
|
||||
- [ ] uvicorn restart with a job in `queued`/`draining` is invisible to the client beyond next-poll-window jitter.
|
||||
- [ ] Frontend polls every 2.5s when `phase ∈ (expanding, queued, draining)`; stops on terminal phase.
|
||||
- [ ] Status strip text matches the §7 table for each phase.
|
||||
- [ ] Cancel button issues `DELETE`, stops polling, hides strip, retains input in textarea.
|
||||
- [ ] `WORKSHOP_ITEMS_LINE` in `result_json` matches `sort_jobs.wsids[]` regardless of which wsids ended up in `non_mod` / `unknown` (Spec A §8 ownership preserved).
|
||||
|
||||
## §12 Test recipes
|
||||
|
||||
1. **Synchronous fast path** - `POST /api/sort` with `{"input":"2169435993;2392709985;2487022075"}`. Expect: response has `MODS_LINE`, no `job_id`. ~50ms.
|
||||
2. **Collection URL, cold cache** - clear `collections` row for the test ID; `POST /api/sort` with a known PZ collection URL. Expect: `{status:"expanding", job_id:"…"}` immediately. Poll: phase progresses `expanding → queued → draining → done`. Final `result.MODS_LINE` populated.
|
||||
3. **Collection URL, warm cache** - re-submit the same URL within 6h. Expect: phase skips `expanding`, goes straight to `queued` (or `done` if all children cached). One Steam API call total across both runs (verify via `/var/log/...` or `journalctl -u sortof-api | grep GetCollectionDetails`).
|
||||
4. **Mixed bare + collection** - `POST /api/sort` with `"<URL>\n2169435993"`. Expect: job created in `expanding`; on resolve, `wsids[]` contains both the collection's children and the bare wsid; deduped.
|
||||
5. **Partial collection failure** - input contains two collection URLs, one valid, one to a deleted collection. Expect: job phase progresses normally; `result_json.WARNINGS` contains exactly one `collection-partial` entry; `wsids[]` contains only the valid collection's children.
|
||||
6. **All collections fail** - input contains only unresolvable collection URLs. Expect: job `phase=failed`, `failure_reason="all input collections unresolvable"`.
|
||||
7. **Cancel during draining** - submit a 50-mod cold collection, wait until `phase=draining`, `DELETE /api/jobs/{id}`. Expect: phase=failed reason=cancelled. Verify `download_jobs` rows for the wsids are still in `queued`/`downloading`/`done` (not nuked).
|
||||
8. **Restart mid-drain** - submit a job, wait for `phase=draining`, `sudo systemctl restart sortof-api`. Wait 5s, GET the job. Expect: phase still derives correctly (computed from live counts), client polling resumes.
|
||||
9. **Restart mid-expansion** - submit a collection job, kill `sortof-api` mid-expansion (race window: hard to hit deliberately; can simulate by directly SET `phase='expanding', phase_started_at=now()-interval '15 minutes'` then restart). Expect: lifespan sweep flips it to `failed` with `failure_reason="expansion timed out"`.
|
||||
10. **404 on expired job** - manually `DELETE FROM sort_jobs WHERE job_id=…`; client poll. Expect: `404`. Frontend shows the expired-toast with re-submit affordance.
|
||||
11. **Counts contract** - at each poll during a 50-mod cold drain, sum `counts.cached + counts.queued + counts.draining` and compare to `len(wsids)`. Equal at every snapshot. (Some wsids may be `non_mod` post-drain; they appear in `cached=0, queued=0, draining=0` because `mod_parsed` has no row - they're "missing from all three buckets," which is the expected steady state for non-mods.)
|
||||
12. **Concurrent collection submit** - open two browser tabs simultaneously and submit the same URL. Expect: two distinct `job_id`s, but only one `GetCollectionDetails` call lands at Steam (verify journal). Worst case (cache-miss race): two API calls; this is acceptable.
|
||||
Reference in New Issue
Block a user