sortof/docs/plans/2026-05-01-collection-expansion.md

# Collection Expansion + Live Drain Progress Implementation Plan

> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.

**Goal:** Accept Steam Workshop collection URLs in `/api/sort`, expand them server-side via `GetCollectionDetails`, and drive a polling endpoint (`/api/jobs/{job_id}`) that gives the frontend live `cached / queued / draining` counters during cold loads.

**Architecture:** A new `sort_jobs` table tracks asynchronous expansion + drain lifecycles. `/api/sort` becomes polymorphic: all-cached bare wsids return synchronously (unchanged); anything that needs work returns a `job_id`. The frontend polls `GET /api/jobs/{job_id}` every 2.5s and renders phase-specific status strip text. Phase is **derived live** from `download_jobs` counts on every poll - no event log, no leader, restart-resilient by construction.

**Tech Stack:** Postgres (new table + indexes via additive migration), FastAPI (two new routes + background `asyncio.create_task` for expansion), asyncpg parameterized queries (mirroring existing patterns), httpx for `GetCollectionDetails`, vanilla React + Babel-standalone on the frontend (no build step - same as Spec A).

> **Spec dependency:** Read `/opt/sortof/docs/specs/2026-05-01-collection-expansion.md` (270 lines, all decisions locked in §10) before starting. The acceptance criteria in §11 and the test recipes in §12 are what Task 11 verifies.

---

## File structure

| Path | Action | Responsibility |
|---|---|---|
| `/opt/sortof/init/02_sort_jobs.sql` | **Create** | Schema for fresh deploys (idempotent `CREATE TABLE IF NOT EXISTS`). Identical DDL also applied to the live DB via one-shot `psql` in Task 1. |
| `/opt/sortof/api/parse.py` | Modify | Add `parse_with_collections(text) -> (wsids, collection_ids)`. Reuses the existing wsid extractor; classifies URL-form IDs as candidate collections. |
| `/opt/sortof/api/steam.py` | Modify | Add `async fetch_collection_details(client, ids)` mirroring the existing `fetch_workshop_details` pattern. |
| `/opt/sortof/api/jobs.py` | **Create** | `sort_jobs` row CRUD, phase derivation (the §4 rule executed inside `GET`), live counts SQL, lifespan-startup stale-expansion sweep. |
| `/opt/sortof/api/app.py` | Modify | Polymorphic `/api/sort`; new `GET /api/jobs/{job_id}`; new `DELETE /api/jobs/{job_id}`; lifespan sweep wired in. |
| `/opt/sortof/frontend/sortof-app.jsx` | Modify | Detect `job_id` in `/api/sort` response; `pollJob()` async loop @ 2.5s; phase-specific status-strip text; cancel button; 404 expired-job toast. |
| `/opt/sortof/frontend/index.html` | Modify | CSS for new phase indicators (e.g. `.status-pill.expanding`, `.cancel-btn`). |

No changes to `/opt/sortof/worker/` - drain stays exactly as-is. Collections expand at API time; the resulting wsids flow into `download_jobs` via the existing queueing path.

**Verification fixtures** (referenced throughout):
- All-cached bare wsids: `2169435993;2392709985;2487022075` → MODS_LINE="modoptions;tsarslib;TMC_TrueActions" (canonical sync regression).
- Synthetic collection (no Steam round-trip): direct `INSERT INTO collections` with known children. Used for cache-hit verification.
- Real Steam collection: Task 11 step 2 instructs the implementer to find a public PZ collection URL on `https://steamcommunity.com/workshop/browse/?appid=108600&section=collections` and use its ID. Required for cold-expansion path.

---

## Task 1: Schema migration - `sort_jobs` table

**Files:**
- Create: `/opt/sortof/init/02_sort_jobs.sql`
- One-shot apply to live DB via `docker exec sortof_db psql`

- [ ] **Step 1: Write the schema file**

Create `/opt/sortof/init/02_sort_jobs.sql` with:

```sql
-- Async sort jobs: lifecycle + result for collection expansion + cold drains.
-- Created 2026-05-01 (Spec B+F).

CREATE TABLE IF NOT EXISTS sort_jobs (
    job_id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    phase            TEXT NOT NULL CHECK (phase IN ('expanding','queued','draining','done','failed')),
    phase_started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    created_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
    input_raw        TEXT NOT NULL,
    collection_ids   TEXT[] NOT NULL DEFAULT '{}',
    wsids            TEXT[],
    rules_raw        TEXT,
    result_json      JSONB,
    failure_reason   TEXT
);

CREATE INDEX IF NOT EXISTS sort_jobs_phase_idx ON sort_jobs (phase);
CREATE INDEX IF NOT EXISTS sort_jobs_updated_idx ON sort_jobs (updated_at);

DROP TRIGGER IF EXISTS sort_jobs_touch ON sort_jobs;
CREATE TRIGGER sort_jobs_touch
    BEFORE UPDATE ON sort_jobs
    FOR EACH ROW
    EXECUTE FUNCTION touch_updated_at();
```

The `touch_updated_at()` function already exists (defined in `init/01_schema.sql` for `download_jobs`).

Note: `init/` is owned by root. Use `sudo tee` to write the file:

```bash
sudo tee /opt/sortof/init/02_sort_jobs.sql > /dev/null <<'SQL'
-- Async sort jobs: lifecycle + result for collection expansion + cold drains.
-- Created 2026-05-01 (Spec B+F).

CREATE TABLE IF NOT EXISTS sort_jobs (
    job_id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    phase            TEXT NOT NULL CHECK (phase IN ('expanding','queued','draining','done','failed')),
    phase_started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    created_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
    input_raw        TEXT NOT NULL,
    collection_ids   TEXT[] NOT NULL DEFAULT '{}',
    wsids            TEXT[],
    rules_raw        TEXT,
    result_json      JSONB,
    failure_reason   TEXT
);

CREATE INDEX IF NOT EXISTS sort_jobs_phase_idx ON sort_jobs (phase);
CREATE INDEX IF NOT EXISTS sort_jobs_updated_idx ON sort_jobs (updated_at);

DROP TRIGGER IF EXISTS sort_jobs_touch ON sort_jobs;
CREATE TRIGGER sort_jobs_touch
    BEFORE UPDATE ON sort_jobs
    FOR EACH ROW
    EXECUTE FUNCTION touch_updated_at();
SQL
```

- [ ] **Step 2: Apply DDL to the live DB**

```bash
sudo docker exec -i sortof_db psql -U sortof -d sortof < /opt/sortof/init/02_sort_jobs.sql
```

Expected: a few `CREATE TABLE / CREATE INDEX / CREATE TRIGGER` notices (or none if already applied).

- [ ] **Step 3: Verify the table exists with the right columns**

```bash
sudo docker exec -i sortof_db psql -U sortof -d sortof -c "\d sort_jobs"
```

Expected: 11 columns matching the schema, 3 indexes (PK + phase_idx + updated_idx), trigger present. The `\d` output should mention `gen_random_uuid()` as the default for `job_id` and the CHECK constraint on `phase`.

- [ ] **Step 4: Smoke insert / select**

```bash
sudo docker exec -i sortof_db psql -U sortof -d sortof -c "
INSERT INTO sort_jobs (phase, input_raw) VALUES ('expanding', 'smoke') RETURNING job_id, phase;
DELETE FROM sort_jobs WHERE input_raw='smoke';"
```

Expected: one row returned with a UUID and phase='expanding', followed by `DELETE 1`.

- [ ] **Step 5: Checkpoint** - schema is live; foundation for Tasks 4+ ready. No backup needed (DDL is idempotent and the table was empty).

---

## Task 2: Parser extension - `parse_with_collections()`

**Files:**
- Modify: `/opt/sortof/api/parse.py`

- [ ] **Step 1: Backup**

```bash
cp /opt/sortof/api/parse.py /opt/sortof/api/parse.py.bak-$(date +%Y%m%d-%H%M)
```

- [ ] **Step 2: Add the new function and a Steam-URL regex**

Open `/opt/sortof/api/parse.py`. The current file defines only `parse_workshop_input(text)`. Add at the bottom (do **not** modify the existing function - it's still used by `/api/sort` for backwards compat through the polymorphic path):

```python
import re as _re_module  # already imported at top; alias avoided if duplicate

# Steam Workshop URL form: https://steamcommunity.com/{sharedfiles,workshop}/filedetails/?id=NNNNNNN
_STEAM_URL_RE = re.compile(
    r"https?://steamcommunity\.com/(?:sharedfiles|workshop)/filedetails/\?id=(\d{7,12})",
    re.IGNORECASE,
)


def parse_with_collections(text: str) -> tuple[List[str], List[str]]:
    """Split an input blob into bare wsids and candidate collection IDs.

    A "candidate collection" is any 7-12-digit ID that appears inside a
    Steam Workshop URL. Bare numeric IDs in the same blob are treated as
    mod wsids (current behavior). Steam doesn't syntactically distinguish
    collection IDs from mod IDs; the candidate list is sent to
    GetCollectionDetails to confirm. If a candidate isn't actually a
    collection, the caller falls it back to wsids.

    Returns (wsids, collection_ids), each deduped and in first-seen order.
    """
    if not text:
        return ([], [])

    # 1. Find URL-form IDs FIRST (so they don't get double-counted as bare).
    url_ids: List[str] = []
    seen_url: set[str] = set()
    for m in _STEAM_URL_RE.finditer(text):
        i = m.group(1)
        if i not in seen_url:
            seen_url.add(i)
            url_ids.append(i)

    # 2. Strip the URLs out before extracting bare numbers.
    text_minus_urls = _STEAM_URL_RE.sub("", text)

    # 3. Bare wsids: same regex as parse_workshop_input.
    cleaned = re.sub(
        r"^\s*(WorkshopItems|Mods|Map)\s*=\s*",
        "",
        text_minus_urls,
        flags=re.MULTILINE | re.IGNORECASE,
    )
    bare_ids = re.findall(r"\b\d{7,12}\b", cleaned)
    seen_bare: set[str] = set()
    bare_unique: List[str] = []
    for i in bare_ids:
        if i not in seen_bare and i not in seen_url:
            seen_bare.add(i)
            bare_unique.append(i)

    return (bare_unique, url_ids)
```

(The `import re as _re_module` line is a paste-safe stub - `re` is already imported at the top of the file. Drop the alias line if a static check complains about a duplicate import.)

- [ ] **Step 3: py_compile**

```bash
/opt/sortof/api/.venv/bin/python -m py_compile /opt/sortof/api/parse.py && echo PY_OK
```

- [ ] **Step 4: Functional smoke test in the venv REPL**

```bash
cd /opt/sortof/api && .venv/bin/python -c "
from parse import parse_with_collections, parse_workshop_input

# Pure bare wsids - backwards compat.
assert parse_with_collections('2169435993;2392709985') == (['2169435993','2392709985'], [])

# Pure URL.
assert parse_with_collections('https://steamcommunity.com/sharedfiles/filedetails/?id=2200148440') == ([], ['2200148440'])

# Mixed: URL + bare.
assert parse_with_collections('https://steamcommunity.com/sharedfiles/filedetails/?id=2200148440\n2169435993') == (['2169435993'], ['2200148440'])

# Same ID appearing both as URL AND as bare → URL wins, bare side dedups.
assert parse_with_collections('2200148440\nhttps://steamcommunity.com/sharedfiles/filedetails/?id=2200148440') == ([], ['2200148440'])

# Empty.
assert parse_with_collections('') == ([], [])
assert parse_with_collections(None) == ([], [])

# Existing parse_workshop_input still works.
assert parse_workshop_input('2169435993;2392709985') == ['2169435993','2392709985']

print('ALL_OK')
"
```

Expected: `ALL_OK`. Any AssertionError stops the task - fix the regex/dedupe logic before proceeding.

- [ ] **Step 5: Checkpoint** - parser ready for the API to consume.

---

## Task 3: Steam helper - `fetch_collection_details()`

**Files:**
- Modify: `/opt/sortof/api/steam.py`

- [ ] **Step 1: Backup**

```bash
cp /opt/sortof/api/steam.py /opt/sortof/api/steam.py.bak-$(date +%Y%m%d-%H%M)
```

- [ ] **Step 2: Add the helper, mirroring `fetch_workshop_details`**

Open `/opt/sortof/api/steam.py`. The current file has one async helper. Add a sibling at the bottom:

```python
COLLECTION_URL = (
    "https://api.steampowered.com/ISteamRemoteStorage/GetCollectionDetails/v1/"
)


async def fetch_collection_details(
    client: httpx.AsyncClient,
    collection_ids: List[str],
) -> Dict[str, Dict]:
    """Resolve candidate collection IDs to their child wsids.

    Returns a dict keyed by collection_id with shape:
        { "result": int, "children": List[str] }

    Anonymous endpoint; no API key needed. result==1 means valid collection;
    result!=1 means the ID isn't a collection (could be a mod, deleted, or
    private). Caller decides what to do with non-1 results - see Spec B+F
    §10 Q3 "Partial expansion failure" and Q4 "Flakiness".
    """
    if not collection_ids:
        return {}
    data: Dict[str, str] = {"collectioncount": str(len(collection_ids))}
    for i, cid in enumerate(collection_ids):
        data[f"publishedfileids[{i}]"] = cid
    r = await client.post(COLLECTION_URL, data=data)
    r.raise_for_status()
    body = r.json()
    out: Dict[str, Dict] = {}
    for item in body.get("response", {}).get("collectiondetails", []) or []:
        cid = item.get("publishedfileid")
        if not cid:
            continue
        out[cid] = {
            "result": int(item.get("result", 0)),
            "children": [
                c.get("publishedfileid", "")
                for c in (item.get("children") or [])
                if c.get("publishedfileid")
            ],
        }
    return out
```

- [ ] **Step 3: py_compile + smoke import**

```bash
/opt/sortof/api/.venv/bin/python -m py_compile /opt/sortof/api/steam.py && echo PY_OK
cd /opt/sortof/api && .venv/bin/python -c "import app; from steam import fetch_collection_details; print(fetch_collection_details.__doc__.split(chr(10))[0])"
```

Expected: `PY_OK` and the first line of the helper's docstring.

- [ ] **Step 4: Functional smoke test against real Steam (one collection ID)**

The implementer should pick a known PZ collection - search `https://steamcommunity.com/workshop/browse/?appid=108600&section=collections` for any active collection, copy its ID from the URL bar, and use it here. Substitute below:

```bash
COLL_ID="<paste-real-PZ-collection-id-here>"
curl -sS -X POST 'https://api.steampowered.com/ISteamRemoteStorage/GetCollectionDetails/v1/' \
  --data-urlencode 'collectioncount=1' \
  --data-urlencode "publishedfileids[0]=$COLL_ID" \
  | jq '.response.collectiondetails[0] | {result, n_children: (.children | length)}'
```

Expected: `result: 1`, `n_children > 0`.

Then call our helper through the venv:

```bash
cd /opt/sortof/api && .venv/bin/python -c "
import asyncio, httpx
from steam import fetch_collection_details

async def main():
    async with httpx.AsyncClient(timeout=30.0) as c:
        out = await fetch_collection_details(c, ['$COLL_ID'])
    print(out)

asyncio.run(main())
"
```

Expected: a dict like `{'<COLL_ID>': {'result': 1, 'children': ['<wsid1>', '<wsid2>', ...]}}`.

- [ ] **Step 5: Checkpoint** - Steam helper verified end-to-end.

---

## Task 4: `jobs.py` - CRUD, phase derivation, live counts, lifespan sweep

**Files:**
- Create: `/opt/sortof/api/jobs.py`

- [ ] **Step 1: Write the module**

Create `/opt/sortof/api/jobs.py` with:

```python
"""sort_jobs persistence + phase derivation.

Phase is *derived* on every GET (Spec B+F §4): never stored as the source
of truth except for terminal states. The function `derive_phase` reads
live counts from download_jobs and decides expanding/queued/draining/done.
This makes the system restart-resilient by construction - there is no
event log to replay.
"""

from __future__ import annotations

import json
from typing import Any, Dict, List, Optional, Tuple
from uuid import UUID

import asyncpg


# ── CRUD ────────────────────────────────────────────────────────────────────

async def create_job(
    conn: asyncpg.Connection,
    *,
    input_raw: str,
    collection_ids: List[str],
    wsids: Optional[List[str]],
    rules_raw: Optional[str],
    initial_phase: str,
) -> str:
    """Insert a sort_jobs row and return the job_id (UUID as string).

    initial_phase: 'expanding' if collections still need resolving,
                   'queued' if wsids are already resolved at submit time.
    """
    row = await conn.fetchrow(
        """
        INSERT INTO sort_jobs (phase, input_raw, collection_ids, wsids, rules_raw)
        VALUES ($1, $2, $3, $4, $5)
        RETURNING job_id
        """,
        initial_phase, input_raw, collection_ids, wsids, rules_raw,
    )
    return str(row["job_id"])


async def get_job_row(conn: asyncpg.Connection, job_id: str) -> Optional[Dict[str, Any]]:
    """Fetch a sort_jobs row by id. Returns None if not found.

    job_id may be either a string UUID or asyncpg-native UUID.
    """
    try:
        uid = UUID(job_id) if isinstance(job_id, str) else job_id
    except ValueError:
        return None
    row = await conn.fetchrow(
        "SELECT * FROM sort_jobs WHERE job_id = $1",
        uid,
    )
    return dict(row) if row else None


async def update_phase(
    conn: asyncpg.Connection,
    job_id: str,
    phase: str,
    *,
    wsids: Optional[List[str]] = None,
    result_json: Optional[Dict[str, Any]] = None,
    failure_reason: Optional[str] = None,
) -> None:
    """Advance a job's phase. wsids/result_json/failure_reason are optional
    column updates that pair with phase transitions."""
    sets = ["phase = $2", "phase_started_at = now()"]
    params: List[Any] = [UUID(job_id), phase]
    idx = 3
    if wsids is not None:
        sets.append(f"wsids = ${idx}::text[]")
        params.append(wsids)
        idx += 1
    if result_json is not None:
        sets.append(f"result_json = ${idx}::jsonb")
        params.append(json.dumps(result_json))
        idx += 1
    if failure_reason is not None:
        sets.append(f"failure_reason = ${idx}")
        params.append(failure_reason)
        idx += 1
    await conn.execute(
        f"UPDATE sort_jobs SET {', '.join(sets)} WHERE job_id = $1",
        *params,
    )


# ── live counts (Spec B+F §6) ───────────────────────────────────────────────

async def compute_counts(conn: asyncpg.Connection, wsids: List[str]) -> Dict[str, int]:
    """Compute live cached/queued/draining counts for a set of wsids.
    Empty wsids → all zeros."""
    if not wsids:
        return {"cached": 0, "queued": 0, "draining": 0}
    rows = await conn.fetch(
        """
        SELECT
            (SELECT COUNT(DISTINCT mp.workshop_id)
             FROM mod_parsed mp
             JOIN workshop_meta wm ON wm.workshop_id = mp.workshop_id
             WHERE mp.workshop_id = ANY($1::text[])
               AND mp.parsed_at_time_updated = wm.time_updated) AS cached,
            (SELECT COUNT(DISTINCT workshop_id)
             FROM download_jobs
             WHERE workshop_id = ANY($1::text[]) AND status = 'queued') AS queued,
            (SELECT COUNT(DISTINCT workshop_id)
             FROM download_jobs
             WHERE workshop_id = ANY($1::text[]) AND status = 'downloading') AS draining
        """,
        wsids,
    )
    r = rows[0]
    return {"cached": int(r["cached"]), "queued": int(r["queued"]), "draining": int(r["draining"])}


# ── phase derivation (Spec B+F §4) ──────────────────────────────────────────

def derive_phase(
    stored_phase: str,
    wsids: Optional[List[str]],
    counts: Dict[str, int],
) -> str:
    """Decide the live phase from the row's stored phase + current counts.

    Terminal phases (done/failed) are never demoted. Non-terminal phases
    are recomputed from current state.
    """
    if stored_phase in ("done", "failed"):
        return stored_phase
    if wsids is None:
        return "expanding"
    if counts["draining"] > 0:
        return "draining"
    if counts["queued"] > 0:
        return "queued"
    if counts["cached"] >= len(wsids):
        return "done"
    # Transient gap: a row just left 'queued' and hasn't shown up in
    # mod_parsed yet. Most likely just-failed and not yet re-queued.
    return "queued"


# ── stale-expansion sweep (Spec B+F §9) ─────────────────────────────────────

STALE_EXPANSION_SQL = """
UPDATE sort_jobs
   SET phase = 'failed',
       failure_reason = 'expansion timed out',
       updated_at = now()
 WHERE phase = 'expanding'
   AND phase_started_at < now() - interval '10 minutes'
RETURNING job_id;
"""


async def sweep_stale_expansions(conn: asyncpg.Connection) -> int:
    """Run on uvicorn lifespan startup. Returns the number of jobs reaped."""
    rows = await conn.fetch(STALE_EXPANSION_SQL)
    return len(rows)
```

- [ ] **Step 2: py_compile + smoke import**

```bash
/opt/sortof/api/.venv/bin/python -m py_compile /opt/sortof/api/jobs.py && echo PY_OK
cd /opt/sortof/api && .venv/bin/python -c "import jobs; print(sorted(n for n in dir(jobs) if not n.startswith('_'))[:8])"
```

Expected: `PY_OK` followed by a list including `compute_counts`, `create_job`, `derive_phase`, `get_job_row`, `sweep_stale_expansions`, `update_phase`.

- [ ] **Step 3: Phase derivation unit smoke**

Phase derivation is pure (no DB), so it's testable without a connection:

```bash
cd /opt/sortof/api && .venv/bin/python -c "
from jobs import derive_phase

# Terminal preserved.
assert derive_phase('done',   ['a'],  {'cached':1,'queued':0,'draining':0}) == 'done'
assert derive_phase('failed', ['a'],  {'cached':0,'queued':0,'draining':0}) == 'failed'

# wsids null → expanding.
assert derive_phase('expanding', None, {'cached':0,'queued':0,'draining':0}) == 'expanding'

# Active drain.
assert derive_phase('queued', ['a','b'], {'cached':0,'queued':1,'draining':1}) == 'draining'

# Just queued.
assert derive_phase('queued', ['a','b'], {'cached':0,'queued':2,'draining':0}) == 'queued'

# All cached.
assert derive_phase('queued', ['a','b'], {'cached':2,'queued':0,'draining':0}) == 'done'

# Transient gap (between queued exit and parsed entry).
assert derive_phase('queued', ['a','b'], {'cached':1,'queued':0,'draining':0}) == 'queued'

print('PHASE_OK')
"
```

Expected: `PHASE_OK`.

- [ ] **Step 4: Live-counts smoke (DB round-trip)**

```bash
cd /opt/sortof/api && .venv/bin/python -c "
import asyncio, db
from jobs import compute_counts

async def main():
    pool = await db.create_pool()
    async with pool.acquire() as conn:
        c1 = await compute_counts(conn, [])
        assert c1 == {'cached':0,'queued':0,'draining':0}
        # canonical 3-mod test set is fully cached.
        c2 = await compute_counts(conn, ['2169435993','2392709985','2487022075'])
        assert c2['cached'] == 3
    await pool.close()
    print('COUNTS_OK')

asyncio.run(main())
"
```

Expected: `COUNTS_OK`.

- [ ] **Step 5: Checkpoint** - module reusable from `app.py`.

---

## Task 5: Background expansion task

**Files:**
- Create: `/opt/sortof/api/expansion.py`

- [ ] **Step 1: Write the expansion runner**

Create `/opt/sortof/api/expansion.py` with:

```python
"""Background async task: take a freshly-created sort_jobs row in 'expanding'
phase, resolve its collection_ids via Steam, populate wsids[], advance phase
to 'queued' (and drop wsids into download_jobs as needed)."""

from __future__ import annotations

import asyncio
import logging
from typing import Any, Dict, List

import asyncpg
import httpx

from jobs import update_phase
from steam import fetch_collection_details

log = logging.getLogger("sortof.expansion")

COLLECTION_TTL_SECONDS = 6 * 3600  # Spec B+F §5.3


async def _resolve_collections(
    conn: asyncpg.Connection,
    http: httpx.AsyncClient,
    collection_ids: List[str],
) -> tuple[Dict[str, List[str]], List[str]]:
    """Returns (resolved, unresolvable). resolved maps collection_id ->
    [child_wsids]. unresolvable lists collection_ids that GetCollectionDetails
    couldn't fetch (after one retry)."""
    if not collection_ids:
        return ({}, [])

    # Cache lookup (TTL = 6h via last_fetched_at).
    cache_rows = await conn.fetch(
        """
        SELECT collection_id, child_workshop_ids
          FROM collections
         WHERE collection_id = ANY($1::text[])
           AND last_fetched_at > now() - interval '6 hours'
        """,
        collection_ids,
    )
    resolved: Dict[str, List[str]] = {
        r["collection_id"]: list(r["child_workshop_ids"])
        for r in cache_rows
    }
    miss = [cid for cid in collection_ids if cid not in resolved]

    unresolvable: List[str] = []
    if miss:
        for attempt in (1, 2):
            try:
                api_out = await fetch_collection_details(http, miss)
            except httpx.HTTPError as e:
                log.warning("GetCollectionDetails attempt %d failed: %s", attempt, e)
                if attempt == 1:
                    await asyncio.sleep(2.0)
                    continue
                unresolvable = list(miss)
                api_out = {}
            for cid in miss:
                rec = api_out.get(cid)
                if rec is None or rec.get("result") != 1:
                    unresolvable.append(cid)
                    continue
                children = rec.get("children") or []
                resolved[cid] = list(children)
                await conn.execute(
                    """
                    INSERT INTO collections (collection_id, child_workshop_ids, last_fetched_at)
                    VALUES ($1, $2, now())
                    ON CONFLICT (collection_id) DO UPDATE
                       SET child_workshop_ids = EXCLUDED.child_workshop_ids,
                           last_fetched_at    = now()
                    """,
                    cid, children,
                )
            break  # success - stop retrying
    # Dedupe (in case retry-on-flake added the same cid twice).
    seen: set[str] = set()
    out_unres: List[str] = []
    for u in unresolvable:
        if u not in seen:
            seen.add(u)
            out_unres.append(u)
    return (resolved, out_unres)


async def run_expansion(
    pool: asyncpg.Pool,
    http: httpx.AsyncClient,
    job_id: str,
    bare_wsids: List[str],
    collection_ids: List[str],
) -> None:
    """Top-level expansion task. Logs and persists; never raises out."""
    try:
        async with pool.acquire() as conn:
            resolved, unresolvable = await _resolve_collections(conn, http, collection_ids)

            # Compose wsids: collections (in input order) + bare wsids, deduped.
            seen: set[str] = set()
            wsids: List[str] = []
            for cid in collection_ids:
                for w in resolved.get(cid, []):
                    if w and w not in seen:
                        seen.add(w)
                        wsids.append(w)
            for w in bare_wsids:
                if w not in seen:
                    seen.add(w)
                    wsids.append(w)

            if not wsids:
                # All collections unresolvable AND no bare wsids. Job dies.
                await update_phase(
                    conn, job_id, "failed",
                    failure_reason="all input collections unresolvable",
                )
                log.info("expansion %s: failed - all collections unresolvable", job_id)
                return

            partial_warnings = [
                {
                    "tag": "collection-partial",
                    "level": "warning",
                    "msg": f"collection {cid} could not be fetched",
                }
                for cid in unresolvable
            ]
            seed_result: Dict[str, Any] = {"WARNINGS": partial_warnings} if partial_warnings else None

            await update_phase(
                conn, job_id, "queued",
                wsids=wsids,
                result_json=seed_result,
            )
            log.info(
                "expansion %s: queued (wsids=%d unresolvable=%d)",
                job_id, len(wsids), len(unresolvable),
            )
    except Exception:
        log.exception("expansion %s: crashed", job_id)
        try:
            async with pool.acquire() as conn:
                await update_phase(conn, job_id, "failed", failure_reason="expansion crashed")
        except Exception:
            log.exception("expansion %s: cleanup failed", job_id)
```

- [ ] **Step 2: py_compile + smoke import**

```bash
/opt/sortof/api/.venv/bin/python -m py_compile /opt/sortof/api/expansion.py && echo PY_OK
cd /opt/sortof/api && .venv/bin/python -c "from expansion import run_expansion; print('IMPORT_OK')"
```

Expected: `PY_OK` and `IMPORT_OK`.

- [ ] **Step 3: End-to-end smoke against the live DB + Steam**

```bash
cd /opt/sortof/api && .venv/bin/python -c "
import asyncio, httpx, db, jobs, expansion

COLL_ID = '<paste-real-PZ-collection-id>'  # same one you used in Task 3 step 4

async def main():
    pool = await db.create_pool()
    async with pool.acquire() as conn:
        # Pre-clear any cached row to test the cold path.
        await conn.execute('DELETE FROM collections WHERE collection_id=\$1', COLL_ID)
        jid = await jobs.create_job(
            conn, input_raw=f'https://steamcommunity.com/sharedfiles/filedetails/?id={COLL_ID}',
            collection_ids=[COLL_ID], wsids=None, rules_raw=None,
            initial_phase='expanding',
        )
    async with httpx.AsyncClient(timeout=30.0) as http:
        await expansion.run_expansion(pool, http, jid, [], [COLL_ID])
    async with pool.acquire() as conn:
        row = await jobs.get_job_row(conn, jid)
        assert row['phase'] == 'queued', row
        assert row['wsids'] is not None and len(row['wsids']) > 0
        # Cleanup.
        await conn.execute('DELETE FROM sort_jobs WHERE job_id=\$1', row['job_id'])
    await pool.close()
    print('EXPANSION_OK')

asyncio.run(main())
"
```

Expected: `EXPANSION_OK`. Substitute `COLL_ID` with the real ID from Task 3.

- [ ] **Step 4: Checkpoint** - expansion runner ready to be triggered from `/api/sort`.

---

## Task 6: Polymorphic `/api/sort` + lifespan sweep wiring

**Files:**
- Modify: `/opt/sortof/api/app.py`

- [ ] **Step 1: Backup**

```bash
cp /opt/sortof/api/app.py /opt/sortof/api/app.py.bak-$(date +%Y%m%d-%H%M)
```

- [ ] **Step 2: Add new imports**

Find the existing import block (around lines 1-30). Add:

```python
import asyncio
import jobs
import expansion
from parse import parse_with_collections  # existing parse_workshop_input import stays
```

- [ ] **Step 3: Wire stale-expansion sweep into lifespan startup**

Find the existing `@asynccontextmanager async def lifespan(app: FastAPI):` block (around line 38). Inside the body, after the pool/http are created and before `yield`, add:

```python
    async with pool.acquire() as conn:
        n_reaped = await jobs.sweep_stale_expansions(conn)
        if n_reaped:
            log.info("lifespan startup: reaped %d stale expansion job(s)", n_reaped)
```

- [ ] **Step 4: Make `/api/sort` polymorphic**

Find the `async def sort_endpoint(...)` function. The current body parses input via `parse_workshop_input`, hits Steam, queues misses, runs `mlos_sort`, returns sync. Replace the parsing line:

```python
    input_ids = parse_workshop_input(req.input or "")
```

with:

```python
    bare_wsids, collection_ids = parse_with_collections(req.input or "")
    input_ids = bare_wsids  # used by existing code paths below
```

Then, immediately after the validation `raise HTTPException(...)` checks (so we still 400 on empty input and 413 on >MAX_IDS), but **before** the Steam metadata fetch, insert a fork:

```python
    # ── B+F: route to async job if collections present OR uncached wsids
    # require drain time ───────────────────────────────────────────────────
    if collection_ids or len(input_ids) > 0:
        # Fast-path probe: are ALL bare wsids already cache-fresh? If so AND
        # there are no collections, fall through to the existing sync path.
        # (Spec B+F §10 Q1: "Bare wsid + all-cached → synchronous".)
        if not collection_ids and input_ids:
            try:
                steam_details = await steam.fetch_workshop_details(
                    request.app.state.http, input_ids,
                )
            except httpx.HTTPError as e:
                log.warning("steam api error: %s", e)
                elapsed_ms = int((time.monotonic() - t0) * 1000)
                log.info(
                    "sort done hits=0 misses=%d status=error ms=%d",
                    len(input_ids), elapsed_ms,
                )
                return _empty_payload(input_ids, "error")
            # Cache check: are all input_ids in mod_parsed and fresh?
            pool = request.app.state.db
            async with pool.acquire() as conn:
                fresh = 0
                for wid in input_ids:
                    d = steam_details.get(wid)
                    if not d or d.get("result") != 1:
                        break  # bail out - there's a non-cacheable id, route to job
                    tu = int(d.get("time_updated", 0))
                    row = await conn.fetchrow(
                        "SELECT 1 FROM mod_parsed "
                        "WHERE workshop_id = $1 AND parsed_at_time_updated = $2 LIMIT 1",
                        wid, tu,
                    )
                    if row is not None:
                        fresh += 1
                    else:
                        break
                if fresh == len(input_ids):
                    # All cache-fresh - sync path. Re-use the existing flow
                    # by NOT routing to a job. Fall through.
                    pass
                else:
                    # Async path.
                    return await _route_to_job(
                        request, conn, req.input or "", req.rules,
                        bare_wsids, collection_ids,
                    )
        elif collection_ids:
            pool = request.app.state.db
            async with pool.acquire() as conn:
                return await _route_to_job(
                    request, conn, req.input or "", req.rules,
                    bare_wsids, collection_ids,
                )
```

This is the routing fork. The fast-path probe lets all-cached bare-wsid input fall through to the existing sync code unchanged. Anything else (uncached wsids OR any collection) returns a job_id.

- [ ] **Step 5: Add `_route_to_job` helper near the route definition**

Just above the `@app.post("/api/sort")` decorator, insert:

```python
async def _route_to_job(
    request: Request,
    conn,
    input_raw: str,
    rules_raw: Optional[str],
    bare_wsids: List[str],
    collection_ids: List[str],
) -> Dict[str, Any]:
    """Create a sort_jobs row and (if needed) kick off background expansion.
    Returns {status, job_id} for the client to start polling."""
    if collection_ids:
        # Will resolve in the background.
        job_id = await jobs.create_job(
            conn,
            input_raw=input_raw,
            collection_ids=collection_ids,
            wsids=None,
            rules_raw=rules_raw,
            initial_phase="expanding",
        )
        asyncio.create_task(expansion.run_expansion(
            request.app.state.db,
            request.app.state.http,
            job_id,
            bare_wsids,
            collection_ids,
        ))
        return {"status": "expanding", "job_id": job_id}
    else:
        # Bare wsids that include uncached. Kick off cold drain by queueing.
        # We dedupe wsids before storing them on the job (the existing
        # /api/sort flow does this for bare input lists).
        seen: set = set()
        wsids: List[str] = []
        for w in bare_wsids:
            if w not in seen:
                seen.add(w)
                wsids.append(w)
        job_id = await jobs.create_job(
            conn,
            input_raw=input_raw,
            collection_ids=[],
            wsids=wsids,
            rules_raw=rules_raw,
            initial_phase="queued",
        )
        # Queue any wsids not already in download_jobs (mirrors the existing
        # flow at the bottom of sort_endpoint, but we don't need Steam validation
        # here since the GET poll will surface unknowns/non-mods naturally
        # via the counts contract).
        for wid in wsids:
            existing = await conn.fetchval(
                "SELECT 1 FROM download_jobs "
                "WHERE workshop_id = $1 AND status IN ('queued','downloading') LIMIT 1",
                wid,
            )
            if existing is None:
                await conn.execute(
                    "INSERT INTO download_jobs (workshop_id, status) VALUES ($1, 'queued')",
                    wid,
                )
        return {"status": "queued", "job_id": job_id}
```

- [ ] **Step 6: py_compile + smoke import**

```bash
/opt/sortof/api/.venv/bin/python -m py_compile /opt/sortof/api/app.py && echo PY_OK
cd /opt/sortof/api && .venv/bin/python -c "import app" && echo IMPORT_OK
```

- [ ] **Step 7: Restart API**

```bash
sudo systemctl restart sortof-api && sleep 2 && sudo systemctl is-active sortof-api
```

Expected: `active`.

- [ ] **Step 8: Verify the sync fast path is unchanged**

```bash
curl -sS -X POST http://100.114.205.53:8801/api/sort \
  -H 'Content-Type: application/json' \
  -d '{"input":"2169435993;2392709985;2487022075"}' \
  | jq '{status, MODS_LINE, has_job_id: (has("job_id"))}'
```

Expected: `{"status":"success","MODS_LINE":"modoptions;tsarslib;TMC_TrueActions","has_job_id":false}`.

- [ ] **Step 9: Verify the bare-uncached path returns a job_id**

```bash
# First, find a wsid that ISN'T cached. The HellDrinx wsid is non_mod, not great.
# Use a real PZ mod that isn't in mod_parsed yet - implementer needs to find one
# fresh from Steam. Or simpler: temporarily delete a cached row to force a miss:
sudo docker exec -i sortof_db psql -U sortof -d sortof -c "
DELETE FROM mod_parsed WHERE workshop_id='2196102849';
DELETE FROM workshop_meta WHERE workshop_id='2196102849';"

curl -sS -X POST http://100.114.205.53:8801/api/sort \
  -H 'Content-Type: application/json' \
  -d '{"input":"2196102849"}' | jq '{status, has_job_id: (has("job_id")), job_id_preview: (.job_id // "" | .[0:8])}'
```

Expected: `{"status":"queued","has_job_id":true,"job_id_preview":"<8-hex-chars>"}`.

The drain will reprocess `2196102849` (Raven Creek) and re-cache it; that's fine.

- [ ] **Step 10: Verify the collection path returns expanding + job_id**

```bash
COLL_ID="<real-PZ-collection-id>"
curl -sS -X POST http://100.114.205.53:8801/api/sort \
  -H 'Content-Type: application/json' \
  -d "{\"input\":\"https://steamcommunity.com/sharedfiles/filedetails/?id=$COLL_ID\"}" \
  | jq '{status, has_job_id: (has("job_id"))}'
```

Expected: `{"status":"expanding","has_job_id":true}`.

- [ ] **Step 11: Checkpoint** - `/api/sort` polymorphism live; jobs being created. Polling endpoint is Task 7.

---

## Task 7: `GET` + `DELETE /api/jobs/{job_id}`

**Files:**
- Modify: `/opt/sortof/api/app.py` (add two endpoints near the existing routes)

- [ ] **Step 1: Backup (if more than ~15 minutes since the Task 6 backup)**

```bash
cp /opt/sortof/api/app.py /opt/sortof/api/app.py.bak-$(date +%Y%m%d-%H%M)
```

- [ ] **Step 2: Add the GET endpoint**

Find the end of `sort_endpoint` (the function body ends with `return payload`). Below it, insert:

```python
@app.get("/api/jobs/{job_id}")
async def get_job_endpoint(job_id: str, request: Request) -> Dict[str, Any]:
    pool = request.app.state.db
    async with pool.acquire() as conn:
        row = await jobs.get_job_row(conn, job_id)
        if row is None:
            raise HTTPException(status_code=404, detail="job not found or expired")
        wsids = list(row["wsids"]) if row["wsids"] else None
        counts = await jobs.compute_counts(conn, wsids or [])
        phase = jobs.derive_phase(row["phase"], wsids, counts)

        # If we just transitioned a non-terminal job to 'done', persist the
        # final result for future polls (and for the §3 24h TTL artifact).
        result_json = row["result_json"]
        if phase == "done" and row["phase"] != "done":
            result_json = await _build_result_for_job(conn, wsids, row["rules_raw"])
            await jobs.update_phase(
                conn, job_id, "done", result_json=result_json,
            )
        elif phase != "done" and wsids:
            # Compute a fresh partial result on every poll - cheap, avoids
            # staleness. Don't persist; only `done` writes result_json.
            result_json = await _build_result_for_job(conn, wsids, row["rules_raw"])

    return {
        "job_id":         str(row["job_id"]),
        "phase":          phase,
        "counts":         counts,
        "wsids":          wsids,
        "result":         result_json,
        "failure_reason": row["failure_reason"],
    }
```

- [ ] **Step 3: Add the `_build_result_for_job` helper**

Just above the `_empty_payload` helper (around line 100), insert:

```python
async def _build_result_for_job(
    conn,
    wsids: List[str],
    rules_raw: Optional[str],
) -> Dict[str, Any]:
    """Compute the SORTOF_DATA payload from currently-cached mod_parsed rows
    for the given wsids. Used both for partial results during draining and
    for the final result on phase transition to 'done'."""
    if not wsids:
        return _empty_payload([], "success")
    rows = await conn.fetch(
        """
        SELECT mp.workshop_id, mp.mod_id, mp.name, mp.category,
               mp.requirements, mp.load_after, mp.load_before,
               mp.incompatible_mods, mp.load_first, mp.load_last,
               mp.tags, mp.maps
          FROM mod_parsed mp
          JOIN workshop_meta wm ON wm.workshop_id = mp.workshop_id
         WHERE mp.workshop_id = ANY($1::text[])
           AND mp.parsed_at_time_updated = wm.time_updated
         ORDER BY mp.workshop_id, mp.mod_id
        """,
        wsids,
    )
    mods = [_row_to_modinfo(r) for r in rows]
    rules: Dict[str, Any] = {}
    if rules_raw:
        try:
            rules = parse_sorting_rules(rules_raw)
        except Exception:
            log.warning("job result: failed to parse sorting_rules")
    sort_result = sort_mods(mods, rules)
    cached_ids = list({r["workshop_id"] for r in rows})
    payload = adapters.build_response(
        input_ids=wsids,    # contract: WORKSHOP_ITEMS_LINE = wsids[] at job creation
        hit_ids=cached_ids,
        mods=mods,
        sort_result=sort_result,
        status="success" if len(cached_ids) >= len(wsids) else "partial",
    )
    # Forced override: WORKSHOP_ITEMS_LINE locked to the original wsids[]
    # regardless of which are currently cached (Spec A §8 / Spec B+F §6).
    payload["WORKSHOP_ITEMS_LINE"] = ";".join(wsids) + ";" if wsids else ""
    payload["pending"] = [w for w in wsids if w not in set(cached_ids)]
    payload["unknown"] = []   # this endpoint doesn't compute Steam-result-9
    payload["non_mod"] = []   # nor non-mod classification - those are sync-path concerns
    return payload
```

- [ ] **Step 4: Add the DELETE endpoint**

Below the GET endpoint, insert:

```python
@app.delete("/api/jobs/{job_id}", status_code=204)
async def delete_job_endpoint(job_id: str, request: Request):
    """Cancel a job. Idempotent: cancelling a terminal job is a no-op 204.
    Does NOT touch download_jobs (Spec B+F §8)."""
    pool = request.app.state.db
    async with pool.acquire() as conn:
        row = await jobs.get_job_row(conn, job_id)
        if row is None:
            raise HTTPException(status_code=404, detail="job not found")
        if row["phase"] not in ("done", "failed"):
            await jobs.update_phase(conn, job_id, "failed", failure_reason="cancelled")
    return None
```

- [ ] **Step 5: py_compile + restart**

```bash
/opt/sortof/api/.venv/bin/python -m py_compile /opt/sortof/api/app.py && echo PY_OK
cd /opt/sortof/api && .venv/bin/python -c "import app" && echo IMPORT_OK
sudo systemctl restart sortof-api && sleep 2 && sudo systemctl is-active sortof-api
```

- [ ] **Step 6: Verify GET on a fresh collection job**

```bash
COLL_ID="<real-PZ-collection-id>"
JOB_RESP=$(curl -sS -X POST http://100.114.205.53:8801/api/sort \
  -H 'Content-Type: application/json' \
  -d "{\"input\":\"https://steamcommunity.com/sharedfiles/filedetails/?id=$COLL_ID\"}")
JID=$(echo "$JOB_RESP" | jq -r '.job_id')
echo "job_id=$JID"

# First poll, likely expanding
curl -sS http://100.114.205.53:8801/api/jobs/$JID | jq '{phase, counts, has_wsids: (.wsids != null)}'

# Wait for expansion + initial drain
sleep 3
curl -sS http://100.114.205.53:8801/api/jobs/$JID | jq '{phase, counts, n_wsids: (.wsids|length)}'

# 404 on garbage id
curl -sS -o /dev/null -w 'http=%{http_code}\n' http://100.114.205.53:8801/api/jobs/00000000-0000-0000-0000-000000000000
```

Expected: first poll has `phase: "expanding"`; second poll has `phase` in `(queued, draining, done)` with `n_wsids > 0`; the garbage id returns `http=404`.

- [ ] **Step 7: Verify DELETE on an active job**

```bash
# Submit a fresh job so we can cancel it before it drains
JID=$(curl -sS -X POST http://100.114.205.53:8801/api/sort \
  -H 'Content-Type: application/json' \
  -d "{\"input\":\"https://steamcommunity.com/sharedfiles/filedetails/?id=$COLL_ID\"}" \
  | jq -r '.job_id')

# Cancel immediately
curl -sS -o /dev/null -w 'cancel=%{http_code}\n' -X DELETE http://100.114.205.53:8801/api/jobs/$JID

# Idempotent re-cancel
curl -sS -o /dev/null -w 'recancel=%{http_code}\n' -X DELETE http://100.114.205.53:8801/api/jobs/$JID

# Confirm phase
curl -sS http://100.114.205.53:8801/api/jobs/$JID | jq '{phase, failure_reason}'
```

Expected: `cancel=204`, `recancel=204`, `phase: "failed"`, `failure_reason: "cancelled"`.

- [ ] **Step 8: Checkpoint** - backend complete. Frontend wiring is Tasks 8-10.

---

## Task 8: Frontend - detect `job_id`, polling loop, partial-result rendering

**Files:**
- Modify: `/opt/sortof/frontend/sortof-app.jsx`

- [ ] **Step 1: Backup**

```bash
cp /opt/sortof/frontend/sortof-app.jsx /opt/sortof/frontend/sortof-app.jsx.bak-$(date +%Y%m%d-%H%M)
```

- [ ] **Step 2: Add a polling helper near other top-of-file helpers**

Find the `buildModsLine` function (around line 32). Below it (and **above** the `isRadioMode` / `defaultSelectionForBranches` block), add:

```jsx
// Spec B+F: poll a job. Resolves with { phase, result, counts, wsids } on
// terminal phase OR when an explicit stop signal fires. Caller controls the
// AbortSignal to cancel polling on unmount / cancel-button / new-sort.
const POLL_INTERVAL_MS = 2500;

async function pollJobOnce(jobId) {
  const res = await fetch(`/api/jobs/${jobId}`);
  if (res.status === 404) return { kind: 'expired' };
  if (!res.ok) return { kind: 'error', status: res.status };
  const json = await res.json();
  return { kind: 'ok', body: json };
}

function pollJobLoop(jobId, signal, onTick) {
  // Returns a Promise that resolves on terminal phase or AbortSignal.
  return new Promise((resolve) => {
    let timer = null;
    async function tick() {
      if (signal.aborted) { if (timer) clearTimeout(timer); resolve({ kind: 'aborted' }); return; }
      const r = await pollJobOnce(jobId);
      if (signal.aborted) { resolve({ kind: 'aborted' }); return; }
      onTick(r);
      if (r.kind === 'expired' || r.kind === 'error') { resolve(r); return; }
      const phase = r.body.phase;
      if (phase === 'done' || phase === 'failed') { resolve(r); return; }
      timer = setTimeout(tick, POLL_INTERVAL_MS);
    }
    tick();
  });
}
```

- [ ] **Step 3: Add an AbortController ref + cancel-job state in App**

Find the App function's state declarations (search for `const [pzBuild, setPzBuild]`). Below the existing useState/useRef block but above the existing useEffects, add:

```jsx
const pollAbortRef = useRef(null);
const [activeJobId, setActiveJobId] = useState(null);
```

- [ ] **Step 4: Update `onSort` to branch on `job_id`**

Find the `async function onSort()` body. The current body POSTs to `/api/sort` and applies the response. Find the line that receives the response:

```jsx
const json = await res.json();
_liveSortData = json;
```

Insert a branch immediately before `_liveSortData = json`:

```jsx
const json = await res.json();
if (json.job_id) {
  // Async path - start polling and let the loop drive state.
  // Abort any in-flight previous poll.
  if (pollAbortRef.current) { pollAbortRef.current.abort(); }
  const ctrl = new AbortController();
  pollAbortRef.current = ctrl;
  setActiveJobId(json.job_id);

  pollJobLoop(json.job_id, ctrl.signal, (r) => {
    if (r.kind !== 'ok') return;
    const b = r.body;
    if (b.result) {
      _liveSortData = b.result;
      sortContextRef.current = {
        workshopItemsLine: (b.result.WORKSHOP_ITEMS_LINE) || '',
        originalQueued: (b.result.pending || []).length,
        unknown: b.result.unknown || [],
        nonMod:  b.result.non_mod || [],
      };
    }
    setProgress(b.phase === 'expanding' ? 5 : Math.min(95, 10 + b.counts.cached));
    setCounts({
      cached: b.counts.cached,
      queued: b.counts.queued,
      parsing: b.counts.draining,
      warnings: ((b.result && b.result.WARNINGS) || []).length,
      unknown: ((b.result && b.result.unknown) || []).length,
      nonMod:  ((b.result && b.result.non_mod) || []).length,
    });
    setState(b.phase);   // 'expanding' | 'queued' | 'draining' | 'done' | 'failed'
  }).then((final) => {
    setActiveJobId(null);
    if (final.kind === 'expired') {
      setState('error');
      _liveSortData = {
        ...(_liveSortData || {}),
        WARNINGS: [
          ...((_liveSortData?.WARNINGS) || []),
          { tag: 'retry', level: 'red', msg: 'this job expired - re-submit' },
        ],
      };
    }
  });
  return;
}
// Sync fast path - existing code follows.
_liveSortData = json;
```

- [ ] **Step 5: Verify served file picks up the new symbols**

```bash
curl -sS http://100.114.205.53:8801/sortof-app.jsx | grep -cE 'pollJobLoop|pollJobOnce|activeJobId|pollAbortRef|/api/jobs/'
```

Expected: ≥ 6.

- [ ] **Step 6: Manual browser smoke (or curl-driven simulation)**

The implementer should open `https://sortof.indifferentketchup.com/` and submit a real PZ collection URL. Expect: status strip text changes from `expanding…` to `queued`/`draining` to `done`. Network tab shows `GET /api/jobs/<uuid>` calls every 2.5s. Output panel populates as mods land in `mod_parsed`.

If headless: replicate by curling `/api/sort` with a collection URL, capturing the job_id, and curling `/api/jobs/<id>` repeatedly until `phase=done`. Confirm `result_json` populates with `MODS_LINE` etc. once draining completes.

- [ ] **Step 7: Checkpoint** - polling drives `_liveSortData` updates. Phase-specific status strip is Task 9.

---

## Task 9: Frontend - phase-specific status strip

**Files:**
- Modify: `/opt/sortof/frontend/sortof-app.jsx`

- [ ] **Step 1: Backup**

```bash
cp /opt/sortof/frontend/sortof-app.jsx /opt/sortof/frontend/sortof-app.jsx.bak-$(date +%Y%m%d-%H%M)
```

- [ ] **Step 2: Update `StatusStrip` to render phase-specific text**

Find the existing `function StatusStrip({ state, counts, progress })`. The existing function returns either an idle strip or a counts strip based on `state`. Replace its body with:

```jsx
function StatusStrip({ state, counts, progress }) {
  // Idle / terminal states - single pill summary.
  if (state === 'idle' || state === 'success' || state === 'error' || state === 'cold' || state === 'done' || state === 'failed') {
    return (
      <div className="status-strip">
        <span className={'status-pill ' + (state === 'failed' || state === 'error' ? 'idle' : 'idle')}>
          <span className="dot-led"></span>
          {state === 'idle'    && 'ready when you are'}
          {state === 'success' && `done. ${counts.cached} mods, ${counts.warnings} warnings`}
          {state === 'done'    && `done. ${counts.cached} mods, ${counts.warnings} warnings`}
          {state === 'error'   && 'something went sideways'}
          {state === 'failed'  && 'job failed'}
          {state === 'cold'    && 'cache miss - be patient'}
        </span>
      </div>
    );
  }

  // 'expanding' phase - no useful counts yet.
  if (state === 'expanding') {
    return (
      <div className="status-strip">
        <span className="status-pill expanding">
          <span className="dot-led"></span>expanding collection…
        </span>
        <div className="progress-bar"><i style={{ width: `${progress}%` }}></i></div>
      </div>
    );
  }

  // 'queued' or 'draining' - live counts. Existing 'partial'/'loading' too.
  return (
    <div className="status-strip">
      <span className="status-pill cached">
        <span className="dot-led"></span>{counts.cached} cached
      </span>
      <span className="status-pill queued">
        <span className="dot-led"></span>{counts.queued} queued
      </span>
      <span className="status-pill parse">
        <span className="dot-led"></span>{counts.parsing} draining
      </span>
      {counts.unknown > 0 && (
        <span className="status-pill unknown" title="Steam doesn't recognize these IDs (deleted, typo'd, or private)">
          <span className="dot-led"></span>{counts.unknown} unknown
        </span>
      )}
      {counts.nonMod > 0 && (
        <span className="status-pill nonmod" title="Workshop items that aren't loadable mods (collections, art, etc.)">
          <span className="dot-led"></span>{counts.nonMod} non-mod
        </span>
      )}
      <div className="progress-bar"><i style={{ width: `${progress}%` }}></i></div>
    </div>
  );
}
```

- [ ] **Step 3: Verify served**

```bash
curl -sS http://100.114.205.53:8801/sortof-app.jsx | grep -cE 'expanding collection|state === .expanding|state === .draining|state === .done|state === .failed'
```

Expected: ≥ 4.

- [ ] **Step 4: Manual browser smoke**

Submit a collection URL. Expect: strip starts with `expanding collection…`, transitions to live counts, ends with `done. N mods, …` summary.

---

## Task 10: Frontend - Cancel button + 404 expired-job handling + CSS

**Files:**
- Modify: `/opt/sortof/frontend/sortof-app.jsx`
- Modify: `/opt/sortof/frontend/index.html`

- [ ] **Step 1: Backup**

```bash
cp /opt/sortof/frontend/sortof-app.jsx /opt/sortof/frontend/sortof-app.jsx.bak-$(date +%Y%m%d-%H%M)
cp /opt/sortof/frontend/index.html /opt/sortof/frontend/index.html.bak-$(date +%Y%m%d-%H%M)
```

- [ ] **Step 2: Render a Cancel button when a job is active**

Find where the Sort button is rendered (search for `sort-btn` and `onClick={onSort}`). It's inside the left column. Below the existing Sort button JSX, add:

```jsx
{activeJobId && (
  <button
    className="cancel-btn"
    onClick={async () => {
      if (pollAbortRef.current) pollAbortRef.current.abort();
      try {
        await fetch(`/api/jobs/${activeJobId}`, { method: 'DELETE' });
      } catch {}
      setActiveJobId(null);
      setState('idle');
      setProgress(0);
    }}
  >cancel</button>
)}
```

- [ ] **Step 3: CSS for `.status-pill.expanding` and `.cancel-btn`**

Open `/opt/sortof/frontend/index.html`. Find the `.status-pill.nonmod` rule (added during the unknown/non-mod feature). Below it, add:

```css
  .status-pill.expanding { color: var(--acc-blue); }
  .status-pill.expanding .dot-led { background: var(--acc-blue); animation: bl 1.2s ease-in-out infinite; }

  .cancel-btn {
    appearance: none;
    width: 100%;
    height: 32px;
    margin-top: 6px;
    border: 1px solid var(--line);
    background: transparent;
    color: var(--fg-2);
    border-radius: var(--radius);
    font-family: var(--mono);
    font-size: 12px;
    cursor: pointer;
    transition: color .12s, border-color .12s, background .12s;
  }
  .cancel-btn:hover { color: var(--acc-red); border-color: var(--acc-red); background: var(--acc-red-bg); }
```

- [ ] **Step 4: Verify served**

```bash
curl -sS http://100.114.205.53:8801/sortof-app.jsx | grep -c 'cancel-btn'
curl -sS http://100.114.205.53:8801/ | grep -cE '\.status-pill\.expanding|\.cancel-btn'
```

Expected: ≥ 1 (jsx), ≥ 2 (CSS).

- [ ] **Step 5: Manual browser smoke**

Submit a fresh cold collection. While the strip reads `expanding` or `draining`, click cancel. Expect: strip clears, `setActiveJobId(null)` fires, no further GET polls in the network tab.

---

## Task 11: Spec §11 acceptance + §12 test recipes

For each item, document expected vs actual. If any fails, return to the relevant task.

- [ ] **Step 1: §11.1 Sync fast path** - bare wsids, all cached.

```bash
curl -sS -X POST http://100.114.205.53:8801/api/sort \
  -H 'Content-Type: application/json' \
  -d '{"input":"2169435993;2392709985;2487022075"}' | jq '{has_job_id: (has("job_id")), MODS_LINE}'
```

Expected: `{"has_job_id": false, "MODS_LINE": "modoptions;tsarslib;TMC_TrueActions"}`.

- [ ] **Step 2: §11.2 Async path on uncached bare wsid**

```bash
sudo docker exec -i sortof_db psql -U sortof -d sortof -c "DELETE FROM mod_parsed WHERE workshop_id='2196102849'; DELETE FROM workshop_meta WHERE workshop_id='2196102849';"
curl -sS -X POST http://100.114.205.53:8801/api/sort -H 'Content-Type: application/json' -d '{"input":"2196102849"}' | jq '{status, has_job_id: (has("job_id"))}'
```

Expected: `{"status":"queued","has_job_id":true}`.

- [ ] **Step 3: §11.3 Collection URL → expanding**

```bash
COLL_ID="<paste-real-PZ-collection-id>"
curl -sS -X POST http://100.114.205.53:8801/api/sort \
  -H 'Content-Type: application/json' \
  -d "{\"input\":\"https://steamcommunity.com/sharedfiles/filedetails/?id=$COLL_ID\"}" \
  | jq '{status, has_job_id: (has("job_id"))}'
```

Expected: `{"status":"expanding","has_job_id":true}`.

- [ ] **Step 4: §11.4 GET on bogus job → 404**

```bash
curl -sS -o /dev/null -w 'http=%{http_code}\n' http://100.114.205.53:8801/api/jobs/00000000-0000-0000-0000-000000000000
```

Expected: `http=404`.

- [ ] **Step 5: §11.5 DELETE → idempotent 204**

```bash
JID=$(curl -sS -X POST http://100.114.205.53:8801/api/sort -H 'Content-Type: application/json' -d "{\"input\":\"https://steamcommunity.com/sharedfiles/filedetails/?id=$COLL_ID\"}" | jq -r '.job_id')
curl -sS -o /dev/null -w 'cancel1=%{http_code}\n' -X DELETE http://100.114.205.53:8801/api/jobs/$JID
curl -sS -o /dev/null -w 'cancel2=%{http_code}\n' -X DELETE http://100.114.205.53:8801/api/jobs/$JID
curl -sS http://100.114.205.53:8801/api/jobs/$JID | jq '{phase, failure_reason}'
```

Expected: `cancel1=204`, `cancel2=204`, `{"phase":"failed","failure_reason":"cancelled"}`.

- [ ] **Step 6: §11.6 Steam URL detection + GetCollectionDetails routing**

Already verified by step 3 of this task. Confirm via journal:

```bash
sudo journalctl -u sortof-api --since "2 min ago" | grep -E 'GetCollectionDetails|expansion'
```

Expected: at least one entry naming the collection ID.

- [ ] **Step 7: §11.7 Cache hit on second submit (within 6h)**

Re-submit the same collection URL. Confirm: response is fast; `journalctl` for the new request does NOT show a fresh GetCollectionDetails call. (The expansion task's cache hit short-circuits the API call.) An implementation note: a second submit creates a new sort_jobs row but reuses the cached children.

- [ ] **Step 8: §11.8 Partial-collection failure**

```bash
# Combine real + bogus collection URL
curl -sS -X POST http://100.114.205.53:8801/api/sort \
  -H 'Content-Type: application/json' \
  -d "{\"input\":\"https://steamcommunity.com/sharedfiles/filedetails/?id=$COLL_ID\nhttps://steamcommunity.com/sharedfiles/filedetails/?id=99999999\"}" | jq '.job_id'
```

Then poll the returned job until `phase=done`, and check `result.WARNINGS`:

```bash
curl -sS http://100.114.205.53:8801/api/jobs/<jid> | jq '.result.WARNINGS[] | select(.tag=="collection-partial")'
```

Expected: one entry with `msg` mentioning `99999999`.

- [ ] **Step 9: §11.9 All collections fail**

```bash
curl -sS -X POST http://100.114.205.53:8801/api/sort \
  -H 'Content-Type: application/json' \
  -d '{"input":"https://steamcommunity.com/sharedfiles/filedetails/?id=99999999"}' | jq -r '.job_id' | tee /tmp/jid_test
sleep 3
curl -sS http://100.114.205.53:8801/api/jobs/$(cat /tmp/jid_test) | jq '{phase, failure_reason}'
```

Expected: `{"phase":"failed","failure_reason":"all input collections unresolvable"}`.

- [ ] **Step 10: §11.10 Stale-expansion sweep on restart**

```bash
# Manually create a stale expansion row
sudo docker exec -i sortof_db psql -U sortof -d sortof -c "
INSERT INTO sort_jobs (phase, phase_started_at, input_raw, collection_ids)
VALUES ('expanding', now() - interval '15 minutes', 'sweep test', ARRAY['99999999'])
RETURNING job_id;" | tail -3 | head -1 | xargs -I{} echo "stale_jid={}"
sudo systemctl restart sortof-api && sleep 3
sudo docker exec -i sortof_db psql -U sortof -d sortof -c "
SELECT phase, failure_reason FROM sort_jobs WHERE input_raw='sweep test';"
```

Expected: phase=`failed`, failure_reason=`expansion timed out`.

- [ ] **Step 11: §11.11 Counts contract - sum equals total minus non_mod**

After a cold collection drains, sum the live counts and compare to wsids count:

```bash
JID=<some-active-job>
curl -sS http://100.114.205.53:8801/api/jobs/$JID | jq '
  .counts as $c | .wsids as $w | {
    sum: ($c.cached + $c.queued + $c.draining),
    total_wsids: ($w | length),
    delta: (($w | length) - ($c.cached + $c.queued + $c.draining))
  }'
```

Expected: `delta` is the count of wsids that ended up `non_mod` or `unknown` (not in any of the three count buckets) - typically 0 or a small integer.

- [ ] **Step 12: §11.12 WORKSHOP_ITEMS_LINE locked at job creation**

After cancellation or partial-failure, the final `result.WORKSHOP_ITEMS_LINE` from `/api/jobs/<id>` must equal `wsids[]` joined by `;` regardless of how many landed in `non_mod` / `unknown`. Spot-check by comparing.

- [ ] **Step 13: Public-hostname mirror**

```bash
curl -sS -X POST https://sortof.indifferentketchup.com/api/sort \
  -H 'Content-Type: application/json' \
  -d '{"input":"2169435993;2392709985;2487022075"}' | jq '{status, MODS_LINE}'
```

Expected: success, canonical MODS_LINE. Public side mirrors all backend behavior.

- [ ] **Step 14: Final regression** - re-run the canonical 3-mod sync sort once more and confirm the response shape did not gain a `job_id` field.

---

## Self-review (already applied)

- **Spec coverage:** §1 (overview) - Tasks 1-7 cover backend, 8-10 cover frontend. §2 (API contract) - Tasks 6-7. §3 (schema) - Task 1. §4 (phase machine) - `derive_phase` in Task 4. §5 (Steam expansion) - Tasks 3, 5. §6 (counts contract) - `compute_counts` in Task 4, applied in Task 7. §7 (frontend) - Tasks 8-10. §8 (cancellation) - Task 7 step 4 + Task 10. §9 (restart resilience) - `sweep_stale_expansions` in Task 4 + lifespan wiring in Task 6. §10 (open questions) - locked in spec; plan implements verbatim. §11/§12 - Task 11.
- **Placeholders:** all `<paste-real-…>` markers explicitly call out the implementer-action; no TBDs.
- **Type consistency:** `wsids` is `List[str] | None` everywhere; `counts` is `{cached: int, queued: int, draining: int}` everywhere; `phase` is the locked enum throughout. `pollJobLoop` callback receives the same shape `r.body` matches the GET endpoint return.
- **No git, by design:** every code-changing task starts with a `cp file file.bak-$(date)` step in lieu of a commit. The schema migration in Task 1 is idempotent (`CREATE TABLE IF NOT EXISTS`) so no rollback file is needed.
- **Restart-vs-no-restart:** backend tasks (1, 4, 5, 6, 7) end with `sudo systemctl restart sortof-api`. Frontend tasks (8, 9, 10) end with `curl-grep` only - `StaticFiles` serves from disk; hard-refresh in browser. Worker is unchanged across the entire feature.