# Plan: pzmm conflict detection + content-type categorization

**Date:** 2026-05-04
**Branch:** `feat/pzmm-conflict-typing`
**Status:** Approved (Sam, 2026-05-04)

**Sources read:**
- `/tmp/pzmm-src/pzmm-main/core/scanner.py` — `scan_file_conflicts`, `solve_load_order`, `FileConflict`
- `/tmp/pzmm-src/pzmm-main/core/mods.py` — `detect_mod_types`, `ModInfo`
- `/tmp/pzmm-src/pzmm-main/core/bundle.py` — debug bundle (read for context, not integrated)
- `/opt/sortof/init/01_schema.sql` and migrations 02..08
- `/opt/sortof/api/app.py` — `/api/sort`, `_build_result_for_job`, `_row_to_modinfo`
- `/opt/sortof/api/mlos_sort.py` — `CATEGORY_ORDER`, `derive_category`
- `/opt/sortof/api/adapters.py` — `CAT_MAP`
- `/opt/sortof/worker/worker.py` — `process_one`

**Open questions resolved at approval:**
- Manifest scope: walk all `media/` subtrees under the mod_id root, last-wins on duplicate rel_paths, **no per-branch column**.
- `mod_files.size_bytes` column: keep.
- Module split: `api/diagnostics.py` and `api/categorize.py` are **separate files**.
- `/api/conflicts` v1: **bare wsids only**, return HTTP 400 on collection input. Defer async-job/collection-expansion plumbing to a follow-up plan.

---

## 1. Context

pzmm ships two pieces sortof doesn't have today:

1. **File-conflict detection** — when two mods both ship `media/scripts/items_food.txt` with byte-different content, the later one silently overrides the earlier one at runtime. PZ never reports this; the player only sees the symptom (broken food, duplicate item ids, etc.). pzmm walks each mod's `media/` tree, hashes the conflict-prone extensions (`.lua`, `.txt`, `.xml`, `.json`, `.ini`), and reports rel-paths claimed by ≥2 mods with non-equal content. Sortof currently only detects `mod_id` collisions (one mod_id under multiple wsids). File-level overrides are invisible to us.
2. **Content-type detection** — pzmm walks `media/` paths plus the contents of `lua/` and `scripts/*.txt|xml` files to fingerprint what a mod actually ships (Weapons, Vehicles, Maps, Traits, Professions, Recipes, etc.). Sortof's `derive_category` infers category from `workshop_meta.tags` + name regex + `mod.info` tags. Authors who tag poorly (or skip tagging) end up in `other`/`undefined`. Detection from media/ contents is more reliable for those.

Both pzmm functions assume on-disk media trees. Sortof's worker uses `tempfile.TemporaryDirectory` (`worker/worker.py:472`) — the entire DD extraction is destroyed at the end of `process_one`'s `with` block. **Only `mod.info` (as `raw_mod_info`), discovered map folder names, and a few derived columns persist.**

This plan keeps the existing model: parse once, serve from DB. We **persist a manifest at parse time**. Re-fetch on demand was rejected — every conflict check would queue N DD pulls, minutes per request, completely unusable.

We **do not import pzmm's `solve_load_order`**. Sortof's `mlos_sort.py` is strictly more correct (preorder, loadFirst/loadLast tiers, category buckets, patch G-axis, multi-branch picker, addon injection). pzmm's solver is a plain Kahn topo sort with no tie-breakers.

---

## 2. Integration A — File conflict detection

### 2.1 New schema (`init/09_mod_files.sql`)

```sql
CREATE TABLE IF NOT EXISTS mod_files (
    workshop_id  TEXT NOT NULL,
    mod_id       TEXT NOT NULL,
    rel_path     TEXT NOT NULL,            -- lowercased, posix-style, relative to mod_id root
    sha1         TEXT NOT NULL,
    size_bytes   INTEGER NOT NULL DEFAULT 0,
    PRIMARY KEY (workshop_id, mod_id, rel_path),
    FOREIGN KEY (workshop_id, mod_id) REFERENCES mod_parsed (workshop_id, mod_id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS mod_files_rel_path_idx ON mod_files (rel_path);
CREATE INDEX IF NOT EXISTS mod_files_mod_idx ON mod_files (workshop_id, mod_id);
```

Plus additions to `mod_parsed`:

```sql
ALTER TABLE mod_parsed
  ADD COLUMN IF NOT EXISTS mod_types TEXT[] NOT NULL DEFAULT '{}',
  ADD COLUMN IF NOT EXISTS files_manifest_built BOOLEAN NOT NULL DEFAULT FALSE;
```

The flag lets `derive_category` and `/api/conflicts` know whether a mod has a manifest yet (graceful degradation while the cache backfills organically).

### 2.2 Worker changes (`worker/worker.py`)

In `process_one`, **inside the existing `with tempfile.TemporaryDirectory` block** (after `discover_mod_infos`, before the `with` exits):

**Single-pass requirement:** the manifest build (Integration A) and `detect_mod_types` content sniffing (Integration B) **share one pass over the tempdir**. No two-pass implementations. The walk reads each file's bytes once: hash → manifest insert; concurrently inspect path + content for type signals. The output is the `mod_files` rows for that mod_id and the ordered `mod_types` list, both committed in the same transaction as the existing `UPSERT_MOD_PARSED`.

For each `(workshop_id, mod_id)` pair we just upserted:

1. Compute `mod_id_root`: the directory whose name equals `mod.id`. For B41 (`mods/<modId>/mod.info`) that's `mip.parent`; for B42 (`mods/<modId>/<branch>/mod.info`) that's `mip.parent.parent`. Detect via `mip.parent.name == mod.id`.
2. Single recursive walk under `mod_id_root` covering every `media/` subtree (handles B42 `<branch>/media/` + `common/media/` together). For each file:
   - If suffix matches `_CONFLICT_EXTS = {".lua", ".txt", ".xml", ".json", ".ini"}` (verbatim from pzmm `scanner.py:21`), compute sha1 (chunked reader, mirrors pzmm `_sha1`) and accumulate `(rel_path, sha1, size_bytes)`. **Last-wins** on duplicate rel_paths across branches.
   - Concurrently, in the same loop, accumulate the path-based signals from pzmm `mods.py:detect_mod_types` (lines 88–115): `Maps`, `Tiles`, `Textures`, `Vehicles`, `Clothing`, `Sounds`, `UI`, `Animations`, `Translations`, `Lua`, plus collected `lua_text_parts` and `script_text_parts` blobs (capped at 60 lua × 64 KB and 80 script × 96 KB per pzmm).
3. After the walk, run pzmm's content-blob checks (lines 117–136): weapon/vehicle/item/recipe/clothing/trait/profession signals from concatenated blobs. Resolve to `mod_types` ordered list (lines 138–145).
4. DELETE existing `mod_files` rows for `(workshop_id, mod_id)` then bulk INSERT new rows.
5. UPSERT `mod_parsed.mod_types` and set `files_manifest_built = true` for the row.

The whole step adds disk-walk + hashing of small text files only — typical mod has 20–200 files in scope, hashing is cheap (≤100 KB each, sha1 ≈ 500 MB/s). Estimated cost: <500 ms per mod, well under the DD pull cost we're already paying.

### 2.3 New module: `api/diagnostics.py`

Port of pzmm `scan_file_conflicts` adapted to read from `mod_files` instead of walking disk:

```python
async def scan_file_conflicts(conn, mods: list[ModInfo]) -> list[FileConflict]:
    """For the given (already-loaded) ModInfos, report rel_paths claimed
    by ≥2 mods with non-equal sha1. Returns list ordered by rel_path."""
```

Implementation:
1. `SELECT workshop_id, mod_id, rel_path, sha1 FROM mod_files WHERE (workshop_id, mod_id) IN (...)`.
2. Group rows in Python by `rel_path`.
3. For each group with ≥2 distinct mods, count distinct sha1s. If >1, emit a `FileConflict`.
4. Winner = last in input order (mirrors pzmm's "last in load order wins").

Dataclass:
```python
@dataclass
class FileConflict:
    rel_path: str
    providers: list[str]   # mod_ids (not ModInfo, to keep payload small)
    winner: str            # mod_id
```

`pzmm.scanner._CONFLICT_EXTS` filtering happened at manifest-build time, so this read path doesn't need it.

### 2.4 New endpoint: `POST /api/conflicts`

Same input shape as `/api/sort`, **bare wsids only** (Q4 resolved):
```json
{"input": "wsid1;wsid2;wsid3", "rules": "...", "pz_build": "B42"}
```

If `parse_with_collections` returns any `collection_ids`, return HTTP 400 with `detail="conflict scan does not support collection input; resolve via /api/sort first"`.

Response:
```json
{
  "conflicts": [
    {"rel_path": "media/scripts/items_food.txt",
     "providers": ["FoodModA", "FoodModB"],
     "winner": "FoodModB"}
  ],
  "missing_manifests": ["wsid1", "wsid2"]
}
```

`missing_manifests` lists mods we couldn't analyze because `files_manifest_built=false`. The frontend can show a banner ("X mods haven't been re-fetched since this feature shipped — file conflicts unavailable for them"), and re-clicking sort eventually triggers re-parse on workshop updates.

Reuse path: `_build_result_for_job` already loads ModInfos via `_row_to_modinfo` — the conflicts endpoint follows the same load pattern, then calls `scan_file_conflicts(conn, mods)` instead of `sort_mods`.

### 2.5 Frontend (out of scope for this plan)

A follow-up plan can wire a "File conflicts" warnings section. For now `/api/conflicts` is consumable from curl and lays the groundwork.

---

## 3. Integration B — Content-type detection feeding category derivation

### 3.1 Schema additions

Already covered by §2.1's `mod_parsed` ALTER TABLE (`mod_types` + `files_manifest_built`). One migration file (`init/09_mod_files.sql`) ships both A and B because they share the worker walk.

### 3.2 Worker changes

Folded into §2.2's single-pass walk. No additional file I/O.

### 3.3 New module: `api/categorize.py`

```python
def types_to_category(mod_types: list[str], name: str) -> str | None:
    """First mod_type that maps to a sortof CATEGORY_ORDER bucket wins.
    Returns None if mod_types is empty / Unknown / Dependency-only and we
    should fall through to the existing derive_category cascade."""
```

### 3.4 Tag→category mapping (explicit)

| pzmm `mod_type` | sortof `CATEGORY_ORDER` | notes |
|---|---|---|
| `Maps`         | `map`        | already covered by `mod.maps non-empty`; types-derived is a fallback |
| `Vehicles`     | `vehicle`    | name regex `"spawn zone"` already routes to `vehicle_spawn` upstream |
| `Weapons`      | `weapon`     | wins over `Items` (pzmm prefers list ordering) |
| `Items`        | *skip*       | too generic — almost every mod has Items; would mis-trigger |
| `Clothing`     | `wearable`   | armor name-hint check still runs after, can override to `armor` |
| `Traits`       | `code`       | no dedicated `trait` bucket; `code` is the gameplay-axis fallback |
| `Professions`  | `profession` | |
| `Recipes`      | `crafting`   | |
| `Tiles`        | `tile`       | |
| `Textures`     | `texture`    | |
| `Sounds`       | `sound`      | already handled by `Audio` ws_tag; types-derived is a fallback |
| `Animations`   | *skip*       | no bucket; falls through |
| `UI`           | `ui`         | |
| `Translations` | `translation` | |
| `Lua`          | *skip*       | too generic; falls through |
| `Patch`        | `patch`      | already detected by `_PATCH_NAME_RE`; types-derived is a fallback |
| `Dependency`   | `tweaks`     | maps to existing `lib` pill |
| `Framework`    | `tweaks`     | same |
| `Unknown`      | *skip*       | falls through |

"*skip*" means: don't return a category; let `derive_category` continue its cascade.

### 3.5 `derive_category` integration

Insert a single new check in `api/mlos_sort.py:derive_category` after the explicit-category early return at line 412, **before** the patch/lib name regex at lines 416–419:

```python
if mod.mod_types:
    cat = types_to_category(mod.mod_types, name)
    if cat:
        return cat
```

`mod.mod_types` is added to the `ModInfo` dataclass (`mlos_sort.py:113`). `_row_to_modinfo` (`api/app.py:176`) is updated to read the new column. **Both `mlos_sort.py` copies must change in lockstep.**

**Position rationale:** `mod_types` comes from media-content fingerprinting, more reliable than name regex but less reliable than an explicit `category=` field in `mod.info`. So it sits between (1) explicit category and (2) name regex. The patch/lib regexes that come after still win for true patches/libraries (they'd usually return `Patch`/`Dependency` from detect_mod_types anyway, but we want the regex to win for cases where a "patch mod" hasn't shipped enough media to fingerprint).

Empty `mod_types` (e.g. older rows where `files_manifest_built=false`) means the new check returns `None` and the existing cascade runs unchanged. **Graceful degradation is built in.**

---

## 4. Blockers / risks

### 4.1 Schema migration cost
- Current cache: **3,123 `workshop_meta` rows, 3,298 `mod_parsed` rows**.
- New `mod_files` rows estimate: median mod ships ~50 conflict-eligible files (light mods 5–10, heavy framework/map mods 200–500). At 50 avg × 3,298 mods = **~165 k rows**. With sha1 (40 chars) + rel_path (avg 80 chars) + overhead ≈ 200 bytes/row, that's ~33 MB before indexes. Postgres handles this trivially.
- `ALTER TABLE mod_parsed ADD COLUMN mod_types TEXT[]` and `files_manifest_built BOOLEAN` are additive and metadata-only on Postgres 16 (no rewrite). Instant.

### 4.2 Backfill feasibility
- The `/tmp/sortof_steam_throttle` flock + `/tmp/sortof_steam_cooldown` 1h kill-switch (worker.py — `fetch_required_wsids`) protect us from Steam metadata 429s. **DD itself does not hit the metadata API**; it hits Steam content servers, which are not part of the rate-limited path. So mass re-DD does not trip the cooldown.
- Mass re-DD still costs real time: typical DD pull is 20–60 s wall-clock. 3,123 wsids × 30 s avg ÷ 4 drains = **~6.5 hours wall-clock for a full backfill**. Doable but disruptive.
- **Recommendation: do not run a bulk backfill.** Let the cache populate organically — every workshop update bumps `time_updated`, which triggers a re-parse and now also a manifest build. The `missing_manifests` field in `/api/conflicts` and the empty-`mod_types` graceful-degrade path together mean the feature works on day 1 (empty results for old rows) and improves as authors push updates.
- Per-mod manual trigger pattern still works (operator-only):
  ```sql
  DELETE FROM mod_parsed WHERE workshop_id='<wsid>';
  INSERT INTO download_jobs (workshop_id, status) VALUES ('<wsid>','queued');
  ```

### 4.3 Inline detection at sort time
- Rejected. `detect_mod_types` reads up to ~11 MB per mod from disk (lua/script blobs). With the tempdir destroyed (the actual case), we'd need to re-DD inline — minutes per sort.
- **All detection runs at parse time** in `process_one`. `derive_category` and `/api/conflicts` are pure DB reads.

---

## 5. Files touched (summary)

**New:**
- `init/09_mod_files.sql` — `mod_files` table, `mod_parsed.mod_types`, `mod_parsed.files_manifest_built`
- `api/diagnostics.py` — port of `scan_file_conflicts`, `FileConflict` dataclass
- `api/categorize.py` — `types_to_category` helper

**Modified:**
- `worker/worker.py` — extend `process_one`'s `with` block: single-pass walk, manifest + detect_mod_types, upsert rows
- `worker/worker.py` (top-level) — port `detect_mod_types` from pzmm `mods.py:57–145` (sortof-side copy; do not import from pzmm at runtime)
- `api/mlos_sort.py` — add `mod_types: List[str]` to `ModInfo` dataclass; add `mod_types` check at top of `derive_category`
- `worker/mlos_sort.py` — mirror the `ModInfo` and `derive_category` change (worker/api dual-edit rule)
- `api/app.py` — `_row_to_modinfo` reads new `mod_types` column; `_build_result_for_job` SELECT list adds `mp.mod_types`; register `POST /api/conflicts`

**Out of scope (deferred to follow-up plan):**
- Frontend conflicts panel — `/api/conflicts` endpoint only, no UI
- Integration of `pzmm/core/bundle.py` (debug bundle export) — read for context, not ported
- Backfill orchestration — relying on organic backfill

---

## 6. Rollback

Before applying the migration:

```bash
# Backup mod_parsed (the only existing table we ALTER)
sudo docker exec -i sortof_db pg_dump -U sortof -d sortof -t mod_parsed \
  > /opt/sortof/backups/mod_parsed-pre-09.sql.$(date +%Y%m%d-%H%M)
ls -la /opt/sortof/backups/ | tail -3
```

Down SQL (paste into psql to revert the schema half of this plan):

```sql
DROP TABLE IF EXISTS mod_files;
ALTER TABLE mod_parsed
  DROP COLUMN IF EXISTS mod_types,
  DROP COLUMN IF EXISTS files_manifest_built;
```

To revert code, `git checkout main` and restart services:
```bash
sudo systemctl restart sortof-api sortof-drain@1 sortof-drain@2 sortof-drain@3 sortof-drain@4
```

The migration is additive only (new table + new columns with safe defaults), so the rollback is a clean drop. No data is destroyed in `mod_parsed`'s existing columns.

---

## 7. Verification

1. **Migration applies cleanly:**
   ```bash
   sudo docker exec -i sortof_db psql -U sortof -d sortof < /opt/sortof/init/09_mod_files.sql
   sudo docker exec -i sortof_db psql -U sortof -d sortof -c "\d mod_files"
   sudo docker exec -i sortof_db psql -U sortof -d sortof -c "\d mod_parsed" | grep -E "mod_types|files_manifest_built"
   ```

2. **Compile checks** (after every Python edit):
   ```bash
   /opt/sortof/api/.venv/bin/python -m py_compile /opt/sortof/api/app.py /opt/sortof/api/mlos_sort.py /opt/sortof/api/diagnostics.py /opt/sortof/api/categorize.py
   /opt/sortof/worker/.venv/bin/python -m py_compile /opt/sortof/worker/worker.py /opt/sortof/worker/mlos_sort.py
   cd /opt/sortof/api    && .venv/bin/python -c "import app"   && echo OK
   cd /opt/sortof/worker && .venv/bin/python -c "import drain" && echo OK
   ```

3. **Dual-edit consistency check** (worker/api `mlos_sort.py` lockstep rule):
   ```bash
   diff /opt/sortof/api/mlos_sort.py /opt/sortof/worker/mlos_sort.py | grep -E "^[<>]" | head -20
   ```
   Logic must match; only comments / docstrings may differ. If any logic line shows up in the diff, fix the lockstep before continuing.

4. **Restart services:**
   ```bash
   sudo systemctl restart sortof-api sortof-drain@1 sortof-drain@2 sortof-drain@3 sortof-drain@4
   sudo systemctl is-active sortof-api sortof-drain@{1..4}
   ```

5. **Force a fresh parse on a known multi-file mod and verify manifest:**
   ```bash
   sudo docker exec -i sortof_db psql -U sortof -d sortof -c \
     "DELETE FROM mod_parsed WHERE workshop_id='2169435993';
      INSERT INTO download_jobs (workshop_id, status) VALUES ('2169435993','queued');"
   sleep 60
   sudo docker exec -i sortof_db psql -U sortof -d sortof -c \
     "SELECT mod_id, mod_types, files_manifest_built FROM mod_parsed WHERE workshop_id='2169435993';
      SELECT count(*) AS file_count FROM mod_files WHERE workshop_id='2169435993';"
   ```
   Expected: `files_manifest_built=t`, `mod_types` populated, `file_count > 0`.

6. **Conflict endpoint smoke:**
   ```bash
   curl -sS -X POST http://100.114.205.53:8801/api/conflicts \
     -H 'Content-Type: application/json' \
     -d '{"input":"2169435993;2392709985;2487022075"}' | jq .
   ```
   Expected: `{"conflicts": [], "missing_manifests": [<wsids without manifests yet>]}`.

7. **Collection-input rejection (Q4):**
   ```bash
   curl -sS -i -X POST http://100.114.205.53:8801/api/conflicts \
     -H 'Content-Type: application/json' \
     -d '{"input":"https://steamcommunity.com/sharedfiles/filedetails/?id=999999999"}' | head -5
   ```
   Expected: HTTP 400 with the documented `detail` message (when the URL is detected as a collection ref).

8. **Category-from-types smoke:**
   - Find a mod whose Steam tags don't reflect content (e.g. weapon mod tagged only `Realistic`); `/api/sort` currently classifies it as `code` / `other` / `undefined`.
   - Re-queue it through the new pipeline (delete+insert).
   - Re-run `/api/sort`; confirm category is now `weapon`.

9. **Graceful-degradation check:** confirm a mod with `files_manifest_built=false` still sorts correctly through the existing cascade (no exceptions, category falls back to current behavior).