Files

indifferentketchup b73325882e feat: pzmm conflict detection + content-type categorization

- mod_files manifest table populated at parse time
- POST /api/conflicts endpoint
- mod_types fingerprinting feeds derive_category
- DD filelist regex broadened to cover conflict-eligible exts
- media/maps/<*>/* excluded from manifest (per-mod namespaced,
  no conflict value, can be tens of MB per mod)

Plan: docs/plans/2026-05-04-pzmm-conflict-and-typing.md

2026-05-04 15:22:35 +00:00

19 KiB

Raw Blame History

Plan: pzmm conflict detection + content-type categorization

Date: 2026-05-04 Branch: feat/pzmm-conflict-typing Status: Approved (Sam, 2026-05-04)

Sources read:

/tmp/pzmm-src/pzmm-main/core/scanner.py — scan_file_conflicts, solve_load_order, FileConflict
/tmp/pzmm-src/pzmm-main/core/mods.py — detect_mod_types, ModInfo
/tmp/pzmm-src/pzmm-main/core/bundle.py — debug bundle (read for context, not integrated)
/opt/sortof/init/01_schema.sql and migrations 02..08
/opt/sortof/api/app.py — /api/sort, _build_result_for_job, _row_to_modinfo
/opt/sortof/api/mlos_sort.py — CATEGORY_ORDER, derive_category
/opt/sortof/api/adapters.py — CAT_MAP
/opt/sortof/worker/worker.py — process_one

Open questions resolved at approval:

Manifest scope: walk all media/ subtrees under the mod_id root, last-wins on duplicate rel_paths, no per-branch column.
mod_files.size_bytes column: keep.
Module split: api/diagnostics.py and api/categorize.py are separate files.
/api/conflicts v1: bare wsids only, return HTTP 400 on collection input. Defer async-job/collection-expansion plumbing to a follow-up plan.

1. Context

pzmm ships two pieces sortof doesn't have today:

File-conflict detection — when two mods both ship media/scripts/items_food.txt with byte-different content, the later one silently overrides the earlier one at runtime. PZ never reports this; the player only sees the symptom (broken food, duplicate item ids, etc.). pzmm walks each mod's media/ tree, hashes the conflict-prone extensions (.lua, .txt, .xml, .json, .ini), and reports rel-paths claimed by ≥2 mods with non-equal content. Sortof currently only detects mod_id collisions (one mod_id under multiple wsids). File-level overrides are invisible to us.
Content-type detection — pzmm walks media/ paths plus the contents of lua/ and scripts/*.txt|xml files to fingerprint what a mod actually ships (Weapons, Vehicles, Maps, Traits, Professions, Recipes, etc.). Sortof's derive_category infers category from workshop_meta.tags + name regex + mod.info tags. Authors who tag poorly (or skip tagging) end up in other/undefined. Detection from media/ contents is more reliable for those.

Both pzmm functions assume on-disk media trees. Sortof's worker uses tempfile.TemporaryDirectory (worker/worker.py:472) — the entire DD extraction is destroyed at the end of process_one's with block. Only mod.info (as raw_mod_info), discovered map folder names, and a few derived columns persist.

This plan keeps the existing model: parse once, serve from DB. We persist a manifest at parse time. Re-fetch on demand was rejected — every conflict check would queue N DD pulls, minutes per request, completely unusable.

We do not import pzmm's solve_load_order. Sortof's mlos_sort.py is strictly more correct (preorder, loadFirst/loadLast tiers, category buckets, patch G-axis, multi-branch picker, addon injection). pzmm's solver is a plain Kahn topo sort with no tie-breakers.

2. Integration A — File conflict detection

2.1 New schema (`init/09_mod_files.sql`)

CREATE TABLE IF NOT EXISTS mod_files (
    workshop_id  TEXT NOT NULL,
    mod_id       TEXT NOT NULL,
    rel_path     TEXT NOT NULL,            -- lowercased, posix-style, relative to mod_id root
    sha1         TEXT NOT NULL,
    size_bytes   INTEGER NOT NULL DEFAULT 0,
    PRIMARY KEY (workshop_id, mod_id, rel_path),
    FOREIGN KEY (workshop_id, mod_id) REFERENCES mod_parsed (workshop_id, mod_id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS mod_files_rel_path_idx ON mod_files (rel_path);
CREATE INDEX IF NOT EXISTS mod_files_mod_idx ON mod_files (workshop_id, mod_id);

Plus additions to mod_parsed:

ALTER TABLE mod_parsed
  ADD COLUMN IF NOT EXISTS mod_types TEXT[] NOT NULL DEFAULT '{}',
  ADD COLUMN IF NOT EXISTS files_manifest_built BOOLEAN NOT NULL DEFAULT FALSE;

The flag lets derive_category and /api/conflicts know whether a mod has a manifest yet (graceful degradation while the cache backfills organically).

2.2 Worker changes (`worker/worker.py`)

In process_one, inside the existing with tempfile.TemporaryDirectory block (after discover_mod_infos, before the with exits):

Single-pass requirement: the manifest build (Integration A) and detect_mod_types content sniffing (Integration B) share one pass over the tempdir. No two-pass implementations. The walk reads each file's bytes once: hash → manifest insert; concurrently inspect path + content for type signals. The output is the mod_files rows for that mod_id and the ordered mod_types list, both committed in the same transaction as the existing UPSERT_MOD_PARSED.

For each (workshop_id, mod_id) pair we just upserted:

Compute mod_id_root: the directory whose name equals mod.id. For B41 (mods/<modId>/mod.info) that's mip.parent; for B42 (mods/<modId>/<branch>/mod.info) that's mip.parent.parent. Detect via mip.parent.name == mod.id.
Single recursive walk under mod_id_root covering every media/ subtree (handles B42 <branch>/media/ + common/media/ together). For each file:
- If suffix matches _CONFLICT_EXTS = {".lua", ".txt", ".xml", ".json", ".ini"} (verbatim from pzmm scanner.py:21), compute sha1 (chunked reader, mirrors pzmm _sha1) and accumulate (rel_path, sha1, size_bytes). Last-wins on duplicate rel_paths across branches.
- Concurrently, in the same loop, accumulate the path-based signals from pzmm mods.py:detect_mod_types (lines 88–115): Maps, Tiles, Textures, Vehicles, Clothing, Sounds, UI, Animations, Translations, Lua, plus collected lua_text_parts and script_text_parts blobs (capped at 60 lua × 64 KB and 80 script × 96 KB per pzmm).
After the walk, run pzmm's content-blob checks (lines 117–136): weapon/vehicle/item/recipe/clothing/trait/profession signals from concatenated blobs. Resolve to mod_types ordered list (lines 138–145).
DELETE existing mod_files rows for (workshop_id, mod_id) then bulk INSERT new rows.
UPSERT mod_parsed.mod_types and set files_manifest_built = true for the row.

The whole step adds disk-walk + hashing of small text files only — typical mod has 20–200 files in scope, hashing is cheap (≤100 KB each, sha1 ≈ 500 MB/s). Estimated cost: <500 ms per mod, well under the DD pull cost we're already paying.

2.3 New module: `api/diagnostics.py`

Port of pzmm scan_file_conflicts adapted to read from mod_files instead of walking disk:

async def scan_file_conflicts(conn, mods: list[ModInfo]) -> list[FileConflict]:
    """For the given (already-loaded) ModInfos, report rel_paths claimed
    by ≥2 mods with non-equal sha1. Returns list ordered by rel_path."""

Implementation:

SELECT workshop_id, mod_id, rel_path, sha1 FROM mod_files WHERE (workshop_id, mod_id) IN (...).
Group rows in Python by rel_path.
For each group with ≥2 distinct mods, count distinct sha1s. If >1, emit a FileConflict.
Winner = last in input order (mirrors pzmm's "last in load order wins").

Dataclass:

@dataclass
class FileConflict:
    rel_path: str
    providers: list[str]   # mod_ids (not ModInfo, to keep payload small)
    winner: str            # mod_id

pzmm.scanner._CONFLICT_EXTS filtering happened at manifest-build time, so this read path doesn't need it.

2.4 New endpoint: `POST /api/conflicts`

Same input shape as /api/sort, bare wsids only (Q4 resolved):

{"input": "wsid1;wsid2;wsid3", "rules": "...", "pz_build": "B42"}

If parse_with_collections returns any collection_ids, return HTTP 400 with detail="conflict scan does not support collection input; resolve via /api/sort first".

Response:

{
  "conflicts": [
    {"rel_path": "media/scripts/items_food.txt",
     "providers": ["FoodModA", "FoodModB"],
     "winner": "FoodModB"}
  ],
  "missing_manifests": ["wsid1", "wsid2"]
}

missing_manifests lists mods we couldn't analyze because files_manifest_built=false. The frontend can show a banner ("X mods haven't been re-fetched since this feature shipped — file conflicts unavailable for them"), and re-clicking sort eventually triggers re-parse on workshop updates.

Reuse path: _build_result_for_job already loads ModInfos via _row_to_modinfo — the conflicts endpoint follows the same load pattern, then calls scan_file_conflicts(conn, mods) instead of sort_mods.

2.5 Frontend (out of scope for this plan)

A follow-up plan can wire a "File conflicts" warnings section. For now /api/conflicts is consumable from curl and lays the groundwork.

3. Integration B — Content-type detection feeding category derivation

3.1 Schema additions

Already covered by §2.1's mod_parsed ALTER TABLE (mod_types + files_manifest_built). One migration file (init/09_mod_files.sql) ships both A and B because they share the worker walk.

3.2 Worker changes

Folded into §2.2's single-pass walk. No additional file I/O.

3.3 New module: `api/categorize.py`

def types_to_category(mod_types: list[str], name: str) -> str | None:
    """First mod_type that maps to a sortof CATEGORY_ORDER bucket wins.
    Returns None if mod_types is empty / Unknown / Dependency-only and we
    should fall through to the existing derive_category cascade."""

3.4 Tag→category mapping (explicit)

pzmm `mod_type`	sortof `CATEGORY_ORDER`	notes
`Maps`	`map`	already covered by `mod.maps non-empty`; types-derived is a fallback
`Vehicles`	`vehicle`	name regex `"spawn zone"` already routes to `vehicle_spawn` upstream
`Weapons`	`weapon`	wins over `Items` (pzmm prefers list ordering)
`Items`	skip	too generic — almost every mod has Items; would mis-trigger
`Clothing`	`wearable`	armor name-hint check still runs after, can override to `armor`
`Traits`	`code`	no dedicated `trait` bucket; `code` is the gameplay-axis fallback
`Professions`	`profession`
`Recipes`	`crafting`
`Tiles`	`tile`
`Textures`	`texture`
`Sounds`	`sound`	already handled by `Audio` ws_tag; types-derived is a fallback
`Animations`	skip	no bucket; falls through
`UI`	`ui`
`Translations`	`translation`
`Lua`	skip	too generic; falls through
`Patch`	`patch`	already detected by `_PATCH_NAME_RE`; types-derived is a fallback
`Dependency`	`tweaks`	maps to existing `lib` pill
`Framework`	`tweaks`	same
`Unknown`	skip	falls through

"skip" means: don't return a category; let derive_category continue its cascade.

3.5 `derive_category` integration

Insert a single new check in api/mlos_sort.py:derive_category after the explicit-category early return at line 412, before the patch/lib name regex at lines 416–419:

if mod.mod_types:
    cat = types_to_category(mod.mod_types, name)
    if cat:
        return cat

mod.mod_types is added to the ModInfo dataclass (mlos_sort.py:113). _row_to_modinfo (api/app.py:176) is updated to read the new column. Both mlos_sort.py copies must change in lockstep.

Position rationale: mod_types comes from media-content fingerprinting, more reliable than name regex but less reliable than an explicit category= field in mod.info. So it sits between (1) explicit category and (2) name regex. The patch/lib regexes that come after still win for true patches/libraries (they'd usually return Patch/Dependency from detect_mod_types anyway, but we want the regex to win for cases where a "patch mod" hasn't shipped enough media to fingerprint).

Empty mod_types (e.g. older rows where files_manifest_built=false) means the new check returns None and the existing cascade runs unchanged. Graceful degradation is built in.

4. Blockers / risks

4.1 Schema migration cost

Current cache: 3,123 workshop_meta rows, 3,298 mod_parsed rows.
New mod_files rows estimate: median mod ships ~50 conflict-eligible files (light mods 5–10, heavy framework/map mods 200–500). At 50 avg × 3,298 mods = ~165 k rows. With sha1 (40 chars) + rel_path (avg 80 chars) + overhead ≈ 200 bytes/row, that's ~33 MB before indexes. Postgres handles this trivially.
ALTER TABLE mod_parsed ADD COLUMN mod_types TEXT[] and files_manifest_built BOOLEAN are additive and metadata-only on Postgres 16 (no rewrite). Instant.

4.2 Backfill feasibility

The /tmp/sortof_steam_throttle flock + /tmp/sortof_steam_cooldown 1h kill-switch (worker.py — fetch_required_wsids) protect us from Steam metadata 429s. DD itself does not hit the metadata API; it hits Steam content servers, which are not part of the rate-limited path. So mass re-DD does not trip the cooldown.
Mass re-DD still costs real time: typical DD pull is 20–60 s wall-clock. 3,123 wsids × 30 s avg ÷ 4 drains = ~6.5 hours wall-clock for a full backfill. Doable but disruptive.
Recommendation: do not run a bulk backfill. Let the cache populate organically — every workshop update bumps time_updated, which triggers a re-parse and now also a manifest build. The missing_manifests field in /api/conflicts and the empty-mod_types graceful-degrade path together mean the feature works on day 1 (empty results for old rows) and improves as authors push updates.

Per-mod manual trigger pattern still works (operator-only):

DELETE FROM mod_parsed WHERE workshop_id='<wsid>';
INSERT INTO download_jobs (workshop_id, status) VALUES ('<wsid>','queued');

4.3 Inline detection at sort time

Rejected. detect_mod_types reads up to ~11 MB per mod from disk (lua/script blobs). With the tempdir destroyed (the actual case), we'd need to re-DD inline — minutes per sort.
All detection runs at parse time in process_one. derive_category and /api/conflicts are pure DB reads.

5. Files touched (summary)

New:

init/09_mod_files.sql — mod_files table, mod_parsed.mod_types, mod_parsed.files_manifest_built
api/diagnostics.py — port of scan_file_conflicts, FileConflict dataclass
api/categorize.py — types_to_category helper

Modified:

worker/worker.py — extend process_one's with block: single-pass walk, manifest + detect_mod_types, upsert rows
worker/worker.py (top-level) — port detect_mod_types from pzmm mods.py:57–145 (sortof-side copy; do not import from pzmm at runtime)
api/mlos_sort.py — add mod_types: List[str] to ModInfo dataclass; add mod_types check at top of derive_category
worker/mlos_sort.py — mirror the ModInfo and derive_category change (worker/api dual-edit rule)
api/app.py — _row_to_modinfo reads new mod_types column; _build_result_for_job SELECT list adds mp.mod_types; register POST /api/conflicts

Out of scope (deferred to follow-up plan):

Frontend conflicts panel — /api/conflicts endpoint only, no UI
Integration of pzmm/core/bundle.py (debug bundle export) — read for context, not ported
Backfill orchestration — relying on organic backfill

6. Rollback

Before applying the migration:

# Backup mod_parsed (the only existing table we ALTER)
sudo docker exec -i sortof_db pg_dump -U sortof -d sortof -t mod_parsed \
  > /opt/sortof/backups/mod_parsed-pre-09.sql.$(date +%Y%m%d-%H%M)
ls -la /opt/sortof/backups/ | tail -3

Down SQL (paste into psql to revert the schema half of this plan):

DROP TABLE IF EXISTS mod_files;
ALTER TABLE mod_parsed
  DROP COLUMN IF EXISTS mod_types,
  DROP COLUMN IF EXISTS files_manifest_built;

To revert code, git checkout main and restart services:

sudo systemctl restart sortof-api sortof-drain@1 sortof-drain@2 sortof-drain@3 sortof-drain@4

The migration is additive only (new table + new columns with safe defaults), so the rollback is a clean drop. No data is destroyed in mod_parsed's existing columns.

7. Verification

Migration applies cleanly:

sudo docker exec -i sortof_db psql -U sortof -d sortof < /opt/sortof/init/09_mod_files.sql
sudo docker exec -i sortof_db psql -U sortof -d sortof -c "\d mod_files"
sudo docker exec -i sortof_db psql -U sortof -d sortof -c "\d mod_parsed" | grep -E "mod_types|files_manifest_built"

Compile checks (after every Python edit):

/opt/sortof/api/.venv/bin/python -m py_compile /opt/sortof/api/app.py /opt/sortof/api/mlos_sort.py /opt/sortof/api/diagnostics.py /opt/sortof/api/categorize.py
/opt/sortof/worker/.venv/bin/python -m py_compile /opt/sortof/worker/worker.py /opt/sortof/worker/mlos_sort.py
cd /opt/sortof/api    && .venv/bin/python -c "import app"   && echo OK
cd /opt/sortof/worker && .venv/bin/python -c "import drain" && echo OK

Dual-edit consistency check (worker/api mlos_sort.py lockstep rule):
```
diff /opt/sortof/api/mlos_sort.py /opt/sortof/worker/mlos_sort.py | grep -E "^[<>]" | head -20
```
Logic must match; only comments / docstrings may differ. If any logic line shows up in the diff, fix the lockstep before continuing.

Restart services:

sudo systemctl restart sortof-api sortof-drain@1 sortof-drain@2 sortof-drain@3 sortof-drain@4
sudo systemctl is-active sortof-api sortof-drain@{1..4}

Force a fresh parse on a known multi-file mod and verify manifest:

sudo docker exec -i sortof_db psql -U sortof -d sortof -c \
  "DELETE FROM mod_parsed WHERE workshop_id='2169435993';
   INSERT INTO download_jobs (workshop_id, status) VALUES ('2169435993','queued');"
sleep 60
sudo docker exec -i sortof_db psql -U sortof -d sortof -c \
  "SELECT mod_id, mod_types, files_manifest_built FROM mod_parsed WHERE workshop_id='2169435993';
   SELECT count(*) AS file_count FROM mod_files WHERE workshop_id='2169435993';"

Expected: files_manifest_built=t, mod_types populated, file_count > 0.

Conflict endpoint smoke:

curl -sS -X POST http://100.114.205.53:8801/api/conflicts \
  -H 'Content-Type: application/json' \
  -d '{"input":"2169435993;2392709985;2487022075"}' | jq .

Expected: {"conflicts": [], "missing_manifests": [<wsids without manifests yet>]}.

Collection-input rejection (Q4):

curl -sS -i -X POST http://100.114.205.53:8801/api/conflicts \
  -H 'Content-Type: application/json' \
  -d '{"input":"https://steamcommunity.com/sharedfiles/filedetails/?id=999999999"}' | head -5

Expected: HTTP 400 with the documented detail message (when the URL is detected as a collection ref).

Category-from-types smoke:
- Find a mod whose Steam tags don't reflect content (e.g. weapon mod tagged only Realistic); /api/sort currently classifies it as code / other / undefined.
- Re-queue it through the new pipeline (delete+insert).
- Re-run /api/sort; confirm category is now weapon.
Graceful-degradation check: confirm a mod with files_manifest_built=false still sorts correctly through the existing cascade (no exceptions, category falls back to current behavior).

19 KiB Raw Blame History Unescape Escape