Files
ik-codex/docs/superpowers/plans/2026-05-01-redactor.md
indifferentketchup 409de16003 docs: add Redactor implementation plan
Forward-looking plan on the redactor branch covering all eight design
questions called out in the careful-protocol kickoff: render-time
filter (raw is canonical), standalone string utility (not a Printer
decorator), regex-based detection with lexical anchors per PII
category, per-category placeholder replacement matching synthetic
fixture conventions, thin generic interface plus per-game
implementation under src/Util/ProjectZomboid/, hybrid fixture strategy
(unit-level synthetic plus integration against existing PZ fixtures).

Branch off master aec835e. backup/pre-redactor tag pins start.
No code is written by this commit. Implementation pass kicks off
separately after plan review.
2026-05-01 14:28:44 +00:00

212 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Redactor Utility Implementation Plan
> Forward-looking. No code is written by this document.
> Branch: `redactor` (off master `aec835e`). Backup tag: `backup/pre-redactor`.
> Spec: `docs/superpowers/specs/2026-04-30-redactor-design.md`.
**Goal:** Land the `RedactorInterface` plus a concrete `ProjectZomboidRedactor` implementation so iblogs (and any other downstream consumer) can scrub Project Zomboid log content of Steam IDs, player names, and world coordinates with a single call. The Redactor is a render-time filter on raw string content; raw stays canonical at the storage layer.
**Architecture:** Standalone string-in/string-out utility under a new top-level `src/Util/` directory, with per-game implementations under `src/Util/<Game>/`. Each implementation owns the lexical regex anchors for its game's PII shapes. Three independent toggles per implementation (`redactSteamIds`, `redactPlayerNames`, `redactCoordinates`); defaults all on; "all toggles off" yields verbatim passthrough.
**Tech stack:** PHP 8.4+, PHPUnit 12, Composer (`indifferentketchup/codex` v0.1.0+). All command invocations wrap in the `composer:latest` Docker image per `CLAUDE.md`.
---
## Design questions — resolved
### a. Render-time vs ingest-time
**Decision: render-time. Confirm spec's lean.**
Raw log content is canonical. Redaction is a view filter that consumers apply when they want to display, export, or analyse a redacted projection. iblogs's storage layer holds the unredacted upload (subject to iblogs's own upload-time `Filter` chain for IPs/access-tokens, which is a different layer of defence); the codex Redactor runs on the way *out* of storage, not on the way in.
**Why:** the alternative (ingest-time, where storage holds redacted content) is destructive — once stored, the original cannot be recovered for legitimate operator use. Render-time leaves the original in place and lets each render path opt in. iblogs gets a per-session toggle without needing to keep two copies of every paste.
**Implication for iblogs schema:** iblogs stores raw content; the redaction toggle in the iblogs UI invokes `ProjectZomboidRedactor::redact()` at render time (server-side) or at fetch time (API consumers' choice). No schema migration required for the redaction feature.
### b. Redactor as standalone class vs Printer decorator
**Decision: standalone utility (option iii from the question).**
The Redactor is a `string → string` function. It does not know about `Insight`, `Printer`, or any other codex type. Three options were considered:
- **(i) Printer wrapper.** Cleanly composable but ties the Redactor to the Printer abstraction. Doesn't help iblogs's most common case: redacting raw log content for display in a non-Printer rendering path (HTML page rendered server-side, raw download served to API client).
- **(ii) Pre-Printer pass on Insights.** Heavy. Insights are typed objects with structured fields; redacting them means per-Insight code that knows which fields are PII-bearing. Against the YAGNI line for v1.
- **(iii) Standalone string utility.** Simple, generic, works on any string input — raw log content, JSON-serialised analysis output, rendered Printer output piped through. Doesn't know about Insights.
The spec describes (iii). v1 ships (iii) only. If a Printer-wrapper convenience is later wanted, it can be added as a thin adapter that calls the standalone Redactor on the Printer's output; it doesn't require restructuring the core.
### c. PII field taxonomy for PZ
**Decision: regex-based with lexical context anchors. No structured-field detection in v1.**
PZ-specific PII categories observed in the in-tree fixtures and the `.scratch/pz/Logs/` reference corpus:
| Field | Detection | Rationale |
|---|---|---|
| Steam ID | regex with `76561198\d{9}` prefix anchor and word-boundary classes | Steam's `76561198` SteamID64 universe prefix lets us cleanly distinguish from other long numbers (timestamps, build numbers). |
| Player name | regex with multi-context lexical anchors (after-Steam-ID-quoted, ChatMessage author, `Combat:`/`Safety:` subsystem) | Names are arbitrary strings — not detectable without context. The contexts are well-defined by the parser-side pattern classes. |
| World coordinate triple | regex with bracket / paren / `at`-clause anchors | Generic `\d+,\d+,\d+` would over-redact server metadata (`f:0, t:NNNN, st:48,648,157,584`). Lexical context disambiguates. |
**Not redacted in v1:**
- **IP addresses.** PZ logs do not normally include IPs in any of the eleven file types observed. iblogs's upload-side `IPv4Filter` / `IPv6Filter` (ported from upstream mclogs) covers the rare case where a mod might log them.
- **Server-side usernames distinct from player names.** PZ uses Steam display name as the player identity; there's no separate auth username layer. Mclogs's `UsernameFilter` is Minecraft-specific and isn't mirrored here.
- **BurdJournals scientific-notation Steam IDs** (`7.65611…E16`). Spec open-question 2 explicitly defers this to v2; the `[BurdJournals]` tag already disambiguates them as mod-internal.
**Hybrid (regex + structured-field) deferred.** A v2 enhancement could redact specific Insight fields at JSON-serialisation time (e.g. `ConnectionFailureProblem::$steamId` → placeholder when serialised). Useful only if iblogs starts shipping the structured analysis JSON to redacted views — a real but currently hypothetical need.
### d. Replacement strategy
**Decision: per-category placeholder strings matching the synthetic-fixture conventions. Configurable replacement style is YAGNI for v1.**
Per the spec:
| Category | Replacement |
|---|---|
| Steam ID | `76561198000000000` (zeroed placeholder, still a syntactically valid Steam ID) |
| Player name | `<player>` |
| Coordinates | `0,0,0` (with shape preserved per anchor — bracketed, parenthesised, or `at` clause) |
Why these specifically and not `[REDACTED]` / `[STEAM_ID]` / hashed:
- The placeholders **match the existing synthetic test fixtures** (`76561198000000001``76561198000000004` collapse to `76561198000000000`; player names `Player1`/`Player2`/`AdminUser` collapse to `<player>`). Tests can verify "redacted output looks like a synthetic fixture."
- Shape preservation means downstream consumers can still parse the redacted output with the same Pattern classes — a redacted log is still a syntactically valid PZ log, it just contains no identities.
- Type-tagged replacements (`[STEAM_ID]`) break shape preservation: a Pattern looking for `\d{17}` would fail. Worth offering as a config option if a consumer specifically wants type-visibility, but v1 ships placeholder-only.
- Hashing breaks shape preservation similarly and adds determinism / collision concerns.
If a consumer later needs `[STEAM_ID]`-style output, a `setReplacementStyle('typed' | 'placeholder' | 'redacted')` setter can be added without breaking the v1 API. v1 ships placeholder-only.
### e. Game-agnostic vs PZ-specific layout
**Decision: thin generic interface in `src/Util/` plus PZ-specific implementation in `src/Util/ProjectZomboid/`.**
```
src/Util/
├── RedactorInterface.php (1 method: redact(string): string)
└── ProjectZomboid/
└── ProjectZomboidRedactor.php (toggles + regex passes)
```
**YAGNI tradeoff stated:** the interface has one method and currently one implementation. Strictly, YAGNI says collapse to just `ProjectZomboidRedactor` and skip the interface. The interface earns its keep because **iblogs's call sites will type-hint against `RedactorInterface`**, not the concrete class — that's the architectural payoff. Consumer code stays loosely coupled; when Minecraft or another game ships a redactor, iblogs swaps the implementation by changing one DI binding rather than touching call sites.
The cost is two files instead of one. Acceptable given the dependency-inversion benefit. The directory layout (`src/Util/<Game>/`) mirrors the components-outer-with-game-suffix convention used everywhere else in the tree (Analyser, Analysis, Detective, Log, Parser, Pattern).
**Note on the new `src/Util/` directory.** Codex currently has no `src/Util/` (the Phase A scaffolding established Analyser / Analysis / Detective / Log / Parser / Pattern / Printer; Phase B.3 added Analyser/ProjectZomboid content but not Util). The Redactor introduces this new top-level. This is an additive change — no existing code is modified.
### f. Test strategy
**Decision: hybrid — small dedicated synthetic fixtures under `test/src/Util/Redactor/` for direct unit tests, plus an integration test that runs the Redactor over an existing PZ fixture and asserts idempotence.**
**Dedicated unit fixtures** (small string constants in test classes, not separate files): per spec test plan #1#5. Each test class owns its input/expected pairs. Keeps unit tests self-contained and fast.
**Integration test** that re-uses an existing PZ fixture (e.g. `test/src/Games/ProjectZomboid/fixtures/admin-minimal.txt`). Two assertions:
- The Redactor's output is a syntactically valid log (still parses cleanly through the corresponding `ProjectZomboidAdminLog`).
- Idempotence: `redact(redact($x)) === redact($x)`. Existing fixture content is already placeholder-shaped, so the redactor should leave it byte-for-byte identical OR apply the canonical normalisation once and then no-op.
**False-positive avoidance.** The synthetic fixtures use `76561198000000001` etc. as placeholder Steam IDs. The Redactor's Steam ID regex matches the `76561198\d{9}` prefix and replaces with `76561198000000000` — so `76561198000000001` becomes `76561198000000000` (a normalisation, not a corruption). Tests verify this normalisation is correct and that legitimate-non-PII data (e.g. server metadata triples like `f:0, t:1776297642406, st:48,648,157,584`) is **not** touched.
---
## Tasks
Tasks are intended for the `redactor` branch. Each is a single logical commit. Test-running between commits uses the standard Docker invocation. Work proceeds only after Step 0 sign-off (this plan reviewed).
### Task 0 — Plan doc commit
- [ ] **Step 0.1.** Already done out-of-band: `git checkout -b redactor` off master `aec835e`; `git tag backup/pre-redactor` at branch tip; this plan written.
- [ ] **Step 0.2.** Commit this plan: `docs: add Redactor implementation plan` on branch `redactor`. Push branch to origin for review.
### Task 1 — Scaffold (interface + skeleton class with toggles)
- [ ] **Step 1.1.** Create `src/Util/RedactorInterface.php`. Single method: `public function redact(string $content): string;` PHPDoc describing the contract: stateless from the caller's perspective; configuration happens via implementation-specific setters before `redact()`.
- [ ] **Step 1.2.** Create `src/Util/ProjectZomboid/ProjectZomboidRedactor.php` that implements the interface. Class structure: three private bool properties (`$redactSteamIds`, `$redactPlayerNames`, `$redactCoordinates`) all defaulting to `true`; three fluent setters (`redactSteamIds(bool): static`, etc.); `redact(string): string` body that returns input unchanged when all toggles are off (for now — regex passes added in subsequent tasks).
- [ ] **Step 1.3.** Run `composer test` — expect 195 tests still green (no Redactor tests yet).
- [ ] **Step 1.4.** Commit: `feat: scaffold RedactorInterface and ProjectZomboidRedactor with toggles`.
### Task 2 — Steam ID redaction pass
- [ ] **Step 2.1.** Add `STEAM_ID_REGEX` and `STEAM_ID_REPLACEMENT` constants on `ProjectZomboidRedactor`. Regex uses the `76561198\d{9}` prefix anchor with word-boundary classes (per spec). The `/u` flag is added to all regexes for Unicode safety even though Steam IDs themselves are ASCII.
- [ ] **Step 2.2.** Implement the Steam ID branch of `redact()`: when `$redactSteamIds` is true, run `preg_replace` against the input.
- [ ] **Step 2.3.** Create `test/tests/Util/Redactor/ProjectZomboidRedactorSteamIdTest.php`. Tests: redaction of various distinct synthetic Steam IDs collapses all to `76561198000000000`; non-Steam-ID 17-digit numbers (e.g. timestamps) are not touched; toggle-off leaves Steam IDs intact.
- [ ] **Step 2.4.** Run `composer test`. Expect new tests pass; old 195 unaffected.
- [ ] **Step 2.5.** Commit: `feat: add Steam ID redaction pass`.
### Task 3 — Player name redaction pass
- [ ] **Step 3.1.** Add three regex constants on `ProjectZomboidRedactor` for the three player-name lexical contexts: `PLAYER_AFTER_STEAMID_REGEX`, `PLAYER_IN_CHATMESSAGE_REGEX`, `PLAYER_IN_PVP_SUBSYSTEM_REGEX`. Replacement is `<player>` for all. **Order constraint:** the after-Steam-ID context anchors on the post-redaction Steam ID `76561198000000000`, so the player-name pass must run *after* the Steam ID pass. Document this in a class-level docblock.
- [ ] **Step 3.2.** Implement the player-name branch of `redact()`: three sequential `preg_replace` calls when `$redactPlayerNames` is true.
- [ ] **Step 3.3.** Create `test/tests/Util/Redactor/ProjectZomboidRedactorPlayerNameTest.php`. Tests: each of the three contexts redacts correctly when paired with its anchor; a bare quoted string (e.g. `"foo"` not preceded by a Steam ID) is **not** touched; toggle-off leaves names intact; the after-Steam-ID context works correctly when the Steam ID has already been redacted to the zeroed placeholder.
- [ ] **Step 3.4.** Run `composer test`. Expect new tests pass.
- [ ] **Step 3.5.** Commit: `feat: add player name redaction pass`.
### Task 4 — Coordinates redaction pass
- [ ] **Step 4.1.** Add three regex constants on `ProjectZomboidRedactor` for the three coordinate contexts: `COORDS_AT_CLAUSE_REGEX`, `COORDS_BRACKETED_REGEX`, `COORDS_PARENTHESISED_REGEX`. Replacements preserve shape (`0,0,0` inside whatever bracket/paren wrapper).
- [ ] **Step 4.2.** Implement the coords branch of `redact()`: three sequential `preg_replace_callback` (or `preg_replace`) calls when `$redactCoordinates` is true.
- [ ] **Step 4.3.** Create `test/tests/Util/Redactor/ProjectZomboidRedactorCoordinatesTest.php`. Tests: each of the three contexts redacts correctly; **negative test** — server metadata `f:0, t:1776297642406, st:48,648,157,584` is not touched; basement Z-coordinates (`-1`) are handled; toggle-off leaves coords intact.
- [ ] **Step 4.4.** Run `composer test`. Expect new tests pass.
- [ ] **Step 4.5.** Commit: `feat: add coordinates redaction pass`.
### Task 5 — Combined / toggle / idempotence tests
- [ ] **Step 5.1.** Create `test/tests/Util/Redactor/ProjectZomboidRedactorCombinedTest.php`. Tests cover: combined input with all three PII categories present produces fully-scrubbed output when all toggles on; each toggle off in isolation produces partial scrubbing matching the toggle's category; all toggles off returns input byte-for-byte identical (`===` equality).
- [ ] **Step 5.2.** Create `test/tests/Util/Redactor/ProjectZomboidRedactorIdempotenceTest.php`. Tests: `redact(redact($x)) === redact($x)` for several input shapes including all three PII categories.
- [ ] **Step 5.3.** Run `composer test`. Expect new tests pass.
- [ ] **Step 5.4.** Commit: `test: add Redactor combined and idempotence coverage`.
### Task 6 — Existing-fixture integration tests
- [ ] **Step 6.1.** Create `test/tests/Util/Redactor/ProjectZomboidRedactorIntegrationTest.php`. Loads each existing PZ fixture (`admin-minimal.txt`, `chat-minimal.txt`, etc.) via `PathLogFile`, calls `redact()` on the content, and asserts: (a) the redacted content still parses cleanly through the corresponding `ProjectZomboid<X>Log`'s parser without throwing; (b) the synthetic Steam IDs `76561198000000001``76561198000000004` all collapse to `76561198000000000`; (c) the synthetic player names (`Player1`, `Player2`, `AdminUser`, `PlayerSuspect`) all collapse to `<player>`.
- [ ] **Step 6.2.** Run `composer test`. Expect all integration assertions pass without modifying any existing test or fixture.
- [ ] **Step 6.3.** Commit: `test: add Redactor integration coverage against existing PZ fixtures`.
### Task 7 — Documentation updates
- [ ] **Step 7.1.** Update `CLAUDE.md`: add a one-line `src/Util/` mention to the framework architecture section; one-line note in the ProjectZomboid specifics section pointing at `ProjectZomboidRedactor` for downstream PII scrubbing; update the "Scaffolded games" line to mention that `ProjectZomboid` now also has a Redactor implementation under `src/Util/ProjectZomboid/`.
- [ ] **Step 7.2.** Update `README.md`: add a short usage block showing `(new ProjectZomboidRedactor())->redact($logContent)` as a render-time scrub option, alongside the existing worked example.
- [ ] **Step 7.3.** Update `CHANGELOG.md`: move Redactor out of the **Deferred** section under `[0.1.0]`, OR add a new `[Unreleased]` section if the v0.1.0 line should remain accurate as-shipped. Decision: **add `[Unreleased]`** — v0.1.0 was tagged without the Redactor and the changelog should reflect the historical truth.
- [ ] **Step 7.4.** Run `composer test` once more for safety; confirm 195+(redactor tests) green.
- [ ] **Step 7.5.** Commit: `docs: document Redactor utility in CLAUDE.md, README, CHANGELOG`.
### Task 8 — Final verification
- [ ] **Step 8.1.** Run `composer test`. All tests green.
- [ ] **Step 8.2.** Re-run `vendor/bin/phpunit --display-deprecations --display-warnings --display-notices --display-errors`. Expect zero output beyond the standard pass summary.
- [ ] **Step 8.3.** Sanity-check the branch with `git log --oneline master..redactor`. Should be the plan-doc commit plus 7 implementation commits = 8 commits total.
- [ ] **Step 8.4.** Push final state: `git push origin redactor`. **Do NOT merge to master.** User reviews diff and approves merge separately.
---
## Open questions / spec gaps
The spec is generally tight. Items worth flagging while implementing:
1. **`/u` flag for Unicode safety.** Spec doesn't specify regex flags. PZ player names can contain non-ASCII characters (Steam display names are Unicode-permissive). The implementation will use `/u` on all regexes to avoid mangling multi-byte sequences. Documenting in the class docblock.
2. **Replacement order.** Spec says "Redaction order matters: SIDs first, names second" because the after-Steam-ID player-name regex anchors on the redacted Steam ID. The implementation will enforce this order in `redact()` (Steam ID pass first, then names, then coords). The class docblock will document the ordering invariant.
3. **HTML / JSON-encoded input.** Spec assumes plain log text. If a consumer feeds HTML-escaped content (e.g. `&quot;` instead of `"`), the player-name regex won't match. Document as a v2 concern: callers feed plain text in, render afterwards. v1 does not implement HTML/JSON-aware mode.
4. **Future PII categories.** v1 ships exactly the three toggles per spec. New categories (emails, IPs from mods, etc.) extend the toggle set in a future release; v1 does not pre-build extension points beyond what the interface already provides.
5. **`src/Util/` is a new top-level directory** in this codebase. The Redactor is the first occupant. Future utilities (e.g. a tokenizing variant per spec open-question 1) would also live here. No existing-code modification is needed; the new directory is purely additive.
6. **The empty `src/Printer/<Game>/.gitkeep` situation.** Phase A scaffolding chose not to create `Printer/<Game>/` directories at all (only Analyser/Detective/Log/Parser/Pattern got per-game subdirs). The Redactor's home in `src/Util/<Game>/` mirrors that — `src/Util/` is created with PZ as its first occupant; no stub `Hytale/`/`Minecraft/`/`SevenDaysToDie/` placeholders are scaffolded. When other games' redactors land, they create their own subdirectories at that point.
No spec contradictions found. No existing-code modifications required (additive-only design).
---
## Branch / commit invariants
- All commits land on the `redactor` branch.
- Master is not touched until the user explicitly approves merge after reviewing the diff.
- Conventional commit prefixes: `docs:`, `feat:`, `test:`, `refactor:`. (No `fix:` expected — this is greenfield work.)
- One logical concept per commit. Tasks 1, 2, 3, 4 each ship implementation + per-pass tests in one commit; Task 5 / 6 / 7 are pure-test or pure-docs commits.
- Backup tag `backup/pre-redactor` at `aec835e` lets us discard the branch and recover if the implementation goes sideways.
- Branch can be pushed to origin freely for visibility / review checkpoints.
## Pointers
- Spec: `docs/superpowers/specs/2026-04-30-redactor-design.md`.
- Synthetic fixtures the integration test will reuse: `test/src/Games/ProjectZomboid/fixtures/*.txt`.
- Existing per-game layout precedent: `src/Analyser/ProjectZomboid/`, `src/Pattern/ProjectZomboid/`, `src/Log/ProjectZomboid/`.
- Workflow conventions and pitfalls: `CLAUDE.md`.