PHI scrubber for DICOM Structured Report (SR) content trees. The piece dcm-anon deliberately does not ship.
Project description
dicom-sr-scrubber
Parses and scrubs PHI from DICOM Structured Report (SR) content trees. The piece dcm-anon deliberately does not ship — single pip install, single command, recursive walk over the SR ContentSequence, audit log of every item touched.
pip install dicom-sr-scrubber
dicom-sr-scrub input.dcm output.dcm
Pairs with dcm-anon — the recommended pipeline is:
dcm-anon scrub raw.dcm clean.dcm # top-level tags + nested sequences
dicom-sr-scrub clean.dcm final.dcm # SR content tree
Why this exists
dcm-anon ships PHI scrubbing for top-level DICOM tags and nested
sequences (PatientName, PatientID, AccessionNumber, the standard PS3.15
Basic De-identification Profile set). Its README.md documents the
explicit limitation:
DICOM SR / Structured Report content scanning. Free-text inside SR sequences may contain PHI; we do not parse SR semantics.
That gap is real. DICOM Structured Reports (SOP Classes Basic Text SR,
Enhanced SR, Comprehensive SR, Mammography CAD SR,
Radiation Dose SR, etc.) carry their payload as a recursive tree of
content items under ContentSequence (0040,A730). Each content item
has a ValueType (0040,A040) (TEXT, NUM, CODE, PNAME, DATE, TIME,
UIDREF, COMPOSITE, IMAGE, WAVEFORM, CONTAINER, SCOORD, TCOORD), a
RelationshipType (0040,A010) (CONTAINS, HAS PROPERTIES, HAS OBS
CONTEXT, …), and either a value or another ContentSequence. PHI lives
inside this tree: free-text findings, observer names (PNAME),
acquisition dates (DATE), patient identifiers as text (TEXT).
A naive dcm-anon-style top-level tag scrub leaves all of that intact.
No widely-used OSS DICOM tool walks the SR content tree for PHI today.
dcmtk's dsr2html parses but does not scrub. pydicom exposes the
tree but ships no PHI-aware walker. gdcm's anonymizer skips SR
semantics. CTP (Clinical Trial Processor) can be scripted but requires
hand-written profiles per institution.
dicom-sr-scrubber is the missing walker.
What it does
pip install dicom-sr-scrubber— pure Python, single dependency (pydicom>=2.4).dicom-sr-scrub scrub input.dcm output.dcm— recursively walksContentSequence, applies per-ValueTypePHI rules, writes a new DICOM file with the SR content tree cleaned, leaves all non-SR pixel/metadata untouched.dicom-sr-scrub verify output.dcm— re-parses the scrubbed file and reports whether any PHI pattern survived in the SR content tree. Exit0= clean, exit1= residual PHI.- Every scrub run emits an audit log (JSON) listing every content
item visited, its tree path, its
ValueType, the rule that fired (orPASS), and the action taken (REDACT,GENERALIZE_DATE_YEAR,STRIP,KEEP). CI-friendly: pipe tojq, fail builds on surprises.
Per-ValueType rules (v0.1)
| ValueType | Default rule | Rationale |
|---|---|---|
TEXT |
Pattern-match for PHI tokens (names, MRNs, free-form dates, phone, email). Redact span; replace with [REDACTED]. |
Free-text is the highest-risk surface in SR. |
PNAME |
Always replace with Anonymous^Anonymous^^^. |
A PNAME is a person name by definition. |
DATE |
Generalize to year-only (YYYY0101). Configurable: --date-policy={year,strip,keep}. |
HIPAA Safe Harbor permits year for non-elderly subjects; year-only is the common research-grade choice. |
TIME |
Strip (000000.000000). |
Time-of-day is rarely scientifically necessary; high re-identification risk when combined with date. |
CODE |
Keep (coded values are dictionary entries, not PHI). | SNOMED CT / LOINC / RadLex codes are public. |
NUM |
Keep (measurement values are not PHI). | Body temperature 37.0 is not identifying. |
UIDREF |
Replace with deterministic hash-derived UID (same input → same output across runs in the same session). | Preserves referential integrity inside the report; breaks linkability to the source archive. |
COMPOSITE |
Strip the SOPInstanceUID reference (set to placeholder UID). | A reference to the source image series can leak the patient through the receiving PACS. |
IMAGE / WAVEFORM / SCOORD / TCOORD |
Keep coordinate / reference fields, strip embedded annotation text if any. | Geometry is not PHI; text overlays may be. |
CONTAINER |
Recurse into child ContentSequence. |
Containers are structural, not data. |
Rules are pluggable — drop a Python module implementing the
PhiRule protocol in ~/.config/dicom-sr-scrubber/rules.d/ and it is
loaded at startup.
What it does NOT do
- Not a replacement for
dcm-anon. It only touches the SR content tree. Rundcm-anonfirst to scrub top-level tags and nested non-SR sequences. - No semantic understanding of the report. It does not "read the finding"; it pattern-matches PHI tokens. False negatives are possible on adversarial free-text (e.g., a name spelled phonetically). The audit log makes residual review tractable.
- No DICOM network transport. This is a file-in, file-out CLI. Pair
with
dcmtk'sstorescu/storescpfor transport. - No re-identification. UID remapping is per-session only; the map
is not persisted unless you pass
--uid-map-out path.json.
Differentiation
| Tool | SR content-tree walker | PHI-aware | OSS | Bundled rules |
|---|---|---|---|---|
dcm-anon |
no (documented gap) | yes | yes | yes |
dcmtk dsr2html |
yes | no (read-only) | yes | n/a |
dcmtk dcmodify |
no | partial | yes | manual |
pydicom |
tree access only | no | yes | n/a |
gdcm anonymizer |
no | partial | yes | basic |
| CTP (RSNA) | scriptable | yes | yes | per-institution scripts |
| dicom-sr-scrubber | yes | yes | yes | yes (per-ValueType) |
Pricing
- CLI: MIT, free, forever.
- Hosted add-on on the
dcm-anonPhase 2 plan — €19–29/mo, bundles SR scrubbing into the same hosted batch pipeline (drop a DICOM folder, get a scrubbed folder plus audit log back). Stripe billing once the demand signal justifies it.
Roadmap
- v0.1 (this release) — Walker, per-
ValueTyperules, CLI, audit log,verifysubcommand, synthetic fixture tests. - v0.2 — Configurable rule plug-ins, structured-error JSON
identical to
dcm-anon's. - v0.3 — Optional LLM-backed free-text PHI detection for
TEXTitems (opt-in, local model only, no cloud). - v1.0 — Stable rule-protocol API; semver guarantees.
Audience
- Radiology research groups submitting de-identified SR cohorts to IRB / ethics committees.
- Hospital IT departments running PACS pipelines that need defensible SR-level scrubbing for secondary use.
- IRB / ethics submitters needing an audit log per study.
- Distribution:
r/medicalimaging,r/healthIT, dev.to,awesome-dicomlists, thedcm-anonuser channel (cross-promo).
Install
pip install dicom-sr-scrubber # PyPI (once published)
# or from source:
pip install git+https://github.com/plusultra/dicom-sr-scrubber.git
Requirements: Python 3.10+, pydicom>=2.4, pyyaml>=6.0, pydantic>=2.0.
Quickstart
# Default profile: redact only TEXT items that match PHI patterns
dicom-sr-scrubber --input study_sr/ --out clean_sr/
# Conservative: redact ALL text items unconditionally + strip COMPOSITE refs
dicom-sr-scrubber --input study_sr/ --out clean_sr/ --profile conservative
# Dry-run: emit audit only, no DICOM files written
dicom-sr-scrubber --input study_sr/ --out audit_only/ --dry-run
# Add institution-specific name tokens to the blacklist
dicom-sr-scrubber --input study_sr/ --out clean_sr/ \
--blacklist "DrSmith,JohnDoe,ClinicA"
CLI reference
dicom-sr-scrubber [options]
--input PATH[,PATH...] DICOM SR file(s) or directory (recursive .dcm scan)
--out DIR Output directory for scrubbed files + manifests
--profile {default,conservative}
Scrubbing aggressiveness (default: default)
--dry-run Emit audit only; do not write scrubbed files
--continue-on-error Log per-file errors; do not abort the batch
--uid-salt STRING Per-project UID/PNAME hash salt (default: "dicom-sr-scrubber-v1")
--blacklist TOKEN,... Additional name tokens to treat as PHI in TEXT fields
--version Show version and exit
Output files
After each run, --out contains:
| File | Description |
|---|---|
*.dcm |
Scrubbed DICOM SR objects (unless --dry-run) |
sr_evidence.json |
Machine-readable audit manifest (one entry per content item) |
sr_evidence.md |
Human-readable rendering for IRB / DPIA documentation |
audit.sha256 |
SHA-256 chain over inputs + outputs + manifest |
sr_evidence.json format (excerpt)
{
"tool": "dicom-sr-scrubber",
"tool_version": "0.1.0",
"profile": "default",
"total_items": 42,
"redacted_items": 7,
"entries": [
{
"file": "study_sr.dcm",
"content_item_path": "root/0",
"value_type": "TEXT",
"action": "REDACT",
"trigger": "SSN,EMAIL",
"source_clause_citation": "DICOM PS3.3 C.17.3.3.5; HIPAA Safe Harbor identifiers 1,3-8 (45 CFR 164.514(b)(2)); GDPR Art. 4(1)",
"original_snippet": "Patient John Doe (SSN: 123-45-6789)...",
"scrubbed_snippet": "Patient John Doe ([REDACTED:SSN])..."
}
]
}
Citation for IRB / DPIA submissions
dicom-sr-scrubber v0.1.0 (2026). PHI scrubber for DICOM Structured Report
content trees. Implements HIPAA Safe Harbor (45 CFR 164.514(b)(2)) 18-identifier
redaction and GDPR Art. 35 audit documentation for DICOM SR SOP Classes.
https://github.com/plusultra/dicom-sr-scrubber
Regulatory citation coverage
| Regulation | Coverage |
|---|---|
| DICOM PS3.3 (2024c) | Per-tag and per-ValueType citations (ContentSequence, TextValue, PersonName, Date, Time, UID, Composite) |
| HIPAA Safe Harbor (45 CFR 164.514(b)(2)) | All 18 identifiers verbatim |
| GDPR | Art. 4(1), Art. 9(1), Art. 35, Recital 26 |
Full citation map: docs/sr-citation-map.md.
Known gaps and out-of-scope items (honest)
- Enhanced SR multi-frame: ACQUISITION CONTEXT sequences and WAVEFORM items with embedded annotation text are not scanned for PHI content in v0.1. Geometry coordinates are preserved; text overlays in IMAGE/WAVEFORM items are not inspected.
- NER not bundled: The NER detection layer is a plug-in interface only.
No NLP model ships with this package (opt-in; see
phi_detect.pyNerHooktype for the integration contract). - UID session scope: UID remapping is per-run; the map is not persisted
between runs unless the caller manages
--uid-saltconsistency. - Not a complete anonymiser: must be combined with a top-level tag scrubber
(e.g.
dcm-anon) for full DICOM anonymisation. - Age-check for >89 rule: HIPAA Safe Harbor requires year suppression for subjects aged >89; this tool generalises all dates to year-only but does not implement the age-computation check.
Contributing
Open an issue with a real SR (anonymised already, please) that the
scrubber missed PHI in, or that it over-redacted. PRs welcome —
especially for additional ValueType rules and locale-specific PHI
patterns (Spanish DNI, French INS, German Versichertennummer).
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dicom_sr_scrubber-0.1.0.tar.gz.
File metadata
- Download URL: dicom_sr_scrubber-0.1.0.tar.gz
- Upload date:
- Size: 31.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ec755811175408404c6084d16754c1e8c6e88976ddfbb47705ee38035da4e90
|
|
| MD5 |
04a6d5021729d9f6a5579679fb85ca03
|
|
| BLAKE2b-256 |
18a8a53d813b8ab4a28eb394b697e98104164a1781faf37cb89726a7fb77acaf
|
File details
Details for the file dicom_sr_scrubber-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dicom_sr_scrubber-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fe33af02bf8c6537cea64c9d34b8410402350b2a3c331879e2176f98ea31e61
|
|
| MD5 |
150c3a43bf2ca70a697c512029bd767e
|
|
| BLAKE2b-256 |
709ca48a80d66efd19fc536e712a8d5e9675d0eb8e095e9e4e008a3b5ae0f323
|