Skip to main content

DICOM anonymizer (PS3.15 Basic Profile) with regulator-clause-cited compliance manifest and independent output verification

Project description

dcm-anon

OSS DICOM anonymizer with a verbatim-cited, machine-verifiable compliance manifest — for research-data sharing under GDPR Art. 35 DPIA and HIPAA Safe Harbor.

CI DOI License: MIT Python Manifest format


Why this exists

GDPR Art. 35 makes a DPIA mandatory for large-scale processing of health data, and best practice — endorsed by the EDPS, the EDPB pseudonymisation guidelines (01/2025), and HHS OCR — is to de-identify at the source site before moving research data off-prem. Doing that defensibly means:

  1. A traceable mapping from each tag you removed to the specific clause that obliges you to remove it.
  2. An independent second opinion that the output does not still contain PHI.
  3. An audit trail that a regulator, IRB, or ethics committee can verify without your help.

Most DICOM anonymizers do step 1 implicitly and skip steps 2 and 3. dcm-anon emits all three as first-class artifacts of every run.

Context. This tool is open-sourced as a research artifact accompanying ongoing work on fairness-aware Software-as-a-Medical-Device (SaMD) — see the author's TFG on inter-hospital fairness in dermatology AI (UPV RiuNet). The compliance- manifest layer was built because cross-hospital data preparation kept tripping on the same legal-traceability gap.


What it does

Implements the DICOM PS3.15 Basic Application Level Confidentiality Profile (Table E.1-1, 2024 edition; 125 explicit tags = mandatory Basic Profile plus retired tags still common in legacy archives; curve groups (50xx,xxxx) and overlay groups (60xx,xxxx) handled by range mask, not enumerated). Five properties:

  1. UID consistency across files. Anonymize a CT study directory and the Study/Series/SOP UIDs remain coherent — slices are still a usable study, not 200 orphan files. file_meta.MediaStorageSOPInstanceUID is remapped to match the dataset-level SOPInstanceUID, so DICOMDIR and WADO-RS references stay intact.

  2. Audit log out-of-the-box. Every modified tag is recorded with its PS3.15 action code (X/Z/U/D), source SHA-256, and UTC timestamp. Drop it in your IRB folder.

  3. Nested PHI in Sequence items is scrubbed. Tags inside RequestAttributesSequence, ReferencedStudySequence, and any other SQ element are recursed into and cleaned — not silently skipped.

  4. Compliance manifest — --manifest-mode [gdpr|hipaa|eu-ai-act]. Emits a tamper-evident JSON + Markdown artifact that maps each PS3.15 action to the specific regulatory clauses it implements (verbatim citations re-verified against EUR-Lex / eCFR / gdpr-info.eu on 2026-05-13). Each regime carries a defensive disclosure tailored to the failure mode regulators actually pursue first: GDPR → Art. 9(2) lawful-basis disclosure (controller establishes it independently); HIPAA → Safe-Harbor-only declaration (does NOT substitute for Expert Determination); EU AI Act → Digital Omnibus enforcement-date context (deferred to 2027-12-02 / 2028-08-02 for SaMD embedded in MDR/IVDR).

  5. Independent output verification — --verify-output. After the run, re-reads the anonymized files using a separate PHI tag list (curated from HIPAA Safe Harbor §164.514(b)(2) + TCIA checklist, NOT derived from the internal table). Result embedded in the manifest, covered by the SHA chain. Defeats the "tool vouches for itself" problem.

# Single file
python anonymize.py input.dcm out/

# Directory (all *.dcm, preserving subdirectory structure)
python anonymize.py /data/study_0001 /data/anon/study_0001

# Deterministic UIDs — same salt + same source = same output every run
python anonymize.py /data/study_0001 /data/anon/study_0001 --salt cohort-A-2024

# Preview without writing files (audit log still emitted)
python anonymize.py /data/study_0001 /data/anon/study_0001 --dry-run

# Continue past malformed DICOMs; collect them in the audit "errors" list
python anonymize.py /data/study_0001 /data/anon/study_0001 --continue-on-error

# Whitelist tags (use sparingly — kept tags break the strict-profile claim)
python anonymize.py input.dcm out/ --keep 0010,0010 --keep 0008,0090

# Markdown summary alongside the JSON audit log
python anonymize.py /data/study out/ --report-md report.md

Install

pip install dcm-anonymizer
# CLI command stays `dcm-anon`. Distribution name on PyPI is `dcm-anonymizer`
# because `dcm-anon` collides with a similar-named legacy project.
# Or, from source:
pip install -e ".[dev]"

Runtime dependency: pydicom>=2.4. Optional: pytesseract for burned-in text detection (--scan-burned-in).


Compliance manifest

The manifest answers the single question every reviewer asks: "how do I prove this de-identification step satisfies the regulation?" It maps every PS3.15 action that ran on your data to the specific regulatory clauses it implements, with verbatim citations and links to canonical text.

Usage

# GDPR Art. 4(5) pseudonymisation + Art. 32(1)(a) technical safeguard
python anonymize.py /data/study out/ --manifest-mode gdpr --verify-output

# HIPAA Safe Harbor (45 CFR 164.514(b)(2))
python anonymize.py /data/study out/ --manifest-mode hipaa --verify-output

# EU AI Act Art. 10 data governance
# (enforcement deferred to 2027-12-02 / 2028-08-02 by Digital Omnibus 2026-05-07)
python anonymize.py /data/study out/ --manifest-mode eu-ai-act --verify-output

# Verify an existing manifest against its audit (e.g. on the auditor's machine)
python anonymize.py --verify-manifest compliance_manifest.json \
                    --audit anonymization_audit.json

Three files land alongside anonymization_audit.json:

out/
├── COMPLIANCE_MANIFEST.md       Human-readable. Attach to your tech file.
├── compliance_manifest.json     Structured + SHA-chained. For auditors / CI.
└── anonymization_audit.json     The per-tag log the manifest signs over.

What the manifest contains

  • Tool + PS3.15 profile + generation timestamp (post-Cegedim defensive stamp).
  • Regulatory regime metadata + enforcement-date context (live counter for AI Act).
  • Output classification: explicitly pseudonymous (NOT anonymous) under GDPR Art. 4(5), with a risk statement addressing the CNIL / Cegedim Santé enforcement pattern (€800K fine, September 2024).
  • Per-action clauses. For each PS3.15 action (X/Z/U/D) used in the run: count, citation, short title, verbatim regulatory summary. Examples:
    • Action U (UID remap) under HIPAA cites 45 CFR 164.514(c) — "re-identification code".
    • Action Z (zero) under GDPR cites Art. 32(1)(a) + Art. 4(5).
    • Action X (remove) under EU AI Act cites **Art. 10(2)(b) + 10(2)(c)
      • 10(3)** — not Art. 10(5), which is the narrow bias-detection exception.
  • Audit-trail clauses. Clauses that justify the existence of the signed log itself: AI Act Art. 12 + Art. 18, HIPAA 164.312(b), GDPR Art. 30 + Art. 5(2).
  • Authoritative guidance applied. Post-2024 docs that regulators apply in audits: EDPB Guidelines 01/2025 (pseudonymisation-domain model), MDCG 2025-6 (MDR ↔ AI Act interplay for SaMD), NIST SP 800-66r2, GPAI Code of Practice, HHS OCR de-id guidance, ENISA health-sector pseudonymisation.
  • Independent output verification (when --verify-output is set): files scanned, tag list size, residuals found. Counted in the SHA chain.
  • Two SHA-256 hashes: audit_sha256 (over the per-file log) and manifest_sha256 (over the manifest payload including audit_sha256 and the verification block). Tampering either layer fails verification.

Disclaimer

The manifest is an engineering artifact, not legal advice. It does not certify compliance and does not replace review by your Quality Management System and legal counsel. Cited regulatory text must be independently verified against the canonical source before submission to any regulator or notified body.

Note for Spanish public-sector hospitals (ENS)

Public-sector data controllers in Spain (including SNS hospitals) are subject to the Esquema Nacional de Seguridad (Real Decreto 311/2022) in addition to GDPR. Health data is Category 3, which mandates Nivel ALTO (CAT-ALTA). The signed audit log and verbatim-cited manifest produced by this tool satisfy the CAT-ALTA technical security measures op.exp.8 (registro de actividad), mp.info.3 (cifrado) and mp.info.6 (limpieza de documentos) in combination with the controller's organisational measures. ENS does NOT replace GDPR or any third-country regime (HIPAA, etc.); it is the domestic-law backstop AEPD-supervised entities are audited against.

Verification (auditor workflow)

# Independent party, with only the JSON files, can verify integrity:
python anonymize.py \
  --verify-manifest path/to/compliance_manifest.json \
  --audit path/to/anonymization_audit.json
# → PASS: manifest ... matches audit ...
# (exit code 0; non-zero on any tamper / mismatch with itemised reasons)

Architecture

Single-responsibility modules; anonymize.py re-exports the public API.

phi_table.py          PS3.15 Table E.1-1 reference data.
actions.py            Action(str, Enum) — X / Z / U / D. ActionRegistry.
uid_mapper.py         Random or salted-deterministic UID remap (SHA-256(salt+orig) → 2.25.xxx).
audit.py              Frozen dataclasses (AuditRecord / AuditSummary / ProcessingError).
                      audit_sha256 — tamper-evident hash. render_markdown_report.
pipeline.py           AnonymizationConfig dataclass; anonymize_file / anonymize_path.
                      Point-tag actions + curve/overlay range scrub + SQ recursion.
cli.py                Argparse + main. All user-facing flags.
regulatory_mapping.py Verbatim-cited regulatory clause data per regime.
manifest.py           COMPLIANCE_MANIFEST.{md,json} builder + SHA chain.
verify_output.py      Independent PHI scanner (separate tag list).
anonymize.py          Public API surface.

Python API

from anonymize import anonymize_path, AnonymizationConfig

cfg = AnonymizationConfig(salt="cohort-A", continue_on_error=True)
summary = anonymize_path("/data/study", "/data/anon", config=cfg)

print(summary.files_processed, summary.audit_sha256)
for record in summary.records:
    print(record.source, len(record.tags_modified))

Comparison with other tools

Feature dcm-anon pydicom example script dcm4che deidentify dicom-anon (chop-dbhi) Kitware/dicom-anonymizer
PS3.15 Table E.1-1 coverage 125 tags + range masks, 2024 ed. ~10 tags (example only) Full (Java, complex) Partial (varies) Full
UID consistency across files Yes No Yes Partial Yes
file_meta UID consistency Yes No Yes Unknown Unknown
Sequence (SQ) recursion Yes No Yes No Yes
Deterministic UID remapping --salt No Config hash No No
Audit log with action codes JSON, per-tag No XML logs No No
Verbatim-cited compliance manifest Yes (GDPR / HIPAA / AI Act) No No No No
Independent output verification Yes No No No No
Burned-in PHI detection Flag + optional OCR No Flag only No No
Zero runtime dependencies pydicom only pydicom Java 11+ Yes Yes
License MIT BSD Apache 2.0 Apache 2.0 Apache 2.0

What we do NOT do (explicit limits)

These are documented gaps, not hidden bugs:

  • Pixel-level OCR redaction. If BurnedInAnnotation = YES, the audit log warns you. Pixel data is NOT modified by default. Pixel OCR is on the hosted roadmap.
  • Private tag scrubbing. The standard says remove private attributes (X), but identifying which private groups contain PHI requires vendor-specific knowledge. We do not claim to handle private tags. See SECURITY.md.
  • DICOM SR / Structured Report content scanning. Free-text inside SR sequences may contain PHI; we do not parse SR semantics.
  • DICOMDIR update. Directory records in a DICOMDIR are not updated after UID remapping. Regenerate the DICOMDIR after anonymization.
  • Big-endian transfer syntaxes. Rare in practice; not tested.

Tests

pytest -v --cov=. --cov-report=term-missing

132 tests, coverage ≥80% gated in CI. Suite covers: per-tag PHI removal, UID consistency, file-meta SOP UID parity, sequence recursion, deterministic remap, multi-frame DICOM, burned-in flag detection, batch directory processing, cross-file UID linkage, manifest SHA-chain integrity, manifest tamper detection, independent verification correctness.


Examples

# 1. Download public test DICOMs (pydicom test data, no real PHI)
python examples/download_test_dicom.py

# 2. Run the annotated example (before/after comparison + audit log)
python examples/run_example.py

# 3. Same with deterministic UIDs
python examples/run_example.py --salt my-project-2024

A hosted interactive demo runs at huggingface.co/spaces/cpereiro/dcm-anon (synthetic DICOM only — please do not upload real patient data to the public demo).


Citing

If you use this tool in a publication, please cite via the Zenodo DOI:

@software{dcm_anon,
  author       = {Pereiro García, César},
  title        = {{dcm-anon: DICOM anonymizer with verbatim-cited compliance manifest}},
  year         = {2026},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.20267652},
  url          = {https://github.com/Ces107/dcm-anon},
}

Hosted service — preparing

A managed version with batch processing, S3-source support, private-tag handling, SR scanning, SLA, and retained audit logs is in preparation for single-team research labs.

Reserve early access (free)


License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dcm_anonymizer-0.3.4.tar.gz (47.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dcm_anonymizer-0.3.4-py3-none-any.whl (44.9 kB view details)

Uploaded Python 3

File details

Details for the file dcm_anonymizer-0.3.4.tar.gz.

File metadata

  • Download URL: dcm_anonymizer-0.3.4.tar.gz
  • Upload date:
  • Size: 47.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for dcm_anonymizer-0.3.4.tar.gz
Algorithm Hash digest
SHA256 89ac25603d02d06632cdc5941dcf7c99800a3c3108476c5e54147b0f356a1bd8
MD5 64e019a7b8403f934c235d70ed900a77
BLAKE2b-256 68d6a428ec9275aab3b55978affdc37634ee5913ebc6b89ace95e9f64f41102d

See more details on using hashes here.

File details

Details for the file dcm_anonymizer-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: dcm_anonymizer-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 44.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for dcm_anonymizer-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 2e27daa428f5420dd6132e5583e575c1cb55903658d0cdcd619eea3be467c077
MD5 0eab31a9f07fb91461c09ac53ed59b3e
BLAKE2b-256 ec762a6d61c0865bc883b0239b5763418e1d3a846dc0034b110e9ace334be328

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page