Skip to main content

Comprehensive package for removing PHI from medical images

Project description

medimagecleaner

Comprehensive Python package for removing Protected Health Information (PHI) from medical images

PyPI version Python 3.8+ License: MIT GitHub

Overview

medimagecleaner is a production-ready package for de-identifying medical images with a focus on DICOM files. It provides comprehensive tools for metadata removal, burned-in text detection, face detection, risk assessment, and compliance validation.

Key Features

🔒 Complete De-identification

  • Remove/anonymize 50+ PHI tags from DICOM metadata
  • Detect and remove burned-in text using OCR
  • Detect and remove faces from clinical photos
  • Convert DICOM to standard formats (PNG, JPEG, TIFF, NumPy)

Validation & Compliance

  • Automated PHI detection and validation
  • Re-identification risk assessment (K-anonymity, L-diversity)
  • HIPAA-compliant de-identification (Safe Harbor & Expert Determination)
  • Complete audit trails for regulatory compliance

Performance & UX

  • Batch processing with progress tracking
  • Real-time ETA and status updates
  • Comprehensive reporting and analytics
  • Command-line interface for easy integration

Quick Start

Installation

# Basic installation
pip install medimagecleaner

# With OCR support (recommended)
pip install medimagecleaner[ocr]

# Full installation (all features)
pip install medimagecleaner[all]

Note: OCR features require Tesseract. Install via:

  • Ubuntu/Debian: sudo apt-get install tesseract-ocr
  • macOS: brew install tesseract
  • Windows: Download installer

Basic Usage

from medimagecleaner import BatchProcessor

# Initialize processor
processor = BatchProcessor(
    log_dir="./logs",
    enable_logging=True,
    enable_validation=True
)

# Process entire directory
results = processor.process_directory(
    input_dir="./raw_dicoms",
    output_dir="./deidentified",
    remove_metadata=True,
    remove_burned_text=True,
    validate_output=True
)

# Generate comprehensive report
report = processor.generate_complete_report(
    results,
    output_path="./deidentification_report.txt"
)

print(f"Processed: {results['successful']}/{results['total_files']} files")

Command Line

# Basic de-identification
medimagecleaner --input ./raw --output ./clean

# With all features
medimagecleaner \
  --input ./raw \
  --output ./clean \
  --remove-text \
  --validate \
  --format png \
  --log-dir ./logs

Core Features

1. DICOM Metadata De-identification

Remove or anonymize 50+ PHI tags including patient info, physician names, dates, and device identifiers.

from medimagecleaner import DicomDeidentifier

deidentifier = DicomDeidentifier(
    hash_patient_id=True,      # Hash instead of removing
    date_offset_days=365,      # Offset dates by 1 year
    preserve_age=True,         # Keep age information
    preserve_sex=True          # Keep sex information
)

result = deidentifier.deidentify(
    input_path="scan.dcm",
    output_path="anonymized.dcm",
    remove_private_tags=True
)

2. Burned-in Text Removal

Detect and remove patient information embedded in image pixels.

from medimagecleaner import TextRemover

text_remover = TextRemover(ocr_enabled=True)

# OCR-based detection
result = text_remover.process_dicom(
    "input.dcm",
    "output.dcm",
    method="ocr"
)

# Region-based cropping
result = text_remover.process_dicom(
    "input.dcm",
    "output.dcm",
    method="crop",
    crop_top=0.1  # Remove top 10%
)

3. Face Detection & Removal

Protect patient privacy by detecting and removing faces from clinical images.

from medimagecleaner import FaceRemover

face_remover = FaceRemover(method="blur", blur_strength=25)

result = face_remover.process_image(
    "patient_photo.jpg",
    "deidentified.jpg"
)

print(f"Detected {result['faces_detected']} faces")

4. Re-identification Risk Assessment

Assess the risk that de-identified data could be re-identified.

from medimagecleaner import RiskAssessment

risk = RiskAssessment(strict_mode=True)

# Assess entire dataset
assessment = risk.assess_dataset("./deidentified")

print(f"Risk Level: {assessment['overall_risk_level']}")
print(f"K-anonymity: {assessment['k_anonymity']['k_value']}")

# Generate detailed report
report = risk.generate_report(assessment, "risk_report.txt")

5. Format Conversion

Convert de-identified DICOM files to standard image formats.

from medimagecleaner import FormatConverter

converter = FormatConverter(
    normalize=True,
    apply_windowing=True
)

# Convert to PNG
converter.dicom_to_png("scan.dcm", "scan.png")

# Batch conversion
results = converter.batch_convert(
    input_dir="./dicoms",
    output_dir="./images",
    output_format="png"
)

6. Validation

Automated validation ensures PHI has been properly removed.

from medimagecleaner import DeidentificationValidator

validator = DeidentificationValidator(strict_mode=True)

# Validate single file
validation = validator.validate_dicom("output.dcm")

# Batch validation
results = validator.validate_batch(
    input_dir="./deidentified",
    sample_rate=0.2  # Validate 20%
)

# Generate report
report = validator.generate_report(results, "validation_report.txt")

7. Progress Tracking

Real-time progress updates for long-running operations.

from medimagecleaner import ProgressTracker, with_progress

# Progress bar
with ProgressTracker(100, "Processing") as tracker:
    for i in range(100):
        # Do work
        tracker.update()

# Iterator wrapper
for file in with_progress(files, "Converting"):
    process(file)

Module Reference

Module Description
DicomDeidentifier Remove PHI from DICOM metadata
TextRemover Remove burned-in text from images
FaceRemover Detect and remove faces
FormatConverter Convert DICOM to standard formats
DeidentificationValidator Validate PHI removal
RiskAssessment Assess re-identification risk
AuditLogger Maintain compliance audit trails
BatchProcessor Orchestrate complete workflows
ProgressTracker Real-time progress tracking

Compliance

This package supports HIPAA de-identification requirements:

  • Safe Harbor Method (§164.514(b)): Removes all 18 HIPAA identifiers
  • Expert Determination: Provides validation framework and risk assessment
  • Audit Requirements: Comprehensive logging for regulatory compliance

Disclaimer: This software is provided as-is. Users are responsible for validating that de-identification meets their specific compliance requirements.

Best Practices

  1. ✅ Never overwrite originals - Always save to different location
  2. ✅ Enable validation - Always validate de-identified files
  3. ✅ Use audit logging - Maintain compliance trails
  4. ✅ Test on samples - Verify workflow before full batch
  5. ✅ Manual review - Spot-check validation failures
  6. ✅ Assess risk - Run risk assessment before sharing
  7. ✅ Secure storage - Keep originals and mappings separate

License

MIT License - See LICENSE file for details

Changelog

v0.2.0 (2025-12-28)

  • ✨ Added face detection and removal
  • ✨ Added re-identification risk assessment (K-anonymity, L-diversity)
  • ✨ Added progress tracking and status logging
  • 📚 Enhanced documentation with feature roadmap
  • 🎯 Improved PyPI packaging

v0.1.0 (2025-12-22)

  • 🎉 Initial release
  • ✅ DICOM metadata de-identification
  • ✅ Burned-in text removal
  • ✅ Format conversion
  • ✅ Validation and audit logging

medimagecleaner - Comprehensive medical image de-identification for HIPAA compliance

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

medimagecleaner-0.2.0.tar.gz (88.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

medimagecleaner-0.2.0-py3-none-any.whl (39.5 kB view details)

Uploaded Python 3

File details

Details for the file medimagecleaner-0.2.0.tar.gz.

File metadata

  • Download URL: medimagecleaner-0.2.0.tar.gz
  • Upload date:
  • Size: 88.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for medimagecleaner-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e6ee2595b7a66152a9e907c4992f35f5336a49274c3f048033e3b044d924b759
MD5 2fcb03a702cbf2055414e1d617e25e36
BLAKE2b-256 d38af8adfe29d8b9af65df3c4f4f0f17beac494be4bdfada5204cec88e7bd982

See more details on using hashes here.

File details

Details for the file medimagecleaner-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for medimagecleaner-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cda55f426a06b64da6416c3d163dc508c5f866e12a420107d4a52bd327b3dfd4
MD5 ccc7a2471db921c0cd4f248d215d7c30
BLAKE2b-256 b494f18ebf19a300304bdd2136ab51297d536969da2e2a027fe2e14864246e3b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page