Skip to main content

DICOM validation for the Laboratory Catalog and Archive Service of the Early Detection Research Network

Project description

🛂 EDRN DICOM Validation

A validation tool for DICOM files used by the Laboratory Catalog and Archive Service (LabCAS) of the Early Detection Research Network (EDRN). This program ensures that DICOM files:

  • Contain little-to-no PHI/PII — Scans both DICOM headers and pixel data for protected health information (PHI) and personally identifiable information (PII)
  • Adhere to EDRN requirements — Validates DICOM tags against the EDRN core and MR requirements

This tool was developed in response to EDRN/EDRN-metadata#160.

🎯 Features

This program has features described in the following subsections.

🔍 PHI/PII Detection

  • Header-based detection: Scans DICOM metadata tags for identifiers including:
    • Patient names, birth dates, addresses
    • Physician and operator names
    • Email addresses, phone numbers, SSNs
    • Medical record numbers (MRNs)
  • Pixel-based detection: Uses OCR (Tesseract) to detect text embedded in DICOM images
  • Multiple recognizers: Choose between different PHI/PII detection algorithms:
    • simple-scoring (default): Pattern-based detection with configurable scoring
    • accepting: Accepts all files (testing only)
    • rejecting: Rejects all files (testing only)

✅ DICOM Tag Validation

Validates over 40 DICOM tags against EDRN requirements including:

  • Study/Series/Image Identification: UIDs, instance numbers, SOP class
  • Acquisition Modality and Equipment: Modality codes, manufacturer info, device details
  • Temporal Data: Dates and times in proper format
  • Image Data: Dimensions, pixel data, display parameters
  • MR-specific: Spacing between slices validation

📊 Reporting

Generates detailed Markdown reports organized by:

  • Site ID
  • Event ID
  • File name
  • Finding type and severity score

📦 Installation

Details on installing this software follows in this section.

⚙️ Prerequisites

Requires Python 3.12 or higher and Tesseract OCR for pixel-based PHI/PII detection.

🔤 Tesseract

Tesseract provides optical character recgonition features for this program and must be installed separately.

macOS:

brew install tesseract

Linux (Ubuntu/Debian):

sudo apt-get install tesseract-ocr

Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki

📥 Install the Package

It's best to set up a Python virtual environment and use pip to install it into that environment:

pip install jpl.labcas.validation

Or install from source:

git clone https://github.com/EDRN/jpl.labcas.validation.git
cd jpl.labcas.validation
pip install --editable .

🚀 Usage

The following describes how to use this program.

💻 Basic Usage

The easiest way to run this is:

validate-dicom-files <directory>

the <directory> should eventually contain the following directory hierarchy:

<directory>
    … (sub-directories)
    collection-folder (such as Prostate_MRI)
        event-ID-folder (such as 1234567)
        … (sub-folders)
            DICOM file 1
            DICOM file 2
            …

⚡ Command-Line Options

Use --help to get more details, but summarizing:

  • -s, --score <value>: Maximum PHI/PII score threshold (0.0-1.0, default: 0.8)
  • -c, --concurrency <num>: Number of concurrent processes (default: CPU count)
  • -r, --recognizer <name>: PHI/PII recognizer to use:
    • simple-scoring (default): Pattern-based detection
    • accepting: Accept all files
    • rejecting: Reject all files
  • -o, --output <file>: Output file for report (default: report.md)
  • -v, --verbose: Verbose logging
  • -q, --quiet: Quiet logging

📝 Examples

Validate a directory with default settings:

validate-dicom-files /path/to/dicom/files

Use a different PHI/PII threshold (lower = less strict):

validate-dicom-files --score 0.5 /path/to/dicom/files

Generate a custom report filename:

validate-dicom-files --output validation_results.md /path/to/dicom/files

Use a specific number of workers:

validate-dicom-files --concurrency 4 /path/to/dicom/files

In general, use a --concurrency equal to at least the number of CPU cores available. Some recommend using twice that number.

📖 Understanding the Report

The tool generates a Markdown report with findings organized hierarchically:

  1. By Site ID: Grouped by blinded site identifier
  2. By Event ID: Grouped by 7-digit event ID
  3. By File: Individual DICOM files within each event
  4. By Finding: Each finding includes:
    • Score: Severity from 0.0 (low) to 1.0 (high)
    • Kind: Type of finding:
      • 🙈 Header: PHI/PII found in DICOM metadata
      • 🖼️ Pixels: PHI/PII found in image data via OCR
      • ⚠️ Validation: Tag compliance issue
      • ❌ Error: File reading or processing error
    • Details: Specific information about the finding

Only findings with scores above the threshold are included in the report.

🏗️ Architecture

The validation framework is modular and extensible:

  • PHI/PII Recognizers: Plug-in system for different detection algorithms
  • Validators: Individual validators for each DICOM tag requirement
  • Findings: Structured representation of all issues discovered

🧪 Development Status

Development Status: Pre-Alpha

CT requirements may be added in the future, pending completion of the spreadsheet's CT tab.

📄 License

Apache 2.0 - See LICENSE.md for details

🤝 Contributing

Issues and pull requests welcome on GitHub: https://github.com/EDRN/jpl.labcas.validation/issues. See also the EDRN Code of Conduct and Contributors' Guide.

👤 Authors

  • Sean Kelly @nutjob4life

©️ Copyright

Copyright © 2025 California Institute of Technology. U.S. Government sponsorship acknowledged.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jpl_labcas_validation-1.0.1.tar.gz (20.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jpl_labcas_validation-1.0.1-py3-none-any.whl (57.5 kB view details)

Uploaded Python 3

File details

Details for the file jpl_labcas_validation-1.0.1.tar.gz.

File metadata

  • Download URL: jpl_labcas_validation-1.0.1.tar.gz
  • Upload date:
  • Size: 20.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for jpl_labcas_validation-1.0.1.tar.gz
Algorithm Hash digest
SHA256 b7b4131d57b0f98d17aeb5c0af0bb8bde2bd311c4b51366a3e5adb0f28bfd6e2
MD5 433c92fa5a9b460a241d2c955e7e7ba2
BLAKE2b-256 2cb107bd4128bceae661464117f02a2ddbb9423b37b743df31e74f4ee63fe8c5

See more details on using hashes here.

File details

Details for the file jpl_labcas_validation-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for jpl_labcas_validation-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7969b745c79bfb1e867b7ec6c0bea5f8b0e6f3ae5db824a4adc6ab79a7a04b7f
MD5 d4d1a7ae28d0b53d91cd18a81e7c8a8e
BLAKE2b-256 6fae536495dfb2e189088a49cab1b7fc17284232e1bba03c605e08e18b5a8581

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page