DICOM validation for the Laboratory Catalog and Archive Service of the Early Detection Research Network
Project description
🛂 EDRN DICOM Validation
A validation tool for DICOM files used by the Laboratory Catalog and Archive Service (LabCAS) of the Early Detection Research Network (EDRN). This program ensures that DICOM files:
- Contain little-to-no PHI/PII — Scans both DICOM headers and pixel data for protected health information (PHI) and personally identifiable information (PII)
- Adhere to EDRN requirements — Validates DICOM tags against the EDRN core and MR requirements
This tool was developed in response to EDRN/EDRN-metadata#160.
🎯 Features
This program has features described in the following subsections.
🔍 PHI/PII Detection
- Header-based detection: Scans DICOM metadata tags for identifiers including:
- Patient names, birth dates, addresses
- Physician and operator names
- Email addresses, phone numbers, SSNs
- Medical record numbers (MRNs)
- Pixel-based detection: Uses OCR (Tesseract) to detect text embedded in DICOM images
- Multiple recognizers: Choose between different PHI/PII detection algorithms:
simple-scoring(default): Pattern-based detection with configurable scoringaccepting: Accepts all files (testing only)rejecting: Rejects all files (testing only)
✅ DICOM Tag Validation
Validates over 40 DICOM tags against EDRN requirements including:
- Study/Series/Image Identification: UIDs, instance numbers, SOP class
- Acquisition Modality and Equipment: Modality codes, manufacturer info, device details
- Temporal Data: Dates and times in proper format
- Image Data: Dimensions, pixel data, display parameters
- MR-specific: Spacing between slices validation
📊 Reporting
Generates detailed Markdown reports organized by:
- Site ID
- Event ID
- File name
- Finding type and severity score
📦 Installation
Details on installing this software follows in this section.
⚙️ Prerequisites
Requires Python 3.12 or higher and Tesseract OCR for pixel-based PHI/PII detection.
🔤 Tesseract
Tesseract provides optical character recgonition features for this program and must be installed separately.
macOS:
brew install tesseract
Linux (Ubuntu/Debian):
sudo apt-get install tesseract-ocr
Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
📥 Install the Package
It's best to set up a Python virtual environment and use pip to install it into that environment:
pip install jpl.labcas.validation
Or install from source:
git clone https://github.com/EDRN/jpl.labcas.validation.git
cd jpl.labcas.validation
pip install --editable .
🚀 Usage
The following describes how to use this program.
💻 Basic Usage
The easiest way to run this is:
validate-dicom-files <directory>
the <directory> should eventually contain the following directory hierarchy:
<directory>
… (sub-directories)
collection-folder (such as Prostate_MRI)
event-ID-folder (such as 1234567)
… (sub-folders)
DICOM file 1
DICOM file 2
…
⚡ Command-Line Options
Use --help to get more details, but summarizing:
-s, --score <value>: Maximum PHI/PII score threshold (0.0-1.0, default: 0.8)-c, --concurrency <num>: Number of concurrent processes (default: CPU count)-r, --recognizer <name>: PHI/PII recognizer to use:simple-scoring(default): Pattern-based detectionaccepting: Accept all filesrejecting: Reject all files
-o, --output <file>: Output file for report (default: report.md)-v, --verbose: Verbose logging-q, --quiet: Quiet logging
📝 Examples
Validate a directory with default settings:
validate-dicom-files /path/to/dicom/files
Use a different PHI/PII threshold (lower = less strict):
validate-dicom-files --score 0.5 /path/to/dicom/files
Generate a custom report filename:
validate-dicom-files --output validation_results.md /path/to/dicom/files
Use a specific number of workers:
validate-dicom-files --concurrency 4 /path/to/dicom/files
In general, use a --concurrency equal to at least the number of CPU cores available. Some recommend using twice that number.
📖 Understanding the Report
The tool generates a Markdown report with findings organized hierarchically:
- By Site ID: Grouped by blinded site identifier
- By Event ID: Grouped by 7-digit event ID
- By File: Individual DICOM files within each event
- By Finding: Each finding includes:
- Score: Severity from 0.0 (low) to 1.0 (high)
- Kind: Type of finding:
- 🙈 Header: PHI/PII found in DICOM metadata
- 🖼️ Pixels: PHI/PII found in image data via OCR
- ⚠️ Validation: Tag compliance issue
- ❌ Error: File reading or processing error
- Details: Specific information about the finding
Only findings with scores above the threshold are included in the report.
🏗️ Architecture
The validation framework is modular and extensible:
- PHI/PII Recognizers: Plug-in system for different detection algorithms
- Validators: Individual validators for each DICOM tag requirement
- Findings: Structured representation of all issues discovered
🧪 Development Status
Development Status: Pre-Alpha
CT requirements may be added in the future, pending completion of the spreadsheet's CT tab.
📄 License
Apache 2.0 - See LICENSE.md for details
🤝 Contributing
Issues and pull requests welcome on GitHub: https://github.com/EDRN/jpl.labcas.validation/issues. See also the EDRN Code of Conduct and Contributors' Guide.
👤 Authors
- Sean Kelly
@nutjob4life
©️ Copyright
Copyright © 2025 California Institute of Technology. U.S. Government sponsorship acknowledged.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jpl_labcas_validation-1.0.1.tar.gz.
File metadata
- Download URL: jpl_labcas_validation-1.0.1.tar.gz
- Upload date:
- Size: 20.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7b4131d57b0f98d17aeb5c0af0bb8bde2bd311c4b51366a3e5adb0f28bfd6e2
|
|
| MD5 |
433c92fa5a9b460a241d2c955e7e7ba2
|
|
| BLAKE2b-256 |
2cb107bd4128bceae661464117f02a2ddbb9423b37b743df31e74f4ee63fe8c5
|
File details
Details for the file jpl_labcas_validation-1.0.1-py3-none-any.whl.
File metadata
- Download URL: jpl_labcas_validation-1.0.1-py3-none-any.whl
- Upload date:
- Size: 57.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7969b745c79bfb1e867b7ec6c0bea5f8b0e6f3ae5db824a4adc6ab79a7a04b7f
|
|
| MD5 |
d4d1a7ae28d0b53d91cd18a81e7c8a8e
|
|
| BLAKE2b-256 |
6fae536495dfb2e189088a49cab1b7fc17284232e1bba03c605e08e18b5a8581
|