{{description}}
Project description
Presidio Image Redactor (presidio-image-redactor)
1. Overview
1.1. QuickLinks
1.2. Summary
Gear is under active development and current Release Candidate is subject to change. At present, running the Presidio-Image-Redactor as a Gear Rule is not supported, but will be added in a future release.
PLEASE NOTE: The methodologies used in this gear for identifying text & PHI entities in medical images relies heavily on statistics-based models and algorithms. These methodologies are not fullproof and it is highly recommended that human-in-the-loop workflows are implemented to verify the identification of PHI or text entities.
This gear builds upon Microsoft's open source Presidio SDK to scan DICOM images for potential Personal Identifiable Information (PII), report on PII findings, generate example images with bounding boxes embedded, generate ReaderTasks with annotated PHI entities, and the option to redact PII stored within DICOM pixel data.
1.3. Cite
Additional information on Microsoft's Presidio SDK can be found on their website and through their GitHub Page.
1.4. License
MIT
1.5. Classification
Category: Converter
Gear Level:
- Project
- Subject
- Session
- Acquisition
- Analysis
1.6. Inputs
-
DICOM image or series to be scanned/redacted
- Name: image_file
- Type: DICOM or archive (.zip)
- Optional: false
- Classification: DICOM
- Modalities: US, CT, MR, XRay
- Description: A single or multi-frame DICOM file. Isolated file or as zipped DICOM series
-
Coordinates of bounding boxes encapsulating PII
- Name: bbox_coords
- Type: source code (json)
- Optional: true
- Classification: source code
- Description: Json containing the bounding box coordinates of a previous scanning run.
1.7. ConfigSettings
-
Debug
- Name: debug
- Type: boolean
- Description: Log debug messages
- Default: false
-
Assignees
- Name: Assignees
- type: string
- Description: Comma separated ist of Flywheel user emails to assign ReaderTasks. If empty & Operating Mode=Detection+ReaderTasks, gear will fail. e.g. bob@flywheel.io, mary@flywheel.io
- Optional: true
-
Baseline Operating Mode
- Name: Baseline Operating Mode
- Type: string
- Description: Selects the operating mode for the gear. Detection only: scans images for PHI & reports on findings. Detection+ReaderTasks: scans images for PHI & creates ReaderTasks with found PHI. Dynamic PHI Redaction: scans images for PHI & redacts them. RedactAllText: scans for all text within images & redacts all of it.
- Default: true
-
Transformer Score Threshold
- Name: Transformer Score Threshold
- Type: integer
- Description:The minimum confidence score (0 to 100) required for an entity identified by the transformer to be considered PHI. Default=30
- Default:30
- Minimum: 0
- Maximum: 100
-
Entity Frequency Threshold
- Name: Entity Frequency Threshold
- Type: integer
- Description: Only applied on multi-frame files, frequency_threshold specifies the minimum number of times (as a percentage 0 to 100) an entity must appear across frames to be included in all frames. Default=30. Does not impact single frame files.
- Default:30
- Minimum: 0
- Maximum: 100
-
Use DICOM Metadata
- Name: Use DICOM Metadata
- Type: boolean
- Description: If true, creates a regex recognizer from DICOM metadata to facilitate identifying PHI text in DICOM pixel data. Default=False.
- Default: false
-
Entities to Find
- Name: Entities to Find
- Type: string
- Description: List of entities the gear should look for. Current list
shows all possible entities; remove any entity not needed. - Default: PERSON,DATE_TIME,LOCATION,AGE,ID,PROFESSION,ORGANIZATION, PHONE_NUMBER,ZIP,USERNAME,EMAIL
1.8. Outputs
1.8.1. Modes
There are four operating modes for the image redactor gear. Regardless of
selected operating mode, the presidio-image-redactor
will tag files that it
runs on with its gear name: presidio-image-redactor
.
- Running the gear with Detection Only will solely use the gear's scanning capabilities. In this mode the gear will scan the image for PHI and generate three review documents:
- A csv denoting PII entities found alongside corresponding bounding box coordinates
- A duplicate DICOM image with bounding boxes overlaid on the image
- A
.json
file containing the coordinates for the bounding boxes
Lastly, the gear will tag files and acquisition containers with
PHI-Found
if PII was identified andPHI-Not-Found
if no PHI was identified.
- Running the gear in Detection+ReaderTasks mode will run the gear using the gear's scanning capabilities & produce the same three output as stated above. Additionally, the gear will create:
- A Reader Protocol, default name
presidio_default_protocol
for assigning ReaderTasks to- A ReaderTask for the image that is being processed
- Annotations of the returned bounding boxes, overlaying them on the ReaderTask image
Only 1 ReaderTask is created for a given
input_file
and is assigned using theAssignees
configuration option.
- Dynamic PHI Redaction mode will utilize Optical Character Recognition (OCR) & Named Entity Recognition (NER) via the Transformer model Deid-Roberta-i2b2 to extract text from the input image, determine if it is a PHI entity, and redact that area of the image.
This operating mode permits an optional configuration option called "Bbox_coords". This optional configuration option allows the user to input the bounding box coordinates from their Detection Only job to the gear which will prevent the gear from scanning for a second time and proceed directly to redacting the image.
- The final operating mode RedactAllText uses the same OCR method as the method above, but does not use NER or the Transformer.
Operating the gear in this mode will cause the gear to redact any and all text that it finds in the image, regardless if it is PHI or not.
1.8.2. Files
-
Identified PHI
- Name: PHI_INFO.presidio-image-redactor.<gear_version>.csv
- Type: csv
- Optional: true
- Classification: file
- Description: A csv file containing located PII, which entity type, and location in pixel data. Example documentation can be found in the Example Documents folder nested under the docs folder.
-
Bounding box DICOM(s)
- Name: bbox_.dcm or bbox.zip
- Type: DICOM or archive (.zip)
- Optional: true
- Classification: file
- Description: A single DICOM or DICOM series with burned in bounding boxes surrounding identified PII.
-
Redacted DICOM(s)
- Name: redacted_image-name_.dcm or redacted.zip
- Type: DICOM or archive (.zip)
- Optional: true
- Classification: file
- Description: A single DICOM or DICOM series with burned in redaction mask covering identified PII.
1.8.3. Metadata
-
Gear Tag
- Name: presidio-image-redactor
- Type: tag
- Optional: false
- Classification: string
- Description: A Flywheel tag added to input file to denote that this gear was run on it.
-
PHI Tag
- Name: PHI-Found
- Type: tag
- Optional: false
- Classification: string
- Description: A Flywheel tag added to file containers indicating that contain PII was found in the image.
-
No PHI Tag
- Name: PHI-Not-Found
- Type: tag
- Optional: false
- Classification: string
- Description: A Flywheel tag added to the input file to denote that no PHI was found by this gear.
1.9. Pre-requisites
There are no specific pre-requisites in order to run this gear. All that is needed is a DICOM image or series. However, it is recommended that users have some pre-existing knowledge of de-identification processes to effectively identify which PII entities to look for and obscure.
2. Usage
2.1. Description
This gear runs Optical Character Recognition (OCR), NER, and regex operations in order to identify PII entities in DICOM pixel data. PII identified by these algorithms are then cataloged for review by the user, consolidated into a ReaderTask for human review, or redact to ensure subject privacy during research.
2.1.1. FileSpecification
DICOM Images
At this time, DICOM images or series must have the photometric interpretation metadata value of MONOCHROME1, MONOCHROME2, or RGB. It is highly recommended to first run the dicom-fixer on all DICOM files prior running Presidio Image Redactor. Improper metadata formatting or alternative pixel compression formats can impair or terminate the gear run.
2.2. Workflow
A picture and description of the workflow
graph LR;
A["Input<br>DICOM Image"]:::start;
A --> X[Detection+ReaderTasks]:::input --> H;
A --> Y[DetectionOnly]:::input --> D;
A --> C[RedactAllText]:::input --> L;
H[Human-in-the-loop <br>ReaderTask annotations review]:::container-->I;
D[Review any found PII <br> Decide if further scanning required]:::container-->E;
L[Review images to determine if sufficient text removed]:::container --> K
E((Run gear in <br> Dynamic PHI Redaction)):::gear --> F;
I((Run<br>image-redaction-exporter)):::gear --> J;
K[Review redacted outputs <br> Move redacted files to deid project]:::output
F[Review redacted outputs <br> Move redacted files to deid project]:::output;
J[Review redacted outputs <br> Move redacted files to deid project]:::output;
classDef start fill:#415e9a,color:#fff
classDef container fill:#415e9a,color:#fff
classDef input fill:#008080,color:#fff
classDef gear fill:#659,color:#fff
classDef output fill:#005851
2.3. UseCases
2.3.1. UseCase1
PHI Detection + ReaderTask Pipeline: You need to conduct PHI identification and redaction on your data set & require human-in-the-loop verification of gear's identification performance.
- Prep the images by ensuring
dicom-fixer
has been run on all your images.- Enter the Flywheel emails of the individuals that will be reviewing the ReaderTasks.
- Select the
Detection+ReaderTasks
operating mode in the configuration options.- Run the gear & have your Readers complete their Assigned ReaderTasks. Ensure Readers add or remove annotations on the image as needed.
- Once satisfied that your dataset has been de-identified, run the
image-redaction-exporter
to redact all areas indicated by ReaderTask annotations.- Export data to clean project or instance, or simply begin data analytics.
2.3.2. UseCase2
Simple PHI Scan & Redact
- Prep the images by ensuring
dicom-fixer
has been run on all your images.- Select the
DetectionOnly
operating mode in the configuration options.- Run the gear & inspect output files showcasing identified PHI.
- Once satisfied that your dataset has been de-identified, run the gear again and set the operating mode to
Dynamic PHI Redaction
. The gear will run and redact the entities that were found. You may choose to provide the bounding box json as an additional input.- Export data to clean project or instance, or simply begin data analytics.
2.3.3. UseCase3
Complete Text Removal
- Prep the images by ensuring
dicom-fixer
has been run on all your images.- Select the
RedactAllText
operating mode in the configuration options.- Run the gear & inspect output to determine if sufficient text has been removed from the images.
- Export data to clean project or instance, or simply begin data analytics.
2.4. Logging
Logging implemented for this gear aims to provide the user with an understanding of what flags were passed into the gear, what mode of operation the gear is currently running, and what outputs are provided upon completion.
To facilitate troubleshooting, raw OCR results can be created when running the gear in debug mode.
3. FAQ
4. Contributing
[For more information about how to get started contributing to that gear, checkout CONTRIBUTING.md.]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file fw_presidio_image_redactor-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: fw_presidio_image_redactor-0.1.3-py3-none-any.whl
- Upload date:
- Size: 44.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.7 Linux/5.15.154+
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9e06940210b49ab945ec567e49da51d6d6183fcede144054d364eb278bfd791e |
|
MD5 | d670d27337c5da6174783bb5dcede89b |
|
BLAKE2b-256 | 5f8f00e4371bd4820ee9e85dadc7fbc6f7ebdf2f48991f08db79f73cc2a689bb |