Skip to main content

{{description}}

Project description

Presidio Image Redactor (presidio-image-redactor)

1. Overview

1.1. QuickLinks

1.2. Summary

Gear is under active development and current Release Candidate is subject to change. At present, running the Presidio-Image-Redactor as a Gear Rule is not supported, but will be added in a future release.

PLEASE NOTE: The methodologies used in this gear for identifying text & PHI entities in medical images relies heavily on statistics-based models and algorithms. These methodologies are not fullproof and it is highly recommended that human-in-the-loop workflows are implemented to verify the identification of PHI or text entities.

This gear builds upon Microsoft's open source Presidio SDK to scan DICOM images for potential Personal Identifiable Information (PII), report on PII findings, generate example images with bounding boxes embedded, generate ReaderTasks with annotated PHI entities, and the option to redact PII stored within DICOM pixel data.

1.3. Cite

Additional information on Microsoft's Presidio SDK can be found on their website and through their GitHub Page.

1.4. License

MIT

1.5. Classification

Category: Converter

Gear Level:

  • Project
  • Subject
  • Session
  • Acquisition
  • Analysis

1.6. Inputs

  • DICOM image or series to be scanned/redacted

    • Name: image_file
    • Type: DICOM or archive (.zip)
    • Optional: false
    • Classification: DICOM
    • Modalities: US, CT, MR, XRay
    • Description: A single or multi-frame DICOM file. Isolated file or as zipped DICOM series
  • Coordinates of bounding boxes encapsulating PII

    • Name: bbox_coords
    • Type: source code (json)
    • Optional: true
    • Classification: source code
    • Description: Json containing the bounding box coordinates of a previous scanning run.

1.7. ConfigSettings

  • Debug

    • Name: debug
    • Type: boolean
    • Description: Log debug messages
    • Default: false
  • Assignees

    • Name: Assignees
    • type: string
    • Description: Comma separated ist of Flywheel user emails to assign ReaderTasks. If empty & Operating Mode=Detection+ReaderTasks, gear will fail. e.g. bob@flywheel.io, mary@flywheel.io
    • Optional: true
  • Baseline Operating Mode

    • Name: Baseline Operating Mode
    • Type: string
    • Description: Selects the operating mode for the gear. Detection only: scans images for PHI & reports on findings. Detection+ReaderTasks: scans images for PHI & creates ReaderTasks with found PHI. Dynamic PHI Redaction: scans images for PHI & redacts them. RedactAllText: scans for all text within images & redacts all of it.
    • Default: true
  • Transformer Score Threshold

    • Name: Transformer Score Threshold
    • Type: integer
    • Description:The minimum confidence score (0 to 100) required for an entity identified by the transformer to be considered PHI. Default=30
    • Default:30
    • Minimum: 0
    • Maximum: 100
  • Entity Frequency Threshold

    • Name: Entity Frequency Threshold
    • Type: integer
    • Description: Only applied on multi-frame files, frequency_threshold specifies the minimum number of times (as a percentage 0 to 100) an entity must appear across frames to be included in all frames. Default=30. Does not impact single frame files.
    • Default:30
    • Minimum: 0
    • Maximum: 100
  • Use DICOM Metadata

    • Name: Use DICOM Metadata
    • Type: boolean
    • Description: If true, creates a regex recognizer from DICOM metadata to facilitate identifying PHI text in DICOM pixel data. Default=False.
    • Default: false
  • Entities to Find

    • Name: Entities to Find
    • Type: string
    • Description: List of entities the gear should look for. Current list
      shows all possible entities; remove any entity not needed.
    • Default: PERSON,DATE_TIME,LOCATION,AGE,ID,PROFESSION,ORGANIZATION, PHONE_NUMBER,ZIP,USERNAME,EMAIL

1.8. Outputs

1.8.1. Modes

There are four operating modes for the image redactor gear. Regardless of selected operating mode, the presidio-image-redactor will tag files that it runs on with its gear name: presidio-image-redactor .

  1. Running the gear with Detection Only will solely use the gear's scanning capabilities. In this mode the gear will scan the image for PHI and generate three review documents:
  1. A csv denoting PII entities found alongside corresponding bounding box coordinates
  2. A duplicate DICOM image with bounding boxes overlaid on the image
  3. A .json file containing the coordinates for the bounding boxes

Lastly, the gear will tag files and acquisition containers with PHI-Found if PII was identified and PHI-Not-Found if no PHI was identified.

  1. Running the gear in Detection+ReaderTasks mode will run the gear using the gear's scanning capabilities & produce the same three output as stated above. Additionally, the gear will create:
  1. A Reader Protocol, default name presidio_default_protocol for assigning ReaderTasks to
  2. A ReaderTask for the image that is being processed
  3. Annotations of the returned bounding boxes, overlaying them on the ReaderTask image

Only 1 ReaderTask is created for a given input_file and is assigned using the Assignees configuration option.

  1. Dynamic PHI Redaction mode will utilize Optical Character Recognition (OCR) & Named Entity Recognition (NER) via the Transformer model Deid-Roberta-i2b2 to extract text from the input image, determine if it is a PHI entity, and redact that area of the image.

This operating mode permits an optional configuration option called "Bbox_coords". This optional configuration option allows the user to input the bounding box coordinates from their Detection Only job to the gear which will prevent the gear from scanning for a second time and proceed directly to redacting the image.

  1. The final operating mode RedactAllText uses the same OCR method as the method above, but does not use NER or the Transformer.

Operating the gear in this mode will cause the gear to redact any and all text that it finds in the image, regardless if it is PHI or not.

1.8.2. Files

  • Identified PHI

    • Name: PHI_INFO.presidio-image-redactor.<gear_version>.csv
    • Type: csv
    • Optional: true
    • Classification: file
    • Description: A csv file containing located PII, which entity type, and location in pixel data. Example documentation can be found in the Example Documents folder nested under the docs folder.
  • Bounding box DICOM(s)

    • Name: bbox_.dcm or bbox.zip
    • Type: DICOM or archive (.zip)
    • Optional: true
    • Classification: file
    • Description: A single DICOM or DICOM series with burned in bounding boxes surrounding identified PII.
  • Redacted DICOM(s)

    • Name: redacted_image-name_.dcm or redacted.zip
    • Type: DICOM or archive (.zip)
    • Optional: true
    • Classification: file
    • Description: A single DICOM or DICOM series with burned in redaction mask covering identified PII.

1.8.3. Metadata

  • Gear Tag

    • Name: presidio-image-redactor
    • Type: tag
    • Optional: false
    • Classification: string
    • Description: A Flywheel tag added to input file to denote that this gear was run on it.
  • PHI Tag

    • Name: PHI-Found
    • Type: tag
    • Optional: false
    • Classification: string
    • Description: A Flywheel tag added to file containers indicating that contain PII was found in the image.
  • No PHI Tag

    • Name: PHI-Not-Found
    • Type: tag
    • Optional: false
    • Classification: string
    • Description: A Flywheel tag added to the input file to denote that no PHI was found by this gear.

1.9. Pre-requisites

There are no specific pre-requisites in order to run this gear. All that is needed is a DICOM image or series. However, it is recommended that users have some pre-existing knowledge of de-identification processes to effectively identify which PII entities to look for and obscure.

2. Usage

2.1. Description

This gear runs Optical Character Recognition (OCR), NER, and regex operations in order to identify PII entities in DICOM pixel data. PII identified by these algorithms are then cataloged for review by the user, consolidated into a ReaderTask for human review, or redact to ensure subject privacy during research.

2.1.1. FileSpecification

DICOM Images

At this time, DICOM images or series must have the photometric interpretation metadata value of MONOCHROME1, MONOCHROME2, or RGB. It is highly recommended to first run the dicom-fixer on all DICOM files prior running Presidio Image Redactor. Improper metadata formatting or alternative pixel compression formats can impair or terminate the gear run.

2.2. Workflow

A picture and description of the workflow

graph LR;
    A["Input<br>DICOM Image"]:::start;
    A --> X[Detection+ReaderTasks]:::input --> H;
    A --> Y[DetectionOnly]:::input --> D;
    A --> C[RedactAllText]:::input --> L; 
    
    H[Human-in-the-loop <br>ReaderTask annotations review]:::container-->I;
    D[Review any found PII <br> Decide if further scanning required]:::container-->E;
    L[Review images to determine if sufficient text removed]:::container --> K
    
    E((Run gear in <br> Dynamic PHI Redaction)):::gear --> F;
    I((Run<br>image-redaction-exporter)):::gear --> J;
    K[Review redacted outputs <br> Move redacted files to deid project]:::output
    
    
    F[Review redacted outputs <br> Move redacted files to deid project]:::output;
    J[Review redacted outputs <br> Move redacted files to deid project]:::output;

    classDef start fill:#415e9a,color:#fff
    classDef container fill:#415e9a,color:#fff
    classDef input fill:#008080,color:#fff
    classDef gear fill:#659,color:#fff
    classDef output fill:#005851

2.3. UseCases

2.3.1. UseCase1

PHI Detection + ReaderTask Pipeline: You need to conduct PHI identification and redaction on your data set & require human-in-the-loop verification of gear's identification performance.

  1. Prep the images by ensuring dicom-fixer has been run on all your images.
  2. Enter the Flywheel emails of the individuals that will be reviewing the ReaderTasks.
  3. Select the Detection+ReaderTasks operating mode in the configuration options.
  4. Run the gear & have your Readers complete their Assigned ReaderTasks. Ensure Readers add or remove annotations on the image as needed.
  5. Once satisfied that your dataset has been de-identified, run the image-redaction-exporter to redact all areas indicated by ReaderTask annotations.
  6. Export data to clean project or instance, or simply begin data analytics.

2.3.2. UseCase2

Simple PHI Scan & Redact

  1. Prep the images by ensuring dicom-fixer has been run on all your images.
  2. Select the DetectionOnly operating mode in the configuration options.
  3. Run the gear & inspect output files showcasing identified PHI.
  4. Once satisfied that your dataset has been de-identified, run the gear again and set the operating mode to Dynamic PHI Redaction. The gear will run and redact the entities that were found. You may choose to provide the bounding box json as an additional input.
  5. Export data to clean project or instance, or simply begin data analytics.

2.3.3. UseCase3

Complete Text Removal

  1. Prep the images by ensuring dicom-fixer has been run on all your images.
  2. Select the RedactAllText operating mode in the configuration options.
  3. Run the gear & inspect output to determine if sufficient text has been removed from the images.
  4. Export data to clean project or instance, or simply begin data analytics.

2.4. Logging

Logging implemented for this gear aims to provide the user with an understanding of what flags were passed into the gear, what mode of operation the gear is currently running, and what outputs are provided upon completion.

To facilitate troubleshooting, raw OCR results can be created when running the gear in debug mode.

3. FAQ

FAQ.md

4. Contributing

[For more information about how to get started contributing to that gear, checkout CONTRIBUTING.md.]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

fw_presidio_image_redactor-0.1.3-py3-none-any.whl (44.9 kB view details)

Uploaded Python 3

File details

Details for the file fw_presidio_image_redactor-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for fw_presidio_image_redactor-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9e06940210b49ab945ec567e49da51d6d6183fcede144054d364eb278bfd791e
MD5 d670d27337c5da6174783bb5dcede89b
BLAKE2b-256 5f8f00e4371bd4820ee9e85dadc7fbc6f7ebdf2f48991f08db79f73cc2a689bb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page