Skip to main content

Pseudonymization of medical reports using a named entity recognition model and the Faker library

Project description

PseudoCare: A Python Libray for consistent and realistic Pseudonymization of clinical text entities

Description

This Python package enables the automatic pseudonymization of medical reports and records using a deep learning-based Named Entity Recognition (NER) NLP model. It leverages the NER model's results to identify and replace sensitive information with synthetic data generated using the Faker library.
The targeted entities are:

Label Description
ADRESSE Postal address, e.g., 33 Boulevard de la Paix
DATE Any absolute date other than a birth date
DATE_NAISSANCE Patient's birth date
HOPITAL Hospital name, e.g., Hôpital Robert Debré
IPP Permanent patient identifier, a number assigned during the patient's first hospital visit
PRENOM Any first name (patients, doctors, etc.)
NOM Any last name (patients, doctors, etc.)
SECU Social security number social
TEL Any phone number
VILLE Any city
ZIP Any postal code
NDA Identifier for visits

Functionalities

  • Detection of sensitive entities using a Named Entity Recognition (NER) model.
  • Automatic pseudonymization of medical reports by replacing detected entities with fictitious data generated by Faker.
  • Customization of generated data with two custom Faker providers:
    • A dedicated provider handles date formats frequently found in medical reports, such as janv.12, 13 05.2015, mars2020, or mi-mai. The pseudonymization of dates is performed through offsetting, ensuring that for the same IPP, the dates across different documents are pseudonymized consistently. The user can define the maximum offset value via the PseudoCare class constructor. By default, birth dates are shifted by a random number of days between 1 and 30, while other dates are shifted by a random value between 1 and 100. These parameters can be customized according to specific needs.
    • A provider dedicated to handling email addresses ensures pseudonymization while preserving the format used by the CHU de Reims (e.g., example@chu-reims.fr).
  • Extensibility: the user can add custom Faker providers as well as their own NER models for entity detection.
  • Default model used: the package utilizes the eds-pseudo model from AP-HP, specifically trained on medical documents, including reports.
  • Generation of a results file: After executing this package, a results.html file is generated, allowing the user to view both the original predicted document and the pseudonymized document.
    Here is an example of execution on a fictitious medical report:
pseudo-test

Structure of files

  • pseudocare/ : Contains pyhton script files
  • pseudocare/providers/ : Contains all customised providers
  • examples/ : Contains test scripts
  • Results/ : Contains the results (html and txt files)

Launch

Using Gitlab repo

First, clone the project locally. Our package relies on the edsnlp model for entity detection, which is hosted on Hugging Face. Therefore, you need to create a Hugging Face access token [https://huggingface.co/settings/tokens?new_token=true], and register it on your machine. This step only needs to be done once by running the following script:

import huggingface_hub

huggingface_hub.login(token=YOUR_TOKEN, new_session=False, add_to_git_credential=True)

Once completed, you'll be able to use the model.

Next, install uv, an ultra-fast tool for managing virtual environments and Python dependencies. It's compatible with pip, venv, setuptools, and poetry, but significantly faster.

uv sync

This command creates a virtual environment and installs all the dependencies required by the project.

Finally, to launch the main pipeline :
Run this command from the project root directory if you have several documents for the same patient:
uv run python -m examples.batch_script --input "your/folder/path" --seed "seed" --is_folder

  • --input indicates the path of the folder containing the CRs.
  • --seed to indicate a seed for a patient (This is used to create a seed for this patient).
  • --is_folder to indicate that --input is a folder and not plain text

Alternatively, if you have your CR in text or .txt format, run this command:
uv run python -m examples.batch_script --input "your CR" --seed "seed"

OR

uv run python -m examples.batch_script --input "your/txt file/path" --seed "seed"

However, before that, if you wish to use your own providers, make sure to add them in the providers folder and then include them in the pseudocare.py file during the initialization of the PseudoCare class. Each added provider must include a function named pseudonymize_{entity_type}. If no provider is added, the default providers will be used.

The user must specify the data and seed, and they have the option to test on one or more documents (a list of .txt files).

A quick_start.py file has been added, demonstrating how to easily use the package in just two lines of code if you prefer not to go through the command line.

Using pip

You can install Pseudocare locally using pip. We recommend using uv, an ultra-fast tool for managing virtual environments and Python dependencies. Follow the steps below to set up the package:

  1. Create a local folder for your project:
mkdir your_folder
cd your_folder
  1. Create a virtual environment with Python 3.10 or higher:
uv venv --python 3.10
  1. Activate the virtual environment:
source .venv/bin/activate
  1. (Optional) if pip is not available in your environment, install it:
uv run python -m ensurepip --upgrade
  1. Install Pseudocare:
uv run python -m pip install pseudocare

Once the installation is complete, you can start using Pseudocare to pseudonymize your files. Here's a quick exemple to get started:

from pseudocare.providers.custom_mail_provider import CustomMailProvider
from pseudocare.pseudocare import PseudoCare

if __name__ == "__main__":
    # Import user providers
    custom_providers = {
         'MAIL': CustomMailProvider,
    }

    DOC = "Docteur BERNARD François, Tel: 04.10.14.10.14 Tel: 04.10.14.10.14, \
          Mail: fbernard@test.fr,\
          ipp: 12845673, \
          iep: 147085237, \
          Fait le mercredi 06/01/2025Aujourd'hui, le 6 janvier 2025, j'ai eu l'opportunité de recevoir en \
          consultation Monsieur Jean Dupont né le 15/12/1922, un patient résidant à Paris. Monsieur Dupont \
          est venu pour une consultation médicale afin de discuter de son état de santé général. Après un \
          entretien approfondi, nous avons examiné ses antécédents médicaux ainsi que ses préoccupations actuelles.\
          Le patient a été opéré en 07/2018 pour des problèmes cardiaques, puis à nouveau en sept-2019. En juin 1996,\
          il a subi une intervention chirurgicale pour une pathologie pulmonaire liée au tabagisme, et en sept.22, pour\
          une pathologie intestinale. La consultation du 06/01 a permis d'évaluer plusieurs aspects de son bien-être,\
          notamment en ce qui concerne ses habitudes de vie et ses symptômes. Nous avons convenu de plusieurs\
          recommandations pour améliorer sa santé et avons programmé une consultation pour mi-mai."
    # Instanciate the package
    pseudo_faker = PseudoCare(custom_providers=custom_providers)
    # Run the pseudonymization process
    pseudo_document = pseudo_faker.run(DOC, 214)
    print(f"{pseudo_document = }")

Credits

Youcef Anis DAHLOUK

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pseudocare-0.1.6.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pseudocare-0.1.6-py3-none-any.whl (17.9 kB view details)

Uploaded Python 3

File details

Details for the file pseudocare-0.1.6.tar.gz.

File metadata

  • Download URL: pseudocare-0.1.6.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for pseudocare-0.1.6.tar.gz
Algorithm Hash digest
SHA256 40bb36700e8f6c4ca0cb1e566d3288740bccfbf1f76ac6e6217d5f2497f73089
MD5 01f47250167adb7aad3edade8714ddf0
BLAKE2b-256 76f947efac78c5ba11bb08c32ea9e6ad35b749fcf0fafebf27c793579b9b647b

See more details on using hashes here.

File details

Details for the file pseudocare-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: pseudocare-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 17.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for pseudocare-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 fedd7f90a19c1cb36e4ce2bb5563d5af1405728a3e2dca15068a250cc30ba5d9
MD5 b192dab1f406fe15b59684bb1aa9d396
BLAKE2b-256 57d8660809f8fc30b17e470be688d081df49c8ddb6d96ac785a607e583b47b4d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page