Skip to main content

Pseudonymization of medical reports using a named entity recognition model and the Faker library

Project description

PseudoCare: A Python Libray for consistent and realistic Pseudonymization of clinical text entities

Description

This Python package enables the automatic pseudonymization of medical reports and records using a deep learning-based Named Entity Recognition (NER) NLP model. It leverages the NER model's results to identify and replace sensitive information with synthetic data generated using the Faker library.
The targeted entities are:

Label Description
ADRESSE Postal address, e.g., 33 Boulevard de la Paix
DATE Any absolute date other than a birth date
DATE_NAISSANCE Patient's birth date
HOPITAL Hospital name, e.g., Hôpital Robert Debré
IPP Permanent patient identifier, a number assigned during the patient's first hospital visit
PRENOM Any first name (patients, doctors, etc.)
NOM Any last name (patients, doctors, etc.)
SECU Social security number social
TEL Any phone number
VILLE Any city
ZIP Any postal code

Functionalities

  • Detection of sensitive entities using a Named Entity Recognition (NER) model.
  • Automatic pseudonymization of medical reports by replacing detected entities with fictitious data generated by Faker.
  • Customization of generated data with two custom Faker providers:
    • A dedicated provider handles date formats frequently found in medical reports, such as janv.12, 13 05.2015, mars2020, or mi-mai. The pseudonymization of dates is performed through offsetting, ensuring that for the same IPP, the dates across different documents are pseudonymized consistently. The user can define the maximum offset value via the Pseudonymization class constructor. By default, birth dates are shifted by a random number of days between 1 and 30, while other dates are shifted by a random value between 1 and 100. These parameters can be customized according to specific needs.
    • A provider dedicated to handling email addresses ensures pseudonymization while preserving the format used by the CHU de Reims (e.g., example@chu-reims.fr).
  • Extensibility: the user can add custom Faker providers as well as their own NER models for entity detection.
  • Default model used: the package utilizes the eds-pseudo model from AP-HP, specifically trained on medical documents, including reports.
  • Generation of a results file: After executing this package, a results.html file is generated, allowing the user to view both the original predicted document and the pseudonymized document.
    Here is an example of execution on a fictitious medical report:
pseudo-test

Structure of files

  • scripts/ : Contains pyhton script files
  • scripts/providers/ : Contains all customised providers
  • tests/ : Contains test notebooks
  • Results/ : Contains the results (html and txt files)

Launch

Using Gitlab repo

First, clone the project locally. Our package relies on the edsnlp model for entity detection, which is hosted on Hugging Face. Therefore, you need to create a Hugging Face access token [https://huggingface.co/settings/tokens?new_token=true], and register it on your machine. This step only needs to be done once by running the following script:

import huggingface_hub

huggingface_hub.login(token=YOUR_TOKEN, new_session=False, add_to_git_credential=True)

Once completed, you'll be able to use the model.

Next, install uv, an ultra-fast tool for managing virtual environments and Python dependencies. It's compatible with pip, venv, setuptools, and poetry, but significantly faster.

uv sync

This command creates a virtual environment and installs all the dependencies required by the project.

Finally, to launch the main pipeline :
Run this command from the project root directory if you have several documents for the same patient:
uv run python -m tests.pseudo_test --input "your/folder/path" --seed "seed" --is_folder

  • --input indicates the path of the folder containing the CRs.
  • --seed to indicate a seed for a patient (This is used to create a seed for this patient).
  • --is_folder to indicate that --input is a folder and not plain text

Alternatively, if you have your CR in text or .txt format, run this command:
uv run python -m tests.pseudo_test --input "your CR" --seed "seed"

OR

uv run python -m tests.pseudo_test --input "your/txt file/path" --seed "seed"

However, before that, if you wish to use your own providers, make sure to add them in the providers folder and then include them in the main.py file during the initialization of the pseudonymization class. Each added provider must include a function named pseudonymize_{entity_type}. If no provider is added, the default providers will be used.

The user must specify the data and seed, and they have the option to test on one or more documents (a list of .txt files).

A quick_start.py file has been added, demonstrating how to easily use the package in just two lines of code if you prefer not to go through the command line.

Using pip

You can install Pseudocare locally using pip. We recommend using uv, an ultra-fast tool for managing virtual environments and Python dependencies. Follow the steps below to set up the package:

  1. Create a local folder for your project:
mkdir your_folder
cd your_folder
  1. Create a virtual environment with Python 3.10 or higher:
uv venv --python 3.10
  1. Activate the virtual environment:
source .venv/bin/activate
  1. (Optional) if pip is not available in your environment, install it:
uv run python -m ensurepip --upgrade
  1. Install Pseudocare:
uv run python -m pip install -i https://test.pypi.org/simple/ pseudocare --extra-index-url https://pypi.org/simple/

Once the installation is complete, you can start using Pseudocare to pseudonymize your files. Here's a quick exemple to get started:

from pseudocare.providers.custom_mail_provider import CustomMailProvider
from pseudocare.model.pseudo_faker import Pseudonymization

if __name__ == "__main__":
    # Import user providers
    custom_providers = {
         'MAIL': CustomMailProvider,
    }

    DOC = "Docteur BERNARD François, Tel: 04.10.14.10.14 Tel: 04.10.14.10.14, \
          Mail: fbernard@test.fr,\
          ipp: 12845673, \
          iep: 147085237, \
          Fait le mercredi 06/01/2025Aujourd'hui, le 6 janvier 2025, j'ai eu l'opportunité de recevoir en \
          consultation Monsieur Jean Dupont né le 15/12/1922, un patient résidant à Paris. Monsieur Dupont \
          est venu pour une consultation médicale afin de discuter de son état de santé général. Après un \
          entretien approfondi, nous avons examiné ses antécédents médicaux ainsi que ses préoccupations actuelles.\
          Le patient a été opéré en 07/2018 pour des problèmes cardiaques, puis à nouveau en sept-2019. En juin 1996,\
          il a subi une intervention chirurgicale pour une pathologie pulmonaire liée au tabagisme, et en sept.22, pour\
          une pathologie intestinale. La consultation du 06/01 a permis d'évaluer plusieurs aspects de son bien-être,\
          notamment en ce qui concerne ses habitudes de vie et ses symptômes. Nous avons convenu de plusieurs\
          recommandations pour améliorer sa santé et avons programmé une consultation pour mi-mai."
    # Instanciate the package
    pseudo_faker = Pseudonymization(custom_providers=custom_providers)
    # Run the pseudonymization process
    pseudo_document = pseudo_faker.run(DOC, 214)
    print(f"{pseudo_document = }")

Credits

Youcef Anis DAHLOUK

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pseudocare-0.1.4.tar.gz (20.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pseudocare-0.1.4-py3-none-any.whl (17.9 kB view details)

Uploaded Python 3

File details

Details for the file pseudocare-0.1.4.tar.gz.

File metadata

  • Download URL: pseudocare-0.1.4.tar.gz
  • Upload date:
  • Size: 20.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for pseudocare-0.1.4.tar.gz
Algorithm Hash digest
SHA256 e076423c20925bcbed171051fa97f57254785e082b6f2a8e3e88c648bf37c7ed
MD5 425285771be0288a9193031bcd448f34
BLAKE2b-256 31ae5a531fb48efc668cb9d264fb54351509b300e0a709b53f3fb3e52749c634

See more details on using hashes here.

File details

Details for the file pseudocare-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: pseudocare-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 17.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for pseudocare-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 245a9a1820450c18e137a0e4549e244708c958e06107494f889132138aeb2c35
MD5 dcbf33358b046abe927ad82d2d096b98
BLAKE2b-256 76b86978b9a37982b0c2b1e60886da2a1926d197966c7c11678e92eaca79b3dc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page