Pseudonymization of medical reports using a named entity recognition model and the Faker library
Project description
PseudoCare: A Python Libray for consistent and realistic Pseudonymization of clinical text entities
Description
This Python package enables the automatic pseudonymization of medical reports and records using a deep learning-based Named Entity Recognition (NER) NLP model. It leverages the NER model's results to identify and replace sensitive information with synthetic data generated using the Faker library.
The targeted entities are:
| Label | Description |
|---|---|
| ADRESSE | Postal address, e.g., 33 Boulevard de la Paix |
| DATE | Any absolute date other than a birth date |
| DATE_NAISSANCE | Patient's birth date |
| HOPITAL | Hospital name, e.g., Hôpital Robert Debré |
| IPP | Permanent patient identifier, a number assigned during the patient's first hospital visit |
| PRENOM | Any first name (patients, doctors, etc.) |
| NOM | Any last name (patients, doctors, etc.) |
| SECU | Social security number social |
| TEL | Any phone number |
| VILLE | Any city |
| ZIP | Any postal code |
| NDA | Identifier for visits |
Functionalities
- Detection of sensitive entities using a Named Entity Recognition (NER) model.
- Automatic pseudonymization of medical reports by replacing detected entities with fictitious data generated by Faker.
- Customization of generated data with two custom Faker providers:
- A dedicated provider handles date formats frequently found in medical reports, such as janv.12, 13 05.2015, mars2020, or mi-mai. The pseudonymization of dates is performed through offsetting, ensuring that for the same IPP, the dates across different documents are pseudonymized consistently. The user can define the maximum offset value via the Pseudonymization class constructor. By default, birth dates are shifted by a random number of days between 1 and 30, while other dates are shifted by a random value between 1 and 100. These parameters can be customized according to specific needs.
- A provider dedicated to handling email addresses ensures pseudonymization while preserving the format used by the CHU de Reims (e.g., example@chu-reims.fr).
- Extensibility: the user can add custom Faker providers as well as their own NER models for entity detection.
- Default model used: the package utilizes the eds-pseudo model from AP-HP, specifically trained on medical documents, including reports.
- Generation of a results file: After executing this package, a results.html file is generated, allowing the user to view both the original predicted document and the pseudonymized document.
Here is an example of execution on a fictitious medical report:
Structure of files
- scripts/ : Contains pyhton script files
- scripts/providers/ : Contains all customised providers
- tests/ : Contains test notebooks
- Results/ : Contains the results (html and txt files)
Launch
Using Gitlab repo
First, clone the project locally. Our package relies on the edsnlp model for entity detection, which is hosted on Hugging Face. Therefore, you need to create a Hugging Face access token [https://huggingface.co/settings/tokens?new_token=true], and register it on your machine. This step only needs to be done once by running the following script:
import huggingface_hub
huggingface_hub.login(token=YOUR_TOKEN, new_session=False, add_to_git_credential=True)
Once completed, you'll be able to use the model.
Next, install uv, an ultra-fast tool for managing virtual environments and Python dependencies. It's compatible with pip, venv, setuptools, and poetry, but significantly faster.
uv sync
This command creates a virtual environment and installs all the dependencies required by the project.
Finally, to launch the main pipeline :
Run this command from the project root directory if you have several documents for the same patient:
uv run python -m tests.pseudo_test --input "your/folder/path" --seed "seed" --is_folder
- --input indicates the path of the folder containing the CRs.
- --seed to indicate a seed for a patient (This is used to create a seed for this patient).
- --is_folder to indicate that --input is a folder and not plain text
Alternatively, if you have your CR in text or .txt format, run this command:
uv run python -m tests.pseudo_test --input "your CR" --seed "seed"
OR
uv run python -m tests.pseudo_test --input "your/txt file/path" --seed "seed"
However, before that, if you wish to use your own providers, make sure to add them in the providers folder and then include them in the main.py file during the initialization of the pseudonymization class. Each added provider must include a function named pseudonymize_{entity_type}. If no provider is added, the default providers will be used.
The user must specify the data and seed, and they have the option to test on one or more documents (a list of .txt files).
A quick_start.py file has been added, demonstrating how to easily use the package in just two lines of code if you prefer not to go through the command line.
Using pip
You can install Pseudocare locally using pip. We recommend using uv, an ultra-fast tool for managing virtual environments and Python dependencies.
Follow the steps below to set up the package:
- Create a local folder for your project:
mkdir your_folder
cd your_folder
- Create a virtual environment with Python 3.10 or higher:
uv venv --python 3.10
- Activate the virtual environment:
source .venv/bin/activate
- (Optional) if
pipis not available in your environment, install it:
uv run python -m ensurepip --upgrade
- Install Pseudocare:
uv run python -m pip install pseudocare
Once the installation is complete, you can start using Pseudocare to pseudonymize your files. Here's a quick exemple to get started:
from pseudocare.providers.custom_mail_provider import CustomMailProvider
from pseudocare.pseudocare import PseudoCare
if __name__ == "__main__":
# Import user providers
custom_providers = {
'MAIL': CustomMailProvider,
}
DOC = "Docteur BERNARD François, Tel: 04.10.14.10.14 Tel: 04.10.14.10.14, \
Mail: fbernard@test.fr,\
ipp: 12845673, \
iep: 147085237, \
Fait le mercredi 06/01/2025Aujourd'hui, le 6 janvier 2025, j'ai eu l'opportunité de recevoir en \
consultation Monsieur Jean Dupont né le 15/12/1922, un patient résidant à Paris. Monsieur Dupont \
est venu pour une consultation médicale afin de discuter de son état de santé général. Après un \
entretien approfondi, nous avons examiné ses antécédents médicaux ainsi que ses préoccupations actuelles.\
Le patient a été opéré en 07/2018 pour des problèmes cardiaques, puis à nouveau en sept-2019. En juin 1996,\
il a subi une intervention chirurgicale pour une pathologie pulmonaire liée au tabagisme, et en sept.22, pour\
une pathologie intestinale. La consultation du 06/01 a permis d'évaluer plusieurs aspects de son bien-être,\
notamment en ce qui concerne ses habitudes de vie et ses symptômes. Nous avons convenu de plusieurs\
recommandations pour améliorer sa santé et avons programmé une consultation pour mi-mai."
# Instanciate the package
pseudo_faker = PseudoCare(custom_providers=custom_providers)
# Run the pseudonymization process
pseudo_document = pseudo_faker.run(DOC, 214)
print(f"{pseudo_document = }")
Credits
Youcef Anis DAHLOUK
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pseudocare-0.1.5.tar.gz.
File metadata
- Download URL: pseudocare-0.1.5.tar.gz
- Upload date:
- Size: 20.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
271b91ee3969e0d9351851ce9fa637df656184832ebf032f54520a9f1da89eec
|
|
| MD5 |
399412ad0ae066668fb12b76b1603cc1
|
|
| BLAKE2b-256 |
652b879f628de9e8a7c198931889c5dc321ff04c457d70b13b1ce4deb4f97bf5
|
File details
Details for the file pseudocare-0.1.5-py3-none-any.whl.
File metadata
- Download URL: pseudocare-0.1.5-py3-none-any.whl
- Upload date:
- Size: 17.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3ae1b8a0025c907c07c1c4b7ba07a13e6859cfa35946af93b38ffb66bd8032c
|
|
| MD5 |
002330cc83d93e09b61f299d7883736a
|
|
| BLAKE2b-256 |
51017e26a31e2c6651825a0e8889821a9e26c737496fc8adba818a9da4f3869c
|