A module to anonymize french text data
Project description
Incognito
Description
Incognito is a Python module for anonymizing French text. It uses Regex and other strategies to mask names and personal information provided by the user.
This module was specifically designed for medical reports, ensuring that disease names remain unaltered.
Installation
From pip
pip install incognito-anonymizer
From this repository
-
Clone the repository:
git clone https://github.com/Micropot/incognito
-
Install the dependencies (defined in
pyproject.toml):pip install .
Usage
Python API
Example: Providing Personal Information Directly in Code
from . import anonymizer
# Initialize the anonymizer
ano = anonymizer.Anonymizer()
# Define personal information
infos = {
"first_name": "Bob",
"last_name": "Jungels",
"birth_name": "",
"birthdate": "1992-09-22",
"ipp": "0987654321",
"postal_code": "01000",
"adress": ""
}
# Configure the anonymizer
ano.set_info(infos)
ano.add_analyser('pii')
ano.add_analyser('regex')
ano.add_analyser('lossy') # trigger a warning. See doc string for better understanding
ano.set_mask('placeholder')
# Read and anonymize text
text_to_anonymize = ano.open_text_file("/path/to/file.txt")
anonymized_text = ano.anonymize(text_to_anonymize)
print(anonymized_text)
Example: Using JSON File for Personal Information
from . import anonymizer
# Initialize the anonymizer
ano = anonymizer.Anonymizer()
# Load personal information from JSON
infos_json = ano.open_json_file("/path/to/infofile.json")
# Configure the anonymizer
ano.set_info(infos_json)
ano.add_analyser('pii')
ano.add_analyser('regex')
ano.set_mask('placeholder')
# Read and anonymize text
text_to_anonymize = ano.open_text_file("/path/to/file.txt")
anonymized_text = ano.anonymize(text_to_anonymize)
print(anonymized_text)
Example: Annote a file
from . import anonymizer
# Initialize the anonymizer
ano = anonymizer.Anonymizer()
# Load personal information from JSON
infos_json = ano.open_json_file("/path/to/infofile.json")
# Configure the annotator
ano.set_info(infos_json)
ano.add_analyser('pii')
ano.add_analyser('regex')
ano.set_annotator('placeholder')
# Read and annotate text
text_to_anonymize = ano.open_text_file("/path/to/file.txt")
annotated_text = ano.annotate(text_to_anonymize)
print(annotated_text)
Command-Line Interface (CLI)
Basic Usage
python -m incognito --input myinputfile.txt --output myanonymizedfile.txt --strategies mystrategies --mask mymasks
Find Available Strategies, Masks and Annotator
python -m incognito --help
Anonymization with JSON File
python -m incognito --input myinputfile.txt --output myanonymizedfile.txt --strategies mystrategies --mask mymasks json --json myjsonfile.json
To view helper options for the JSON submodule:
python -m incognito json --help
Anonymization with Personal Information in CLI
python -m incognito --input myinputfile.txt --output myanonymizedfile.txt --strategies mystrategies --mask mymasks infos --first_name Bob --last_name Dylan --birthdate 1800-01-01 --ipp 0987654312 --postal_code 75001
To view helper options for the "infos" submodule:
python -m incognito infos --help
Annotation
python -m incognito --input myinputfile.txt --output annotationfile.ann --strategies mystrategies --annotate myannotator infos --first_name Bob --last_name Dylan --birthdate 1800-01-01 --ipp 0987654312 --postal_code 75001
Unit Tests
Unit tests are included to ensure the module's functionality. You can modify them based on your needs.
To run the tests:
make test
To check code coverage:
make cov
Anonymization Process Details
Regex Strategy
One available anonymization strategy is Regex. It can extract and mask specific information from the input text, such as:
- Email addresses
- Phone numbers
- French NIR (social security number)
- First and last names (if preceded by titles like "Monsieur", "Madame", "Mr", "Mme", "Docteur", "Professeur", etc.)
For more details, see the RegexStrategy class and the self.title_regex variable.
PII Stategy
This strategy is used to catch the personal informations of the patient.
You can use it in CLI with the infos or in a json fil.
For further example you can see the CLI chapter
Lossy Strategy
Another available anonymization strategy is Lossy. The idea is to mask pattern like DUPONT Marc or Marc DUPONT.
!!!warn
It can produce false positive. Be aware that this strategy can can unexpected matched and loose informations in your text
Get the matched entities
If you want to print the matched entities to check what the code did you can use the get_entities() function
ano = Anonymizer()
ano.add_analyzer("regex")
ano.add_analyzer("lossy")
ano.set_mask("placeholder")
ano.anonymize(input)
entities = ano.get_entities()
The output will match this kind of list :
[ {"original": "DUPONT", "replacement": "<NOM>", "type": "NOM", "start": 42, "end": 49}, {"original": "01/01/1970", "replacement": "<DATE>", "type": "DATE", "start": 80, "end": 90}, ]
For more details, see the LossyStrategy class
Anotation Process Details
Standoff Strategy
You can create an annotation file based on the Standoff format.
This file will be automatically created based on the matched entity.
You can find example in the CLI/API chapters
License
This project is licensed under the terms of the MIT License.
Contributors
- Maintainer: Micropot
Feel free to open issues or contribute via pull requests!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file incognito_anonymizer-1.4.6.tar.gz.
File metadata
- Download URL: incognito_anonymizer-1.4.6.tar.gz
- Upload date:
- Size: 123.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f36cb2bea48419100959b5442f3586b512a4fcf55b8995c3ebcc56410db98fcf
|
|
| MD5 |
da21a9eaf986bee366712ac413b6296d
|
|
| BLAKE2b-256 |
bad590d169de3f22995941e7993d935fd99beb8b93221563871108160118ca5e
|
File details
Details for the file incognito_anonymizer-1.4.6-py3-none-any.whl.
File metadata
- Download URL: incognito_anonymizer-1.4.6-py3-none-any.whl
- Upload date:
- Size: 16.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7ccc6e1aa179956be99d2e73b88650a4c367418ecc8028dcc07edbb60b5b155
|
|
| MD5 |
aee43aaae9df280f95b712a3b4b2e75e
|
|
| BLAKE2b-256 |
b4589f307e422cd4274f38d95ab5f50d1bbd4c7e9cac5bf5f3568c1606ee6f22
|