High-performance French medical text anonymization
Project description
unpii
High-performance French medical text anonymization library. Rust core with Python bindings.
Designed to process millions of documents efficiently. Inspired by Micropot/incognito
Installation
pip install unpii
# With Polars support
pip install unpii[polars]
Quick Start
import unpii
text = "Dr Martin au 06 12 34 56 78, email: martin@chu-brest.fr"
# Anonymize with placeholders (default)
unpii.anonymize(text)
# → "Dr <NOM> au <TELEPHONE>, email: <EMAIL>"
# Anonymize with stars
unpii.anonymize(text, style="stars")
# → "Dr ***** au *****, email: *****"
Detection Modes
Two detection levels: standard (reliable patterns) and paranoid (aggressive).
# Standard: titles, known patterns, blacklisted names
unpii.anonymize("Dr Martin est ici")
# → "Dr <NOM> est ici"
unpii.anonymize("DUPONT Jean est ici")
# → "DUPONT Jean est ici" (not detected in standard)
# Paranoid: also catches UPPERCASE Titlecase patterns, 5+ digit sequences, loose emails
unpii.anonymize("DUPONT Jean est ici", mode="paranoid")
# → "<NOM> est ici"
Custom Words to Mask
Pass additional words to mask per call. Useful when you know the patient's name:
unpii.anonymize("bob dylan est ici", mask=["bob", "dylan"])
# → "<PII> <PII> est ici"
Case-insensitive with word boundary checks:
unpii.anonymize("Bonjour Bob", mask=["bob"])
# → "Bonjour <PII>"
Ignore Groups
Skip specific categories:
unpii.anonymize("Dr Martin au 06 12 34 56 78", ignore_groups=["TELEPHONE"])
# → "Dr <NOM> au 06 12 34 56 78"
Inspect Detected Spans
Dry-run mode to see what would be masked:
for span in unpii.find_spans("Dr Martin au 06 12 34 56 78"):
print(span)
# Span(start=3, end=9, category="NOM")
# Span(start=13, end=27, category="TELEPHONE")
DataFrame Integration
anonymize_dataframe anonymizes a column in a Polars DataFrame:
import polars as pl
import unpii
df = pl.DataFrame({"text": [
"Dr Martin au 06 12 34 56 78",
"Email: joe@chu-brest.fr",
"Maladie de Parkinson",
]})
# Anonymize in place (overwrites the column)
df = unpii.anonymize_dataframe(df, "text")
# ┌─────────────────────────┐
# │ text │
# ╞═════════════════════════╡
# │ Dr <NOM> au <TELEPHONE> │
# │ Email: <EMAIL> │
# │ Maladie de Parkinson │ ← protected by whitelist
# └─────────────────────────┘
# Write to a new column
df = unpii.anonymize_dataframe(df, "text", new_column="text_anonymized")
# With options
df = unpii.anonymize_dataframe(df, "text", style="stars", mode="paranoid", ignore_groups=["TELEPHONE"])
Per-row words to mask (mask_from_columns)
Pass column names whose values are added as words to mask, per row. Useful when patient name/city are in structured columns:
df = pl.DataFrame({
"text": ["bob est ici", "alice va bien"],
"nom": ["bob", "alice"],
})
df = unpii.anonymize_dataframe(df, "text", mask_from_columns=["nom"])
# ┌─────────────────┬───────┐
# │ text ┆ nom │
# ╞═════════════════╪═══════╡
# │ <PII> est ici ┆ bob │
# │ <PII> va bien ┆ alice │
# └─────────────────┴───────┘
Global words to mask (mask)
Words to mask on every row (e.g. the doctor who wrote all reports):
df = unpii.anonymize_dataframe(df, "text", mask=["Dupont", "Cabinet Santé Plus"])
Both combined
df = unpii.anonymize_dataframe(df, "text",
mask_from_columns=["nom", "ville"],
mask=["Dupont"],
style="stars",
)
Low-level: anonymize_series
Operates on a Polars Series directly:
masked = unpii.anonymize_series(
df["text"],
mask_from_columns=[df["nom"], df["ville"]],
mask=["Dupont"],
)
df = df.with_columns(masked.alias("text_anonymized"))
Batch processing: anonymize_batch
Operates on plain Python lists (no Polars dependency):
results = unpii.anonymize_batch(["Dr Martin ici", "Email: a@b.fr"])
# → ["Dr <NOM> ici", "Email: <EMAIL>"]
Threading
Control the number of threads used by anonymize_batch, anonymize_series, and anonymize_dataframe:
unpii.set_max_threads(4) # Use 4 threads
unpii.get_max_threads() # → 4
unpii.set_max_threads(0) # Use all available cores (default)
Categories
| Group | Placeholder | Standard | Paranoid |
|---|---|---|---|
| NOM | <NOM> |
Titles + name, blacklist | UPPERCASE/Titlecase patterns, initials |
| TELEPHONE | <TELEPHONE> |
French phone numbers | — |
<EMAIL> |
Valid emails | Anything with @ |
|
| DATE | <DATE> |
DD/MM/YYYY, literal months, ISO | — |
| BIRTHDATE | <BIRTHDATE> |
né(e) le + date | — |
| ADRESSE | <ADRESSE> |
Street number + type + name | — |
| CODE_POSTAL | <CODE_POSTAL> |
5 digits + city name | — |
| NIR | <NIR> |
French social security number | — |
| IBAN | <IBAN> |
French IBAN | — |
| NUMBER | <NUMBER> |
— | 5+ consecutive digits |
| PII | <PII> |
Custom words passed via mask= |
— |
Whitelist
Medical eponyms (Parkinson, Alzheimer, Verneuil...) are protected from masking by a global whitelist, regardless of which group detected them.
API Reference
# Single text
def anonymize(text, *, style="placeholder", mode="standard", ignore_groups=None, mask=None) -> str
def find_spans(text, *, mode="standard", ignore_groups=None, mask=None) -> list[Span]
# Batch (plain Python lists, no Polars needed)
def anonymize_batch(texts, *, mask_from_columns=None, mask=None, style="placeholder", mode="standard", ignore_groups=None) -> list[str | None]
# DataFrame (requires polars)
def anonymize_dataframe(df, column, *, mask_from_columns=None, mask=None, new_column=None, style="placeholder", mode="standard", ignore_groups=None) -> DataFrame
def anonymize_series(series, *, mask_from_columns=None, mask=None, style="placeholder", mode="standard", ignore_groups=None) -> Series
# Threading
def set_max_threads(n: int) -> None # 0 = all cores (default)
def get_max_threads() -> int
# Span attributes: .start, .end, .category
Performance
Rust core with compiled regex and Aho-Corasick automata. All rules and dictionaries are embedded in the binary — zero I/O at runtime.
anonymize_batch, anonymize_series, and anonymize_dataframe use rayon for automatic parallelization across all cores.
License
MIT
See also
https://github.com/micropot/incognito https://github.com/microsoft/presidio
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unpii-0.3.1.tar.gz.
File metadata
- Download URL: unpii-0.3.1.tar.gz
- Upload date:
- Size: 200.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
adc98555c369b98a5afbce1dd2edfc7b9abdea2f768cc6b98123edbe8d4f5bc3
|
|
| MD5 |
6ec4b175d4ed178b624167b268f2902d
|
|
| BLAKE2b-256 |
b38a599247eeef8663aac4e7d07d85ce0ee1c3aa799959bb162682f237174e09
|
File details
Details for the file unpii-0.3.1-cp310-abi3-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: unpii-0.3.1-cp310-abi3-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 1.5 MB
- Tags: CPython 3.10+, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22c3c52d7c43aaa868c1ccf247dd449e4650da35464e8b6af408af1efb0d944b
|
|
| MD5 |
aa5475b3b5ecfd004aa16c1cf59389b7
|
|
| BLAKE2b-256 |
2d74e588a1e25524a0442fa10232a5ecb7554ea5227d012b26efb52a83e58d32
|