High-performance French medical text anonymization
Project description
unpii
High-performance French medical text anonymization library. Rust core with Python bindings and native Polars integration.
Designed to process millions of documents efficiently.
Installation
pip install unpii
Quick Start
import unpii
text = "Dr Martin au 06 12 34 56 78, email: martin@chu-brest.fr"
# Mask with placeholders (default)
unpii.mask(text)
# → "Dr <NOM> au <TELEPHONE>, email: <EMAIL>"
# Mask with stars
unpii.mask(text, mask="stars")
# → "Dr ***** au *****, email: *****"
Detection Modes
Two detection levels: standard (reliable patterns) and paranoid (aggressive).
# Standard: titles, known patterns, blacklisted names
unpii.mask("Dr Martin est ici")
# → "Dr <NOM> est ici"
unpii.mask("DUPONT Jean est ici")
# → "DUPONT Jean est ici" (not detected in standard)
# Paranoid: also catches UPPERCASE Titlecase patterns, 5+ digit sequences, loose emails
unpii.mask("DUPONT Jean est ici", mode="paranoid")
# → "<NOM> est ici"
Ignore Groups
Skip specific categories:
unpii.mask("Dr Martin au 06 12 34 56 78", ignore_groups=["TELEPHONE"])
# → "Dr <NOM> au 06 12 34 56 78"
Inspect Detected Spans
Dry-run mode to see what would be masked:
for span in unpii.find_spans("Dr Martin au 06 12 34 56 78"):
print(span)
# Span(start=3, end=9, category="NOM")
# Span(start=13, end=27, category="TELEPHONE")
Polars Integration
Native expression plugin — Polars handles parallelization automatically:
import polars as pl
import unpii
df = pl.DataFrame({"text": [
"Dr Martin au 06 12 34 56 78",
"Email: joe@chu-brest.fr",
"Maladie de Parkinson",
]})
df.with_columns(pl.col("text").unpii.mask())
# ┌─────────────────────────┐
# │ text │
# ╞═════════════════════════╡
# │ Dr <NOM> au <TELEPHONE> │
# │ Email: <EMAIL> │
# │ Maladie de Parkinson │ ← protected by whitelist
# └─────────────────────────┘
# With options
df.with_columns(
pl.col("text").unpii.mask(mask="stars", mode="paranoid", ignore_groups=["TELEPHONE"])
)
Categories
| Group | Placeholder | Standard | Paranoid |
|---|---|---|---|
| NOM | <NOM> |
Titles + name, blacklist | UPPERCASE/Titlecase patterns, initials |
| TELEPHONE | <TELEPHONE> |
French phone numbers | — |
<EMAIL> |
Valid emails | Anything with @ |
|
| DATE | <DATE> |
DD/MM/YYYY, literal months, ISO | — |
| BIRTHDATE | <BIRTHDATE> |
né(e) le + date | — |
| ADRESSE | <ADRESSE> |
Street number + type + name | — |
| CODE_POSTAL | <CODE_POSTAL> |
5 digits + city name | — |
| NIR | <NIR> |
French social security number | — |
| IBAN | <IBAN> |
French IBAN | — |
| NUMBER | <NUMBER> |
— | 5+ consecutive digits |
Whitelist
Medical eponyms (Parkinson, Alzheimer, Verneuil...) are protected from masking by a global whitelist, regardless of which group detected them.
API Reference
def mask(
text: str,
*,
mask: str = "placeholder", # "placeholder" or "stars"
mode: str = "standard", # "standard" or "paranoid"
ignore_groups: list[str] | None = None,
) -> str: ...
def find_spans(
text: str,
*,
mode: str = "standard",
ignore_groups: list[str] | None = None,
) -> list[Span]: ...
# Span attributes: .start, .end, .category
Performance
Rust core with compiled regex and Aho-Corasick automata. All rules and dictionaries are embedded in the binary — zero I/O at runtime.
Single-threaded: ~10ms per 7KB document (~100 docs/sec). With Polars: automatic parallelization across all cores.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unpii-0.1.0.tar.gz.
File metadata
- Download URL: unpii-0.1.0.tar.gz
- Upload date:
- Size: 211.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf37c055557cdd5bbfa0885460f5a9effd5a0bb01627f7b6c32272d5d170c9a9
|
|
| MD5 |
d3ea7c6407043f6cc5e0404aa00654bf
|
|
| BLAKE2b-256 |
b8c0fb008c3cfe23e9d4b200c41346a73699400a09881b3bf4b5e038aa38ed97
|
File details
Details for the file unpii-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: unpii-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 5.1 MB
- Tags: CPython 3.10+, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
12a98d9fe5f47f7f8d3e5c1ee8a9830d728c4fca8331ee0276705ce1546d39bd
|
|
| MD5 |
798aa352d2ffba0960509cb352f5fa9d
|
|
| BLAKE2b-256 |
06bf98163f588a640b086296b1b2842f3b02757983c86556e5058e5261a19d38
|