Skip to main content

High-performance French medical text anonymization

Project description

unpii

High-performance French medical text anonymization library. Rust core with Python bindings.

Designed to process millions of documents efficiently. Inspired by Micropot/incognito

Installation

pip install unpii

# With Polars support
pip install unpii[polars]

Quick Start

import unpii

text = "Dr Martin au 06 12 34 56 78, email: martin@chu-brest.fr"

# Anonymize with placeholders (default)
unpii.anonymize(text)
# → "Dr <NOM> au <TELEPHONE>, email: <EMAIL>"

# Anonymize with stars
unpii.anonymize(text, style="stars")
# → "Dr ***** au *****, email: *****"

Detection Modes

Two detection levels: standard (reliable patterns) and paranoid (aggressive).

# Standard: titles, known patterns, blacklisted names
unpii.anonymize("Dr Martin est ici")
# → "Dr <NOM> est ici"

unpii.anonymize("DUPONT Jean est ici")
# → "DUPONT Jean est ici"  (not detected in standard)

# Paranoid: also catches UPPERCASE Titlecase patterns, 5+ digit sequences, loose emails
unpii.anonymize("DUPONT Jean est ici", mode="paranoid")
# → "<NOM> est ici"

Custom Words to Mask

Pass additional words to mask per call. Useful when you know the patient's name:

unpii.anonymize("bob dylan est ici", mask=["bob", "dylan"])
# → "<PII> <PII> est ici"

Case-insensitive with word boundary checks:

unpii.anonymize("Bonjour Bob", mask=["bob"])
# → "Bonjour <PII>"

Ignore Groups

Skip specific categories:

unpii.anonymize("Dr Martin au 06 12 34 56 78", ignore_groups=["TELEPHONE"])
# → "Dr <NOM> au 06 12 34 56 78"

Inspect Detected Spans

Dry-run mode to see what would be masked:

for span in unpii.find_spans("Dr Martin au 06 12 34 56 78"):
    print(span)
# Span(start=3, end=9, category="NOM")
# Span(start=13, end=27, category="TELEPHONE")

DataFrame Integration

anonymize_dataframe anonymizes a column in a Polars DataFrame:

import polars as pl
import unpii

df = pl.DataFrame({"text": [
    "Dr Martin au 06 12 34 56 78",
    "Email: joe@chu-brest.fr",
    "Maladie de Parkinson",
]})

# Anonymize in place (overwrites the column)
df = unpii.anonymize_dataframe(df, "text")
# ┌─────────────────────────┐
# │ text                    │
# ╞═════════════════════════╡
# │ Dr <NOM> au <TELEPHONE> │
# │ Email: <EMAIL>          │
# │ Maladie de Parkinson    │  ← protected by whitelist
# └─────────────────────────┘

# Write to a new column
df = unpii.anonymize_dataframe(df, "text", new_column="text_anonymized")

# With options
df = unpii.anonymize_dataframe(df, "text", style="stars", mode="paranoid", ignore_groups=["TELEPHONE"])

Per-row words to mask (mask_from_columns)

Pass column names whose values are added as words to mask, per row. Useful when patient name/city are in structured columns:

df = pl.DataFrame({
    "text": ["bob est ici", "alice va bien"],
    "nom": ["bob", "alice"],
})

df = unpii.anonymize_dataframe(df, "text", mask_from_columns=["nom"])
# ┌─────────────────┬───────┐
# │ text            ┆ nom   │
# ╞═════════════════╪═══════╡
# │ <PII> est ici   ┆ bob   │
# │ <PII> va bien   ┆ alice │
# └─────────────────┴───────┘

Global words to mask (mask)

Words to mask on every row (e.g. the doctor who wrote all reports):

df = unpii.anonymize_dataframe(df, "text", mask=["Dupont", "Cabinet Santé Plus"])

Both combined

df = unpii.anonymize_dataframe(df, "text",
    mask_from_columns=["nom", "ville"],
    mask=["Dupont"],
    style="stars",
)

Low-level: anonymize_series

Operates on a Polars Series directly:

masked = unpii.anonymize_series(
    df["text"],
    mask_from_columns=[df["nom"], df["ville"]],
    mask=["Dupont"],
)
df = df.with_columns(masked.alias("text_anonymized"))

Batch processing: anonymize_batch

Operates on plain Python lists (no Polars dependency):

results = unpii.anonymize_batch(["Dr Martin ici", "Email: a@b.fr"])
# → ["Dr <NOM> ici", "Email: <EMAIL>"]

Threading

Control the number of threads used by anonymize_batch, anonymize_series, and anonymize_dataframe:

unpii.set_max_threads(4)     # Use 4 threads
unpii.get_max_threads()      # → 4
unpii.set_max_threads(0)     # Use all available cores (default)

Categories

Group Placeholder Standard Paranoid
NOM <NOM> Titles + name, blacklist UPPERCASE/Titlecase patterns, initials
TELEPHONE <TELEPHONE> French phone numbers
EMAIL <EMAIL> Valid emails Anything with @
DATE <DATE> DD/MM/YYYY, literal months, ISO
BIRTHDATE <BIRTHDATE> né(e) le + date
ADRESSE <ADRESSE> Street number + type + name
CODE_POSTAL <CODE_POSTAL> 5 digits + city name
NIR <NIR> French social security number
IBAN <IBAN> French IBAN
NUMBER <NUMBER> 5+ consecutive digits
PII <PII> Custom words passed via mask=

Whitelist

Medical eponyms (Parkinson, Alzheimer, Verneuil...) are protected from masking by a global whitelist, regardless of which group detected them.

API Reference

# Single text
def anonymize(text, *, style="placeholder", mode="standard", ignore_groups=None, mask=None) -> str
def find_spans(text, *, mode="standard", ignore_groups=None, mask=None) -> list[Span]

# Batch (plain Python lists, no Polars needed)
def anonymize_batch(texts, *, mask_from_columns=None, mask=None, style="placeholder", mode="standard", ignore_groups=None) -> list[str | None]

# DataFrame (requires polars)
def anonymize_dataframe(df, column, *, mask_from_columns=None, mask=None, new_column=None, style="placeholder", mode="standard", ignore_groups=None) -> DataFrame
def anonymize_series(series, *, mask_from_columns=None, mask=None, style="placeholder", mode="standard", ignore_groups=None) -> Series

# Threading
def set_max_threads(n: int) -> None   # 0 = all cores (default)
def get_max_threads() -> int

# Span attributes: .start, .end, .category

Performance

Rust core with compiled regex and Aho-Corasick automata. All rules and dictionaries are embedded in the binary — zero I/O at runtime.

anonymize_batch, anonymize_series, and anonymize_dataframe use rayon for automatic parallelization across all cores.

License

MIT

See also

https://github.com/micropot/incognito https://github.com/microsoft/presidio

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unpii-0.3.0.tar.gz (199.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unpii-0.3.0-cp310-abi3-manylinux_2_34_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

File details

Details for the file unpii-0.3.0.tar.gz.

File metadata

  • Download URL: unpii-0.3.0.tar.gz
  • Upload date:
  • Size: 199.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for unpii-0.3.0.tar.gz
Algorithm Hash digest
SHA256 27e090bc208d9f98cf0296d953d2f9ec625544efa56ab8165cbaf57a6e83dcb1
MD5 13b5dbf223d04a32ab1b59ca6d7732be
BLAKE2b-256 bb368692a4136928aa5cdbabcce55f62cc9dcd3e44a0fc6393980a0a724d4b10

See more details on using hashes here.

File details

Details for the file unpii-0.3.0-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for unpii-0.3.0-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 222c19c22cbc313ef2bc24405e3c1a1550a28421320980abf14101fa9b3c657f
MD5 602e404af17da66dac1b98167850837c
BLAKE2b-256 3633ea19c6ce0ebbdc4c59e6133af53a54d498047664e5064737fa224341ce51

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page