High-performance French medical text anonymization

These details have not been verified by PyPI

Project links

Repository

Project description

unpii

High-performance French medical text anonymization library. Rust core with Python bindings.

Designed to process millions of documents efficiently. Inspired by Micropot/incognito

Installation

pip install unpii

# With Polars support
pip install unpii[polars]

Quick Start

import unpii

text = "Dr Martin au 06 12 34 56 78, email: martin@chu-brest.fr"

# Anonymize with placeholders (default)
unpii.anonymize(text)
# → "Dr <NOM> au <TELEPHONE>, email: <EMAIL>"

# Anonymize with stars
unpii.anonymize(text, style="stars")
# → "Dr ***** au *****, email: *****"

Detection Modes

Two detection levels: standard (reliable patterns) and paranoid (aggressive).

# Standard: titles, known patterns, blacklisted names
unpii.anonymize("Dr Martin est ici")
# → "Dr <NOM> est ici"

unpii.anonymize("DUPONT Jean est ici")
# → "DUPONT Jean est ici"  (not detected in standard)

# Paranoid: also catches UPPERCASE Titlecase patterns, 5+ digit sequences, loose emails
unpii.anonymize("DUPONT Jean est ici", mode="paranoid")
# → "<NOM> est ici"

Custom Words to Mask

Pass additional words to mask per call. Useful when you know the patient's name:

unpii.anonymize("bob dylan est ici", mask=["bob", "dylan"])
# → "<PII> <PII> est ici"

Case-insensitive with word boundary checks:

unpii.anonymize("Bonjour Bob", mask=["bob"])
# → "Bonjour <PII>"

Ignore Groups

Skip specific categories:

unpii.anonymize("Dr Martin au 06 12 34 56 78", ignore_groups=["TELEPHONE"])
# → "Dr <NOM> au 06 12 34 56 78"

Inspect Detected Spans

Dry-run mode to see what would be masked:

for span in unpii.find_spans("Dr Martin au 06 12 34 56 78"):
    print(span)
# Span(start=3, end=9, category="NOM")
# Span(start=13, end=27, category="TELEPHONE")

DataFrame Integration

anonymize_dataframe anonymizes a column in a Polars DataFrame:

import polars as pl
import unpii

df = pl.DataFrame({"text": [
    "Dr Martin au 06 12 34 56 78",
    "Email: joe@chu-brest.fr",
    "Maladie de Parkinson",
]})

# Anonymize in place (overwrites the column)
df = unpii.anonymize_dataframe(df, "text")
# ┌─────────────────────────┐
# │ text                    │
# ╞═════════════════════════╡
# │ Dr <NOM> au <TELEPHONE> │
# │ Email: <EMAIL>          │
# │ Maladie de Parkinson    │  ← protected by whitelist
# └─────────────────────────┘

# Write to a new column
df = unpii.anonymize_dataframe(df, "text", new_column="text_anonymized")

# With options
df = unpii.anonymize_dataframe(df, "text", style="stars", mode="paranoid", ignore_groups=["TELEPHONE"])

Per-row words to mask (`mask_from_columns`)

Pass column names whose values are added as words to mask, per row. Useful when patient name/city are in structured columns:

df = pl.DataFrame({
    "text": ["bob est ici", "alice va bien"],
    "nom": ["bob", "alice"],
})

df = unpii.anonymize_dataframe(df, "text", mask_from_columns=["nom"])
# ┌─────────────────┬───────┐
# │ text            ┆ nom   │
# ╞═════════════════╪═══════╡
# │ <PII> est ici   ┆ bob   │
# │ <PII> va bien   ┆ alice │
# └─────────────────┴───────┘

Global words to mask (`mask`)

Words to mask on every row (e.g. the doctor who wrote all reports):

df = unpii.anonymize_dataframe(df, "text", mask=["Dupont", "Cabinet Santé Plus"])

Both combined

df = unpii.anonymize_dataframe(df, "text",
    mask_from_columns=["nom", "ville"],
    mask=["Dupont"],
    style="stars",
)

Low-level: `anonymize_series`

Operates on a Polars Series directly:

masked = unpii.anonymize_series(
    df["text"],
    mask_from_columns=[df["nom"], df["ville"]],
    mask=["Dupont"],
)
df = df.with_columns(masked.alias("text_anonymized"))

Batch processing: `anonymize_batch`

Operates on plain Python lists (no Polars dependency):

results = unpii.anonymize_batch(["Dr Martin ici", "Email: a@b.fr"])
# → ["Dr <NOM> ici", "Email: <EMAIL>"]

Threading

Control the number of threads used by anonymize_batch, anonymize_series, and anonymize_dataframe:

unpii.set_max_threads(4)     # Use 4 threads
unpii.get_max_threads()      # → 4
unpii.set_max_threads(0)     # Use all available cores (default)

Group	Placeholder	Standard	Paranoid
NOM	`<NOM>`	Titles + name, blacklist	UPPERCASE/Titlecase patterns, initials
TELEPHONE	`<TELEPHONE>`	French phone numbers	—
EMAIL	`<EMAIL>`	Valid emails	Anything with `@`
DATE	`<DATE>`	DD/MM/YYYY, literal months, ISO	—
BIRTHDATE	`<BIRTHDATE>`	né(e) le + date	—
ADRESSE	`<ADRESSE>`	Street number + type + name	—
CODE_POSTAL	`<CODE_POSTAL>`	5 digits + city name	—
NIR	`<NIR>`	French social security number	—
IBAN	`<IBAN>`	French IBAN	—
NUMBER	`<NUMBER>`	—	5+ consecutive digits
PII	`<PII>`	Custom words passed via `mask=`	—

Whitelist

Medical eponyms (Parkinson, Alzheimer, Verneuil...) are protected from masking by a global whitelist, regardless of which group detected them.

API Reference

# Single text
def anonymize(text, *, style="placeholder", mode="standard", ignore_groups=None, mask=None) -> str
def find_spans(text, *, mode="standard", ignore_groups=None, mask=None) -> list[Span]

# Batch (plain Python lists, no Polars needed)
def anonymize_batch(texts, *, mask_from_columns=None, mask=None, style="placeholder", mode="standard", ignore_groups=None) -> list[str | None]

# DataFrame (requires polars)
def anonymize_dataframe(df, column, *, mask_from_columns=None, mask=None, new_column=None, style="placeholder", mode="standard", ignore_groups=None) -> DataFrame
def anonymize_series(series, *, mask_from_columns=None, mask=None, style="placeholder", mode="standard", ignore_groups=None) -> Series

# Threading
def set_max_threads(n: int) -> None   # 0 = all cores (default)
def get_max_threads() -> int

# Span attributes: .start, .end, .category

Performance

Rust core with compiled regex and Aho-Corasick automata. All rules and dictionaries are embedded in the binary — zero I/O at runtime.

anonymize_batch, anonymize_series, and anonymize_dataframe use rayon for automatic parallelization across all cores.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

0.3.1

Mar 27, 2026

This version

0.3.0

Mar 26, 2026

0.2.0

Mar 24, 2026

0.1.0

Mar 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unpii-0.3.0.tar.gz (199.7 kB view details)

Uploaded Mar 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

unpii-0.3.0-cp310-abi3-manylinux_2_34_x86_64.whl (1.5 MB view details)

Uploaded Mar 26, 2026 CPython 3.10+manylinux: glibc 2.34+ x86-64

File details

Details for the file unpii-0.3.0.tar.gz.

File metadata

Download URL: unpii-0.3.0.tar.gz
Upload date: Mar 26, 2026
Size: 199.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.6

File hashes

Hashes for unpii-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`27e090bc208d9f98cf0296d953d2f9ec625544efa56ab8165cbaf57a6e83dcb1`
MD5	`13b5dbf223d04a32ab1b59ca6d7732be`
BLAKE2b-256	`bb368692a4136928aa5cdbabcce55f62cc9dcd3e44a0fc6393980a0a724d4b10`

See more details on using hashes here.

File details

Details for the file unpii-0.3.0-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

Download URL: unpii-0.3.0-cp310-abi3-manylinux_2_34_x86_64.whl
Upload date: Mar 26, 2026
Size: 1.5 MB
Tags: CPython 3.10+, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.6

File hashes

Hashes for unpii-0.3.0-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`222c19c22cbc313ef2bc24405e3c1a1550a28421320980abf14101fa9b3c657f`
MD5	`602e404af17da66dac1b98167850837c`
BLAKE2b-256	`3633ea19c6ce0ebbdc4c59e6133af53a54d498047664e5064737fa224341ce51`

See more details on using hashes here.

unpii 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

unpii

Installation

Quick Start

Detection Modes

Custom Words to Mask

Ignore Groups

Inspect Detected Spans

DataFrame Integration

Per-row words to mask (mask_from_columns)

Global words to mask (mask)

Both combined

Low-level: anonymize_series

Batch processing: anonymize_batch

Threading

Categories

Whitelist

API Reference

Performance

License

See also

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Per-row words to mask (`mask_from_columns`)

Global words to mask (`mask`)

Low-level: `anonymize_series`

Batch processing: `anonymize_batch`