Skip to main content

High-performance French medical text anonymization

Project description

unpii

High-performance French medical text anonymization library. Rust core with Python bindings and native Polars integration.

Designed to process millions of documents efficiently.

Installation

pip install unpii

Quick Start

import unpii

text = "Dr Martin au 06 12 34 56 78, email: martin@chu-brest.fr"

# Mask with placeholders (default)
unpii.mask(text)
# → "Dr <NOM> au <TELEPHONE>, email: <EMAIL>"

# Mask with stars
unpii.mask(text, mask="stars")
# → "Dr ***** au *****, email: *****"

Detection Modes

Two detection levels: standard (reliable patterns) and paranoid (aggressive).

# Standard: titles, known patterns, blacklisted names
unpii.mask("Dr Martin est ici")
# → "Dr <NOM> est ici"

unpii.mask("DUPONT Jean est ici")
# → "DUPONT Jean est ici"  (not detected in standard)

# Paranoid: also catches UPPERCASE Titlecase patterns, 5+ digit sequences, loose emails
unpii.mask("DUPONT Jean est ici", mode="paranoid")
# → "<NOM> est ici"

Ignore Groups

Skip specific categories:

unpii.mask("Dr Martin au 06 12 34 56 78", ignore_groups=["TELEPHONE"])
# → "Dr <NOM> au 06 12 34 56 78"

Inspect Detected Spans

Dry-run mode to see what would be masked:

for span in unpii.find_spans("Dr Martin au 06 12 34 56 78"):
    print(span)
# Span(start=3, end=9, category="NOM")
# Span(start=13, end=27, category="TELEPHONE")

Polars Integration

Native expression plugin — Polars handles parallelization automatically:

import polars as pl
import unpii

df = pl.DataFrame({"text": [
    "Dr Martin au 06 12 34 56 78",
    "Email: joe@chu-brest.fr",
    "Maladie de Parkinson",
]})

df.with_columns(pl.col("text").unpii.mask())
# ┌─────────────────────────┐
# │ text                    │
# ╞═════════════════════════╡
# │ Dr <NOM> au <TELEPHONE> │
# │ Email: <EMAIL>          │
# │ Maladie de Parkinson    │  ← protected by whitelist
# └─────────────────────────┘

# With options
df.with_columns(
    pl.col("text").unpii.mask(mask="stars", mode="paranoid", ignore_groups=["TELEPHONE"])
)

Categories

Group Placeholder Standard Paranoid
NOM <NOM> Titles + name, blacklist UPPERCASE/Titlecase patterns, initials
TELEPHONE <TELEPHONE> French phone numbers
EMAIL <EMAIL> Valid emails Anything with @
DATE <DATE> DD/MM/YYYY, literal months, ISO
BIRTHDATE <BIRTHDATE> né(e) le + date
ADRESSE <ADRESSE> Street number + type + name
CODE_POSTAL <CODE_POSTAL> 5 digits + city name
NIR <NIR> French social security number
IBAN <IBAN> French IBAN
NUMBER <NUMBER> 5+ consecutive digits

Whitelist

Medical eponyms (Parkinson, Alzheimer, Verneuil...) are protected from masking by a global whitelist, regardless of which group detected them.

API Reference

def mask(
    text: str,
    *,
    mask: str = "placeholder",       # "placeholder" or "stars"
    mode: str = "standard",          # "standard" or "paranoid"
    ignore_groups: list[str] | None = None,
) -> str: ...

def find_spans(
    text: str,
    *,
    mode: str = "standard",
    ignore_groups: list[str] | None = None,
) -> list[Span]: ...

# Span attributes: .start, .end, .category

Performance

Rust core with compiled regex and Aho-Corasick automata. All rules and dictionaries are embedded in the binary — zero I/O at runtime.

Single-threaded: ~10ms per 7KB document (~100 docs/sec). With Polars: automatic parallelization across all cores.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unpii-0.1.0.tar.gz (211.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unpii-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl (5.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

File details

Details for the file unpii-0.1.0.tar.gz.

File metadata

  • Download URL: unpii-0.1.0.tar.gz
  • Upload date:
  • Size: 211.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for unpii-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cf37c055557cdd5bbfa0885460f5a9effd5a0bb01627f7b6c32272d5d170c9a9
MD5 d3ea7c6407043f6cc5e0404aa00654bf
BLAKE2b-256 b8c0fb008c3cfe23e9d4b200c41346a73699400a09881b3bf4b5e038aa38ed97

See more details on using hashes here.

File details

Details for the file unpii-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for unpii-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 12a98d9fe5f47f7f8d3e5c1ee8a9830d728c4fca8331ee0276705ce1546d39bd
MD5 798aa352d2ffba0960509cb352f5fa9d
BLAKE2b-256 06bf98163f588a640b086296b1b2842f3b02757983c86556e5058e5261a19d38

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page