Skip to main content

High-performance French medical text anonymization

Project description

unpii

High-performance French medical text anonymization library. Rust core with Python bindings and native Polars integration.

Designed to process millions of documents efficiently.

Installation

pip install unpii

Quick Start

import unpii

text = "Dr Martin au 06 12 34 56 78, email: martin@chu-brest.fr"

# Mask with placeholders (default)
unpii.mask(text)
# → "Dr <NOM> au <TELEPHONE>, email: <EMAIL>"

# Mask with stars
unpii.mask(text, mask="stars")
# → "Dr ***** au *****, email: *****"

Detection Modes

Two detection levels: standard (reliable patterns) and paranoid (aggressive).

# Standard: titles, known patterns, blacklisted names
unpii.mask("Dr Martin est ici")
# → "Dr <NOM> est ici"

unpii.mask("DUPONT Jean est ici")
# → "DUPONT Jean est ici"  (not detected in standard)

# Paranoid: also catches UPPERCASE Titlecase patterns, 5+ digit sequences, loose emails
unpii.mask("DUPONT Jean est ici", mode="paranoid")
# → "<NOM> est ici"

Ignore Groups

Skip specific categories:

unpii.mask("Dr Martin au 06 12 34 56 78", ignore_groups=["TELEPHONE"])
# → "Dr <NOM> au 06 12 34 56 78"

Inspect Detected Spans

Dry-run mode to see what would be masked:

for span in unpii.find_spans("Dr Martin au 06 12 34 56 78"):
    print(span)
# Span(start=3, end=9, category="NOM")
# Span(start=13, end=27, category="TELEPHONE")

Polars Integration

Native expression plugin — Polars handles parallelization automatically:

import polars as pl
import unpii

df = pl.DataFrame({"text": [
    "Dr Martin au 06 12 34 56 78",
    "Email: joe@chu-brest.fr",
    "Maladie de Parkinson",
]})

df.with_columns(pl.col("text").unpii.mask())
# ┌─────────────────────────┐
# │ text                    │
# ╞═════════════════════════╡
# │ Dr <NOM> au <TELEPHONE> │
# │ Email: <EMAIL>          │
# │ Maladie de Parkinson    │  ← protected by whitelist
# └─────────────────────────┘

# With options
df.with_columns(
    pl.col("text").unpii.mask(mask="stars", mode="paranoid", ignore_groups=["TELEPHONE"])
)

Categories

Group Placeholder Standard Paranoid
NOM <NOM> Titles + name, blacklist UPPERCASE/Titlecase patterns, initials
TELEPHONE <TELEPHONE> French phone numbers
EMAIL <EMAIL> Valid emails Anything with @
DATE <DATE> DD/MM/YYYY, literal months, ISO
BIRTHDATE <BIRTHDATE> né(e) le + date
ADRESSE <ADRESSE> Street number + type + name
CODE_POSTAL <CODE_POSTAL> 5 digits + city name
NIR <NIR> French social security number
IBAN <IBAN> French IBAN
NUMBER <NUMBER> 5+ consecutive digits

Whitelist

Medical eponyms (Parkinson, Alzheimer, Verneuil...) are protected from masking by a global whitelist, regardless of which group detected them.

API Reference

def mask(
    text: str,
    *,
    mask: str = "placeholder",       # "placeholder" or "stars"
    mode: str = "standard",          # "standard" or "paranoid"
    ignore_groups: list[str] | None = None,
) -> str: ...

def find_spans(
    text: str,
    *,
    mode: str = "standard",
    ignore_groups: list[str] | None = None,
) -> list[Span]: ...

# Span attributes: .start, .end, .category

Performance

Rust core with compiled regex and Aho-Corasick automata. All rules and dictionaries are embedded in the binary — zero I/O at runtime.

Single-threaded: ~10ms per 7KB document (~100 docs/sec). With Polars: automatic parallelization across all cores.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unpii-0.2.0.tar.gz (212.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unpii-0.2.0-cp310-abi3-manylinux_2_34_x86_64.whl (5.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

File details

Details for the file unpii-0.2.0.tar.gz.

File metadata

  • Download URL: unpii-0.2.0.tar.gz
  • Upload date:
  • Size: 212.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for unpii-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0c55bcbdcc9807e4d5ff47c18f6d1852b1033feb0c0e865e0ec5cfd58aa4bcb6
MD5 54874c864cad14b2a80b81f95ad706fc
BLAKE2b-256 4946637f18fbd23cf509a5ddf0eea14ed663a5d383dfe736393b52e2325b15a5

See more details on using hashes here.

File details

Details for the file unpii-0.2.0-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for unpii-0.2.0-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 3c0e8b014bba2f42e7a3c8d05c9449bf6f54bc73fe63f765a067d75e149946f0
MD5 d71a395f04257127cce8879fa34d9ad6
BLAKE2b-256 34e123a91047c0ae8ef92e1cfa880f39eafc12275a762a3ba16542496f056008

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page