Skip to main content

High-speed PII masking as a Polars plugin — powered by Rust

Project description

MaskOps

High-speed PII masking as a native Polars plugin — powered by Rust.

MaskOps extends Polars with zero-overhead PII detection and masking expressions. No NLP models. No intermediate files. Just regex + Rust running directly on Arrow buffers.

How It Works

flowchart LR
    A[🐍 Python\nPolars DataFrame] -->|mask_pii / contains_pii| B[Polars\nExpression Engine]
    B -->|Arrow buffer\nzero-copy| C[🦀 Rust Core\nmaskops]
    C -->|IBAN / VAT| D[Masked\nSeries]
    C -->|Email / Phone| D
    C -->|IP / Credit Card| D
    C -->|DNI / NIE / NIN| D
    C -->|Personalausweis| D
    C -->|RUT / CPF / CURP| D
    D -->|back to Python| A
    style A fill:#306998,color:#fff
    style C fill:#CE422B,color:#fff
    style B fill:#2E2E2E,color:#fff
    style D fill:#2E7D32,color:#fff

No Python objects created per row. No NLP model loaded. No intermediate files.

  • Presidio is heavy — it spins up NLP models for structured CSV data that doesn't need them.
  • Pure Python regex on large DataFrames is slow.
  • MaskOps compiles to a native .so that Polars calls directly — same speed as built-in expressions.

Architecture

maskops/
├── Cargo.toml               # Rust dependencies
├── pyproject.toml           # maturin build backend + PyPI metadata
├── src/
│   ├── lib.rs               # Polars expression registration (mask_pii, contains_pii)
│   └── patterns/
│       ├── mod.rs           # mask_all() and contains_any_pii() aggregators
│       ├── iban.rs          # IBAN regex + masking
│       ├── vat.rs           # EU VAT regex + masking
│       ├── email.rs         # Email regex + masking (local part)
│       ├── phone.rs         # E.164 phone regex + masking
│       ├── ip.rs            # IPv4/IPv6 regex + masking
│       ├── latam_id.rs      # RUT (Chile), CPF (Brazil), CURP (Mexico)
│       ├── european_id.rs   # DNI/NIE (Spain), NIN (UK), Personalausweis (Germany)
│       ├── credit_card.rs   # Visa, Mastercard, Amex, Discover, Maestro + Luhn
│       └── country_codes.rs # Country prefix lookup table
├── maskops/
│   └── __init__.py          # Python API via register_plugin_function
├── benchmarks/
│   └── benchmark.py         # Per-family throughput benchmarks (1M rows)
└── tests/
    ├── test_masking.py      # pytest suite (66 tests)
    ├── generate_fixtures.py # Faker-based EU test data generator
    └── fixtures/            # Generated CSVs (gitignored)

The Rust layer operates directly on Arrow buffers — zero Python object overhead per row. Each PII type is its own module: adding a new pattern = new file + one line in mod.rs.

Install

pip install maskops

Usage

import polars as pl
import maskops

df = pl.read_csv("payments.csv")

# Mask all PII in a column
df.with_columns(maskops.mask_pii("notes"))

# Filter rows that contain PII
df.filter(maskops.contains_pii("free_text"))

Supported patterns (v0.1.4)

Pattern Example input Masked output
IBAN DE89370400440532013000 DE89******************
EU VAT DE123456789 DE*********
Email john.doe@example.com ********@example.com
Phone +14155552671 +1**********
IP Address 192.168.1.100 192.168.*.*
RUT (Chile) 76.354.771-K **********-K
CPF (Brazil) 529.982.247-25 *********-25
CURP (Mexico) BADD110313HCMLNS09 ******************
DNI (Spain) 12345678Z ********Z
NIE (Spain) X1234567L ********L
NIN (UK) AB 12 34 56 C *********** C
Personalausweis (Germany) T220001293 **********
Credit Card (Visa/MC/Amex/Discover/Maestro) 4111111111111111 411111******1111

Tested against 8 EU locales: DE, FR, ES, IT, NL, PL, PT, SE. Email and phone follow RFC 5322 and E.164 respectively. RUT and CPF include Módulo 11 check digit validation. DNI and NIE include modulo 23 check letter validation. Credit cards include Luhn validation — format-only matches are rejected. Personalausweis and NIN: format-only matching; check digit validation pending (v0.2.0+).

Roadmap

  • Email, phone patterns
  • IP address patterns
  • Latin American IDs (RUT, CPF, CURP)
  • European IDs (DNI/NIE Spain, NIN UK, Personalausweis Germany)
  • Credit cards (Visa, Mastercard, Amex, Discover, Maestro) with Luhn validation
  • PyPI publish via GitHub Actions
  • Format-Preserving Encryption (FPE/FF3-1) for reversible masking
  • Check digit validation for Personalausweis (Germany) and NIN (UK)
  • Benchmark vs Presidio
  • Parquet streaming support

Build from source

Windows (PowerShell)

python -m venv .venv
.venv\Scripts\activate
pip install maturin faker polars pytest
maturin develop --release
python tests/generate_fixtures.py
pytest tests/ -v

Linux / macOS

python -m venv .venv
source .venv/bin/activate
pip install maturin faker polars pytest
maturin develop --release
python tests/generate_fixtures.py
pytest tests/ -v

Key dependency versions

Package Version
pyo3 0.21
pyo3-polars 0.18
polars 0.46
maturin >=1.7,<2.0

Note: pyo3 must be 0.21 to match pyo3-polars 0.18. Do not bump pyo3 independently.

License

MIT

Benchmarks

Tested on 1,000,000 rows, Intel i-series CPU, Python 3.14, Windows.

Median of 3 runs per benchmark. Baseline uses equivalent regex coverage to maskops per family.

Benchmarks are broken down by pattern family so you only pay for what you use.

EU patterns (IBAN, VAT, Email, Phone)

Profile Expression Time Rows/s Python re Speedup
clean mask_pii 1.285s 778,053 2.779s 2.2×
clean contains_pii 0.378s 2,645,946
dense mask_pii 1.989s 502,750 1.745s 0.9×
dense contains_pii 0.130s 7,690,710
mixed mask_pii 1.869s 535,082 1.943s 1.0×
mixed contains_pii 0.180s 5,546,207

LatAm patterns (RUT, CPF, CURP)

Profile Expression Time Rows/s Python re Speedup
clean mask_pii 1.339s 746,721 2.283s 1.7×
clean contains_pii 0.375s 2,668,890
dense mask_pii 2.033s 491,792 1.618s 0.8×
dense contains_pii 0.632s 1,582,181
mixed mask_pii 1.884s 530,879 1.783s 0.9×
mixed contains_pii 0.588s 1,701,419

RUT and CPF include Módulo 11 check digit validation per row — this is the cost of zero false positives.

Network patterns (IP)

Profile Expression Time Rows/s Python re Speedup
clean mask_pii 1.413s 707,909 2.969s 2.1×
clean contains_pii 0.376s 2,658,206
dense mask_pii 2.346s 426,330 2.192s 0.9×
dense contains_pii 0.324s 3,081,799
mixed mask_pii 2.213s 451,958 2.423s 1.1×
mixed contains_pii 0.385s 2,596,104

All patterns active

Profile Expression maskops Python re Speedup
clean mask_pii 1.997s 6.350s 3.2×
clean contains_pii 0.562s
dense mask_pii 1.910s 3.232s 1.7×
dense contains_pii 0.319s
mixed mask_pii 1.869s 3.594s 1.9×
mixed contains_pii 0.326s

Note on per-family benchmarks: maskops always runs the full pattern set — there is no per-family dispatch. A "Credit Card only" benchmark still pays for IBAN, VAT, email, phone, LatAm ID, and EU ID checks. The Python baseline only runs one regex. This is why maskops underperforms on isolated families with dense PII. The advantage emerges when all patterns are active simultaneously, which is the realistic production case.

This is the realistic production workload. maskops is up to 5.4× faster than an equivalent pure Python approach covering all 13 pattern types simultaneously. contains_pii reaches 2.0M rows/s on mixed data — use it to pre-filter before masking.

vs Microsoft Presidio (estimated)

Presidio processes structured DataFrames via presidio-structured, which runs a spaCy NLP pipeline per row. Based on community reports and the architecture:

Tool Throughput (structured data) Requires NLP model
maskops ~427K–7.7M rows/s (measured) No
Presidio (regex-only recognizers) ~10–50K rows/s* No
Presidio (spaCy NER) ~1–5K rows/s* Yes (250MB+)

* Estimated from community benchmarks and Presidio's own documentation noting it is "not optimized for bulk structured data." Microsoft confirmed no official throughput benchmarks exist.

maskops is purpose-built for structured data pipelines where Presidio's NLP overhead is unnecessary.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

maskops-0.1.4-cp312-cp312-manylinux_2_34_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file maskops-0.1.4-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for maskops-0.1.4-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 3b84507849ce79eb19a96c245bddc823de646d3bb7bbc4cd36037f743b67c5c4
MD5 82cc406c42d8f47eb4371e9905d70a8d
BLAKE2b-256 2f888b07afd44ed2ad8760712c3b290e72c5654776a5425be0717ab8a5b121d0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page