Skip to main content

High-speed PII masking as a Polars plugin — powered by Rust

Project description

maskops

High-speed PII masking as a native Polars plugin — powered by Rust.

maskops extends Polars with zero-overhead PII detection and masking expressions. No NLP models. No intermediate files. Just regex + Rust running directly on Arrow buffers.

Why

  • Presidio is heavy — it spins up NLP models for structured CSV data that doesn't need them.
  • Pure Python regex on large DataFrames is slow.
  • maskops compiles to a native .so that Polars calls directly — same speed as built-in expressions.

Architecture

maskops/
├── Cargo.toml               # Rust dependencies (pyo3 0.21, pyo3-polars 0.18, polars 0.46)
├── pyproject.toml           # maturin build backend + PyPI metadata
├── src/
│   ├── lib.rs               # Polars expression registration (mask_pii, contains_pii)
│   └── patterns/
│       ├── mod.rs           # mask_all() and contains_any_pii() aggregators
│       ├── iban.rs          # IBAN regex + masking
│       └── vat.rs           # EU VAT regex + masking
├── maskops/
│   └── __init__.py          # Python API via register_plugin_function
└── tests/
    ├── test_masking.py      # pytest suite
    ├── generate_fixtures.py # Faker-based EU test data generator
    └── fixtures/            # Generated CSVs (gitignored)

The Rust layer operates directly on Arrow buffers — zero Python object overhead per row. Each PII type is its own module: adding a new pattern = new file + one line in mod.rs.

Install

pip install maskops

Usage

import polars as pl
import maskops

df = pl.read_csv("payments.csv")

# Mask all PII in a column
df.with_columns(maskops.mask_pii("notes"))

# Filter rows that contain PII
df.filter(maskops.contains_pii("free_text"))

Supported patterns (v0.1)

Pattern Example input Masked output
IBAN DE89370400440532013000 DE89******************
EU VAT DE123456789 DE*********

Tested against 8 EU locales: DE, FR, ES, IT, NL, PL, PT, SE.

Roadmap

  • Email, phone, IP address patterns
  • Format-Preserving Encryption (FPE/FF3-1) for reversible masking
  • Latin American IDs (RUT, CPF, CURP)
  • Benchmark vs Presidio
  • Parquet streaming support
  • PyPI publish via GitHub Actions

Build from source

Windows (PowerShell)

python -m venv .venv
.venv\Scripts\activate
pip install maturin faker polars pytest
maturin develop --release
python tests/generate_fixtures.py
pytest tests/ -v

Linux / macOS

python -m venv .venv
source .venv/bin/activate
pip install maturin faker polars pytest
maturin develop --release
python tests/generate_fixtures.py
pytest tests/ -v

Key dependency versions

Package Version
pyo3 0.21
pyo3-polars 0.18
polars 0.46
maturin >=1.7,<2.0

Note: pyo3 must be 0.21 to match pyo3-polars 0.18. Do not bump pyo3 independently.

License

MIT

Benchmarks

Tested on 1,000,000 rows, Intel i-series CPU, Python 3.14, Windows.

maskops throughput

Profile Expression Time Rows/s MB/s
clean (no PII) mask_pii 0.379s 2,636,105 58.0
clean (no PII) contains_pii 0.170s 5,872,663 129.2
dense (all PII) mask_pii 1.462s 684,035 15.0
dense (all PII) contains_pii 0.059s 16,858,176 370.9
mixed (50/50) mask_pii 0.742s 1,347,927 29.7
mixed (50/50) contains_pii 0.119s 8,401,603 184.8

vs pure Python regex (same machine)

Profile maskops mask_pii Python re Speedup
clean 0.379s 0.907s 2.4×
dense 1.462s 1.481s 1.0×
mixed 0.742s 1.253s 1.7×

On clean and mixed data maskops is consistently faster. On dense data (every row is a full IBAN) both are regex-bound — the bottleneck is the pattern itself, not Python overhead.

vs Microsoft Presidio (estimated)

Presidio processes structured DataFrames via presidio-structured, which runs a spaCy NLP pipeline per row. Based on community reports and the architecture:

Tool Throughput (structured data) Requires NLP model
maskops ~1–16M rows/s No
Presidio (regex-only recognizers) ~10–50K rows/s* No
Presidio (spaCy NER) ~1–5K rows/s* Yes (250MB+)

* Estimated from community benchmarks and Presidio's own documentation noting it is "not optimized for bulk structured data." Microsoft confirmed no official throughput benchmarks exist.

maskops is purpose-built for structured data pipelines where Presidio's NLP overhead is unnecessary.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maskops-0.1.0.tar.gz (450.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

maskops-0.1.0-cp314-cp314-win_amd64.whl (5.1 MB view details)

Uploaded CPython 3.14Windows x86-64

File details

Details for the file maskops-0.1.0.tar.gz.

File metadata

  • Download URL: maskops-0.1.0.tar.gz
  • Upload date:
  • Size: 450.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for maskops-0.1.0.tar.gz
Algorithm Hash digest
SHA256 67f8adc177e1e65d9db36d6205fd20a4872250d983519dcc20a4abd67227efbf
MD5 a7fac07f34a87ea51f619e1b3ea0a0d8
BLAKE2b-256 ddabc95500db4e729b5267b627fc9a8e737ba4310c5c3e49bbe68c64f2bc5013

See more details on using hashes here.

File details

Details for the file maskops-0.1.0-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: maskops-0.1.0-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 5.1 MB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for maskops-0.1.0-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 e4e376e40a8adff3e1b22c2324485f7d6a793cbfd2220736d106418f6b246edd
MD5 44df2858384beaadff5aa9762444a533
BLAKE2b-256 b7af80730b6173997f0e264bbc83b147467764869d10cc2bd2cef2cbac75565c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page