Lightweight PII detection and redaction for structured identifiers using regex and checksum validation

These details have not been verified by PyPI

Project links

Project description

piigex

A small PII detection and redaction library for structured identifiers. It uses regex plus checksum validation. There is no ML model, no NLP pipeline, and no large dependency tree.

Use it to sanitize chatbot input before it hits an LLM, scrub logs, or redact customer support transcripts. The 7 major EU countries are covered, along with international identifiers like IBAN, BIC, credit cards, email, IP addresses, and MAC addresses.

Quickstart

from piigex import clean, scan

clean("Send payment to ES91 2100 0418 4502 0005 1332 by Friday")
# → "Send payment to {{IBAN}} by Friday"

scan("IBAN: DE89370400440532013000")
# → [Match(name='intl_iban', token='IBAN', start=6, end=28,
#          value='DE89370400440532013000', valid=True)]

# Fine-grained control
from piigex import Scrubber
s = Scrubber(regions=["es", "intl"], stable_tokens=True)
s.clean("DNI 12345678Z and card 4111 1111 1111 1111")
# → "DNI {{ES_DNI_1}} and card {{CREDIT_CARD_1}}"

JSON / dict payloads

clean_json and scan_json walk nested dicts, lists, and tuples. They only touch string values; keys, numbers, booleans, and the overall structure are left untouched. scan_json returns each Match with a dotted path showing where the PII was found.

import piigex

payload = {
    "user": {"email": "alice@example.com", "iban": "GB29NWBK60161331926819"},
    "amount": 100,
}

piigex.clean_json(payload)
# → {"user": {"email": "{{EMAIL}}", "iban": "{{IBAN}}"}, "amount": 100}

for m in piigex.scan_json(payload):
    print(f"{m.path}: {m.name}")
# user.email: intl_email
# user.iban:  intl_iban

The CLI exposes the same behaviour with --json:

cat payload.json | piigex scrub --json
cat payload.json | piigex scan  --json    # match list with "path" field

Coverage

72 detectors across 27 regions. 58 Tier-1 identifiers are on by default. The other 14 (phone numbers and low-risk shape-only IDs) are off, since they produce more false positives. Turn them on explicitly with detectors=[...] or regions=[...].

Region	Default-on detectors
ES	`es_dni`, `es_nie`, `es_cif`, `es_nss`, `es_ccc`, `es_referencia_catastral`
IT	`it_codice_fiscale`, `it_partita_iva`
FR	`fr_nir`, `fr_nif`, `fr_siren`, `fr_siret`, `fr_tva`
DE	`de_idnr`, `de_vat`, `de_svnr`
PT	`pt_nif`, `pt_cc`, `pt_niss`
NL	`nl_bsn`, `nl_btw`
BE	`be_nn`, `be_bis`, `be_vat`, `be_eid`
AT	`at_vnr`
BG	`bg_egn`, `bg_pnf`
HR	`hr_oib`
CZ	`cz_rc`, `cz_dic`
DK	`dk_cpr`, `dk_cvr`
EE	`ee_ik`
FI	`fi_hetu`, `fi_ytunnus`
GR	`gr_amka`
HU	`hu_anum`
IE	`ie_pps`
LT	`lt_asmens`
PL	`pl_pesel`, `pl_nip`, `pl_regon`
RO	`ro_cnp`, `ro_cf`
SK	`sk_rc`
SI	`si_emso`, `si_maticna`
SE	`se_personnummer`, `se_orgnr`
intl	`intl_iban`, `intl_eu_vat`, `intl_bic`, `intl_credit_card`, `intl_email`, `intl_ipv4`, `intl_ipv6`, `intl_mac`

Countries with VAT coverage only (via intl_eu_vat): CY, LV, LU, MT.

Opt-in detectors (default disabled, feasibility="medium"):

Phone: intl_phone_e164, es_phone, it_phone, fr_phone, de_phone, pt_phone, nl_phone, be_phone
Shape-only IDs: es_passport, es_matricula, fr_cni, pt_passport, nl_passport, be_ogm_vcs_delimited

Enable them by name (detectors=["es_passport", ...]) or by region. They stay off by default because phone numbers and shape-only IDs are noisier.

US support is planned for v2. The internals were written to be country-agnostic, so adding new regions does not require restructuring the core.

Comparison

Install sizes measured with du -sh against a clean venv (baseline pip/setuptools excluded), Python 3.13. Presidio spaCy model (en_core_web_lg, downloaded separately) adds a further 560 MB.

Library	Approach	Net install size	EU structured IDs	Requires ML
piigex	regex + checksum	~6 MB	72 (58 default + 14 opt-in)	No
commonregex	regex only	~6 KB	None	No
piiregex	regex only	~4 KB	None	No
scrubadub	regex + optional NLP	~335 MB	Limited (IBAN only)	Optional
Microsoft Presidio	NLP + ML	~200 MB + model	Via custom recognizers	Yes (default)

Size breakdown for piigex: python-stdnum is 5.7 MB and the piigex source itself is about 150 KB. scrubadub 2.x pulls in scipy, numpy, scikit-learn, phonenumbers, faker, nltk, regex, and dateparser as transitive dependencies. Most of its install weight is that ML/data stack, not the detection code.

API reference

from piigex import Scrubber, clean, scan, Match

# Module-level convenience (uses the default Scrubber: all high/medium detectors)
scan(text: str) -> list[Match]
clean(text: str) -> str

# Configurable scrubber
s = Scrubber(
    detectors=None,            # None = default set; or ["es_dni", "intl_iban"]
    exclude=None,              # detector names to exclude from the default set
    regions=None,              # ["es", "intl"]: None = all regions
    min_feasibility="medium",  # "high" | "medium" | "low"
    validate=True,             # False = shape-only, skip checksum
    stable_tokens=False,       # True → same value → same numbered token per call
    token_format="{{{name}}}", # produces {{TOKEN}} by default
    token_map=None,            # pass a persistent TokenMap to share state across calls
)

# Match fields (frozen dataclass)
Match(name, token, start, end, value, valid, path)
#     str   str    int   int  str   bool   str   # path populated by scan_json only

Validation behaviour

validate=True (default): regex locates, checksum confirms. Matches with invalid checksums are returned with valid=False and are not replaced by clean().

validate=False: shape-match only. This is useful for catching likely PII even when the value is truncated or encoded.

Stable tokens

s = Scrubber(stable_tokens=True)
s.clean("NIE X1234567L appears again as X1234567L")
# → "NIE {{ES_NIE_1}} appears again as {{ES_NIE_1}}"

Normalized equivalents (e.g. spaced vs. compact IBAN) map to the same token index. The counter resets each clean() call unless you pass a persistent TokenMap:

from piigex.tokens import TokenMap
tm = TokenMap()
s.clean(doc1, token_map=tm)
s.clean(doc2, token_map=tm)  # same values → same indices across both docs

When NOT to use this library

piigex only detects structured identifiers: tax codes, account numbers, social security numbers, and other formats that have a defined shape and a checksum algorithm.

It will not detect:

Names, organizations, or postal addresses (no NER)
Dates of birth written in natural language
Free-form sensitive content

If you need any of that, look at Microsoft Presidio or a spaCy-based NER pipeline. Those tools require ML models. piigex deliberately does not.

Documentation

docs/index.md: landing page and design overview
docs/coverage.md: every detector grouped by region
docs/api.md: Python API reference
docs/cli.md: command-line usage
docs/extending.md: how to add a new detector

Scope

Structured identifiers only. No NER, no ML. EU country coverage in v1, US planned for v2. The only runtime dependency is python-stdnum.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

May 18, 2026

This version

0.1.0

May 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piigex-0.1.0.tar.gz (67.3 kB view details)

Uploaded May 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

piigex-0.1.0-py3-none-any.whl (64.4 kB view details)

Uploaded May 14, 2026 Python 3

File details

Details for the file piigex-0.1.0.tar.gz.

File metadata

Download URL: piigex-0.1.0.tar.gz
Upload date: May 14, 2026
Size: 67.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for piigex-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`32b5ca577b14f2affc3a4a7a61d5ce2e827c4aad1c26299ee5c41b67451da989`
MD5	`16ded6ed31e1ca22a0efe44a82264cc0`
BLAKE2b-256	`738d4f87234f14bd8c24fe5e2b9e309c6c1d08c6d4571a0db1466b89662ffb98`

See more details on using hashes here.

File details

Details for the file piigex-0.1.0-py3-none-any.whl.

File metadata

Download URL: piigex-0.1.0-py3-none-any.whl
Upload date: May 14, 2026
Size: 64.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for piigex-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6682a8c5e543d2c08d0c984a8b3ad9f68b263f53711ab995d177da77c2772443`
MD5	`bff15f812b5033b90494a687ea66df4b`
BLAKE2b-256	`ca48119326d5774fb7c61a84b912bc11312f07f7477283276800849c0ee6bcb5`

See more details on using hashes here.

piigex 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

piigex

Quickstart

JSON / dict payloads

Coverage

Comparison

API reference

Validation behaviour

Stable tokens

When NOT to use this library

Documentation

Scope

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes