Skip to main content

Lightweight PII detection and redaction for structured identifiers using regex and checksum validation

Project description

piigex

A small PII detection and redaction library for structured identifiers. It uses regex plus checksum validation. There is no ML model, no NLP pipeline, and no large dependency tree.

Use it to sanitize chatbot input before it hits an LLM, scrub logs, or redact customer support transcripts. The 7 major EU countries are covered, along with international identifiers like IBAN, BIC, credit cards, email, IP addresses, and MAC addresses.


Quickstart

from piigex import clean, scan

clean("Send payment to ES91 2100 0418 4502 0005 1332 by Friday")
# → "Send payment to {{IBAN}} by Friday"

scan("IBAN: DE89370400440532013000")
# → [Match(name='intl_iban', token='IBAN', start=6, end=28,
#          value='DE89370400440532013000', valid=True)]

# Fine-grained control
from piigex import Scrubber
s = Scrubber(regions=["es", "intl"], stable_tokens=True)
s.clean("DNI 12345678Z and card 4111 1111 1111 1111")
# → "DNI {{ES_DNI_1}} and card {{CREDIT_CARD_1}}"

JSON / dict payloads

clean_json and scan_json walk nested dicts, lists, and tuples. They only touch string values; keys, numbers, booleans, and the overall structure are left untouched. scan_json returns each Match with a dotted path showing where the PII was found.

import piigex

payload = {
    "user": {"email": "alice@example.com", "iban": "GB29NWBK60161331926819"},
    "amount": 100,
}

piigex.clean_json(payload)
# → {"user": {"email": "{{EMAIL}}", "iban": "{{IBAN}}"}, "amount": 100}

for m in piigex.scan_json(payload):
    print(f"{m.path}: {m.name}")
# user.email: intl_email
# user.iban:  intl_iban

The CLI exposes the same behaviour with --json:

cat payload.json | piigex scrub --json
cat payload.json | piigex scan  --json    # match list with "path" field

Coverage

72 detectors across 27 regions. 58 Tier-1 identifiers are on by default. The other 14 (phone numbers and low-risk shape-only IDs) are off, since they produce more false positives. Turn them on explicitly with detectors=[...] or regions=[...].

Region Default-on detectors
ES es_dni, es_nie, es_cif, es_nss, es_ccc, es_referencia_catastral
IT it_codice_fiscale, it_partita_iva
FR fr_nir, fr_nif, fr_siren, fr_siret, fr_tva
DE de_idnr, de_vat, de_svnr
PT pt_nif, pt_cc, pt_niss
NL nl_bsn, nl_btw
BE be_nn, be_bis, be_vat, be_eid
AT at_vnr
BG bg_egn, bg_pnf
HR hr_oib
CZ cz_rc, cz_dic
DK dk_cpr, dk_cvr
EE ee_ik
FI fi_hetu, fi_ytunnus
GR gr_amka
HU hu_anum
IE ie_pps
LT lt_asmens
PL pl_pesel, pl_nip, pl_regon
RO ro_cnp, ro_cf
SK sk_rc
SI si_emso, si_maticna
SE se_personnummer, se_orgnr
intl intl_iban, intl_eu_vat, intl_bic, intl_credit_card, intl_email, intl_ipv4, intl_ipv6, intl_mac

Countries with VAT coverage only (via intl_eu_vat): CY, LV, LU, MT.

Opt-in detectors (default disabled, feasibility="medium"):

  • Phone: intl_phone_e164, es_phone, it_phone, fr_phone, de_phone, pt_phone, nl_phone, be_phone
  • Shape-only IDs: es_passport, es_matricula, fr_cni, pt_passport, nl_passport, be_ogm_vcs_delimited

Enable them by name (detectors=["es_passport", ...]) or by region. They stay off by default because phone numbers and shape-only IDs are noisier.

US support is planned for v2. The internals were written to be country-agnostic, so adding new regions does not require restructuring the core.


Comparison

Install sizes measured with du -sh against a clean venv (baseline pip/setuptools excluded), Python 3.13. Presidio spaCy model (en_core_web_lg, downloaded separately) adds a further 560 MB.

Library Approach Net install size EU structured IDs Requires ML
piigex regex + checksum ~6 MB 72 (58 default + 14 opt-in) No
commonregex regex only ~6 KB None No
piiregex regex only ~4 KB None No
scrubadub regex + optional NLP ~335 MB Limited (IBAN only) Optional
Microsoft Presidio NLP + ML ~200 MB + model Via custom recognizers Yes (default)

Size breakdown for piigex: python-stdnum is 5.7 MB and the piigex source itself is about 150 KB. scrubadub 2.x pulls in scipy, numpy, scikit-learn, phonenumbers, faker, nltk, regex, and dateparser as transitive dependencies. Most of its install weight is that ML/data stack, not the detection code.


API reference

from piigex import Scrubber, clean, scan, Match

# Module-level convenience (uses the default Scrubber: all high/medium detectors)
scan(text: str) -> list[Match]
clean(text: str) -> str

# Configurable scrubber
s = Scrubber(
    detectors=None,            # None = default set; or ["es_dni", "intl_iban"]
    exclude=None,              # detector names to exclude from the default set
    regions=None,              # ["es", "intl"]: None = all regions
    min_feasibility="medium",  # "high" | "medium" | "low"
    validate=True,             # False = shape-only, skip checksum
    stable_tokens=False,       # True → same value → same numbered token per call
    token_format="{{{name}}}", # produces {{TOKEN}} by default
    token_map=None,            # pass a persistent TokenMap to share state across calls
)

# Match fields (frozen dataclass)
Match(name, token, start, end, value, valid, path)
#     str   str    int   int  str   bool   str   # path populated by scan_json only

Validation behaviour

validate=True (default): regex locates, checksum confirms. Matches with invalid checksums are returned with valid=False and are not replaced by clean().

validate=False: shape-match only. This is useful for catching likely PII even when the value is truncated or encoded.

Stable tokens

s = Scrubber(stable_tokens=True)
s.clean("NIE X1234567L appears again as X1234567L")
# → "NIE {{ES_NIE_1}} appears again as {{ES_NIE_1}}"

Normalized equivalents (e.g. spaced vs. compact IBAN) map to the same token index. The counter resets each clean() call unless you pass a persistent TokenMap:

from piigex.tokens import TokenMap
tm = TokenMap()
s.clean(doc1, token_map=tm)
s.clean(doc2, token_map=tm)  # same values → same indices across both docs

When NOT to use this library

piigex only detects structured identifiers: tax codes, account numbers, social security numbers, and other formats that have a defined shape and a checksum algorithm.

It will not detect:

  • Names, organizations, or postal addresses (no NER)
  • Dates of birth written in natural language
  • Free-form sensitive content

If you need any of that, look at Microsoft Presidio or a spaCy-based NER pipeline. Those tools require ML models. piigex deliberately does not.


Documentation


Scope

Structured identifiers only. No NER, no ML. EU country coverage in v1, US planned for v2. The only runtime dependency is python-stdnum.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piigex-0.1.0.tar.gz (67.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

piigex-0.1.0-py3-none-any.whl (64.4 kB view details)

Uploaded Python 3

File details

Details for the file piigex-0.1.0.tar.gz.

File metadata

  • Download URL: piigex-0.1.0.tar.gz
  • Upload date:
  • Size: 67.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for piigex-0.1.0.tar.gz
Algorithm Hash digest
SHA256 32b5ca577b14f2affc3a4a7a61d5ce2e827c4aad1c26299ee5c41b67451da989
MD5 16ded6ed31e1ca22a0efe44a82264cc0
BLAKE2b-256 738d4f87234f14bd8c24fe5e2b9e309c6c1d08c6d4571a0db1466b89662ffb98

See more details on using hashes here.

File details

Details for the file piigex-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: piigex-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 64.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for piigex-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6682a8c5e543d2c08d0c984a8b3ad9f68b263f53711ab995d177da77c2772443
MD5 bff15f812b5033b90494a687ea66df4b
BLAKE2b-256 ca48119326d5774fb7c61a84b912bc11312f07f7477283276800849c0ee6bcb5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page