Skip to main content

Pluggable, on-premise-first PII masking, unmasking, and redaction library.

Project description

pii-shield

A pluggable, on-premise-first PII masking, unmasking, and redaction library for Python.

pii-shield detects personally identifiable and sensitive business information in free text (emails, phone numbers, GST/PAN/IBAN numbers, person and organisation names, bank details, invoice/PO numbers, and more), and gives you three ways to handle it:

  • mask it into a reversible placeholder token, encrypted at rest
  • unmask a previously masked token back to its original value
  • redact it permanently, with no way to recover the original

There is no server, no API, and no hard dependency on any particular database — it's a library you import and call directly. Where your encrypted PII values live is a pluggable choice: in-memory, a local file, Redis, or PostgreSQL, or a backend you write yourself.


Install

pip install pii-shield                      # core: regex detection + in-memory/filesystem storage
pip install "pii-shield[postgres]"           # + PostgreSQL storage backend
pip install "pii-shield[redis]"              # + Redis storage backend
pip install "pii-shield[spacy]"              # + spaCy NER layer (PERSON/ORG/GPE)
pip install "pii-shield[privacy-filter]"     # + transformer token-classification layer
pip install "pii-shield[all]"                # everything

Only cryptography is a hard dependency. asyncpg, redis, spacy, and transformers/torch are all opt-in extras. If you use a backend or detection layer without installing its extra, you get a clear OptionalDependencyMissingError telling you exactly what to install — never a bare ImportError or a silent failure.


Quick start

import asyncio
from pii_shield import PIIMaskingEngine
from pii_shield.storage import InMemoryStorage

async def main():
    async with PIIMaskingEngine(storage=InMemoryStorage()) as engine:
        result = await engine.mask("Contact john@acme.com about GST 27AAPFU0939F1ZV")
        print(result.masked_text)
        # "Contact {{EMAIL:abcc2}} about GST {{GST:9a03b}}"

        original = await engine.unmask(result.masked_text)
        print(original)
        # "Contact john@acme.com about GST 27AAPFU0939F1ZV"

        print(engine.redact("Contact john@acme.com about GST 27AAPFU0939F1ZV"))
        # "Contact [REDACTED:EMAIL] about GST [REDACTED:GST]"  — irreversible, nothing stored

asyncio.run(main())

PIIMaskingEngine is an async context manager — initialise() connects the storage backend, close() releases it. Use async with unless you need to control that lifecycle yourself.


The three operations

Method Reversible? Touches storage? Use for
mask(text, scope=None) Yes, via unmask() Yes Sending documents to an LLM/cloud service while keeping raw PII on-premise
unmask(masked_text, scope=None) Yes (read) Restoring original values before writing back to source systems
redact(text) No No Logs, analytics exports, anything that must never contain recoverable PII
result = await engine.mask(text, scope="invoice-2026-00417")
# result.masked_text     -> text with {{TYPE:xxxxx}} placeholders
# result.token_count     -> number of PII spans masked
# result.entity_counts   -> {"EMAIL": 1, "GST": 1, ...}
# result.entities        -> per-span detail (type, offsets, token, confidence, source)

text_back = await engine.unmask(result.masked_text, scope="invoice-2026-00417")

scrubbed = engine.redact(text)   # synchronous — no storage or encryption involved

mask_dict() / unmask_dict() do the same over JSON-serialisable dicts, masking all string leaf values in one pass so a repeated value across fields still maps to the same token.

scope is a free-form string (e.g. a document or invoice ID). The same PII value repeated within one scope deduplicates to a single stored token instead of being encrypted and stored twice — pass the same scope to mask() and the matching unmask() call.


Architecture

                    ┌─────────────────────────────────────────┐
                    │             PIIMaskingEngine             │
                    │        (pii_shield.engine)               │
                    │                                           │
                    │   mask()    unmask()    redact()          │
                    └───────┬───────────┬───────────┬───────────┘
                            │           │           │
              ┌─────────────┘           │           └─── (redact never
              │                         │                  leaves this box —
              ▼                         │                  no encrypt, no store)
   ┌─────────────────────┐              │
   │      NEREngine        │            │
   │   (pii_shield.ner)     │            │
   │                        │            │
   │  RegexNERLayer   (always on)        │
   │  SpacyNERLayer   (optional)         │
   │  PrivacyFilterLayer (optional)      │
   │        │                            │
   │  TokenizerSafeSpanMerger            │
   │  SpanConflictResolver               │
   └──────────┬─────────────┘            │
              │ DetectedSpan[]           │
              ▼                          │
   ┌─────────────────────┐               │
   │ DeterministicToken-   │             │
   │ Generator              │             │
   │ (pii_shield.tokens)    │             │
   │                        │             │
   │  {{TYPE:xxxxx}}        │             │
   │  find_tokens_in_text() │◄────────────┘
   └──────────┬─────────────┘
              │ token, value_hash
              ▼
   ┌─────────────────────┐        ┌───────────────────────────────┐
   │   AESGCMCipher        │       │        StorageBackend           │
   │  (pii_shield.crypto)   │──────▶       (pii_shield.storage)       │
   │                        │       │                                 │
   │  encrypt() / decrypt() │       │  InMemoryStorage                │
   │  AES-256-GCM            │      │  FileSystemStorage               │
   └────────────────────────┘       │  RedisStorage       (extra)      │
                                     │  PostgresStorage    (extra)      │
                                     └───────────────────────────────┘

Components

PIIMaskingEngine (pii_shield.engine) is the single public entry point. It owns one NEREngine, one DeterministicTokenGenerator, one AESGCMCipher, and one StorageBackend, and wires them together for mask() / unmask() / redact(). This is the only class most callers need to import.

NEREngine (pii_shield.ner) does detection only — it never touches encryption or storage. It runs one or more layers over the input text and merges their output into a single non-overlapping span list:

  • RegexNERLayer — always on, no extra dependencies. High-precision patterns for structured PII: GST, PAN, TAN, ABN, VAT, IBAN, SWIFT, account/sort-code/routing numbers, credit cards, email, phone (India + international), invoice and PO references.
  • SpacyNERLayer — optional. Adds PERSON / ORGANISATION / ADDRESS detection via a local spaCy model. Requires pii-shield[spacy].
  • PrivacyFilterLayer — optional. Adds detection via any HuggingFace token-classification model you point it at, run entirely on-premise through transformers.pipeline. Requires pii-shield[privacy-filter].

When layers disagree or overlap, SpanConflictResolver picks a winner (regex-validated spans win first, then financial-entity-over-phone, then higher confidence, then longer span), and TokenizerSafeSpanMerger stitches back together sub-word fragments that some transformer models emit at token boundaries.

DeterministicTokenGenerator (pii_shield.tokens) turns a detected span into a {{ENTITY_TYPE:xxxxx}} placeholder — a 5-hex-character suffix derived from SHA-256(value | entity_type | salt). Same value + same entity type always produces the same token within one salted instance, which is what makes within-document deduplication and find_tokens_in_text() (used by unmask() to locate placeholders) work. It also computes an unsalted value_hash used purely for storage-side deduplication, so multiple engine instances backed by the same storage can recognise a value they've each seen before, even though their salted tokens differ.

AESGCMCipher (pii_shield.crypto) is the only component that ever sees plaintext PII outside of the NEREngine. Each value is encrypted with AES-256-GCM using a fresh 96-bit IV, with the entity type bound in as additional authenticated data (AAD) — so a stored ciphertext can't be replayed under a different entity type. Storage backends only ever receive ciphertext, IV, and tag; they never see plaintext.

StorageBackend (pii_shield.storage) is an abstract interface with five methods a backend must implement: put, get, get_many, find_by_value_hash, touch (plus an optional log_access audit hook). PIIMaskingEngine depends only on this interface, which is what makes storage swappable without touching detection, tokenisation, or encryption code. Four implementations ship out of the box:

Backend Persistence Extra required Notes
InMemoryStorage None (process lifetime) tests, short scripts
FileSystemStorage Single JSON file, atomic writes single-process, no external infra
RedisStorage Redis hashes per token pii-shield[redis] shared across processes/hosts
PostgresStorage Relational table, auto-migrated schema pii-shield[postgres] shared, queryable, audit-loggable

Writing a fifth backend (S3, DynamoDB, Vault, etc.) means subclassing StorageBackend and implementing those five methods — PIIMaskingEngine needs no changes.

Data flow

mask(): NEREngine.detect() finds spans → for each span, compute value_hash and check the backend for an existing token in this scope (dedup) → if new, AESGCMCipher.encrypt() the value → StorageBackend.put() the ciphertext/IV/tag → splice the {{TYPE:xxxxx}} token into the text in place of the original span.

unmask(): DeterministicTokenGenerator.find_tokens_in_text() locates every placeholder → StorageBackend.get_many() fetches all matching records in one round trip → AESGCMCipher.decrypt() each → splice the decrypted values back into the text. Tokens with no matching record are left in place with a [UNRESOLVED] suffix rather than raising, so a partially-available vault degrades instead of failing the whole call.

redact(): NEREngine.detect() finds spans → each span is replaced in-place with [REDACTED:ENTITY_TYPE]. Nothing downstream of detection is invoked — no cipher, no storage — which is what makes it genuinely irreversible rather than just "not currently reversed."

Design choices worth knowing about

  • Encryption keys are supplied by the caller, not derived from or stored alongside vault data. This keeps a compromised storage backend from being sufficient on its own to decrypt anything, and keeps key rotation an application-level concern independent of which storage backend you choose.
  • Detection is fully separated from storage. You can swap InMemoryStorage for PostgresStorage without changing anything about how PII is found, and you can add spaCy/transformer layers without touching storage at all.
  • All storage backend methods are async, including InMemoryStorage and FileSystemStorage — so the same calling code works unmodified whether the backend is a Python dict or a networked database.

Configuring detection

from pii_shield import NEREngine, PIIMaskingEngine
from pii_shield.storage import InMemoryStorage

ner = NEREngine(
    enable_spacy=True,                 # PERSON / ORGANISATION / ADDRESS
    spacy_model="en_core_web_sm",
    enable_privacy_filter=True,        # any HF token-classification model
    privacy_filter_model="your/token-classification-model",
    privacy_filter_threshold=0.5,
    privacy_filter_device="cpu",
)

engine = PIIMaskingEngine(storage=InMemoryStorage(), ner_engine=ner)

NEREngine() with no arguments runs regex detection only, with zero extra dependencies.


Encryption key

from pii_shield.crypto import AESGCMCipher

engine = PIIMaskingEngine(storage=..., encryption_key="<64-char hex string>")
# or
engine = PIIMaskingEngine(storage=..., encryption_key=AESGCMCipher.generate_key())

If you omit encryption_key, an ephemeral one is generated and a warning is logged — anything masked in that session becomes permanently unrecoverable once the process exits. Always pass a stable key outside of quick experiments; losing the key makes every previously masked value permanently unrecoverable, by design.


Storage backend examples

from pii_shield.storage import InMemoryStorage, FileSystemStorage, RedisStorage, PostgresStorage

InMemoryStorage()
FileSystemStorage("./vault.json")
RedisStorage("redis://localhost:6379/0")
PostgresStorage("postgresql://user:pass@host:5432/mydb")   # creates its own schema on connect()

Custom backend:

from pii_shield.storage import StorageBackend
from pii_shield.types import TokenRecord

class MyBackend(StorageBackend):
    async def put(self, record: TokenRecord) -> None: ...
    async def get(self, token_value: str) -> TokenRecord | None: ...
    async def get_many(self, token_values: list[str]) -> dict[str, TokenRecord]: ...
    async def find_by_value_hash(self, value_hash: str, scope: str | None) -> str | None: ...
    async def touch(self, token_value: str) -> None: ...

Error handling

from pii_shield import (
    PIIShieldError,               # base class for everything below
    EngineNotInitialisedError,     # mask()/unmask() called before initialise()
    DecryptionError,               # AES-GCM tag verification failed
    StorageBackendError,           # backend-specific I/O/connection failure
    OptionalDependencyMissingError, # used a backend/layer without its extra installed
)

Running the tests

pip install -e ".[dev]"
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pii_protect-0.1.0.tar.gz (30.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pii_protect-0.1.0-py3-none-any.whl (35.2 kB view details)

Uploaded Python 3

File details

Details for the file pii_protect-0.1.0.tar.gz.

File metadata

  • Download URL: pii_protect-0.1.0.tar.gz
  • Upload date:
  • Size: 30.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for pii_protect-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7e2515dd04aed31528e05a08f47297285cefc9f5ba3ace8a15b1153a43fcb00c
MD5 6fde76ac85437797d1c9a6b67ca26a6d
BLAKE2b-256 ecc93101e11337848507193ed7bb5c455e7814985a191ef0e6162c3419229ffa

See more details on using hashes here.

File details

Details for the file pii_protect-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pii_protect-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 35.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for pii_protect-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 097ee41ddd67ee151f8b0caf43854d178c1b0a0d21f41fc22e82894c261419bf
MD5 0afa06202925fc74214a179a2170872f
BLAKE2b-256 5876074d391f1218b673ca9b3978de07ad03a83c9d3c578b07c252a4f1b62d94

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page