Pluggable, on-premise-first PII masking, unmasking, and redaction library.

These details have not been verified by PyPI

Project links

Repository

Project description

pii-shield

A pluggable, on-premise-first PII masking, unmasking, and redaction library for Python.

pii-shield detects personally identifiable and sensitive business information in free text (emails, phone numbers, GST/PAN/IBAN numbers, person and organisation names, bank details, invoice/PO numbers, and more), and gives you three ways to handle it:

mask it into a reversible placeholder token, encrypted at rest
unmask a previously masked token back to its original value
redact it permanently, with no way to recover the original

There is no server, no API, and no hard dependency on any particular database — it's a library you import and call directly. Where your encrypted PII values live is a pluggable choice: in-memory, a local file, Redis, or PostgreSQL, or a backend you write yourself.

Install

pip install pii-shield                      # core: regex detection + in-memory/filesystem storage
pip install "pii-shield[postgres]"           # + PostgreSQL storage backend
pip install "pii-shield[redis]"              # + Redis storage backend
pip install "pii-shield[spacy]"              # + spaCy NER layer (PERSON/ORG/GPE)
pip install "pii-shield[privacy-filter]"     # + transformer token-classification layer
pip install "pii-shield[all]"                # everything

Only cryptography is a hard dependency. asyncpg, redis, spacy, and transformers/torch are all opt-in extras. If you use a backend or detection layer without installing its extra, you get a clear OptionalDependencyMissingError telling you exactly what to install — never a bare ImportError or a silent failure.

Quick start

import asyncio
from pii_shield import PIIMaskingEngine
from pii_shield.storage import InMemoryStorage

async def main():
    async with PIIMaskingEngine(storage=InMemoryStorage()) as engine:
        result = await engine.mask("Contact john@acme.com about GST 27AAPFU0939F1ZV")
        print(result.masked_text)
        # "Contact {{EMAIL:abcc2}} about GST {{GST:9a03b}}"

        original = await engine.unmask(result.masked_text)
        print(original)
        # "Contact john@acme.com about GST 27AAPFU0939F1ZV"

        print(engine.redact("Contact john@acme.com about GST 27AAPFU0939F1ZV"))
        # "Contact [REDACTED:EMAIL] about GST [REDACTED:GST]"  — irreversible, nothing stored

asyncio.run(main())

PIIMaskingEngine is an async context manager — initialise() connects the storage backend, close() releases it. Use async with unless you need to control that lifecycle yourself.

The three operations

Method	Reversible?	Touches storage?	Use for
`mask(text, scope=None)`	Yes, via `unmask()`	Yes	Sending documents to an LLM/cloud service while keeping raw PII on-premise
`unmask(masked_text, scope=None)`	—	Yes (read)	Restoring original values before writing back to source systems
`redact(text)`	No	No	Logs, analytics exports, anything that must never contain recoverable PII

result = await engine.mask(text, scope="invoice-2026-00417")
# result.masked_text     -> text with {{TYPE:xxxxx}} placeholders
# result.token_count     -> number of PII spans masked
# result.entity_counts   -> {"EMAIL": 1, "GST": 1, ...}
# result.entities        -> per-span detail (type, offsets, token, confidence, source)

text_back = await engine.unmask(result.masked_text, scope="invoice-2026-00417")

scrubbed = engine.redact(text)   # synchronous — no storage or encryption involved

mask_dict() / unmask_dict() do the same over JSON-serialisable dicts, masking all string leaf values in one pass so a repeated value across fields still maps to the same token.

scope is a free-form string (e.g. a document or invoice ID). The same PII value repeated within one scope deduplicates to a single stored token instead of being encrypted and stored twice — pass the same scope to mask() and the matching unmask() call.

Architecture

                    ┌─────────────────────────────────────────┐
                    │             PIIMaskingEngine             │
                    │        (pii_shield.engine)               │
                    │                                           │
                    │   mask()    unmask()    redact()          │
                    └───────┬───────────┬───────────┬───────────┘
                            │           │           │
              ┌─────────────┘           │           └─── (redact never
              │                         │                  leaves this box —
              ▼                         │                  no encrypt, no store)
   ┌─────────────────────┐              │
   │      NEREngine        │            │
   │   (pii_shield.ner)     │            │
   │                        │            │
   │  RegexNERLayer   (always on)        │
   │  SpacyNERLayer   (optional)         │
   │  PrivacyFilterLayer (optional)      │
   │        │                            │
   │  TokenizerSafeSpanMerger            │
   │  SpanConflictResolver               │
   └──────────┬─────────────┘            │
              │ DetectedSpan[]           │
              ▼                          │
   ┌─────────────────────┐               │
   │ DeterministicToken-   │             │
   │ Generator              │             │
   │ (pii_shield.tokens)    │             │
   │                        │             │
   │  {{TYPE:xxxxx}}        │             │
   │  find_tokens_in_text() │◄────────────┘
   └──────────┬─────────────┘
              │ token, value_hash
              ▼
   ┌─────────────────────┐        ┌───────────────────────────────┐
   │   AESGCMCipher        │       │        StorageBackend           │
   │  (pii_shield.crypto)   │──────▶       (pii_shield.storage)       │
   │                        │       │                                 │
   │  encrypt() / decrypt() │       │  InMemoryStorage                │
   │  AES-256-GCM            │      │  FileSystemStorage               │
   └────────────────────────┘       │  RedisStorage       (extra)      │
                                     │  PostgresStorage    (extra)      │
                                     └───────────────────────────────┘

Components

PIIMaskingEngine (pii_shield.engine) is the single public entry point. It owns one NEREngine, one DeterministicTokenGenerator, one AESGCMCipher, and one StorageBackend, and wires them together for mask() / unmask() / redact(). This is the only class most callers need to import.

NEREngine (pii_shield.ner) does detection only — it never touches encryption or storage. It runs one or more layers over the input text and merges their output into a single non-overlapping span list:

RegexNERLayer — always on, no extra dependencies. High-precision patterns for structured PII: GST, PAN, TAN, ABN, VAT, IBAN, SWIFT, account/sort-code/routing numbers, credit cards, email, phone (India + international), invoice and PO references.
SpacyNERLayer — optional. Adds PERSON / ORGANISATION / ADDRESS detection via a local spaCy model. Requires pii-shield[spacy].
PrivacyFilterLayer — optional. Adds detection via any HuggingFace token-classification model you point it at, run entirely on-premise through transformers.pipeline. Requires pii-shield[privacy-filter].

When layers disagree or overlap, SpanConflictResolver picks a winner (regex-validated spans win first, then financial-entity-over-phone, then higher confidence, then longer span), and TokenizerSafeSpanMerger stitches back together sub-word fragments that some transformer models emit at token boundaries.

DeterministicTokenGenerator (pii_shield.tokens) turns a detected span into a {{ENTITY_TYPE:xxxxx}} placeholder — a 5-hex-character suffix derived from SHA-256(value | entity_type | salt). Same value + same entity type always produces the same token within one salted instance, which is what makes within-document deduplication and find_tokens_in_text() (used by unmask() to locate placeholders) work. It also computes an unsalted value_hash used purely for storage-side deduplication, so multiple engine instances backed by the same storage can recognise a value they've each seen before, even though their salted tokens differ.

AESGCMCipher (pii_shield.crypto) is the only component that ever sees plaintext PII outside of the NEREngine. Each value is encrypted with AES-256-GCM using a fresh 96-bit IV, with the entity type bound in as additional authenticated data (AAD) — so a stored ciphertext can't be replayed under a different entity type. Storage backends only ever receive ciphertext, IV, and tag; they never see plaintext.

StorageBackend (pii_shield.storage) is an abstract interface with five methods a backend must implement: put, get, get_many, find_by_value_hash, touch (plus an optional log_access audit hook). PIIMaskingEngine depends only on this interface, which is what makes storage swappable without touching detection, tokenisation, or encryption code. Four implementations ship out of the box:

Backend	Persistence	Extra required	Notes
`InMemoryStorage`	None (process lifetime)	—	tests, short scripts
`FileSystemStorage`	Single JSON file, atomic writes	—	single-process, no external infra
`RedisStorage`	Redis hashes per token	`pii-shield[redis]`	shared across processes/hosts
`PostgresStorage`	Relational table, auto-migrated schema	`pii-shield[postgres]`	shared, queryable, audit-loggable

Writing a fifth backend (S3, DynamoDB, Vault, etc.) means subclassing StorageBackend and implementing those five methods — PIIMaskingEngine needs no changes.

Data flow

mask(): NEREngine.detect() finds spans → for each span, compute value_hash and check the backend for an existing token in this scope (dedup) → if new, AESGCMCipher.encrypt() the value → StorageBackend.put() the ciphertext/IV/tag → splice the {{TYPE:xxxxx}} token into the text in place of the original span.

unmask(): DeterministicTokenGenerator.find_tokens_in_text() locates every placeholder → StorageBackend.get_many() fetches all matching records in one round trip → AESGCMCipher.decrypt() each → splice the decrypted values back into the text. Tokens with no matching record are left in place with a [UNRESOLVED] suffix rather than raising, so a partially-available vault degrades instead of failing the whole call.

redact(): NEREngine.detect() finds spans → each span is replaced in-place with [REDACTED:ENTITY_TYPE]. Nothing downstream of detection is invoked — no cipher, no storage — which is what makes it genuinely irreversible rather than just "not currently reversed."

Design choices worth knowing about

Encryption keys are supplied by the caller, not derived from or stored alongside vault data. This keeps a compromised storage backend from being sufficient on its own to decrypt anything, and keeps key rotation an application-level concern independent of which storage backend you choose.
Detection is fully separated from storage. You can swap InMemoryStorage for PostgresStorage without changing anything about how PII is found, and you can add spaCy/transformer layers without touching storage at all.
All storage backend methods are async, including InMemoryStorage and FileSystemStorage — so the same calling code works unmodified whether the backend is a Python dict or a networked database.

Configuring detection

from pii_shield import NEREngine, PIIMaskingEngine
from pii_shield.storage import InMemoryStorage

ner = NEREngine(
    enable_spacy=True,                 # PERSON / ORGANISATION / ADDRESS
    spacy_model="en_core_web_sm",
    enable_privacy_filter=True,        # any HF token-classification model
    privacy_filter_model="your/token-classification-model",
    privacy_filter_threshold=0.5,
    privacy_filter_device="cpu",
)

engine = PIIMaskingEngine(storage=InMemoryStorage(), ner_engine=ner)

NEREngine() with no arguments runs regex detection only, with zero extra dependencies.

Encryption key

from pii_shield.crypto import AESGCMCipher

engine = PIIMaskingEngine(storage=..., encryption_key="<64-char hex string>")
# or
engine = PIIMaskingEngine(storage=..., encryption_key=AESGCMCipher.generate_key())

If you omit encryption_key, an ephemeral one is generated and a warning is logged — anything masked in that session becomes permanently unrecoverable once the process exits. Always pass a stable key outside of quick experiments; losing the key makes every previously masked value permanently unrecoverable, by design.

Storage backend examples

from pii_shield.storage import InMemoryStorage, FileSystemStorage, RedisStorage, PostgresStorage

InMemoryStorage()
FileSystemStorage("./vault.json")
RedisStorage("redis://localhost:6379/0")
PostgresStorage("postgresql://user:pass@host:5432/mydb")   # creates its own schema on connect()

Custom backend:

from pii_shield.storage import StorageBackend
from pii_shield.types import TokenRecord

class MyBackend(StorageBackend):
    async def put(self, record: TokenRecord) -> None: ...
    async def get(self, token_value: str) -> TokenRecord | None: ...
    async def get_many(self, token_values: list[str]) -> dict[str, TokenRecord]: ...
    async def find_by_value_hash(self, value_hash: str, scope: str | None) -> str | None: ...
    async def touch(self, token_value: str) -> None: ...

Error handling

from pii_shield import (
    PIIShieldError,               # base class for everything below
    EngineNotInitialisedError,     # mask()/unmask() called before initialise()
    DecryptionError,               # AES-GCM tag verification failed
    StorageBackendError,           # backend-specific I/O/connection failure
    OptionalDependencyMissingError, # used a backend/layer without its extra installed
)

Running the tests

pip install -e ".[dev]"
pytest

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

0.1.2

Jul 3, 2026

0.1.1

Jul 3, 2026

This version

0.1.0

Jul 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pii_protect-0.1.0.tar.gz (30.5 kB view details)

Uploaded Jul 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pii_protect-0.1.0-py3-none-any.whl (35.2 kB view details)

Uploaded Jul 3, 2026 Python 3

File details

Details for the file pii_protect-0.1.0.tar.gz.

File metadata

Download URL: pii_protect-0.1.0.tar.gz
Upload date: Jul 3, 2026
Size: 30.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for pii_protect-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7e2515dd04aed31528e05a08f47297285cefc9f5ba3ace8a15b1153a43fcb00c`
MD5	`6fde76ac85437797d1c9a6b67ca26a6d`
BLAKE2b-256	`ecc93101e11337848507193ed7bb5c455e7814985a191ef0e6162c3419229ffa`

See more details on using hashes here.

File details

Details for the file pii_protect-0.1.0-py3-none-any.whl.

File metadata

Download URL: pii_protect-0.1.0-py3-none-any.whl
Upload date: Jul 3, 2026
Size: 35.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for pii_protect-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`097ee41ddd67ee151f8b0caf43854d178c1b0a0d21f41fc22e82894c261419bf`
MD5	`0afa06202925fc74214a179a2170872f`
BLAKE2b-256	`5876074d391f1218b673ca9b3978de07ad03a83c9d3c578b07c252a4f1b62d94`

See more details on using hashes here.

pii-protect 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pii-shield

Install

Quick start

The three operations

Architecture

Components

Data flow

Design choices worth knowing about

Configuring detection

Encryption key

Storage backend examples

Error handling

Running the tests

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes