Microsoft Presidio plugin: multi-language recognizers with optional reversible anonymization.
Project description
pii-presidio
Microsoft Presidio plugin: multi-language PII recognizers with reversible anonymization, built on pii-core and pii-veil.
Install
pip install pii-presidio
python -m spacy download pl_core_news_sm # required for Polish NLP analysis
pii-presidio pulls in presidio-analyzer, presidio-anonymizer, pii-core, and pii-veil. spaCy itself comes via Presidio; the Polish language model has to be downloaded separately (Presidio's standard pattern).
Recognizers
from pii_presidio import get_recognizers
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider
nlp_engine = NlpEngineProvider(nlp_configuration={
"nlp_engine_name": "spacy",
"models": [{"lang_code": "pl", "model_name": "pl_core_news_sm"}],
}).create_engine()
registry = RecognizerRegistry(supported_languages=["pl"])
for r in get_recognizers(["pl"]):
registry.add_recognizer(r)
analyzer = AnalyzerEngine(registry=registry, nlp_engine=nlp_engine, supported_languages=["pl"])
results = analyzer.analyze(text="PESEL 44051401359, email jan@example.pl", language="pl")
Each pii_core detector becomes one PatternRecognizer. Confidence scores are 0.85 for checksum-validated detectors (PESEL, NIP, REGON, IBAN, credit card) and 0.4 for regex-only ones (ID card, passport, phone, email). Per-detector context words are pre-set to common Polish keywords; pass your own via PiiCoreRecognizer(detector, context=[...]) if you need different boosts.
KRS and postal-code detectors are excluded by default (their raw regexes match ordinary 10-digit and XX-XXX strings); enable them with include_opt_in=True and pair with strict context filtering.
Reversible anonymization
from pii_veil import Mapping, Shield
from pii_presidio import ReversibleReplaceOperator, reversible_operators
from presidio_anonymizer import AnonymizerEngine
mapping = Mapping()
engine = AnonymizerEngine()
engine.add_anonymizer(ReversibleReplaceOperator)
result = engine.anonymize(
text="PESEL 44051401359, email jan@example.pl",
analyzer_results=results,
operators=reversible_operators(mapping),
)
# result.text -> "PESEL [PL_PESEL_001], email [EMAIL_001]"
# Send result.text to an LLM, get a response back, then:
restored = Shield(mapping=mapping).deanonymize(llm_response_text)
The Mapping is the round-trip handle. It uses the same JSON format as standalone pii-veil, so you can interleave the two -- anonymize via Presidio, deanonymize via Shield, or vice versa.
Entity name mapping
pii_core.PIIType |
Presidio entity name |
|---|---|
PL_PESEL, PL_NIP, PL_REGON, PL_ID_CARD, PL_PASSPORT, PL_KRS, PL_POSTAL_CODE |
same string (country-prefixed) |
PL_PHONE |
PHONE_NUMBER |
PL_IBAN |
IBAN_CODE |
EMAIL |
EMAIL_ADDRESS |
CREDIT_CARD |
CREDIT_CARD |
Cross-language types use Presidio's standard names so existing pipelines that filter entities=["EMAIL_ADDRESS"] pick our recognizers up unchanged.
Sibling packages
pii-core-- multi-language detection primitives this plugin reuses.pii-veil-- non-Presidio reversible anonymization with the sameMappingformat.
License
Apache-2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pii_presidio-0.1.0.tar.gz.
File metadata
- Download URL: pii_presidio-0.1.0.tar.gz
- Upload date:
- Size: 15.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
742a1938b1f25477f8edca52d87c58a5d1ca6381b8f7b9bda585becaa4e57ddc
|
|
| MD5 |
4243d557fd5e50bd570a2d68bbf751af
|
|
| BLAKE2b-256 |
9fd867c080098ed13af867efce4c34b127df50a073ba3b6cc5757803e331e7ee
|
Provenance
The following attestation bundles were made for pii_presidio-0.1.0.tar.gz:
Publisher:
publish.yml on pii-toolkit/pii-presidio
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pii_presidio-0.1.0.tar.gz -
Subject digest:
742a1938b1f25477f8edca52d87c58a5d1ca6381b8f7b9bda585becaa4e57ddc - Sigstore transparency entry: 1401921731
- Sigstore integration time:
-
Permalink:
pii-toolkit/pii-presidio@338dd894780d65573c5b1cd9207725cb676e43bb -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/pii-toolkit
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@338dd894780d65573c5b1cd9207725cb676e43bb -
Trigger Event:
push
-
Statement type:
File details
Details for the file pii_presidio-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pii_presidio-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9dd5843d73e529650c8b4c8fbc2e248e6524db9bf033823eefe5ba6146f0583
|
|
| MD5 |
06ab94ea81e4cf6ee261ea8be4f545b1
|
|
| BLAKE2b-256 |
25e55c5f571fdfa58fe89cc94585bbef12ec14087f02ae2393a0f8230a449489
|
Provenance
The following attestation bundles were made for pii_presidio-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on pii-toolkit/pii-presidio
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pii_presidio-0.1.0-py3-none-any.whl -
Subject digest:
a9dd5843d73e529650c8b4c8fbc2e248e6524db9bf033823eefe5ba6146f0583 - Sigstore transparency entry: 1401921795
- Sigstore integration time:
-
Permalink:
pii-toolkit/pii-presidio@338dd894780d65573c5b1cd9207725cb676e43bb -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/pii-toolkit
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@338dd894780d65573c5b1cd9207725cb676e43bb -
Trigger Event:
push
-
Statement type: