Open-source PII firewall for LLM apps. Detect, anonymize and rehydrate sensitive data before it reaches OpenAI, Anthropic or any LLM provider.
Project description
PII Firewall ๐ก๏ธ
Open-source PII firewall for LLM apps โ detect, anonymize and rehydrate sensitive data before it reaches OpenAI, Anthropic or any LLM provider
Why PII Firewall?
Most PII tools were built for data pipelines, not for LLM calls. PII Firewall is designed specifically around the detect โ sanitize โ LLM โ rehydrate round-trip:
- Domain awareness โ keep relevant data (medical diagnoses in healthcare, transaction amounts in finance) so the LLM still has context, while stripping what must not leave your system
- Auto language detection โ 55+ languages detected automatically with thread-level caching (0 ms after the first call)
- Locale-specific patterns โ country-specific ID formats: Spanish DNI, US SSN, French INSEE, German Steuernummer, Italian Codice Fiscale, Portuguese NIF, and more
- 7 detection backends โ regex, Presidio, Hybrid, GLiNER, Transformers, OPF, Nemotron โ switch with one parameter
- 7 disposition actions โ Keep, Redact, Pseudonymize, Generalize, Mask, Hash, Suppress
- Reversible pseudonymization โ vault stores originalโtoken mappings; real names are restored in LLM responses
- Streaming support โ
secure_call_stream()yields rehydrated tokens in real-time - GDPR Art. 17 right to forget โ
firewall.forget()wipes all mappings for a thread or case
๐ฆ Quick Start
Installation
# From PyPI (basic, pattern-based)
pip install pii-firewall
# Recommended: With Presidio and language detection
pip install "pii-firewall[presidio,langdetect]"
# Full features (includes transformers, OPF, GLiNER)
pip install "pii-firewall[all]"
# Local development install
pip install -e .
# Focused installs
pip install "pii-firewall[opf]" # OPF runtime (or install from source if your environment requires it)
pip install "pii-firewall[gliner]" # GLiNER PII models
Basic Usage
from privacy_firewall import create_firewall
# Create healthcare firewall (auto-detects language)
firewall = create_firewall("healthcare")
# Process text
result = firewall.process(
text="Ana Garcรญa, 43 aรฑos, hipertensiรณn. Prescripciรณn: enalapril 10mg.",
context={
"tenant_id": "hospital-001",
"case_id": "patient-123",
"thread_id": "consultation-1",
"actor_id": "doctor-456",
},
)
print(result.sanitized_text)
# Output: "PERSON_1, [AGE_40-49], hipertensiรณn. enalapril 10mg."
# Notice: Medical terms (hipertensiรณn, enalapril) are KEPT!
๐ฏ Domain Profiles
Healthcare
Keeps medical data relevant for diagnosis while protecting patient identity:
firewall = create_firewall("healthcare")
# Keeps: diagnoses, medications, procedures, lab values
# Redacts: names, IDs, addresses
# Generalizes: ages (43 โ 40-49), dates (specific โ month/year)
Finance
Protects customer PII and financial identifiers. Amounts and transaction context pass through without detection (not regulated PII):
firewall = create_firewall("finance")
# Keeps: company names, transaction context (amounts pass through as non-PII)
# Masks: credit card numbers (4111...1111)
# Pseudonymizes: account numbers, IBANs, tax IDs (reversible)
# Redacts: customer PII (names, addresses) and medical data
Legal
High anonymity for legal documents:
firewall = create_firewall("legal")
# Keeps: company/firm names (courts, agencies โ public record)
# Note: statutes, case numbers, legal citations are public record and pass through
# Pseudonymizes: party names (reversible for case management)
# Generalizes: all dates to month/year
# Redacts: strong identifiers and cross-domain medical data
๐ Multi-Language Support
Auto-detects 55+ languages with 0ms overhead after first detection:
firewall = create_firewall("healthcare")
# Spanish - detected automatically
result_es = firewall.process(
text="Paciente con diabetes tipo 2, DNI 12345678A",
context={...}
)
# English - detected automatically
result_en = firewall.process(
text="Patient with type 2 diabetes, SSN 123-45-6789",
context={...}
)
# French - detected automatically
result_fr = firewall.process(
text="Patient avec diabรจte, INSEE 1234567890123",
context={...}
)
Supported locales: ES, US, FR, DE, IT, PT, + global patterns
๐ง Advanced Usage
Custom Profiles
from privacy_firewall import (
PrivacyFirewall,
create_custom_profile,
EntityDisposition,
DispositionAction,
)
# Create custom profile
profile = create_custom_profile("legal_discovery")
# Add entity dispositions
profile.add_disposition(EntityDisposition(
entity_type="PERSON",
action=DispositionAction.PSEUDONYMIZE,
confidence_threshold=0.8,
))
profile.add_disposition(EntityDisposition(
entity_type="CASE_NUMBER",
action=DispositionAction.KEEP,
confidence_threshold=0.9,
))
firewall = PrivacyFirewall(profile=profile)
Adding Your Own Custom PII Detectors
There are two approaches depending on whether you need regex rules or a full ML/NLP model.
Option A โ Regex pattern (no ML, any backend)
Add patterns directly to the catalog at runtime. Works with all detection backends.
import re
from privacy_firewall.patterns.catalog import EntityPattern
# Quick one-liner helper
firewall.add_custom_regex(
entity_type="EMPLOYEE_ID",
regex=r"\bEMP-\d{6}\b",
locales=["GLOBAL"], # or ["US"], ["ES"], etc.
confidence=0.95,
context_words=["employee", "staff"],
disposition_action="redact", # keep / redact / pseudonymize / mask โฆ
)
# Or build the full EntityPattern object for more control
firewall.add_custom_pattern(EntityPattern(
entity_type="CASE_NUMBER",
locale="ES",
pattern=re.compile(r"\bEXP-\d{4}/\d{6}\b"),
confidence=0.98,
context_words=("expediente", "exp"),
description="Spanish legal case number",
))
Option B โ Custom NLP/ML recognizer (Presidio backend)
Pass your own Presidio EntityRecognizer (or PatternRecognizer) when creating the firewall.
This is the right approach when you want to use a spaCy model, a transformer, or any custom heuristic.
from privacy_firewall import create_firewall
from privacy_firewall.presidio_integration import create_custom_recognizer
# Helper that wraps a regex list into a Presidio PatternRecognizer
employee_recognizer = create_custom_recognizer(
entity_type="EMPLOYEE_ID",
patterns=[r"\bEMP\d{6}\b"],
context_words=["employee", "badge"],
score=0.9,
)
firewall = create_firewall(
domain="generic",
detector_backend="presidio", # required for this approach
custom_recognizers=[employee_recognizer],
)
For a fully custom ML-based recognizer, subclass Presidio's EntityRecognizer and pass the instance the same way:
from presidio_analyzer import EntityRecognizer, RecognizerResult
class MyModelRecognizer(EntityRecognizer):
"""Example: wraps any ML model as a Presidio recognizer."""
def load(self): ...
def analyze(self, text, entities, nlp_artifacts):
results = []
# call your model here and yield RecognizerResult objects
for span in my_model.predict(text):
results.append(RecognizerResult(
entity_type="CUSTOM_ENTITY",
start=span.start,
end=span.end,
score=span.confidence,
))
return results
firewall = create_firewall(
domain="generic",
detector_backend="presidio",
custom_recognizers=[MyModelRecognizer(supported_entities=["CUSTOM_ENTITY"])],
)
Which option to use?
| Scenario | Approach |
|---|---|
| Regex or rule-based custom entity | Option A โ add_custom_regex / add_custom_pattern |
| Locale-specific ID format (new country) | Option A with the matching locale code |
| Existing HuggingFace / spaCy NER model | Option B โ wrap in EntityRecognizer subclass |
| Complex heuristic or external API call | Option B โ implement analyze() freely |
Testing a HuggingFace PII Model
The library has a built-in transformers backend. The quickest way to try any HuggingFace NER model is:
pip install "pii-firewall[transformers]"
from privacy_firewall import create_firewall
# Pass any HuggingFace model ID โ downloaded automatically on first call
firewall = create_firewall(
"healthcare",
detector_backend="transformers",
transformer_model_id="dslim/bert-base-NER", # swap for any HF model ID
)
result = firewall.process(
text="John Doe, SSN 123-45-6789, prescribed enalapril 10mg",
context={"tenant_id": "t1", "case_id": "c1", "thread_id": "th1", "actor_id": "a1"},
)
print(result.sanitized_text)
Curated model catalog
The library ships a pre-vetted catalog of models in transformers_ner/models.py:
from privacy_firewall.transformers_ner.models import get_model_for_domain
config = get_model_for_domain("medical", "en")
firewall = create_firewall("healthcare", detector_backend="transformers", transformer_model_id=config.model_id)
| Domain | Language | Model |
|---|---|---|
| General | en |
dslim/bert-base-NER |
| General | multilingual | Davlan/xlm-roberta-base-ner-hrl |
| General | fr |
Jean-Baptiste/camembert-ner |
| Medical | en |
d4data/biomedical-ner-all |
| Medical | es |
PlanTL-GOB-ES/bsc-bio-ehr-es |
Run on GPU
firewall = create_firewall(
"healthcare",
detector_backend="transformers",
transformer_model_id="d4data/biomedical-ner-all",
transformer_device=0, # 0 = first GPU, -1 = CPU (default)
)
Combine with regex patterns (Presidio hybrid)
If you need to mix the HF model with regex patterns in the same pipeline, wrap it as a Presidio recognizer:
from presidio_analyzer import EntityRecognizer, RecognizerResult
from transformers import pipeline
class HFPIIRecognizer(EntityRecognizer):
def __init__(self, model_id: str):
super().__init__(supported_entities=["PERSON", "ORGANIZATION", "LOCATION"])
self._pipe = pipeline("ner", model=model_id, aggregation_strategy="simple")
def load(self): pass
def analyze(self, text, entities, nlp_artifacts):
return [
RecognizerResult(
entity_type=span["entity_group"],
start=span["start"],
end=span["end"],
score=span["score"],
)
for span in self._pipe(text)
]
firewall = create_firewall(
"healthcare",
detector_backend="presidio",
custom_recognizers=[HFPIIRecognizer("dslim/bert-base-NER")],
)
Reversible Pseudonymization
# Anonymize
result = firewall.process(text="Contact John Doe at john@example.com", context={...})
print(result.sanitized_text)
# "Contact PERSON_1 at EMAIL_1"
# LLM processes anonymized text
llm_response = "PERSON_1 should verify EMAIL_1 is correct"
# Rehydrate (restore original values)
from privacy_firewall.anonymization_engine import rehydrate_text
mapping = firewall.vault.get_case_mapping(
tenant_id="...",
case_id="...",
thread_id="...",
)
final = rehydrate_text(llm_response, mapping)
print(final)
# "John Doe should verify john@example.com is correct"
Provider-Agnostic SDK Flow
from privacy_firewall import PrivacyFirewallSDK
sdk = PrivacyFirewallSDK.create(domain="healthcare", detector_backend="presidio")
context = {
"tenant_id": "hospital-001",
"case_id": "patient-123",
"thread_id": "consultation-1",
"actor_id": "doctor-456",
}
# 1) Anonymize input
anon = sdk.anonymize_text(text="Contact John Doe at john@example.com", context=context)
# 2) Call any model client (callable or object with .generate)
def my_llm(prompt: str) -> str:
return f"Please verify PERSON_1 at EMAIL_1. Input was: {prompt}"
# 3) Rehydrate output
result = sdk.secure_call(
text="Contact John Doe at john@example.com",
context=context,
llm_client=my_llm,
)
print(result.final_text)
GDPR Compliance (Right to be Forgotten)
# Forget all data for a case
deleted = firewall.forget(
tenant_id="hospital-001",
case_id="patient-123",
thread_id="consultation-1",
)
print(f"Deleted {deleted} mappings")
๐ Web API
Run the FastAPI web server:
cd pii-firewall
uvicorn privacy_firewall.web.app:create_app --factory --reload --port 8080
Access the API at http://127.0.0.1:8080/docs
API Example
curl -X POST "http://localhost:8000/api/run" \
-H "Content-Type: application/json" \
-d '{
"text": "Ana Garcรญa, 43 aรฑos, hipertensiรณn",
"tenant_id": "hospital-001",
"case_id": "patient-123",
"thread_id": "thread-1",
"actor_id": "doctor-456",
"profile": "healthcare",
"detector_backend": "gliner"
}'
Web UI
The project includes a Next.js web interface:
cd ../pii-web-next
npm install
npm run dev
Access at http://127.0.0.1:3010
๐ Performance
- Language detection: 1โ2 ms (first message), 0 ms (cached)
- Pattern matching (regex mode): < 1 ms
- Presidio NER: 50โ200 ms (depends on text length)
- OPF / Nemotron: 50โ300 ms
- Transformer NER: 100โ500 ms (use for accuracy, not latency)
- Overall round-trip (Presidio mode): ~50โ250 ms per request
Detection backend comparison
| Backend | Install | Best for | Latency |
|---|---|---|---|
regex |
(none) | Structured IDs, emails, phones | < 1 ms |
presidio |
[presidio,langdetect] |
Named entities โ best speed/accuracy balance | 50โ200 ms |
hybrid |
[presidio,langdetect] |
Regex + Presidio for max coverage | 50โ250 ms |
gliner |
[gliner] |
Zero-shot NER, no fine-tuning needed | 100โ400 ms |
transformers |
[transformers] |
Biomedical NER (d4data, BC5CDR) | 100โ500 ms |
opf |
[opf] |
Token-level classifier, language-agnostic | 50โ200 ms |
nemotron |
[opf] |
NVIDIA fine-tune, high recall on free text | 100โ300 ms |
Optimization tips:
- Use thread-level language caching (enabled by default)
- Use
detector_backend="presidio"for best speed/accuracy balance
๐๏ธ Architecture
src/privacy_firewall/
โโโ language/ # Auto-detection & routing
โ โโโ detector.py # LanguageDetector (langdetect/fasttext)
โ โโโ router.py # LanguageRouter (spaCy model selection)
โโโ patterns/ # Locale-aware patterns
โ โโโ catalog.py # PatternCatalog
โ โโโ locales/ # ONE FILE PER LANGUAGE โจ
โ โโโ global_patterns.py
โ โโโ es_patterns.py
โ โโโ us_patterns.py
โ โโโ fr_patterns.py
โ โโโ de_patterns.py
โ โโโ it_patterns.py
โ โโโ pt_patterns.py
โโโ profiles/ # Domain profiles
โ โโโ profiles.py # DomainProfile, EntityDisposition
โ โโโ presets.py # HEALTHCARE, FINANCE, LEGAL
โโโ presidio_integration/ # Full Presidio capabilities
โ โโโ engine.py # Analyzer + Anonymizer
โ โโโ recognizers.py # Custom recognizers
โโโ transformers_ner/ # Domain-specific models
โ โโโ engine.py # TransformerNEREngine
โ โโโ models.py # Biomedical NER model catalog
โโโ unified_detector.py # Multi-backend orchestration
โโโ anonymization_engine.py # Disposition-based anonymization
โโโ firewall.py # Next-gen PrivacyFirewall
โโโ web/ # FastAPI web interface
โโโ app.py # REST API
๐ Comparison
| Feature | Privacy Firewall | Presidio | scrubadub | AWS Comprehend |
|---|---|---|---|---|
| Domain awareness | โ Keep relevant data | โ | โ | โ ๏ธ Healthcare only |
| Multi-language | โ 55+ auto-detect | โ Manual | โ English only | โ Some |
| Locale patterns | โ Per-country | โ | โ | โ |
| Multiple dispositions | โ | โ Basic | โ | โ |
| Transformers | โ BioBERT, biomedical NER | โ | โ | โ Proprietary |
| Reversibility | โ Vault | โ | โ | โ |
| Custom patterns | โ Runtime | โ ๏ธ Code | โ ๏ธ Code | โ |
| Thread caching | โ 0ms after first | โ | โ | N/A |
| Open source | โ | โ | โ | โ |
๐ Extending with New Locales
Add support for a new country in 3 steps:
- Create pattern file (
patterns/locales/nl_patterns.py):
import re
from ..catalog import EntityPattern
NL_BSN = EntityPattern(
entity_type="NATIONAL_ID",
locale="NL",
pattern=re.compile(r"\b\d{9}\b"),
confidence=0.9,
context_words=("bsn", "burgerservicenummer"),
description="Dutch BSN",
)
NL_PATTERNS = [NL_BSN]
- Import in
patterns/locales/__init__.py:
from .nl_patterns import NL_PATTERNS
LOCALE_PATTERNS = [...] + NL_PATTERNS
- Add language config (optional, for spaCy models):
# In language/router.py
"nl": LanguageConfig(
language_code="nl",
spacy_model="nl_core_news_sm",
patterns_locale="NL",
),
Done! Dutch patterns now available automatically.
๐ Documentation
- Developer Guide (HTML) - Complete implementation and usage guide
- tests_integration/README.md - Integration test notes
To show the guide in a panel in VS Code:
- Open docs/guide.html
- Select Open Preview (or use Ctrl+Shift+V)
๐งช Testing
# Unit tests
pytest tests/
# Integration tests
pytest tests_integration/
# Quick package smoke test
python -c "import privacy_firewall; print('ok')"
๐ Security & Privacy
- โ Simple end-to-end anonymizeโLLMโrehydrate flow
- โ Reversible pseudo-anonymization with vault
- โ Pluggable vault storage (in-memory and SQLite)
- โ GDPR "right to be forgotten"
- โ
Audit trails in
result.trace - โ No data leaves your infrastructure
๐ License
Apache 2.0 โ see LICENSE for details.
๐ค Contributing
Contributions welcome! Areas to contribute:
- New locale patterns (add your country!)
- Domain profiles (education, government, etc.)
- Custom recognizers
- Performance optimizations
- Documentation improvements
๐ Acknowledgments
Built with:
- Presidio - Microsoft's PII detection library
- spaCy - Industrial-strength NLP
- langdetect - Fast language detection
- transformers - State-of-the-art NLP models
Built with โค๏ธ for privacy-first AI applications
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pii_firewall-0.1.0.tar.gz.
File metadata
- Download URL: pii_firewall-0.1.0.tar.gz
- Upload date:
- Size: 87.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2530928205aee79f2996372c372d39ebf1df9b909a2efbdca4ed87047873b6a8
|
|
| MD5 |
d400378bb90511c84d3b50142252f327
|
|
| BLAKE2b-256 |
1d950f5a38ba8bc73b6bb22a295156b4df550eff1fdb60c160ad8e219bfdf160
|
File details
Details for the file pii_firewall-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pii_firewall-0.1.0-py3-none-any.whl
- Upload date:
- Size: 97.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
124f05f886ee41a5a38181659cf28e13dfeb4ed0d50a3f6762b42f1e4c0f97cb
|
|
| MD5 |
93f9d65420f303ca8e51058d0a576fa2
|
|
| BLAKE2b-256 |
8bfbc010232967fe0c169ad2f42a1ace8b996933e3179a23d091f621e7177f7e
|