Skip to main content

Open-source PII firewall for LLM apps. Detect, anonymize and rehydrate sensitive data before it reaches OpenAI, Anthropic or any LLM provider.

Project description

PII Firewall ๐Ÿ›ก๏ธ

Open-source PII firewall for LLM apps โ€” detect, anonymize and rehydrate sensitive data before it reaches OpenAI, Anthropic or any LLM provider

Python 3.10+ License: Apache 2.0

Why PII Firewall?

Most PII tools were built for data pipelines, not for LLM calls. PII Firewall is designed specifically around the detect โ†’ sanitize โ†’ LLM โ†’ rehydrate round-trip:

  • Domain awareness โ€” keep relevant data (medical diagnoses in healthcare, transaction amounts in finance) so the LLM still has context, while stripping what must not leave your system
  • Auto language detection โ€” 55+ languages detected automatically with thread-level caching (0 ms after the first call)
  • Locale-specific patterns โ€” country-specific ID formats: Spanish DNI, US SSN, French INSEE, German Steuernummer, Italian Codice Fiscale, Portuguese NIF, and more
  • 7 detection backends โ€” regex, Presidio, Hybrid, GLiNER, Transformers, OPF, Nemotron โ€” switch with one parameter
  • 7 disposition actions โ€” Keep, Redact, Pseudonymize, Generalize, Mask, Hash, Suppress
  • Reversible pseudonymization โ€” vault stores originalโ†”token mappings; real names are restored in LLM responses
  • Streaming support โ€” secure_call_stream() yields rehydrated tokens in real-time
  • GDPR Art. 17 right to forget โ€” firewall.forget() wipes all mappings for a thread or case

๐Ÿ“ฆ Quick Start

Installation

# From PyPI (basic, pattern-based)
pip install pii-firewall

# Recommended: With Presidio and language detection
pip install "pii-firewall[presidio,langdetect]"

# Full features (includes transformers, OPF, GLiNER)
pip install "pii-firewall[all]"

# Local development install
pip install -e .

# Focused installs
pip install "pii-firewall[opf]"       # OPF runtime (or install from source if your environment requires it)
pip install "pii-firewall[gliner]"    # GLiNER PII models

Basic Usage

from privacy_firewall import create_firewall

# Create healthcare firewall (auto-detects language)
firewall = create_firewall("healthcare")

# Process text
result = firewall.process(
    text="Ana Garcรญa, 43 aรฑos, hipertensiรณn. Prescripciรณn: enalapril 10mg.",
    context={
        "tenant_id": "hospital-001",
        "case_id": "patient-123",
        "thread_id": "consultation-1",
        "actor_id": "doctor-456",
    },
)

print(result.sanitized_text)
# Output: "PERSON_1, [AGE_40-49], hipertensiรณn. enalapril 10mg."
# Notice: Medical terms (hipertensiรณn, enalapril) are KEPT!

๐ŸŽฏ Domain Profiles

Healthcare

Keeps medical data relevant for diagnosis while protecting patient identity:

firewall = create_firewall("healthcare")

# Keeps: diagnoses, medications, procedures, lab values
# Redacts: names, IDs, addresses
# Generalizes: ages (43 โ†’ 40-49), dates (specific โ†’ month/year)

Finance

Protects customer PII and financial identifiers. Amounts and transaction context pass through without detection (not regulated PII):

firewall = create_firewall("finance")

# Keeps: company names, transaction context (amounts pass through as non-PII)
# Masks: credit card numbers (4111...1111)
# Pseudonymizes: account numbers, IBANs, tax IDs (reversible)
# Redacts: customer PII (names, addresses) and medical data

Legal

High anonymity for legal documents:

firewall = create_firewall("legal")

# Keeps: company/firm names (courts, agencies โ€” public record)
# Note: statutes, case numbers, legal citations are public record and pass through
# Pseudonymizes: party names (reversible for case management)
# Generalizes: all dates to month/year
# Redacts: strong identifiers and cross-domain medical data

๐ŸŒ Multi-Language Support

Auto-detects 55+ languages with 0ms overhead after first detection:

firewall = create_firewall("healthcare")

# Spanish - detected automatically
result_es = firewall.process(
    text="Paciente con diabetes tipo 2, DNI 12345678A",
    context={...}
)

# English - detected automatically  
result_en = firewall.process(
    text="Patient with type 2 diabetes, SSN 123-45-6789",
    context={...}
)

# French - detected automatically
result_fr = firewall.process(
    text="Patient avec diabรจte, INSEE 1234567890123",
    context={...}
)

Supported locales: ES, US, FR, DE, IT, PT, + global patterns

๐Ÿ”ง Advanced Usage

Custom Profiles

from privacy_firewall import (
    PrivacyFirewall,
    create_custom_profile,
    EntityDisposition,
    DispositionAction,
)

# Create custom profile
profile = create_custom_profile("legal_discovery")

# Add entity dispositions
profile.add_disposition(EntityDisposition(
    entity_type="PERSON",
    action=DispositionAction.PSEUDONYMIZE,
    confidence_threshold=0.8,
))

profile.add_disposition(EntityDisposition(
    entity_type="CASE_NUMBER",
    action=DispositionAction.KEEP,
    confidence_threshold=0.9,
))

firewall = PrivacyFirewall(profile=profile)

Adding Your Own Custom PII Detectors

There are two approaches depending on whether you need regex rules or a full ML/NLP model.

Option A โ€” Regex pattern (no ML, any backend)

Add patterns directly to the catalog at runtime. Works with all detection backends.

import re
from privacy_firewall.patterns.catalog import EntityPattern

# Quick one-liner helper
firewall.add_custom_regex(
    entity_type="EMPLOYEE_ID",
    regex=r"\bEMP-\d{6}\b",
    locales=["GLOBAL"],          # or ["US"], ["ES"], etc.
    confidence=0.95,
    context_words=["employee", "staff"],
    disposition_action="redact", # keep / redact / pseudonymize / mask โ€ฆ
)

# Or build the full EntityPattern object for more control
firewall.add_custom_pattern(EntityPattern(
    entity_type="CASE_NUMBER",
    locale="ES",
    pattern=re.compile(r"\bEXP-\d{4}/\d{6}\b"),
    confidence=0.98,
    context_words=("expediente", "exp"),
    description="Spanish legal case number",
))

Option B โ€” Custom NLP/ML recognizer (Presidio backend)

Pass your own Presidio EntityRecognizer (or PatternRecognizer) when creating the firewall. This is the right approach when you want to use a spaCy model, a transformer, or any custom heuristic.

from privacy_firewall import create_firewall
from privacy_firewall.presidio_integration import create_custom_recognizer

# Helper that wraps a regex list into a Presidio PatternRecognizer
employee_recognizer = create_custom_recognizer(
    entity_type="EMPLOYEE_ID",
    patterns=[r"\bEMP\d{6}\b"],
    context_words=["employee", "badge"],
    score=0.9,
)

firewall = create_firewall(
    domain="generic",
    detector_backend="presidio",   # required for this approach
    custom_recognizers=[employee_recognizer],
)

For a fully custom ML-based recognizer, subclass Presidio's EntityRecognizer and pass the instance the same way:

from presidio_analyzer import EntityRecognizer, RecognizerResult

class MyModelRecognizer(EntityRecognizer):
    """Example: wraps any ML model as a Presidio recognizer."""

    def load(self): ...

    def analyze(self, text, entities, nlp_artifacts):
        results = []
        # call your model here and yield RecognizerResult objects
        for span in my_model.predict(text):
            results.append(RecognizerResult(
                entity_type="CUSTOM_ENTITY",
                start=span.start,
                end=span.end,
                score=span.confidence,
            ))
        return results

firewall = create_firewall(
    domain="generic",
    detector_backend="presidio",
    custom_recognizers=[MyModelRecognizer(supported_entities=["CUSTOM_ENTITY"])],
)

Which option to use?

Scenario Approach
Regex or rule-based custom entity Option A โ€” add_custom_regex / add_custom_pattern
Locale-specific ID format (new country) Option A with the matching locale code
Existing HuggingFace / spaCy NER model Option B โ€” wrap in EntityRecognizer subclass
Complex heuristic or external API call Option B โ€” implement analyze() freely

Testing a HuggingFace PII Model

The library has a built-in transformers backend. The quickest way to try any HuggingFace NER model is:

pip install "pii-firewall[transformers]"
from privacy_firewall import create_firewall

# Pass any HuggingFace model ID โ€” downloaded automatically on first call
firewall = create_firewall(
    "healthcare",
    detector_backend="transformers",
    transformer_model_id="dslim/bert-base-NER",  # swap for any HF model ID
)

result = firewall.process(
    text="John Doe, SSN 123-45-6789, prescribed enalapril 10mg",
    context={"tenant_id": "t1", "case_id": "c1", "thread_id": "th1", "actor_id": "a1"},
)
print(result.sanitized_text)

Curated model catalog

The library ships a pre-vetted catalog of models in transformers_ner/models.py:

from privacy_firewall.transformers_ner.models import get_model_for_domain

config = get_model_for_domain("medical", "en")
firewall = create_firewall("healthcare", detector_backend="transformers", transformer_model_id=config.model_id)
Domain Language Model
General en dslim/bert-base-NER
General multilingual Davlan/xlm-roberta-base-ner-hrl
General fr Jean-Baptiste/camembert-ner
Medical en d4data/biomedical-ner-all
Medical es PlanTL-GOB-ES/bsc-bio-ehr-es

Run on GPU

firewall = create_firewall(
    "healthcare",
    detector_backend="transformers",
    transformer_model_id="d4data/biomedical-ner-all",
    transformer_device=0,   # 0 = first GPU, -1 = CPU (default)
)

Combine with regex patterns (Presidio hybrid)

If you need to mix the HF model with regex patterns in the same pipeline, wrap it as a Presidio recognizer:

from presidio_analyzer import EntityRecognizer, RecognizerResult
from transformers import pipeline

class HFPIIRecognizer(EntityRecognizer):
    def __init__(self, model_id: str):
        super().__init__(supported_entities=["PERSON", "ORGANIZATION", "LOCATION"])
        self._pipe = pipeline("ner", model=model_id, aggregation_strategy="simple")

    def load(self): pass

    def analyze(self, text, entities, nlp_artifacts):
        return [
            RecognizerResult(
                entity_type=span["entity_group"],
                start=span["start"],
                end=span["end"],
                score=span["score"],
            )
            for span in self._pipe(text)
        ]

firewall = create_firewall(
    "healthcare",
    detector_backend="presidio",
    custom_recognizers=[HFPIIRecognizer("dslim/bert-base-NER")],
)

Reversible Pseudonymization

# Anonymize
result = firewall.process(text="Contact John Doe at john@example.com", context={...})
print(result.sanitized_text)
# "Contact PERSON_1 at EMAIL_1"

# LLM processes anonymized text
llm_response = "PERSON_1 should verify EMAIL_1 is correct"

# Rehydrate (restore original values)
from privacy_firewall.anonymization_engine import rehydrate_text
mapping = firewall.vault.get_case_mapping(
    tenant_id="...",
    case_id="...",
    thread_id="...",
)
final = rehydrate_text(llm_response, mapping)
print(final)
# "John Doe should verify john@example.com is correct"

Provider-Agnostic SDK Flow

from privacy_firewall import PrivacyFirewallSDK

sdk = PrivacyFirewallSDK.create(domain="healthcare", detector_backend="presidio")

context = {
    "tenant_id": "hospital-001",
    "case_id": "patient-123",
    "thread_id": "consultation-1",
    "actor_id": "doctor-456",
}

# 1) Anonymize input
anon = sdk.anonymize_text(text="Contact John Doe at john@example.com", context=context)

# 2) Call any model client (callable or object with .generate)
def my_llm(prompt: str) -> str:
    return f"Please verify PERSON_1 at EMAIL_1. Input was: {prompt}"

# 3) Rehydrate output
result = sdk.secure_call(
    text="Contact John Doe at john@example.com",
    context=context,
    llm_client=my_llm,
)
print(result.final_text)

GDPR Compliance (Right to be Forgotten)

# Forget all data for a case
deleted = firewall.forget(
    tenant_id="hospital-001",
    case_id="patient-123",
    thread_id="consultation-1",
)
print(f"Deleted {deleted} mappings")

๐Ÿš€ Web API

Run the FastAPI web server:

cd pii-firewall
uvicorn privacy_firewall.web.app:create_app --factory --reload --port 8080

Access the API at http://127.0.0.1:8080/docs

API Example

curl -X POST "http://localhost:8000/api/run" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Ana Garcรญa, 43 aรฑos, hipertensiรณn",
    "tenant_id": "hospital-001",
    "case_id": "patient-123",
    "thread_id": "thread-1",
    "actor_id": "doctor-456",
    "profile": "healthcare",
        "detector_backend": "gliner"
  }'

Web UI

The project includes a Next.js web interface:

cd ../pii-web-next
npm install
npm run dev

Access at http://127.0.0.1:3010

๐Ÿ“Š Performance

  • Language detection: 1โ€“2 ms (first message), 0 ms (cached)
  • Pattern matching (regex mode): < 1 ms
  • Presidio NER: 50โ€“200 ms (depends on text length)
  • OPF / Nemotron: 50โ€“300 ms
  • Transformer NER: 100โ€“500 ms (use for accuracy, not latency)
  • Overall round-trip (Presidio mode): ~50โ€“250 ms per request

Detection backend comparison

Backend Install Best for Latency
regex (none) Structured IDs, emails, phones < 1 ms
presidio [presidio,langdetect] Named entities โ€” best speed/accuracy balance 50โ€“200 ms
hybrid [presidio,langdetect] Regex + Presidio for max coverage 50โ€“250 ms
gliner [gliner] Zero-shot NER, no fine-tuning needed 100โ€“400 ms
transformers [transformers] Biomedical NER (d4data, BC5CDR) 100โ€“500 ms
opf [opf] Token-level classifier, language-agnostic 50โ€“200 ms
nemotron [opf] NVIDIA fine-tune, high recall on free text 100โ€“300 ms

Optimization tips:

  • Use thread-level language caching (enabled by default)
  • Use detector_backend="presidio" for best speed/accuracy balance

๐Ÿ—๏ธ Architecture

src/privacy_firewall/
โ”œโ”€โ”€ language/              # Auto-detection & routing
โ”‚   โ”œโ”€โ”€ detector.py       # LanguageDetector (langdetect/fasttext)
โ”‚   โ””โ”€โ”€ router.py         # LanguageRouter (spaCy model selection)
โ”œโ”€โ”€ patterns/             # Locale-aware patterns
โ”‚   โ”œโ”€โ”€ catalog.py        # PatternCatalog
โ”‚   โ””โ”€โ”€ locales/          # ONE FILE PER LANGUAGE โœจ
โ”‚       โ”œโ”€โ”€ global_patterns.py
โ”‚       โ”œโ”€โ”€ es_patterns.py
โ”‚       โ”œโ”€โ”€ us_patterns.py
โ”‚       โ”œโ”€โ”€ fr_patterns.py
โ”‚       โ”œโ”€โ”€ de_patterns.py
โ”‚       โ”œโ”€โ”€ it_patterns.py
โ”‚       โ””โ”€โ”€ pt_patterns.py
โ”œโ”€โ”€ profiles/             # Domain profiles
โ”‚   โ”œโ”€โ”€ profiles.py       # DomainProfile, EntityDisposition
โ”‚   โ””โ”€โ”€ presets.py        # HEALTHCARE, FINANCE, LEGAL
โ”œโ”€โ”€ presidio_integration/ # Full Presidio capabilities
โ”‚   โ”œโ”€โ”€ engine.py         # Analyzer + Anonymizer
โ”‚   โ””โ”€โ”€ recognizers.py    # Custom recognizers
โ”œโ”€โ”€ transformers_ner/     # Domain-specific models
โ”‚   โ”œโ”€โ”€ engine.py         # TransformerNEREngine
โ”‚   โ””โ”€โ”€ models.py         # Biomedical NER model catalog
โ”œโ”€โ”€ unified_detector.py   # Multi-backend orchestration
โ”œโ”€โ”€ anonymization_engine.py  # Disposition-based anonymization
โ”œโ”€โ”€ firewall.py        # Next-gen PrivacyFirewall
โ””โ”€โ”€ web/                  # FastAPI web interface
    โ””โ”€โ”€ app.py            # REST API

๐Ÿ†š Comparison

Feature Privacy Firewall Presidio scrubadub AWS Comprehend
Domain awareness โœ… Keep relevant data โŒ โŒ โš ๏ธ Healthcare only
Multi-language โœ… 55+ auto-detect โœ… Manual โŒ English only โœ… Some
Locale patterns โœ… Per-country โŒ โŒ โŒ
Multiple dispositions โœ… โŒ Basic โŒ โŒ
Transformers โœ… BioBERT, biomedical NER โŒ โŒ โœ… Proprietary
Reversibility โœ… Vault โŒ โŒ โŒ
Custom patterns โœ… Runtime โš ๏ธ Code โš ๏ธ Code โŒ
Thread caching โœ… 0ms after first โŒ โŒ N/A
Open source โœ… โœ… โœ… โŒ

๐Ÿ”Œ Extending with New Locales

Add support for a new country in 3 steps:

  1. Create pattern file (patterns/locales/nl_patterns.py):
import re
from ..catalog import EntityPattern

NL_BSN = EntityPattern(
    entity_type="NATIONAL_ID",
    locale="NL",
    pattern=re.compile(r"\b\d{9}\b"),
    confidence=0.9,
    context_words=("bsn", "burgerservicenummer"),
    description="Dutch BSN",
)

NL_PATTERNS = [NL_BSN]
  1. Import in patterns/locales/__init__.py:
from .nl_patterns import NL_PATTERNS
LOCALE_PATTERNS = [...] + NL_PATTERNS
  1. Add language config (optional, for spaCy models):
# In language/router.py
"nl": LanguageConfig(
    language_code="nl",
    spacy_model="nl_core_news_sm",
    patterns_locale="NL",
),

Done! Dutch patterns now available automatically.

๐Ÿ“š Documentation

To show the guide in a panel in VS Code:

  1. Open docs/guide.html
  2. Select Open Preview (or use Ctrl+Shift+V)

๐Ÿงช Testing

# Unit tests
pytest tests/

# Integration tests
pytest tests_integration/

# Quick package smoke test
python -c "import privacy_firewall; print('ok')"

๐Ÿ” Security & Privacy

  • โœ… Simple end-to-end anonymizeโ†’LLMโ†’rehydrate flow
  • โœ… Reversible pseudo-anonymization with vault
  • โœ… Pluggable vault storage (in-memory and SQLite)
  • โœ… GDPR "right to be forgotten"
  • โœ… Audit trails in result.trace
  • โœ… No data leaves your infrastructure

๐Ÿ“ License

Apache 2.0 โ€” see LICENSE for details.

๐Ÿค Contributing

Contributions welcome! Areas to contribute:

  • New locale patterns (add your country!)
  • Domain profiles (education, government, etc.)
  • Custom recognizers
  • Performance optimizations
  • Documentation improvements

๐Ÿ™ Acknowledgments

Built with:


Built with โค๏ธ for privacy-first AI applications

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pii_firewall-0.1.0.tar.gz (87.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pii_firewall-0.1.0-py3-none-any.whl (97.9 kB view details)

Uploaded Python 3

File details

Details for the file pii_firewall-0.1.0.tar.gz.

File metadata

  • Download URL: pii_firewall-0.1.0.tar.gz
  • Upload date:
  • Size: 87.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for pii_firewall-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2530928205aee79f2996372c372d39ebf1df9b909a2efbdca4ed87047873b6a8
MD5 d400378bb90511c84d3b50142252f327
BLAKE2b-256 1d950f5a38ba8bc73b6bb22a295156b4df550eff1fdb60c160ad8e219bfdf160

See more details on using hashes here.

File details

Details for the file pii_firewall-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pii_firewall-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 97.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for pii_firewall-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 124f05f886ee41a5a38181659cf28e13dfeb4ed0d50a3f6762b42f1e4c0f97cb
MD5 93f9d65420f303ca8e51058d0a576fa2
BLAKE2b-256 8bfbc010232967fe0c169ad2f42a1ace8b996933e3179a23d091f621e7177f7e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page