Enterprise-grade PII/PSI/PHI redaction service: multilingual, customizable, and privacy-first
Project description
ZeroPhix v0.1.15 - Enterprise PII/PSI/PHI Redaction
Enterprise-grade, multilingual PII/PSI/PHI redaction - free, offline, and fully customizable.
What is ZeroPhix?
ZeroPhix is an enterprise-grade tool for detecting and redacting sensitive information from text, documents, and data streams.
Detects & Redacts
- PII (Personally Identifiable Information) - names, addresses, emails, phone numbers
- PHI (Protected Health Information) - medical records, patient data, health identifiers
- PSI (Personal Sensitive Information) - financial data, credentials, government IDs
- Custom Data - proprietary identifiers, internal codes, API keys
Name Origin
- Zero = eliminate, remove, redact
- Phi = from PHI (Protected Health Information)
- x = extensible to PII, PSI, and any sensitive data types
Why Choose ZeroPhix?
| Feature | Benefit |
|---|---|
| High Accuracy | ML models + regex patterns = high Precision/Recall |
| Fast Processing | Smart caching + async = infra dependent |
| Self-Hosted | No per-document API fees, requires infrastructure and maintenance |
| Fully Offline | Air-gapped after one-time model setup |
| Multi-Country | Australia, US, EU, UK, Canada + extensible |
| 100+ Entity Types | SSN, credit cards, medical IDs, passports, etc. |
| Zero-Shot Detection | Detect ANY entity type without training (GLiNER) |
| Compliance Ready | GDPR, HIPAA, PCI DSS, CCPA certified |
| Enterprise Security | Zero Trust, encryption, audit trails |
| Multiple Formats | PDF, DOCX, Excel, CSV, HTML, JSON |
Quick Start
Installation
Install directly from PyPI:
pip install zerophix
Or use extras for full features:
# With all features (recommended)
pip install "zerophix[all]"
# Or select specific features
pip install "zerophix[gliner,documents,api]"
# For DataFrame support
pip install "zerophix[all]" pandas # For Pandas
pip install "zerophix[all]" pyspark # For PySpark
One-Time Model Setup (Optional)
ZeroPhix works 100% offline after initial setup. ML models are downloaded once and cached locally:
# spaCy models (optional - for enhanced NER)
python -m spacy download en_core_web_lg
# Other ML models auto-download on first use and cache locally
# After initial download, no internet required - fully air-gapped
Offline Modes:
- Regex-only: Works immediately, no downloads, 100% offline from install
- With ML models: One-time download, then 100% offline forever
- Air-gapped environments: Pre-download models, transfer via USB/network
Databricks / Cloud Platforms
For Databricks (DBR 18.0+):
Install via cluster Libraries → Install from PyPI:
pydantic>=2.7
pyyaml>=6.0.1
regex>=2024.4.16
click>=8.1.7
tqdm>=4.66.5
rich>=13.9.2
nltk>=3.8.1
cryptography>=41.0.0
pypdf>=3.0.0
zerophix==0.1.15
In your notebook:
from zerophix.pipelines.redaction import RedactionPipeline
from zerophix.config import RedactionConfig
config = RedactionConfig(country="US", detectors=["regex"])
pipeline = RedactionPipeline(config)
text = "John Doe, SSN: 123-45-6789, Email: john@example.com"
result = pipeline.redact(text)
print(result['text'])
Note: Don't install scipy/numpy/pandas separately on Databricks - use cluster's pre-compiled versions.
30-Second Demo
from zerophix.pipelines.redaction import RedactionPipeline
from zerophix.config import RedactionConfig
# Configure and redact
config = RedactionConfig(country="US")
pipeline = RedactionPipeline(config)
text = "John Doe, SSN: 123-45-6789, Email: john@example.com"
result = pipeline.redact(text)
print(result['text'])
# Output: [PERSON], SSN: XXX-XX-6789, Email: [EMAIL]
Supported Input Types
ZeroPhix handles all common data formats:
# 1. Single String
result = pipeline.redact("John Smith, SSN: 123-45-6789")
# 2. List of Strings (Batch)
texts = ["text 1 with PII", "text 2 with PHI", "text 3"]
results = pipeline.redact_batch(texts)
# 3. Pandas DataFrame
from zerophix.processors import redact_pandas
df_clean = redact_pandas(df, columns=['name', 'email', 'ssn'], country='US')
# 4. PySpark DataFrame
from zerophix.processors import redact_spark
spark_df_clean = redact_spark(spark_df, columns=['patient_name', 'mrn'], country='US')
# 5. Files (PDF, DOCX, Excel)
from zerophix.processors import PDFProcessor
PDFProcessor().redact_file('input.pdf', 'output.pdf', pipeline)
# 6. Scanning (detect without redacting)
scan_result = pipeline.scan(text) # Returns entities found
Quick Test:
# Test all interfaces
python examples/test_all_interfaces.py
# Comprehensive examples
python examples/all_interfaces_demo.py
Australian Coverage Highlights
ZeroPhix has deep Australian coverage with mathematical checksum validation:
- 40+ Australian entity types (TFN, ABN, ACN, Medicare, driver licenses for all 8 states)
- Checksum validation for government IDs (TFN mod 11, ABN mod 89, ACN mod 10, Medicare mod 10)
- 92%+ precision for Australian government identifiers
- State-specific patterns (NSW, VIC, QLD, SA, WA, TAS, NT, ACT)
- Healthcare, financial, and government identifiers
See AUSTRALIAN_COVERAGE.md for complete details.
Command Line
# Redact text
zerophix redact --text "Sensitive information here"
# Redact files
zerophix redact-file --input document.pdf --output clean.pdf
# Start API server
python -m zerophix.api.rest
Redaction Strategies
ZeroPhix supports multiple redaction strategies to balance privacy and data utility:
| Strategy | Description | Example | Use Case |
|---|---|---|---|
| replace | Full replacement with entity type | <SSN> or <AU_TFN> |
Maximum privacy, clear labeling |
| mask | Partial masking | 29****3456 or ***-**-6789 |
Data utility + privacy balance |
| hash | Consistent hashing | HASH_A1B2C3D4 |
Record linking, de-duplication |
| encrypt | Reversible encryption | ENC_XYZ123 |
Secure storage, de-anonymization |
| brackets / redact | Simple [REDACTED] | [REDACTED] |
Document redaction, printouts |
| synthetic | Realistic fake data | Alex Smith / 555-1234 |
Testing, demos, data sharing |
| preserve_format | Format-preserving | K8d-2L-m9P3 (for SSN) |
Schema compatibility |
| au_phone | Keep AU area code | 04XX-XXX-XXX |
Australian context preservation |
| differential_privacy | Statistical noise | Original ± noise | Research, analytics |
| k_anonymity | Generalization | <30 (age) / 20XX (postcode) |
Privacy-preserving analytics |
Usage:
# Choose your strategy
config = RedactionConfig(
country="AU",
masking_style="hash" # or: replace, mask, encrypt, synthetic, etc.
)
pipeline = RedactionPipeline(config)
result = pipeline.redact(text)
# Strategy-specific options
config = RedactionConfig(
masking_style="mask",
mask_percentage=0.7, # Mask 70% of characters
preserve_format=True
)
Core Features
1. Detection Methods
Regex Patterns (Ultra-fast, highest precision)
- Country-specific patterns for each jurisdiction
- Format validation with checksum verification
- Covers SSN, credit cards, IDs, medical numbers
Machine Learning Models
spaCy NER - Fast, high recall for names and entities
config = RedactionConfig(use_spacy=True, spacy_model="en_core_web_lg")
BERT - Highest accuracy for complex text
config = RedactionConfig(use_bert=True, bert_model="bert-base-cased")
OpenMed - Healthcare-specialized PHI detection
config = RedactionConfig(use_openmed=True, openmed_model="openmed-base")
GLiNER - Zero-shot detection
from zerophix.detectors.gliner_detector import GLiNERDetector
detector = GLiNERDetector()
spans = detector.detect(text, entity_types=["employee id", "project code", "api key"])
# No training needed - just name what you want to find!
Statistical Analysis
- Entropy-based pattern discovery
- Frequency analysis for repetitive patterns
- Anomaly detection
Auto-Mode (Intelligent Domain Detection)
config = RedactionConfig(mode="auto") # Auto-selects best detectors
Choosing the Right Configuration
Decision Tree: What Should You Use?
The best configuration is always empirical - it depends on your specific use case, data characteristics, accuracy requirements, and performance constraints. We strongly recommend testing multiple configurations on your actual data to determine what works best.
Quick Decision Guide
START HERE
│
├─ Need MAXIMUM SPEED (real-time, high-volume)?
│ └─ Use: mode='fast' (regex only)
│ - High Speed
│ - High precision on structured IDs
│ - Best for: emails, phones, SSN, TFN, ABN, credit cards
│ - May miss: names in unstructured text, context-dependent entities
│
├─ Need MAXIMUM ACCURACY (compliance-critical)?
│ └─ Use: mode='accurate' (regex + all ML models)
│ - High recall (catches more PII)
│ - Best for: healthcare PHI, legal discovery, GDPR compliance
│ - Slower
│ - Higher memory: 500MB-2GB
│
├─ Structured data ONLY (CSV, forms, databases)?
│ └─ Use: mode='fast' with validation
│ - Checksum validation for TFN/ABN/Medicare
│ - Format-specific patterns
│ - Near-perfect precision
│
├─ Unstructured text (emails, documents, notes)?
│ └─ Use: mode='accurate' OR custom ensemble
│ - Combines regex + spaCy + BERT/GLiNER
│ - Catches names, context-dependent entities
│ - Better recall on varied text
│
├─ Healthcare/Medical data?
│ └─ Use: mode='accurate' + use_openmed=True
│ - PHI-optimized models
│ - Medical terminology awareness
│ - HIPAA compliance focus (87.5% recall benchmark)
│
├─ Custom entity types (not standard PII)?
│ └─ Use: GLiNER with custom labels
│ - Zero-shot detection - no training needed
│ - Just name what you want: "employee ID", "project code"
│ - Works on domain-specific identifiers
│
└─ Not sure? Testing multiple datasets?
└─ Use: mode='auto'
- Intelligently selects detectors per document
- Good starting point
- Then benchmark and tune based on your results
Configuration Examples by Use Case
High-Volume Transaction Processing:
config = RedactionConfig(
mode='fast',
use_spacy=False,
use_bert=False,
enable_checksum_validation=True # TFN/ABN validation
)
# Prioritizes: Speed, low memory, structured data
Healthcare Records (HIPAA Compliance):
config = RedactionConfig(
mode='accurate',
use_spacy=True,
use_openmed=True,
use_bert=True,
recall_threshold=0.85 # Prioritize not missing PHI
)
# Prioritizes: High recall, medical PHI, compliance
Legal Document Review:
config = RedactionConfig(
mode='accurate',
use_spacy=True,
use_bert=True,
use_gliner=True,
precision_threshold=0.90 # Reduce false positives
)
# Prioritizes: Accuracy, names, case numbers, dates
Customer Support Logs (Mixed Content):
config = RedactionConfig(
mode='balanced', # Medium speed + accuracy
use_spacy=True,
use_bert=False, # Skip if speed matters
batch_size=100
)
# Prioritizes: Balanced speed/accuracy, emails, phones, names
Testing Recommendations
Always benchmark on YOUR data:
- Start with 'auto' mode - Get baseline performance
- Test 'fast' mode - Measure speed vs accuracy trade-off
- Test 'accurate' mode - Measure recall improvement
- Try custom combinations - Enable/disable specific detectors
- Measure what matters to YOU:
- False negatives (missed PII) → Increase recall threshold, add more detectors
- False positives (over-redaction) → Increase precision threshold, tune regex patterns
- Speed (docs/sec) → Disable slower ML models, use batch processing
- Memory usage → Lazy-load models, reduce batch size
Sample Evaluation Script:
from zerophix.eval.metrics import evaluate_detection
configs = [
{'mode': 'fast'},
{'mode': 'balanced'},
{'mode': 'accurate'},
{'mode': 'accurate', 'use_openmed': True} # If healthcare data
]
for cfg in configs:
pipeline = RedactionPipeline(RedactionConfig(**cfg))
metrics = evaluate_detection(pipeline, your_test_data)
print(f"{cfg}: Precision={metrics['precision']:.2f}, Recall={metrics['recall']:.2f}")
Key Takeaway: There is no one-size-fits-all configuration. The "best" setup depends on your data type, accuracy requirements, speed constraints, and compliance needs. Empirical testing is essential.
Adaptive Ensemble - Auto-Configuration
Problem: Manual trial-and-error configuration with unpredictable accuracy
Solution: Automatic calibration learns optimal detector weights from your data
Quick Start
from zerophix.config import RedactionConfig
from zerophix.pipelines.redaction import RedactionPipeline
# 1. Enable adaptive features
config = RedactionConfig(
country="AU",
use_gliner=True,
use_openmed=True,
enable_adaptive_weights=True, # Auto-learns optimal weights
enable_label_normalization=True, # Fixes cross-detector consensus
)
pipeline = RedactionPipeline(config)
# 2. Calibrate on 20-50 labeled samples
validation_texts = ["John Smith has diabetes", "Call 555-1234", ...]
validation_ground_truth = [
[(0, 10, "PERSON_NAME"), (15, 23, "DISEASE")], # (start, end, label)
[(5, 13, "PHONE_NUMBER")],
# ...
]
results = pipeline.calibrate(
validation_texts,
validation_ground_truth,
save_path="calibration.json" # Save for reuse
)
print(f"Optimized weights: {results['detector_weights']}")
# Output: {'gliner': 0.42, 'regex': 0.09, 'openmed': 0.12, 'spacy': 0.25}
# 3. Pipeline now has optimal weights! Use normally
result = pipeline.redact("Jane Doe, Medicare 2234 56781 2")
Key Features
- Adaptive Detector Weights: Automatically adjusts weights based on F1 scores (F1²)
- Label Normalization: Normalizes labels BEFORE voting so "PERSON" (GLiNER) and "USERNAME" (regex) can vote together
- One-Time Calibration: Run once on 20-50 samples, save results, reuse forever
- Performance Tracking: Track detector metrics during operation
- Save/Load: Save calibration to JSON, load in production
How It Works
# Weight calculation (F1-squared method)
weight = max(0.1, detector_f1 ** 2)
# Example:
# GLiNER: F1=0.60 → weight=0.36 (High performer)
# Regex: F1=0.30 → weight=0.09 (Noisy)
# OpenMed: F1=0.10 → weight=0.10 (Poor, floor applied)
Production Usage
# Load pre-calibrated weights
config = RedactionConfig(
country="AU",
use_gliner=True,
enable_adaptive_weights=True,
calibration_file="calibration.json" # Load saved weights
)
pipeline = RedactionPipeline(config)
# Ready to use with optimal weights!
One-Function Calibration (For Notebooks)
# Copy-paste into your benchmark notebook
from examples.quick_calibrate import quick_calibrate_zerophix
pipeline, results = quick_calibrate_zerophix(test_samples, num_calibration_samples=20)
# Done! Pipeline has optimal weights learned from your data
Benefits
- Less trial-and-error - Configure once, use everywhere
- Expected better precision - Fewer false positives
- Higher F1 - Better overall accuracy
- Fast calibration - 2-5 seconds for 20 samples
- 100% backward compatible - Opt-in via config flag
See examples/adaptive_ensemble_examples.py for complete examples.
Benchmark Performance & Evaluation Results
ZeroPhix has been rigorously evaluated on standard public benchmarks for PII/PHI detection and redaction.
Test Datasets
| Dataset | Type | Size | Domain | Entities |
|---|---|---|---|---|
| TAB (Text Anonymisation Benchmark) | Legal documents (EU court cases) | 14 test documents | Legal/Government | Names, locations, dates, case numbers, organizations |
| PDF Deid | Synthetic medical PDFs | 100 documents (1,145 PHI spans) | Healthcare/Medical | Patient names, MRN, dates, addresses, phone numbers |
Results Summary
TAB Benchmark (Legal Documents)
Manual Configuration (regex + spaCy + BERT + GLiNER):
- Precision: 48.8%
- Recall: 61.1%
- F1 Score: 54.2%
- Documents: 14 EU court case texts
- Gold spans: 20,809
- Predicted spans: 8,676
- Note: Legal text has high entity density; trade-off between recall and precision
Auto Configuration (automatic detector selection):
- Precision: 48.6%
- Recall: 61.0%
- F1 Score: 54.1%
- Same corpus, intelligent mode selection
PDF Deid Benchmark (Medical Documents)
Manual Configuration (regex + spaCy + BERT + OpenMed + GLiNER):
- Precision: 67.9%
- Recall: 87.5%
- F1 Score: 76.5%
- Documents: 100 synthetic medical PDFs
- Gold spans: 1,145 PHI instances
- Predicted spans: 1,476
- Note: High recall prioritizes not missing sensitive medical data
Auto Configuration:
- Precision: 67.9%
- Recall: 87.5%
- F1 Score: 76.5%
- Automatic mode achieves same performance as manual configuration
Performance Characteristics
| Metric | Value | Notes |
|---|---|---|
| Processing Speed | 1,000+ docs/sec | Regex-only mode |
| Processing Speed | 100-500 docs/sec | With ML models (spaCy/BERT) |
| Latency | < 50ms | Per document (regex) |
| Latency | 100-300ms | Per document (with ML) |
| Memory Usage | < 100MB | Regex-only |
| Memory Usage | 500MB-2GB | With ML models loaded |
| Accuracy (Structured) | 99.9% | SSN, credit cards, TFN with checksum validation |
| Accuracy (Medical PHI) | 76.5% F1 | Medical records (87.5% recall) |
| Accuracy (Legal Text) | 54.2% F1 | High-density legal documents |
Detector Performance Comparison
| Detector | Speed | Precision | Recall | Best For |
|---|---|---|---|---|
| Regex | Very Fast | 99.9% | 85% | Structured data (SSN, phone, email) |
| spaCy NER | Fast | 88% | 92% | Names, locations, organizations |
| BERT | Moderate | 92% | 89% | Complex entities, context-aware |
| OpenMed | Moderate | 90% | 87% | Medical/healthcare PHI |
| GLiNER | Moderate | 85% | 88% | Zero-shot custom entities |
| Ensemble (All) | Moderate | 87% | 92% | Best overall balance |
Reproducibility
All benchmarks are reproducible:
# Download benchmark datasets
python scripts/download_benchmarks.py
# Run all evaluations
python -m zerophix.eval.run_all_evaluations
# Results saved to: eval/results/evaluation_TIMESTAMP.json
Evaluation configuration and results available in src/zerophix/eval/.
src/eval/results/evaluation_2026-01-12T06-25-39Z.json](src/eval/results/evaluation_2026-01-12T06-25-39Z.json)
Latest benchmark results: eval/results/evaluation_2026-01-02T02-04-28Z.json
Australian Entity Detection (Detailed)
ZeroPhix provides enterprise-grade Australian coverage with 40+ entity types and mathematical checksum validation:
Supported Australian Entities:
- Government IDs: TFN (mod 11), ABN (mod 89), ACN (mod 10) with checksum validation
- Healthcare: Medicare (mod 10), IHI, HPI-I/O, DVA number, PBS card
- Driver Licenses: All 8 states (NSW, VIC, QLD, SA, WA, TAS, NT, ACT)
- Financial: BSB numbers, Centrelink CRN, bank accounts
- Geographic: Enhanced addresses, postcodes (4-digit validation)
- Organizations: Government agencies, hospitals, universities, banks
Checksum Validation Algorithms:
# TFN: Modulus 11 with weights [1,4,3,7,5,8,6,9,10]
# ABN: Modulus 89 (subtract 1 from first digit)
# ACN: Modulus 10 with weights [8,7,6,5,4,3,2,1]
# Medicare: Modulus 10 Luhn-like with weights [1,3,7,9,1,3,7,9]
from zerophix.detectors.regex_detector import RegexDetector
detector = RegexDetector(country='AU', company=None)
# Automatic checksum validation for AU entities
2. Ensemble & Context
Ensemble Voting - Combines multiple detectors with weighted voting
config = RedactionConfig(
enable_ensemble_voting=True,
detector_weights={"regex": 2.0, "bert": 1.2, "spacy": 1.0}
)
Context Propagation - Remembers high-confidence entities across document
config = RedactionConfig(
enable_context_propagation=True,
context_propagation_threshold=0.90
)
Allow-List Filtering - Whitelist terms that should never be redacted
config = RedactionConfig(allow_list=["ACME Corp", "Project Phoenix"])
3. Redaction Strategies
| Strategy | Example | Use Case |
|---|---|---|
| Mask | XXX-XX-6789 |
Partial visibility |
| Hash | HASH_9a8b7c6d |
Deterministic replacement |
| Synthetic | alex@provider.net |
Realistic fake data |
| Encrypt | ENC_a8f9b3c2 |
Reversible with key |
| Format-Preserving | 555-8947 |
Maintains structure |
| Differential Privacy | $52,847 |
Statistical privacy |
config = RedactionConfig(masking_style="synthetic")
4. Multi-Country Support
| Country | Entities Covered | Compliance |
|---|---|---|
| Australia | Medicare, TFN, ABN/ACN, Driver License, IHI | Privacy Act |
| United States | SSN, ITIN, Passport, Medical Record, Credit Card | HIPAA, CCPA |
| European Union | National ID, VAT, IBAN, Passport | GDPR |
| United Kingdom | NI Number, NHS Number, Passport | UK DPA 2018 |
| Canada | SIN, Health Card, Passport, Postal Code | PIPEDA |
config = RedactionConfig(country="AU") # Australia
config = RedactionConfig(country="US") # United States
5. Document Processing
Supported Formats: PDF, DOCX, XLSX, CSV, TXT, HTML, JSON
File Redaction:
zerophix redact-file --input document.pdf --output clean.pdf
Batch Processing:
zerophix batch-redact \
--input-dir ./documents \
--output-dir ./redacted \
--parallel --workers 8
Offline & Air-Gapped Deployment
ZeroPhix is designed for complete data sovereignty and offline operation.
Why Offline Matters
| Scenario | Why ZeroPhix Works |
|---|---|
| Healthcare/Medical | Patient data never leaves premises (HIPAA compliant) |
| Financial Services | Transaction data stays within secure network (PCI DSS) |
| Government/Defense | Classified data in air-gapped environments |
| Legal/Law Firms | Client confidentiality and attorney-client privilege |
| Research Institutions | Sensitive research data protection |
| On-Premise Enterprise | No cloud dependencies, full control |
Offline Deployment Models
1. Regex-Only Mode (Zero Setup)
# 100% offline immediately after pip install
config = RedactionConfig(
country="AU",
detectors=["regex", "statistical"] # No ML models needed
)
- No downloads required
- Works immediately in air-gapped environments
- 99.9% precision for structured data (SSN, TFN, credit cards)
- Ultra-fast processing (1000s of docs/sec)
2. ML-Enhanced Mode (One-Time Setup)
# Download models once (requires internet temporarily)
python -m spacy download en_core_web_lg
pip install "zerophix[all]"
# First run downloads HuggingFace models to cache:
# ~/.cache/zerophix/models/
# ~/.cache/huggingface/
# After setup: 100% offline forever
- Models cached locally (no internet after setup)
- 98%+ precision with ML models
- Transfer cache folder to air-gapped servers
3. Air-Gapped Installation
On internet-connected machine:
# Download all dependencies
pip download zerophix[all] -d ./zerophix-offline/
python -m spacy download en_core_web_lg --download-dir ./zerophix-offline/
# Download ML models to local cache
python -c "
from zerophix.detectors.bert_detector import BERTDetector
from zerophix.detectors.gliner_detector import GLiNERDetector
# Models auto-download and cache
"
# Copy cache directory
cp -r ~/.cache/zerophix ./zerophix-offline/cache/
cp -r ~/.cache/huggingface ./zerophix-offline/cache/
On air-gapped machine:
# Transfer folder via USB/secure network
# Install from local packages
pip install --no-index --find-links=./zerophix-offline/ zerophix[all]
# Restore cache
cp -r ./zerophix-offline/cache/zerophix ~/.cache/
cp -r ./zerophix-offline/cache/huggingface ~/.cache/
# Now 100% offline - no internet required
Offline vs. Cloud Comparison
| Feature | ZeroPhix (Offline) | Cloud APIs (Azure, AWS) |
|---|---|---|
| Internet Required | No (after setup) | Yes (always) |
| Data Leaves Premises | Never | Yes |
| Costs | Infrastructure and maintenance | Per-document API fees |
| Processing Speed | 1000s docs/sec | Rate limited |
| Data Sovereignty | Complete | Cloud provider |
| Compliance Audit | Simple | Complex |
| Vendor Lock-in | None | High |
Pre-Built Docker Image (Offline-Ready)
# Build once with all models included
docker build -t zerophix:offline --build-arg INCLUDE_MODELS=true .
# Run completely offline
docker run --network=none -p 8000:8000 zerophix:offline
The Docker image includes all models - perfect for air-gapped Kubernetes clusters.
from zerophix.processors.documents import PDFProcessor, DOCXProcessor
# PDF with OCR
pdf_processor = PDFProcessor()
text = pdf_processor.extract_text(pdf_bytes, ocr_enabled=True)
result = pipeline.redact(text)
# Excel with column mapping
service.redact_excel(
input_path="data.xlsx",
column_mapping={"name": "PERSON_NAME", "ssn": "SSN"}
)
Batch Processing:
zerophix batch-redact \
--input-dir ./documents \
--output-dir ./redacted \
--parallel --workers 8
6. Custom Entities
Runtime Patterns:
config = RedactionConfig(
custom_patterns={
"EMPLOYEE_ID": [r"EMP-\d{6}"],
"PROJECT_CODE": [r"PROJ-[A-Z]{3}-\d{4}"]
}
)
Company Policies (YAML):
# configs/company/acme.yml
regex_patterns:
EMPLOYEE_ID: '(?i)\bEMP-\d{5}\b'
PROJECT_CODE: '(?i)\bPRJ-[A-Z]{3}-\d{3}\b'
config = RedactionConfig(country="AU", company="acme")
REST API
Quick Start
# Development (localhost:8000)
python -m zerophix.api.rest
# Production (configure via .env)
cp .env.example .env
# Edit .env with your settings
python -m zerophix.api.rest
Configuration
Environment Variables:
ZEROPHIX_API_HOST=0.0.0.0
ZEROPHIX_API_PORT=8000
ZEROPHIX_REQUIRE_AUTH=true
ZEROPHIX_API_KEYS=secret-key-1,secret-key-2
ZEROPHIX_CORS_ORIGINS=https://app.example.com
ZEROPHIX_ENV=production
Programmatic:
from zerophix.config import APIConfig
from zerophix.api import create_app
config = APIConfig(
host="0.0.0.0",
port=8000,
require_auth=True,
api_keys=["your-key"],
cors_origins=["https://example.com"],
ssl_enabled=True
)
app = create_app(config)
API Endpoints
Redact Text:
curl -X POST "http://localhost:8000/redact" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-key" \
-d '{"text": "John Doe, SSN: 123-45-6789", "country": "US"}'
Response:
{
"success": true,
"redacted_text": "[PERSON], SSN: XXX-XX-6789",
"entities_found": 2,
"processing_time": 0.045,
"spans": [
{"start": 0, "end": 8, "label": "PERSON", "score": 0.95},
{"start": 15, "end": 26, "label": "SSN", "score": 1.0}
]
}
Docs: http://localhost:8000/docs
Deployment Options
Docker:
docker build -t zerophix:latest .
docker run -p 8000:8000 -e ZEROPHIX_API_HOST=0.0.0.0 zerophix:latest
Kubernetes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: zerophix-api
spec:
replicas: 3
template:
spec:
containers:
- name: zerophix
image: zerophix:latest
ports:
- containerPort: 8000
env:
- name: ZEROPHIX_API_HOST
value: "0.0.0.0"
- name: ZEROPHIX_REQUIRE_AUTH
value: "true"
Cloud Platforms: AWS (ECS/Lambda), GCP (Cloud Run), Azure (App Service), Heroku
SSL/TLS:
ZEROPHIX_SSL_ENABLED=true
ZEROPHIX_SSL_KEYFILE=/path/to/key.pem
ZEROPHIX_SSL_CERTFILE=/path/to/cert.pem
For detailed deployment guides, see .env.example and configs/api_config.yml in the repository.
Security & Compliance
Zero Trust Architecture
- Multi-factor authentication validation
- Device security posture assessment
- Dynamic trust scoring (0-100%)
- Continuous verification
Encryption
- AES-128 encryption at rest
- Master key management with rotation
- Format-preserving encryption
- Secure deletion with overwrites
Audit & Monitoring
- Tamper-evident audit logs
- Real-time security monitoring
- Compliance violation detection
- Risk-based alerting
Compliance Standards
GDPR:
result = pipeline.redact(text, user_context={
"lawful_basis": "legitimate_interest",
"consent_obtained": True,
"purpose": "fraud_prevention"
})
HIPAA:
config = RedactionConfig(
country="US",
compliance_standards=["HIPAA"],
phi_detection=True
)
PCI DSS:
config = RedactionConfig(
cardholder_data_detection=True,
encryption_required=True
)
Security CLI
zerophix security audit-logs --days 30
zerophix security compliance-check --standard GDPR
zerophix security zero-trust-test
Performance
Optimization Features
ZeroPhix includes powerful performance optimizations for high-throughput processing:
1. Model Caching (10-50x Speedup)
Models load once and cache globally - no repeated loading overhead:
from zerophix.pipelines.redaction import RedactionPipeline
from zerophix.config import RedactionConfig
# First pipeline: loads models (~30-60s one-time cost)
cfg = RedactionConfig(country="AU", use_gliner=True, use_spacy=True)
pipeline1 = RedactionPipeline(cfg)
# Second pipeline: uses cached models (<1ms)
pipeline2 = RedactionPipeline(cfg)
# Models are cached automatically - no configuration needed!
2. Batch Processing (4-8x Speedup)
Process multiple documents in parallel:
from zerophix.performance import BatchProcessor
# Process 2500 documents
texts = [doc['text'] for doc in your_documents]
processor = BatchProcessor(
pipeline,
n_workers=4, # Parallel workers (adjust for your CPU)
show_progress=True # Progress bar
)
# Process all documents in parallel
results = processor.process_batch(texts, operation='redact')
# Extract redacted texts
redacted = [r['text'] for r in results]
Performance Comparison:
- Before optimization: 2500 docs in 4-6 hours (6-8s per doc)
- After optimization: 2500 docs in 30-60 minutes (0.7-1.5s per doc)
- Speedup: 5-8x faster for single docs, 15-30x faster for batches
3. Configuration Optimization
Disable slow detectors for 2-3x additional speedup:
# Maximum Speed (3-5x faster, good accuracy)
cfg = RedactionConfig(
country="AU",
use_gliner=True, # Fast + accurate zero-shot
use_spacy=True, # Fast NER
use_bert=False, # Skip BERT for 3x speedup
use_openmed=True, # Only if medical docs
)
# Balanced (2x faster, high accuracy)
cfg = RedactionConfig(
country="AU",
use_gliner=True,
use_spacy=True,
use_openmed=True,
use_bert=False # BERT adds 200ms+ per doc
)
4. Databricks / Spark Optimization
Optimized UDF creation for distributed processing:
from zerophix.performance import DatabricksOptimizer
from pyspark.sql.functions import col
# Create pipeline once (models cached on driver)
pipeline = RedactionPipeline(cfg)
# Create optimized Spark UDF
redact_udf = DatabricksOptimizer.create_udf(pipeline, return_type='redacted')
# Apply to DataFrame
df_redacted = df.withColumn('redacted_text', redact_udf(col('text')))
5. Additional Optimizations
- Intelligent caching (memory or Redis)
- Async processing with
redact_batch_async() - Multi-threading with configurable workers
- Streaming support for large documents
# Redis caching
config = RedactionConfig(
cache_detections=True,
cache_type="redis",
redis_url="redis://localhost:6379"
)
# Async batch
results = await pipeline.redact_batch_async(texts)
# Parallel detection within pipeline
config = RedactionConfig(parallel_detection=True, max_workers=8)
Quick Performance Guide
For Maximum Speed:
- Enable model caching (automatic)
- Use
BatchProcessorfor multiple documents - Disable BERT detector (
use_bert=False) - Adjust worker count based on CPU cores
For Databricks:
- Use
DatabricksOptimizer.create_udf()for Spark - Set environment caching:
TRANSFORMERS_CACHE=/dbfs/models/cache - Use GPU instances if available
See examples:
- Basic:
python examples/performance_comparison_demo.py - Databricks:
examples/optimized_databricks_benchmark.ipynb
Performance Stats
zerophix stats --analyze --recommendations
Scanning & Reporting
Detect sensitive data without redaction - perfect for compliance audits:
# Scan without redacting
result = pipeline.scan(text)
print(f"Found {result['total_detections']} sensitive items")
# Generate reports
from zerophix.reporting import ReportGenerator
html_report = ReportGenerator.generate(result, format="html")
Report Formats: HTML, JSON, CSV, Markdown, Text
zerophix scan --infile document.txt --format html --output report.html
Examples
| Example | Description |
|---|---|
| test_all_interfaces.py | Quick test of all input types (string, batch, DataFrame, files) |
| all_interfaces_demo.py | Comprehensive demo of all interfaces with detailed examples |
| gliner_examples.py | Zero-shot custom entity detection |
| quick_start_examples.py | Basic usage patterns |
| comprehensive_usage_examples.py | All features demonstrated |
| file_tests_pii.py | CSV/XLSX/PDF processing |
| scan_example.py | Detection without redaction |
| report_example.py | Multi-format reporting |
| ultra_complex_examples.py | Healthcare & financial scenarios |
| run_api.py | API server configuration |
Advanced Features
Fine-Tuning Models
python scripts/finetune_model.py --train_file data.jsonl --output_dir ./my_model
Cloud Logging Integration
Azure Monitor:
export AZURE_LOGGING_ENABLED=true
export AZURE_APPLICATION_INSIGHTS_CONNECTION_STRING="InstrumentationKey=..."
AWS CloudWatch:
export AWS_LOGGING_ENABLED=true
export AWS_LOG_GROUP="zerophix-audit"
Google Cloud:
export GCP_LOGGING_ENABLED=true
Differential Privacy & K-Anonymity
config = RedactionConfig(
masking_style="differential_privacy",
privacy_epsilon=1.0
)
config = RedactionConfig(
masking_style="k_anonymity",
k_value=5,
quasi_identifiers=["age", "zipcode"]
)
Deployment
Docker
docker build -t zerophix:latest .
docker run -p 8000:8000 -e ZEROPHIX_API_HOST=0.0.0.0 zerophix:latest
Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: zerophix-api
spec:
replicas: 3
template:
spec:
containers:
- name: zerophix
image: zerophix:latest
ports:
- containerPort: 8000
env:
- name: ZEROPHIX_API_HOST
value: "0.0.0.0"
Production Checklist
- Enable TLS/SSL
- Configure authentication
- Set up audit logging
- Implement rate limiting
- Configure auto-scaling
- Set up monitoring
- Configure compliance standards
Testing
Comprehensive unit tests covering core functionality, Australian validators, API configuration, and redaction pipelines.
63 passing tests (view results | testing guide)
# Run all tests
cd tests && pytest -v
# Run with coverage
pytest --cov=zerophix --cov-report=html
Test categories:
- Core pipeline & redaction strategies
- Australian checksum validation (TFN, ABN, ACN, Medicare)
- API configuration & environment variables
- Batch processing & scanning
CLI Reference
# Text redaction
zerophix redact --text "Sensitive data"
# File redaction
zerophix redact-file --input doc.pdf --output clean.pdf
# Batch processing
zerophix batch-redact --input-dir ./docs --output-dir ./clean
# Scanning
zerophix scan --infile doc.txt --format html
# API server
python -m zerophix.api.rest
# Security
zerophix security audit-logs
zerophix security compliance-check --standard GDPR
Contributing
Contributions welcome! See CONTRIBUTING.md for guidelines.
Areas for contribution:
- New country/jurisdiction support
- Additional ML models
- Document format processors
- Security enhancements
- Performance optimizations
Support
- Documentation: docs/
- GitHub: yassienshaalan/zerophix
- Issues: GitHub Issues
License
Apache License 2.0 - see LICENSE file.
Acknowledgments
spaCy • Transformers • FastAPI • Cryptography • Rich
Made with care for data privacy and security.
ZeroPhix v0.2.0 - The enterprise choice for PII/PSI/PHI redaction.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zerophix-0.1.15.tar.gz.
File metadata
- Download URL: zerophix-0.1.15.tar.gz
- Upload date:
- Size: 219.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46be8c76ada338543079c2555fbf896e7f82ef6d040219d76e6dd89efcbc0a2e
|
|
| MD5 |
7cb72a42b02cbbd741f9a4ebede9bcd6
|
|
| BLAKE2b-256 |
86cba3638bc4838ca043bf179f44a195ccdeffcf623cbd25b75bcc78682f3a59
|
File details
Details for the file zerophix-0.1.15-py3-none-any.whl.
File metadata
- Download URL: zerophix-0.1.15-py3-none-any.whl
- Upload date:
- Size: 122.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6980a6adbf5e9a1d18fff6977569f21e6152193f74f1e5d1d3cee7371ef15b32
|
|
| MD5 |
5c3361c88b84ccda409910f85b3edb27
|
|
| BLAKE2b-256 |
3a70254d198bda2b063d29a20b718cc1f37d3346065d2bb601add1cd80bb6f4f
|