Enterprise-grade PII/PSI/PHI redaction service: multilingual, customizable, and privacy-first

These details have not been verified by PyPI

Project links

Project description

ZeroPhix v0.1.15 - Enterprise PII/PSI/PHI Redaction

Enterprise-grade, multilingual PII/PSI/PHI redaction - free, offline, and fully customizable.

What is ZeroPhix?

ZeroPhix is an enterprise-grade tool for detecting and redacting sensitive information from text, documents, and data streams.

Detects & Redacts

PII (Personally Identifiable Information) - names, addresses, emails, phone numbers
PHI (Protected Health Information) - medical records, patient data, health identifiers
PSI (Personal Sensitive Information) - financial data, credentials, government IDs
Custom Data - proprietary identifiers, internal codes, API keys

Name Origin

Zero = eliminate, remove, redact
Phi = from PHI (Protected Health Information)
x = extensible to PII, PSI, and any sensitive data types

Why Choose ZeroPhix?

Feature	Benefit
High Accuracy	ML models + regex patterns = high Precision/Recall
Fast Processing	Smart caching + async = infra dependent
Self-Hosted	No per-document API fees, requires infrastructure and maintenance
Fully Offline	Air-gapped after one-time model setup
Multi-Country	Australia, US, EU, UK, Canada + extensible
100+ Entity Types	SSN, credit cards, medical IDs, passports, etc.
Zero-Shot Detection	Detect ANY entity type without training (GLiNER)
Compliance Ready	GDPR, HIPAA, PCI DSS, CCPA certified
Enterprise Security	Zero Trust, encryption, audit trails
Multiple Formats	PDF, DOCX, Excel, CSV, HTML, JSON

Quick Start

Installation

Install directly from PyPI:

pip install zerophix

Or use extras for full features:

# With all features (recommended)
pip install "zerophix[all]"

# Or select specific features
pip install "zerophix[gliner,documents,api]"

# For DataFrame support
pip install "zerophix[all]" pandas  # For Pandas
pip install "zerophix[all]" pyspark  # For PySpark

One-Time Model Setup (Optional)

ZeroPhix works 100% offline after initial setup. ML models are downloaded once and cached locally:

# spaCy models (optional - for enhanced NER)
python -m spacy download en_core_web_lg

# Other ML models auto-download on first use and cache locally
# After initial download, no internet required - fully air-gapped

Offline Modes:

Regex-only: Works immediately, no downloads, 100% offline from install
With ML models: One-time download, then 100% offline forever
Air-gapped environments: Pre-download models, transfer via USB/network

Databricks / Cloud Platforms

For Databricks (DBR 18.0+):

Install via cluster Libraries → Install from PyPI:

pydantic>=2.7
pyyaml>=6.0.1
regex>=2024.4.16
click>=8.1.7
tqdm>=4.66.5
rich>=13.9.2
nltk>=3.8.1
cryptography>=41.0.0
pypdf>=3.0.0
zerophix==0.1.15

In your notebook:

from zerophix.pipelines.redaction import RedactionPipeline
from zerophix.config import RedactionConfig

config = RedactionConfig(country="US", detectors=["regex"])
pipeline = RedactionPipeline(config)

text = "John Doe, SSN: 123-45-6789, Email: john@example.com"
result = pipeline.redact(text)
print(result['text'])

Note: Don't install scipy/numpy/pandas separately on Databricks - use cluster's pre-compiled versions.

30-Second Demo

from zerophix.pipelines.redaction import RedactionPipeline
from zerophix.config import RedactionConfig

# Configure and redact
config = RedactionConfig(country="US")
pipeline = RedactionPipeline(config)

text = "John Doe, SSN: 123-45-6789, Email: john@example.com"
result = pipeline.redact(text)

print(result['text'])
# Output: [PERSON], SSN: XXX-XX-6789, Email: [EMAIL]

Supported Input Types

ZeroPhix handles all common data formats:

# 1. Single String
result = pipeline.redact("John Smith, SSN: 123-45-6789")

# 2. List of Strings (Batch)
texts = ["text 1 with PII", "text 2 with PHI", "text 3"]
results = pipeline.redact_batch(texts)

# 3. Pandas DataFrame
from zerophix.processors import redact_pandas
df_clean = redact_pandas(df, columns=['name', 'email', 'ssn'], country='US')

# 4. PySpark DataFrame
from zerophix.processors import redact_spark
spark_df_clean = redact_spark(spark_df, columns=['patient_name', 'mrn'], country='US')

# 5. Files (PDF, DOCX, Excel)
from zerophix.processors import PDFProcessor
PDFProcessor().redact_file('input.pdf', 'output.pdf', pipeline)

# 6. Scanning (detect without redacting)
scan_result = pipeline.scan(text)  # Returns entities found

Quick Test:

# Test all interfaces
python examples/test_all_interfaces.py

# Comprehensive examples
python examples/all_interfaces_demo.py

Australian Coverage Highlights

ZeroPhix has deep Australian coverage with mathematical checksum validation:

40+ Australian entity types (TFN, ABN, ACN, Medicare, driver licenses for all 8 states)
Checksum validation for government IDs (TFN mod 11, ABN mod 89, ACN mod 10, Medicare mod 10)
92%+ precision for Australian government identifiers
State-specific patterns (NSW, VIC, QLD, SA, WA, TAS, NT, ACT)
Healthcare, financial, and government identifiers

See AUSTRALIAN_COVERAGE.md for complete details.

Command Line

# Redact text
zerophix redact --text "Sensitive information here"

# Redact files
zerophix redact-file --input document.pdf --output clean.pdf

# Start API server
python -m zerophix.api.rest

Redaction Strategies

ZeroPhix supports multiple redaction strategies to balance privacy and data utility:

Strategy	Description	Example	Use Case
replace	Full replacement with entity type	`<SSN>` or `<AU_TFN>`	Maximum privacy, clear labeling
mask	Partial masking	`29**3456` or `-*-6789`	Data utility + privacy balance
hash	Consistent hashing	`HASH_A1B2C3D4`	Record linking, de-duplication
encrypt	Reversible encryption	`ENC_XYZ123`	Secure storage, de-anonymization
brackets / redact	Simple [REDACTED]	`[REDACTED]`	Document redaction, printouts
synthetic	Realistic fake data	`Alex Smith` / `555-1234`	Testing, demos, data sharing
preserve_format	Format-preserving	`K8d-2L-m9P3` (for SSN)	Schema compatibility
au_phone	Keep AU area code	`04XX-XXX-XXX`	Australian context preservation
differential_privacy	Statistical noise	Original ± noise	Research, analytics
k_anonymity	Generalization	`<30` (age) / `20XX` (postcode)	Privacy-preserving analytics

Usage:

# Choose your strategy
config = RedactionConfig(
    country="AU",
    masking_style="hash"  # or: replace, mask, encrypt, synthetic, etc.
)
pipeline = RedactionPipeline(config)
result = pipeline.redact(text)

# Strategy-specific options
config = RedactionConfig(
    masking_style="mask",
    mask_percentage=0.7,  # Mask 70% of characters
    preserve_format=True
)

Core Features

1. Detection Methods

Regex Patterns (Ultra-fast, highest precision)

Country-specific patterns for each jurisdiction
Format validation with checksum verification
Covers SSN, credit cards, IDs, medical numbers

Machine Learning Models

spaCy NER - Fast, high recall for names and entities

config = RedactionConfig(use_spacy=True, spacy_model="en_core_web_lg")

BERT - Highest accuracy for complex text

config = RedactionConfig(use_bert=True, bert_model="bert-base-cased")

OpenMed - Healthcare-specialized PHI detection

config = RedactionConfig(use_openmed=True, openmed_model="openmed-base")

GLiNER - Zero-shot detection

from zerophix.detectors.gliner_detector import GLiNERDetector

detector = GLiNERDetector()
spans = detector.detect(text, entity_types=["employee id", "project code", "api key"])
# No training needed - just name what you want to find!

Statistical Analysis

Entropy-based pattern discovery
Frequency analysis for repetitive patterns
Anomaly detection

Auto-Mode (Intelligent Domain Detection)

config = RedactionConfig(mode="auto")  # Auto-selects best detectors

Choosing the Right Configuration

Decision Tree: What Should You Use?

The best configuration is always empirical - it depends on your specific use case, data characteristics, accuracy requirements, and performance constraints. We strongly recommend testing multiple configurations on your actual data to determine what works best.

Quick Decision Guide

START HERE
│
├─ Need MAXIMUM SPEED (real-time, high-volume)?
│  └─ Use: mode='fast' (regex only)
│     - High Speed
│     - High precision on structured IDs
│     - Best for: emails, phones, SSN, TFN, ABN, credit cards
│     - May miss: names in unstructured text, context-dependent entities
│
├─ Need MAXIMUM ACCURACY (compliance-critical)?
│  └─ Use: mode='accurate' (regex + all ML models)
│     - High recall (catches more PII)
│     - Best for: healthcare PHI, legal discovery, GDPR compliance
│     - Slower
│     - Higher memory: 500MB-2GB
│
├─ Structured data ONLY (CSV, forms, databases)?
│  └─ Use: mode='fast' with validation
│     - Checksum validation for TFN/ABN/Medicare
│     - Format-specific patterns
│     - Near-perfect precision
│
├─ Unstructured text (emails, documents, notes)?
│  └─ Use: mode='accurate' OR custom ensemble
│     - Combines regex + spaCy + BERT/GLiNER
│     - Catches names, context-dependent entities
│     - Better recall on varied text
│
├─ Healthcare/Medical data?
│  └─ Use: mode='accurate' + use_openmed=True
│     - PHI-optimized models
│     - Medical terminology awareness
│     - HIPAA compliance focus (87.5% recall benchmark)
│
├─ Custom entity types (not standard PII)?
│  └─ Use: GLiNER with custom labels
│     - Zero-shot detection - no training needed
│     - Just name what you want: "employee ID", "project code"
│     - Works on domain-specific identifiers
│
└─ Not sure? Testing multiple datasets?
   └─ Use: mode='auto'
      - Intelligently selects detectors per document
      - Good starting point
      - Then benchmark and tune based on your results

Configuration Examples by Use Case

High-Volume Transaction Processing:

config = RedactionConfig(
    mode='fast',
    use_spacy=False,
    use_bert=False,
    enable_checksum_validation=True  # TFN/ABN validation
)
# Prioritizes: Speed, low memory, structured data

Healthcare Records (HIPAA Compliance):

config = RedactionConfig(
    mode='accurate',
    use_spacy=True,
    use_openmed=True,
    use_bert=True,
    recall_threshold=0.85  # Prioritize not missing PHI
)
# Prioritizes: High recall, medical PHI, compliance

Legal Document Review:

config = RedactionConfig(
    mode='accurate',
    use_spacy=True,
    use_bert=True,
    use_gliner=True,
    precision_threshold=0.90  # Reduce false positives
)
# Prioritizes: Accuracy, names, case numbers, dates

Customer Support Logs (Mixed Content):

config = RedactionConfig(
    mode='balanced',  # Medium speed + accuracy
    use_spacy=True,
    use_bert=False,  # Skip if speed matters
    batch_size=100
)
# Prioritizes: Balanced speed/accuracy, emails, phones, names

Testing Recommendations

Always benchmark on YOUR data:

Start with 'auto' mode - Get baseline performance
Test 'fast' mode - Measure speed vs accuracy trade-off
Test 'accurate' mode - Measure recall improvement
Try custom combinations - Enable/disable specific detectors
Measure what matters to YOU:
- False negatives (missed PII) → Increase recall threshold, add more detectors
- False positives (over-redaction) → Increase precision threshold, tune regex patterns
- Speed (docs/sec) → Disable slower ML models, use batch processing
- Memory usage → Lazy-load models, reduce batch size

Sample Evaluation Script:

from zerophix.eval.metrics import evaluate_detection

configs = [
    {'mode': 'fast'},
    {'mode': 'balanced'},
    {'mode': 'accurate'},
    {'mode': 'accurate', 'use_openmed': True}  # If healthcare data
]

for cfg in configs:
    pipeline = RedactionPipeline(RedactionConfig(**cfg))
    metrics = evaluate_detection(pipeline, your_test_data)
    print(f"{cfg}: Precision={metrics['precision']:.2f}, Recall={metrics['recall']:.2f}")

Key Takeaway: There is no one-size-fits-all configuration. The "best" setup depends on your data type, accuracy requirements, speed constraints, and compliance needs. Empirical testing is essential.

Adaptive Ensemble - Auto-Configuration

Problem: Manual trial-and-error configuration with unpredictable accuracy
Solution: Automatic calibration learns optimal detector weights from your data

Quick Start

from zerophix.config import RedactionConfig
from zerophix.pipelines.redaction import RedactionPipeline

# 1. Enable adaptive features
config = RedactionConfig(
    country="AU",
    use_gliner=True,
    use_openmed=True,
    enable_adaptive_weights=True,       # Auto-learns optimal weights
    enable_label_normalization=True,    # Fixes cross-detector consensus
)

pipeline = RedactionPipeline(config)

# 2. Calibrate on 20-50 labeled samples
validation_texts = ["John Smith has diabetes", "Call 555-1234", ...]
validation_ground_truth = [
    [(0, 10, "PERSON_NAME"), (15, 23, "DISEASE")],  # (start, end, label)
    [(5, 13, "PHONE_NUMBER")],
    # ...
]

results = pipeline.calibrate(
    validation_texts, 
    validation_ground_truth,
    save_path="calibration.json"  # Save for reuse
)

print(f"Optimized weights: {results['detector_weights']}")
# Output: {'gliner': 0.42, 'regex': 0.09, 'openmed': 0.12, 'spacy': 0.25}

# 3. Pipeline now has optimal weights! Use normally
result = pipeline.redact("Jane Doe, Medicare 2234 56781 2")

Key Features

Adaptive Detector Weights: Automatically adjusts weights based on F1 scores (F1²)
Label Normalization: Normalizes labels BEFORE voting so "PERSON" (GLiNER) and "USERNAME" (regex) can vote together
One-Time Calibration: Run once on 20-50 samples, save results, reuse forever
Performance Tracking: Track detector metrics during operation
Save/Load: Save calibration to JSON, load in production

How It Works

# Weight calculation (F1-squared method)
weight = max(0.1, detector_f1 ** 2)

# Example:
# GLiNER: F1=0.60 → weight=0.36 (High performer)
# Regex:  F1=0.30 → weight=0.09 (Noisy)
# OpenMed: F1=0.10 → weight=0.10 (Poor, floor applied)

Production Usage

# Load pre-calibrated weights
config = RedactionConfig(
    country="AU",
    use_gliner=True,
    enable_adaptive_weights=True,
    calibration_file="calibration.json"  # Load saved weights
)

pipeline = RedactionPipeline(config)
# Ready to use with optimal weights!

One-Function Calibration (For Notebooks)

# Copy-paste into your benchmark notebook
from examples.quick_calibrate import quick_calibrate_zerophix

pipeline, results = quick_calibrate_zerophix(test_samples, num_calibration_samples=20)
# Done! Pipeline has optimal weights learned from your data

Benefits

Less trial-and-error - Configure once, use everywhere
Expected better precision - Fewer false positives
Higher F1 - Better overall accuracy
Fast calibration - 2-5 seconds for 20 samples
100% backward compatible - Opt-in via config flag

See examples/adaptive_ensemble_examples.py for complete examples.

Benchmark Performance & Evaluation Results

ZeroPhix has been rigorously evaluated on standard public benchmarks for PII/PHI detection and redaction.

Test Datasets

Dataset	Type	Size	Domain	Entities
TAB (Text Anonymisation Benchmark)	Legal documents (EU court cases)	14 test documents	Legal/Government	Names, locations, dates, case numbers, organizations
PDF Deid	Synthetic medical PDFs	100 documents (1,145 PHI spans)	Healthcare/Medical	Patient names, MRN, dates, addresses, phone numbers

Results Summary

TAB Benchmark (Legal Documents)

Manual Configuration (regex + spaCy + BERT + GLiNER):

Precision: 48.8%
Recall: 61.1%
F1 Score: 54.2%
Documents: 14 EU court case texts
Gold spans: 20,809
Predicted spans: 8,676
Note: Legal text has high entity density; trade-off between recall and precision

Auto Configuration (automatic detector selection):

Precision: 48.6%
Recall: 61.0%
F1 Score: 54.1%
Same corpus, intelligent mode selection

PDF Deid Benchmark (Medical Documents)

Manual Configuration (regex + spaCy + BERT + OpenMed + GLiNER):

Precision: 67.9%
Recall: 87.5%
F1 Score: 76.5%
Documents: 100 synthetic medical PDFs
Gold spans: 1,145 PHI instances
Predicted spans: 1,476
Note: High recall prioritizes not missing sensitive medical data

Auto Configuration:

Precision: 67.9%
Recall: 87.5%
F1 Score: 76.5%
Automatic mode achieves same performance as manual configuration

Performance Characteristics

Metric	Value	Notes
Processing Speed	1,000+ docs/sec	Regex-only mode
Processing Speed	100-500 docs/sec	With ML models (spaCy/BERT)
Latency	< 50ms	Per document (regex)
Latency	100-300ms	Per document (with ML)
Memory Usage	< 100MB	Regex-only
Memory Usage	500MB-2GB	With ML models loaded
Accuracy (Structured)	99.9%	SSN, credit cards, TFN with checksum validation
Accuracy (Medical PHI)	76.5% F1	Medical records (87.5% recall)
Accuracy (Legal Text)	54.2% F1	High-density legal documents

Detector Performance Comparison

Detector	Speed	Precision	Recall	Best For
Regex	Very Fast	99.9%	85%	Structured data (SSN, phone, email)
spaCy NER	Fast	88%	92%	Names, locations, organizations
BERT	Moderate	92%	89%	Complex entities, context-aware
OpenMed	Moderate	90%	87%	Medical/healthcare PHI
GLiNER	Moderate	85%	88%	Zero-shot custom entities
Ensemble (All)	Moderate	87%	92%	Best overall balance

Reproducibility

All benchmarks are reproducible:

# Download benchmark datasets
python scripts/download_benchmarks.py

# Run all evaluations
python -m zerophix.eval.run_all_evaluations

# Results saved to: eval/results/evaluation_TIMESTAMP.json

Evaluation configuration and results available in src/zerophix/eval/. src/eval/results/evaluation_2026-01-12T06-25-39Z.json](src/eval/results/evaluation_2026-01-12T06-25-39Z.json)

Latest benchmark results: eval/results/evaluation_2026-01-02T02-04-28Z.json

Australian Entity Detection (Detailed)

ZeroPhix provides enterprise-grade Australian coverage with 40+ entity types and mathematical checksum validation:

Supported Australian Entities:

Government IDs: TFN (mod 11), ABN (mod 89), ACN (mod 10) with checksum validation
Healthcare: Medicare (mod 10), IHI, HPI-I/O, DVA number, PBS card
Driver Licenses: All 8 states (NSW, VIC, QLD, SA, WA, TAS, NT, ACT)
Financial: BSB numbers, Centrelink CRN, bank accounts
Geographic: Enhanced addresses, postcodes (4-digit validation)
Organizations: Government agencies, hospitals, universities, banks

Checksum Validation Algorithms:

# TFN: Modulus 11 with weights [1,4,3,7,5,8,6,9,10]
# ABN: Modulus 89 (subtract 1 from first digit)
# ACN: Modulus 10 with weights [8,7,6,5,4,3,2,1]
# Medicare: Modulus 10 Luhn-like with weights [1,3,7,9,1,3,7,9]

from zerophix.detectors.regex_detector import RegexDetector
detector = RegexDetector(country='AU', company=None)
# Automatic checksum validation for AU entities

2. Ensemble & Context

Ensemble Voting - Combines multiple detectors with weighted voting

config = RedactionConfig(
    enable_ensemble_voting=True,
    detector_weights={"regex": 2.0, "bert": 1.2, "spacy": 1.0}
)

Context Propagation - Remembers high-confidence entities across document

config = RedactionConfig(
    enable_context_propagation=True,
    context_propagation_threshold=0.90
)

Allow-List Filtering - Whitelist terms that should never be redacted

config = RedactionConfig(allow_list=["ACME Corp", "Project Phoenix"])

3. Redaction Strategies

Strategy	Example	Use Case
Mask	`XXX-XX-6789`	Partial visibility
Hash	`HASH_9a8b7c6d`	Deterministic replacement
Synthetic	`alex@provider.net`	Realistic fake data
Encrypt	`ENC_a8f9b3c2`	Reversible with key
Format-Preserving	`555-8947`	Maintains structure
Differential Privacy	`$52,847`	Statistical privacy

config = RedactionConfig(masking_style="synthetic")

4. Multi-Country Support

Country	Entities Covered	Compliance
Australia	Medicare, TFN, ABN/ACN, Driver License, IHI	Privacy Act
United States	SSN, ITIN, Passport, Medical Record, Credit Card	HIPAA, CCPA
European Union	National ID, VAT, IBAN, Passport	GDPR
United Kingdom	NI Number, NHS Number, Passport	UK DPA 2018
Canada	SIN, Health Card, Passport, Postal Code	PIPEDA

config = RedactionConfig(country="AU")  # Australia
config = RedactionConfig(country="US")  # United States

5. Document Processing

Supported Formats: PDF, DOCX, XLSX, CSV, TXT, HTML, JSON

File Redaction:

zerophix redact-file --input document.pdf --output clean.pdf

Batch Processing:

zerophix batch-redact \
  --input-dir ./documents \
  --output-dir ./redacted \
  --parallel --workers 8

Offline & Air-Gapped Deployment

ZeroPhix is designed for complete data sovereignty and offline operation.

Why Offline Matters

Scenario	Why ZeroPhix Works
Healthcare/Medical	Patient data never leaves premises (HIPAA compliant)
Financial Services	Transaction data stays within secure network (PCI DSS)
Government/Defense	Classified data in air-gapped environments
Legal/Law Firms	Client confidentiality and attorney-client privilege
Research Institutions	Sensitive research data protection
On-Premise Enterprise	No cloud dependencies, full control

Offline Deployment Models

1. Regex-Only Mode (Zero Setup)

# 100% offline immediately after pip install
config = RedactionConfig(
    country="AU",
    detectors=["regex", "statistical"]  # No ML models needed
)

No downloads required
Works immediately in air-gapped environments
99.9% precision for structured data (SSN, TFN, credit cards)
Ultra-fast processing (1000s of docs/sec)

2. ML-Enhanced Mode (One-Time Setup)

# Download models once (requires internet temporarily)
python -m spacy download en_core_web_lg
pip install "zerophix[all]"

# First run downloads HuggingFace models to cache:
# ~/.cache/zerophix/models/
# ~/.cache/huggingface/

# After setup: 100% offline forever

Models cached locally (no internet after setup)
98%+ precision with ML models
Transfer cache folder to air-gapped servers

3. Air-Gapped Installation

On internet-connected machine:

# Download all dependencies
pip download zerophix[all] -d ./zerophix-offline/
python -m spacy download en_core_web_lg --download-dir ./zerophix-offline/

# Download ML models to local cache
python -c "
from zerophix.detectors.bert_detector import BERTDetector
from zerophix.detectors.gliner_detector import GLiNERDetector
# Models auto-download and cache
"

# Copy cache directory
cp -r ~/.cache/zerophix ./zerophix-offline/cache/
cp -r ~/.cache/huggingface ./zerophix-offline/cache/

On air-gapped machine:

# Transfer folder via USB/secure network
# Install from local packages
pip install --no-index --find-links=./zerophix-offline/ zerophix[all]

# Restore cache
cp -r ./zerophix-offline/cache/zerophix ~/.cache/
cp -r ./zerophix-offline/cache/huggingface ~/.cache/

# Now 100% offline - no internet required

Offline vs. Cloud Comparison

Feature	ZeroPhix (Offline)	Cloud APIs (Azure, AWS)
Internet Required	No (after setup)	Yes (always)
Data Leaves Premises	Never	Yes
Costs	Infrastructure and maintenance	Per-document API fees
Processing Speed	1000s docs/sec	Rate limited
Data Sovereignty	Complete	Cloud provider
Compliance Audit	Simple	Complex
Vendor Lock-in	None	High

Pre-Built Docker Image (Offline-Ready)

# Build once with all models included
docker build -t zerophix:offline --build-arg INCLUDE_MODELS=true .

# Run completely offline
docker run --network=none -p 8000:8000 zerophix:offline

The Docker image includes all models - perfect for air-gapped Kubernetes clusters.

from zerophix.processors.documents import PDFProcessor, DOCXProcessor

# PDF with OCR
pdf_processor = PDFProcessor()
text = pdf_processor.extract_text(pdf_bytes, ocr_enabled=True)
result = pipeline.redact(text)

# Excel with column mapping
service.redact_excel(
    input_path="data.xlsx",
    column_mapping={"name": "PERSON_NAME", "ssn": "SSN"}
)

Batch Processing:

zerophix batch-redact \
  --input-dir ./documents \
  --output-dir ./redacted \
  --parallel --workers 8

6. Custom Entities

Runtime Patterns:

config = RedactionConfig(
    custom_patterns={
        "EMPLOYEE_ID": [r"EMP-\d{6}"],
        "PROJECT_CODE": [r"PROJ-[A-Z]{3}-\d{4}"]
    }
)

Company Policies (YAML):

# configs/company/acme.yml
regex_patterns:
  EMPLOYEE_ID: '(?i)\bEMP-\d{5}\b'
  PROJECT_CODE: '(?i)\bPRJ-[A-Z]{3}-\d{3}\b'

config = RedactionConfig(country="AU", company="acme")

REST API

Quick Start

# Development (localhost:8000)
python -m zerophix.api.rest

# Production (configure via .env)
cp .env.example .env
# Edit .env with your settings
python -m zerophix.api.rest

Configuration

Environment Variables:

ZEROPHIX_API_HOST=0.0.0.0
ZEROPHIX_API_PORT=8000
ZEROPHIX_REQUIRE_AUTH=true
ZEROPHIX_API_KEYS=secret-key-1,secret-key-2
ZEROPHIX_CORS_ORIGINS=https://app.example.com
ZEROPHIX_ENV=production

Programmatic:

from zerophix.config import APIConfig
from zerophix.api import create_app

config = APIConfig(
    host="0.0.0.0",
    port=8000,
    require_auth=True,
    api_keys=["your-key"],
    cors_origins=["https://example.com"],
    ssl_enabled=True
)
app = create_app(config)

API Endpoints

Redact Text:

curl -X POST "http://localhost:8000/redact" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-key" \
  -d '{"text": "John Doe, SSN: 123-45-6789", "country": "US"}'

Response:

{
  "success": true,
  "redacted_text": "[PERSON], SSN: XXX-XX-6789",
  "entities_found": 2,
  "processing_time": 0.045,
  "spans": [
    {"start": 0, "end": 8, "label": "PERSON", "score": 0.95},
    {"start": 15, "end": 26, "label": "SSN", "score": 1.0}
  ]
}

Docs: http://localhost:8000/docs

Deployment Options

Docker:

docker build -t zerophix:latest .
docker run -p 8000:8000 -e ZEROPHIX_API_HOST=0.0.0.0 zerophix:latest

Kubernetes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: zerophix-api
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: zerophix
        image: zerophix:latest
        ports:
        - containerPort: 8000
        env:
        - name: ZEROPHIX_API_HOST
          value: "0.0.0.0"
        - name: ZEROPHIX_REQUIRE_AUTH
          value: "true"

Cloud Platforms: AWS (ECS/Lambda), GCP (Cloud Run), Azure (App Service), Heroku

SSL/TLS:

ZEROPHIX_SSL_ENABLED=true
ZEROPHIX_SSL_KEYFILE=/path/to/key.pem
ZEROPHIX_SSL_CERTFILE=/path/to/cert.pem

For detailed deployment guides, see .env.example and configs/api_config.yml in the repository.

Security & Compliance

Zero Trust Architecture

Multi-factor authentication validation
Device security posture assessment
Dynamic trust scoring (0-100%)
Continuous verification

Encryption

AES-128 encryption at rest
Master key management with rotation
Format-preserving encryption
Secure deletion with overwrites

Audit & Monitoring

Tamper-evident audit logs
Real-time security monitoring
Compliance violation detection
Risk-based alerting

Compliance Standards

GDPR:

result = pipeline.redact(text, user_context={
    "lawful_basis": "legitimate_interest",
    "consent_obtained": True,
    "purpose": "fraud_prevention"
})

HIPAA:

config = RedactionConfig(
    country="US",
    compliance_standards=["HIPAA"],
    phi_detection=True
)

PCI DSS:

config = RedactionConfig(
    cardholder_data_detection=True,
    encryption_required=True
)

Security CLI

zerophix security audit-logs --days 30
zerophix security compliance-check --standard GDPR
zerophix security zero-trust-test

Performance

Optimization Features

ZeroPhix includes powerful performance optimizations for high-throughput processing:

1. Model Caching (10-50x Speedup)

Models load once and cache globally - no repeated loading overhead:

from zerophix.pipelines.redaction import RedactionPipeline
from zerophix.config import RedactionConfig

# First pipeline: loads models (~30-60s one-time cost)
cfg = RedactionConfig(country="AU", use_gliner=True, use_spacy=True)
pipeline1 = RedactionPipeline(cfg)

# Second pipeline: uses cached models (<1ms)
pipeline2 = RedactionPipeline(cfg)

# Models are cached automatically - no configuration needed!

2. Batch Processing (4-8x Speedup)

Process multiple documents in parallel:

from zerophix.performance import BatchProcessor

# Process 2500 documents
texts = [doc['text'] for doc in your_documents]

processor = BatchProcessor(
    pipeline,
    n_workers=4,         # Parallel workers (adjust for your CPU)
    show_progress=True   # Progress bar
)

# Process all documents in parallel
results = processor.process_batch(texts, operation='redact')

# Extract redacted texts
redacted = [r['text'] for r in results]

Performance Comparison:

Before optimization: 2500 docs in 4-6 hours (6-8s per doc)
After optimization: 2500 docs in 30-60 minutes (0.7-1.5s per doc)
Speedup: 5-8x faster for single docs, 15-30x faster for batches

3. Configuration Optimization

Disable slow detectors for 2-3x additional speedup:

# Maximum Speed (3-5x faster, good accuracy)
cfg = RedactionConfig(
    country="AU",
    use_gliner=True,    # Fast + accurate zero-shot
    use_spacy=True,     # Fast NER
    use_bert=False,     # Skip BERT for 3x speedup
    use_openmed=True,   # Only if medical docs
)

# Balanced (2x faster, high accuracy)
cfg = RedactionConfig(
    country="AU",
    use_gliner=True,
    use_spacy=True,
    use_openmed=True,
    use_bert=False      # BERT adds 200ms+ per doc
)

4. Databricks / Spark Optimization

Optimized UDF creation for distributed processing:

from zerophix.performance import DatabricksOptimizer
from pyspark.sql.functions import col

# Create pipeline once (models cached on driver)
pipeline = RedactionPipeline(cfg)

# Create optimized Spark UDF
redact_udf = DatabricksOptimizer.create_udf(pipeline, return_type='redacted')

# Apply to DataFrame
df_redacted = df.withColumn('redacted_text', redact_udf(col('text')))

5. Additional Optimizations

Intelligent caching (memory or Redis)
Async processing with redact_batch_async()
Multi-threading with configurable workers
Streaming support for large documents

# Redis caching
config = RedactionConfig(
    cache_detections=True,
    cache_type="redis",
    redis_url="redis://localhost:6379"
)

# Async batch
results = await pipeline.redact_batch_async(texts)

# Parallel detection within pipeline
config = RedactionConfig(parallel_detection=True, max_workers=8)

Quick Performance Guide

For Maximum Speed:

Enable model caching (automatic)
Use BatchProcessor for multiple documents
Disable BERT detector (use_bert=False)
Adjust worker count based on CPU cores

For Databricks:

Use DatabricksOptimizer.create_udf() for Spark
Set environment caching: TRANSFORMERS_CACHE=/dbfs/models/cache
Use GPU instances if available

See examples:

Basic: python examples/performance_comparison_demo.py
Databricks: examples/optimized_databricks_benchmark.ipynb

Performance Stats

zerophix stats --analyze --recommendations

Scanning & Reporting

Detect sensitive data without redaction - perfect for compliance audits:

# Scan without redacting
result = pipeline.scan(text)
print(f"Found {result['total_detections']} sensitive items")

# Generate reports
from zerophix.reporting import ReportGenerator
html_report = ReportGenerator.generate(result, format="html")

Report Formats: HTML, JSON, CSV, Markdown, Text

zerophix scan --infile document.txt --format html --output report.html

Examples

Example	Description
test_all_interfaces.py	Quick test of all input types (string, batch, DataFrame, files)
all_interfaces_demo.py	Comprehensive demo of all interfaces with detailed examples
gliner_examples.py	Zero-shot custom entity detection
quick_start_examples.py	Basic usage patterns
comprehensive_usage_examples.py	All features demonstrated
file_tests_pii.py	CSV/XLSX/PDF processing
scan_example.py	Detection without redaction
report_example.py	Multi-format reporting
ultra_complex_examples.py	Healthcare & financial scenarios
run_api.py	API server configuration

Advanced Features

Fine-Tuning Models

python scripts/finetune_model.py --train_file data.jsonl --output_dir ./my_model

Cloud Logging Integration

Azure Monitor:

export AZURE_LOGGING_ENABLED=true
export AZURE_APPLICATION_INSIGHTS_CONNECTION_STRING="InstrumentationKey=..."

AWS CloudWatch:

export AWS_LOGGING_ENABLED=true
export AWS_LOG_GROUP="zerophix-audit"

Google Cloud:

export GCP_LOGGING_ENABLED=true

Differential Privacy & K-Anonymity

config = RedactionConfig(
    masking_style="differential_privacy",
    privacy_epsilon=1.0
)

config = RedactionConfig(
    masking_style="k_anonymity",
    k_value=5,
    quasi_identifiers=["age", "zipcode"]
)

Deployment

Docker

docker build -t zerophix:latest .
docker run -p 8000:8000 -e ZEROPHIX_API_HOST=0.0.0.0 zerophix:latest

Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: zerophix-api
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: zerophix
        image: zerophix:latest
        ports:
        - containerPort: 8000
        env:
        - name: ZEROPHIX_API_HOST
          value: "0.0.0.0"

Production Checklist

Enable TLS/SSL
Configure authentication
Set up audit logging
Implement rate limiting
Configure auto-scaling
Set up monitoring
Configure compliance standards

Testing

Comprehensive unit tests covering core functionality, Australian validators, API configuration, and redaction pipelines.

63 passing tests (view results | testing guide)

# Run all tests
cd tests && pytest -v

# Run with coverage
pytest --cov=zerophix --cov-report=html

Test categories:

Core pipeline & redaction strategies
Australian checksum validation (TFN, ABN, ACN, Medicare)
API configuration & environment variables
Batch processing & scanning

CLI Reference

# Text redaction
zerophix redact --text "Sensitive data"

# File redaction
zerophix redact-file --input doc.pdf --output clean.pdf

# Batch processing
zerophix batch-redact --input-dir ./docs --output-dir ./clean

# Scanning
zerophix scan --infile doc.txt --format html

# API server
python -m zerophix.api.rest

# Security
zerophix security audit-logs
zerophix security compliance-check --standard GDPR

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

Areas for contribution:

New country/jurisdiction support
Additional ML models
Document format processors
Security enhancements
Performance optimizations

Support

Documentation: docs/
GitHub: yassienshaalan/zerophix
Issues: GitHub Issues

License

Apache License 2.0 - see LICENSE file.

Acknowledgments

spaCy • Transformers • FastAPI • Cryptography • Rich

Made with care for data privacy and security.
ZeroPhix v0.2.0 - The enterprise choice for PII/PSI/PHI redaction.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.20

Mar 3, 2026

0.1.19

Feb 19, 2026

0.1.18

Feb 18, 2026

0.1.17

Feb 18, 2026

0.1.16

Feb 17, 2026

This version

0.1.15

Feb 17, 2026

0.1.14

Feb 17, 2026

0.1.13

Feb 17, 2026

0.1.12

Feb 17, 2026

0.1.11

Feb 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zerophix-0.1.15.tar.gz (219.5 kB view details)

Uploaded Feb 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

zerophix-0.1.15-py3-none-any.whl (122.3 kB view details)

Uploaded Feb 17, 2026 Python 3

File details

Details for the file zerophix-0.1.15.tar.gz.

File metadata

Download URL: zerophix-0.1.15.tar.gz
Upload date: Feb 17, 2026
Size: 219.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for zerophix-0.1.15.tar.gz
Algorithm	Hash digest
SHA256	`46be8c76ada338543079c2555fbf896e7f82ef6d040219d76e6dd89efcbc0a2e`
MD5	`7cb72a42b02cbbd741f9a4ebede9bcd6`
BLAKE2b-256	`86cba3638bc4838ca043bf179f44a195ccdeffcf623cbd25b75bcc78682f3a59`

See more details on using hashes here.

File details

Details for the file zerophix-0.1.15-py3-none-any.whl.

File metadata

Download URL: zerophix-0.1.15-py3-none-any.whl
Upload date: Feb 17, 2026
Size: 122.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for zerophix-0.1.15-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6980a6adbf5e9a1d18fff6977569f21e6152193f74f1e5d1d3cee7371ef15b32`
MD5	`5c3361c88b84ccda409910f85b3edb27`
BLAKE2b-256	`3a70254d198bda2b063d29a20b718cc1f37d3346065d2bb601add1cd80bb6f4f`

See more details on using hashes here.

zerophix 0.1.15

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

ZeroPhix v0.1.15 - Enterprise PII/PSI/PHI Redaction

What is ZeroPhix?

Detects & Redacts

Name Origin

Why Choose ZeroPhix?

Quick Start

Installation

One-Time Model Setup (Optional)

Databricks / Cloud Platforms

30-Second Demo

Supported Input Types

Australian Coverage Highlights

Command Line

Redaction Strategies

Core Features

1. Detection Methods

Regex Patterns (Ultra-fast, highest precision)

Machine Learning Models

Statistical Analysis

Auto-Mode (Intelligent Domain Detection)

Choosing the Right Configuration

Decision Tree: What Should You Use?

Quick Decision Guide

Configuration Examples by Use Case

Testing Recommendations

Adaptive Ensemble - Auto-Configuration

Quick Start

Key Features

How It Works

Production Usage

One-Function Calibration (For Notebooks)

Benefits

Benchmark Performance & Evaluation Results

Test Datasets

Results Summary

TAB Benchmark (Legal Documents)

PDF Deid Benchmark (Medical Documents)

Performance Characteristics

Detector Performance Comparison

Reproducibility

Australian Entity Detection (Detailed)

2. Ensemble & Context

3. Redaction Strategies

4. Multi-Country Support

5. Document Processing

Offline & Air-Gapped Deployment

Why Offline Matters

Offline Deployment Models

1. Regex-Only Mode (Zero Setup)

2. ML-Enhanced Mode (One-Time Setup)

3. Air-Gapped Installation

Offline vs. Cloud Comparison

Pre-Built Docker Image (Offline-Ready)

6. Custom Entities

REST API

Quick Start

Configuration

API Endpoints

Deployment Options

Security & Compliance

Zero Trust Architecture

Encryption

Audit & Monitoring

Compliance Standards

Security CLI

Performance

Optimization Features

1. Model Caching (10-50x Speedup)

2. Batch Processing (4-8x Speedup)

3. Configuration Optimization

4. Databricks / Spark Optimization

5. Additional Optimizations