Lightning-fast PII detection and anonymization library with 190x performance advantage
Project description
DataFog: PII Detection & Anonymization
Fast processing • Production-ready • Simple configuration
Overview
DataFog provides efficient PII detection using a pattern-first approach that processes text significantly faster than traditional NLP methods while maintaining high accuracy.
# Basic usage example
from datafog import DataFog
results = DataFog().scan_text("John's email is john@example.com and SSN is 123-45-6789")
Performance Comparison
| Engine | 10KB Text Processing | Relative Speed | Accuracy |
|---|---|---|---|
| DataFog (Regex) | ~2.4ms | 190x faster | High (structured) |
| DataFog (GLiNER) | ~15ms | 32x faster | Very High |
| DataFog (Smart) | ~3-15ms | 60x faster | Highest |
| spaCy | ~459ms | baseline | Good |
Performance measured on 13.3KB business document. GLiNER provides excellent accuracy for named entities while maintaining speed advantage.
Supported PII Types
| Type | Examples | Use Cases |
|---|---|---|
| john@company.com | Contact scrubbing | |
| Phone | (555) 123-4567 | Call log anonymization |
| SSN | 123-45-6789 | HR data protection |
| Credit Cards | 4111-1111-1111-1111 | Payment processing |
| IP Addresses | 192.168.1.1 | Network log cleaning |
| Dates | 01/01/1990 | Birthdate removal |
| ZIP Codes | 12345-6789 | Location anonymization |
Quick Start
Installation
# Lightweight core (fast regex-based PII detection)
pip install datafog
# With advanced ML models for better accuracy
pip install datafog[nlp] # spaCy for advanced NLP
pip install datafog[nlp-advanced] # GLiNER for modern NER
pip install datafog[ocr] # Image processing with OCR
pip install datafog[all] # Everything included
Basic Usage
Detect PII in text:
from datafog import DataFog
# Simple detection (uses fast regex engine)
detector = DataFog()
text = "Contact John Doe at john.doe@company.com or (555) 123-4567"
results = detector.scan_text(text)
print(results)
# Finds: emails, phone numbers, and more
# Modern NER with GLiNER (requires: pip install datafog[nlp-advanced])
from datafog.services import TextService
gliner_service = TextService(engine="gliner")
result = gliner_service.annotate_text_sync("Dr. John Smith works at General Hospital")
# Detects: PERSON, ORGANIZATION with high accuracy
# Best of both worlds: Smart cascading (recommended for production)
smart_service = TextService(engine="smart")
result = smart_service.annotate_text_sync("Contact john@company.com or call (555) 123-4567")
# Uses regex for structured PII (fast), GLiNER for entities (accurate)
Anonymize on the fly:
# Redact sensitive data
redacted = DataFog(operations=["scan", "redact"]).process_text(
"My SSN is 123-45-6789 and email is john@example.com"
)
print(redacted)
# Output: "My SSN is [REDACTED] and email is [REDACTED]"
# Replace with fake data
replaced = DataFog(operations=["scan", "replace"]).process_text(
"Call me at (555) 123-4567"
)
print(replaced)
# Output: "Call me at [PHONE_A1B2C3]"
Process images with OCR:
import asyncio
from datafog import DataFog
async def scan_document():
ocr_scanner = DataFog(operations=["extract", "scan"])
results = await ocr_scanner.run_ocr_pipeline([
"https://example.com/document.png"
])
return results
# Extract text and find PII in images
results = asyncio.run(scan_document())
Advanced Features
Engine Selection
Choose the appropriate engine for your needs:
from datafog.services import TextService
# Regex: Fast, pattern-based (recommended for speed)
regex_service = TextService(engine="regex")
# spaCy: Traditional NLP with broad entity recognition
spacy_service = TextService(engine="spacy")
# GLiNER: Modern ML model optimized for NER (requires nlp-advanced extra)
gliner_service = TextService(engine="gliner")
# Smart: Cascading approach - regex → GLiNER → spaCy (best accuracy/speed balance)
smart_service = TextService(engine="smart")
# Auto: Regex → spaCy fallback (legacy)
auto_service = TextService(engine="auto")
Performance & Accuracy Guide:
| Engine | Speed | Accuracy | Use Case | Install Requirements |
|---|---|---|---|---|
regex |
🚀 Fastest | Good | Structured PII (emails, phones) | Core only |
gliner |
⚡ Fast | Better | Modern NER, custom entities | pip install datafog[nlp-advanced] |
spacy |
🐌 Slower | Good | Traditional NLP entities | pip install datafog[nlp] |
smart |
⚡ Balanced | Best | Combines all approaches | pip install datafog[nlp-advanced] |
Model Management:
# Download specific GLiNER models
import subprocess
# PII-specialized model (recommended)
subprocess.run(["datafog", "download-model", "urchade/gliner_multi_pii-v1", "--engine", "gliner"])
# General-purpose model
subprocess.run(["datafog", "download-model", "urchade/gliner_base", "--engine", "gliner"])
# List available models
subprocess.run(["datafog", "list-models", "--engine", "gliner"])
Anonymization Options
from datafog import DataFog
from datafog.models.anonymizer import AnonymizerType, HashType
# Hash with different algorithms
hasher = DataFog(
operations=["scan", "hash"],
hash_type=HashType.SHA256 # or MD5, SHA3_256
)
# Target specific entity types only
selective = DataFog(
operations=["scan", "redact"],
entities=["EMAIL", "PHONE"] # Only process these types
)
Batch Processing
documents = [
"Document 1 with PII...",
"Document 2 with more data...",
"Document 3..."
]
# Process multiple documents efficiently
results = DataFog().batch_process(documents)
Performance Benchmarks
Performance comparison with alternatives:
Speed Comparison (10KB text)
DataFog Pattern: 4ms ████████████████████████████████ 123x faster
spaCy: 480ms ██ baseline
Engine Selection Guide
| Scenario | Recommended Engine | Why |
|---|---|---|
| High-volume processing | pattern |
Maximum speed, consistent performance |
| Unknown entity types | spacy |
Broader entity recognition |
| General purpose | auto |
Smart fallback, best of both worlds |
| Real-time applications | pattern |
Sub-millisecond processing |
CLI Usage
DataFog includes a command-line interface:
# Scan text for PII
datafog scan-text "John's email is john@example.com"
# Process images
datafog scan-image document.png --operations extract,scan
# Anonymize data
datafog redact-text "My phone is (555) 123-4567"
datafog replace-text "SSN: 123-45-6789"
datafog hash-text "Email: john@company.com" --hash-type sha256
# Utility commands
datafog health
datafog list-entities
datafog show-config
Features
Security & Compliance
- Detection of regulated data types for GDPR/CCPA compliance
- Audit trails for tracking detection and anonymization
- Configurable detection thresholds
Scalability
- Batch processing for handling multiple documents
- Memory-efficient processing for large files
- Async support for non-blocking operations
Integration Example
# FastAPI middleware example
from fastapi import FastAPI
from datafog import DataFog
app = FastAPI()
detector = DataFog()
@app.middleware("http")
async def redact_pii_middleware(request, call_next):
# Automatically scan/redact request data
pass
Common Use Cases
Enterprise
- Log sanitization
- Data migration with PII handling
- Compliance reporting and audits
Data Science
- Dataset preparation and anonymization
- Privacy-preserving analytics
- Research compliance
Development
- Test data generation
- Code review for PII detection
- API security validation
Installation & Setup
Basic Installation
pip install datafog
Development Setup
git clone https://github.com/datafog/datafog-python
cd datafog-python
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements-dev.txt
just setup
Docker Usage
FROM python:3.10-slim
RUN pip install datafog
COPY . .
CMD ["python", "your_script.py"]
Contributing
Contributions are welcome in the form of:
- Bug reports
- Feature requests
- Documentation improvements
- New pattern patterns for PII detection
- Performance improvements
Quick Contribution Guide
# Setup development environment
git clone https://github.com/datafog/datafog-python
cd datafog-python
just setup
# Run tests
just test
# Format code
just format
# Submit PR
git checkout -b feature/your-improvement
# Make your changes
git commit -m "Add your improvement"
git push origin feature/your-improvement
See CONTRIBUTING.md for detailed guidelines.
Benchmarking & Performance
Run Benchmarks Locally
# Install benchmark dependencies
pip install pytest-benchmark
# Run performance tests
pytest tests/benchmark_text_service.py -v
# Compare with baseline
python scripts/run_benchmark_locally.sh
Continuous Performance Monitoring
Our CI pipeline:
- Runs benchmarks on every PR
- Compares against baseline performance
- Fails builds if performance degrades >10%
- Tracks performance trends over time
Documentation & Support
| Resource | Link |
|---|---|
| Documentation | docs.datafog.ai |
| Community Discord | Join here |
| Bug Reports | GitHub Issues |
| Feature Requests | GitHub Discussions |
| Support | hi@datafog.ai |
License & Acknowledgments
DataFog is released under the MIT License.
Built with:
- Pattern optimization for efficient processing
- spaCy integration for NLP capabilities
- Tesseract & Donut for OCR capabilities
- Pydantic for data validation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datafog-4.2.0.tar.gz.
File metadata
- Download URL: datafog-4.2.0.tar.gz
- Upload date:
- Size: 61.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
967b3e076d1762322ca3187cc97513ead4d2f080ed6c4d2725255221b02fb7d8
|
|
| MD5 |
a22c04fe9f4c64545edadccbd8b8238f
|
|
| BLAKE2b-256 |
762fe86d6ac3a11f943d7b0cd763826cfc483a632de23c53257400b7378d9737
|
File details
Details for the file datafog-4.2.0-py3-none-any.whl.
File metadata
- Download URL: datafog-4.2.0-py3-none-any.whl
- Upload date:
- Size: 54.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48534790ab4b0ecd551a7c312b62163a0b97cd1a9dc1e3e2d69e7df183fff99b
|
|
| MD5 |
a1188ba055b560730b4b93eacfddfb4a
|
|
| BLAKE2b-256 |
edca4319ba57d5693d66a71c73ec5b426b8d770417bb0c26f65b61a066609f25
|