Skip to main content

Lightning-fast PII detection and anonymization library with 190x performance advantage

Project description

DataFog: PII Detection & Anonymization

DataFog logo

Fast processing • Production-ready • Simple configuration

PyPi Version PyPI pyversions GitHub stars PyPi downloads Tests Benchmarks


Overview

DataFog provides efficient PII detection using a pattern-first approach that processes text significantly faster than traditional NLP methods while maintaining high accuracy.

# Basic usage example
from datafog import DataFog
results = DataFog().scan_text("John's email is john@example.com and SSN is 123-45-6789")

Performance Comparison

Engine 10KB Text Processing Relative Speed Accuracy
DataFog (Regex) ~2.4ms 190x faster High (structured)
DataFog (GLiNER) ~15ms 32x faster Very High
DataFog (Smart) ~3-15ms 60x faster Highest
spaCy ~459ms baseline Good

Performance measured on 13.3KB business document. GLiNER provides excellent accuracy for named entities while maintaining speed advantage.

Supported PII Types

Type Examples Use Cases
Email john@company.com Contact scrubbing
Phone (555) 123-4567 Call log anonymization
SSN 123-45-6789 HR data protection
Credit Cards 4111-1111-1111-1111 Payment processing
IP Addresses 192.168.1.1 Network log cleaning
Dates 01/01/1990 Birthdate removal
ZIP Codes 12345-6789 Location anonymization

Quick Start

Installation

# Lightweight core (fast regex-based PII detection)
pip install datafog

# With advanced ML models for better accuracy
pip install datafog[nlp]                # spaCy for advanced NLP
pip install datafog[nlp-advanced]       # GLiNER for modern NER
pip install datafog[ocr]                # Image processing with OCR
pip install datafog[all]                # Everything included

Basic Usage

Detect PII in text:

from datafog import DataFog

# Simple detection (uses fast regex engine)
detector = DataFog()
text = "Contact John Doe at john.doe@company.com or (555) 123-4567"
results = detector.scan_text(text)
print(results)
# Finds: emails, phone numbers, and more

# Modern NER with GLiNER (requires: pip install datafog[nlp-advanced])
from datafog.services import TextService
gliner_service = TextService(engine="gliner")
result = gliner_service.annotate_text_sync("Dr. John Smith works at General Hospital")
# Detects: PERSON, ORGANIZATION with high accuracy

# Best of both worlds: Smart cascading (recommended for production)
smart_service = TextService(engine="smart")
result = smart_service.annotate_text_sync("Contact john@company.com or call (555) 123-4567")
# Uses regex for structured PII (fast), GLiNER for entities (accurate)

Anonymize on the fly:

# Redact sensitive data
redacted = DataFog(operations=["scan", "redact"]).process_text(
    "My SSN is 123-45-6789 and email is john@example.com"
)
print(redacted)
# Output: "My SSN is [REDACTED] and email is [REDACTED]"

# Replace with fake data
replaced = DataFog(operations=["scan", "replace"]).process_text(
    "Call me at (555) 123-4567"
)
print(replaced)
# Output: "Call me at [PHONE_A1B2C3]"

Process images with OCR:

import asyncio
from datafog import DataFog

async def scan_document():
    ocr_scanner = DataFog(operations=["extract", "scan"])
    results = await ocr_scanner.run_ocr_pipeline([
        "https://example.com/document.png"
    ])
    return results

# Extract text and find PII in images
results = asyncio.run(scan_document())

Advanced Features

Engine Selection

Choose the appropriate engine for your needs:

from datafog.services import TextService

# Regex: Fast, pattern-based (recommended for speed)
regex_service = TextService(engine="regex")

# spaCy: Traditional NLP with broad entity recognition
spacy_service = TextService(engine="spacy")

# GLiNER: Modern ML model optimized for NER (requires nlp-advanced extra)
gliner_service = TextService(engine="gliner")

# Smart: Cascading approach - regex → GLiNER → spaCy (best accuracy/speed balance)
smart_service = TextService(engine="smart")

# Auto: Regex → spaCy fallback (legacy)
auto_service = TextService(engine="auto")

Performance & Accuracy Guide:

Engine Speed Accuracy Use Case Install Requirements
regex 🚀 Fastest Good Structured PII (emails, phones) Core only
gliner ⚡ Fast Better Modern NER, custom entities pip install datafog[nlp-advanced]
spacy 🐌 Slower Good Traditional NLP entities pip install datafog[nlp]
smart ⚡ Balanced Best Combines all approaches pip install datafog[nlp-advanced]

Model Management:

# Download specific GLiNER models
import subprocess

# PII-specialized model (recommended)
subprocess.run(["datafog", "download-model", "urchade/gliner_multi_pii-v1", "--engine", "gliner"])

# General-purpose model
subprocess.run(["datafog", "download-model", "urchade/gliner_base", "--engine", "gliner"])

# List available models
subprocess.run(["datafog", "list-models", "--engine", "gliner"])

Anonymization Options

from datafog import DataFog
from datafog.models.anonymizer import AnonymizerType, HashType

# Hash with different algorithms
hasher = DataFog(
    operations=["scan", "hash"],
    hash_type=HashType.SHA256  # or MD5, SHA3_256
)

# Target specific entity types only
selective = DataFog(
    operations=["scan", "redact"],
    entities=["EMAIL", "PHONE"]  # Only process these types
)

Batch Processing

documents = [
    "Document 1 with PII...",
    "Document 2 with more data...",
    "Document 3..."
]

# Process multiple documents efficiently
results = DataFog().batch_process(documents)

Performance Benchmarks

Performance comparison with alternatives:

Speed Comparison (10KB text)

DataFog Pattern:  4ms   ████████████████████████████████ 123x faster
spaCy:         480ms   ██ baseline

Engine Selection Guide

Scenario Recommended Engine Why
High-volume processing pattern Maximum speed, consistent performance
Unknown entity types spacy Broader entity recognition
General purpose auto Smart fallback, best of both worlds
Real-time applications pattern Sub-millisecond processing

CLI Usage

DataFog includes a command-line interface:

# Scan text for PII
datafog scan-text "John's email is john@example.com"

# Process images
datafog scan-image document.png --operations extract,scan

# Anonymize data
datafog redact-text "My phone is (555) 123-4567"
datafog replace-text "SSN: 123-45-6789"
datafog hash-text "Email: john@company.com" --hash-type sha256

# Utility commands
datafog health
datafog list-entities
datafog show-config

Features

Security & Compliance

  • Detection of regulated data types for GDPR/CCPA compliance
  • Audit trails for tracking detection and anonymization
  • Configurable detection thresholds

Scalability

  • Batch processing for handling multiple documents
  • Memory-efficient processing for large files
  • Async support for non-blocking operations

Integration Example

# FastAPI middleware example
from fastapi import FastAPI
from datafog import DataFog

app = FastAPI()
detector = DataFog()

@app.middleware("http")
async def redact_pii_middleware(request, call_next):
    # Automatically scan/redact request data
    pass

Common Use Cases

Enterprise

  • Log sanitization
  • Data migration with PII handling
  • Compliance reporting and audits

Data Science

  • Dataset preparation and anonymization
  • Privacy-preserving analytics
  • Research compliance

Development

  • Test data generation
  • Code review for PII detection
  • API security validation

Installation & Setup

Basic Installation

pip install datafog

Development Setup

git clone https://github.com/datafog/datafog-python
cd datafog-python
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements-dev.txt
just setup

Docker Usage

FROM python:3.10-slim
RUN pip install datafog
COPY . .
CMD ["python", "your_script.py"]

Contributing

Contributions are welcome in the form of:

  • Bug reports
  • Feature requests
  • Documentation improvements
  • New pattern patterns for PII detection
  • Performance improvements

Quick Contribution Guide

# Setup development environment
git clone https://github.com/datafog/datafog-python
cd datafog-python
just setup

# Run tests
just test

# Format code
just format

# Submit PR
git checkout -b feature/your-improvement
# Make your changes
git commit -m "Add your improvement"
git push origin feature/your-improvement

See CONTRIBUTING.md for detailed guidelines.


Benchmarking & Performance

Run Benchmarks Locally

# Install benchmark dependencies
pip install pytest-benchmark

# Run performance tests
pytest tests/benchmark_text_service.py -v

# Compare with baseline
python scripts/run_benchmark_locally.sh

Continuous Performance Monitoring

Our CI pipeline:

  • Runs benchmarks on every PR
  • Compares against baseline performance
  • Fails builds if performance degrades >10%
  • Tracks performance trends over time

Documentation & Support

Resource Link
Documentation docs.datafog.ai
Community Discord Join here
Bug Reports GitHub Issues
Feature Requests GitHub Discussions
Support hi@datafog.ai

License & Acknowledgments

DataFog is released under the MIT License.

Built with:

  • Pattern optimization for efficient processing
  • spaCy integration for NLP capabilities
  • Tesseract & Donut for OCR capabilities
  • Pydantic for data validation

GitHubDocumentationDiscord

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datafog-4.2.0.tar.gz (61.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datafog-4.2.0-py3-none-any.whl (54.2 kB view details)

Uploaded Python 3

File details

Details for the file datafog-4.2.0.tar.gz.

File metadata

  • Download URL: datafog-4.2.0.tar.gz
  • Upload date:
  • Size: 61.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for datafog-4.2.0.tar.gz
Algorithm Hash digest
SHA256 967b3e076d1762322ca3187cc97513ead4d2f080ed6c4d2725255221b02fb7d8
MD5 a22c04fe9f4c64545edadccbd8b8238f
BLAKE2b-256 762fe86d6ac3a11f943d7b0cd763826cfc483a632de23c53257400b7378d9737

See more details on using hashes here.

File details

Details for the file datafog-4.2.0-py3-none-any.whl.

File metadata

  • Download URL: datafog-4.2.0-py3-none-any.whl
  • Upload date:
  • Size: 54.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for datafog-4.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 48534790ab4b0ecd551a7c312b62163a0b97cd1a9dc1e3e2d69e7df183fff99b
MD5 a1188ba055b560730b4b93eacfddfb4a
BLAKE2b-256 edca4319ba57d5693d66a71c73ec5b426b8d770417bb0c26f65b61a066609f25

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page