Skip to main content

Lightning-fast PII detection and anonymization library with 190x performance advantage

Project description

DataFog Python

DataFog is a Python library for detecting and redacting personally identifiable information (PII).

It provides:

  • Fast structured PII detection via regex
  • Optional NER support via spaCy and GLiNER
  • A simple agent-oriented API for LLM applications
  • Backward-compatible DataFog and TextService classes

4.5 Focus

DataFog 4.5 is focused on lightweight text PII screening: a small core install, fast regex-based scan/redact helpers, explicit optional extras, and a clearer path toward future middleware use cases. Dedicated Sentry, OpenTelemetry, logging-framework, and cloud DLP adapters are future-facing work and are not part of the 4.5 release.

Installation

# Core install (regex engine)
pip install datafog

# Add spaCy support
pip install datafog[nlp]

# Add GLiNER + spaCy support
pip install datafog[nlp-advanced]

# Add local OCR support
pip install datafog[ocr]

# Add Spark/distributed support
pip install datafog[distributed]

# Everything
pip install datafog[all]

Python 3.13 support is certified for the core SDK, CLI, nlp, nlp-advanced, and ocr install profiles. Donut OCR still requires a model that is available locally before runtime use. distributed and all are not newly certified on Python 3.13 in the 4.5 line.

Quick Start

import datafog

text = "Contact john@example.com or call (555) 123-4567"
clean = datafog.sanitize(text, engine="regex")
print(clean)
# Contact [EMAIL_1] or call [PHONE_1]

For LLM Applications

import datafog

# 1) Scan prompt text before sending to an LLM
prompt = "My SSN is 123-45-6789"
scan_result = datafog.scan_prompt(prompt, engine="regex")
if scan_result.entities:
    print(f"Detected {len(scan_result.entities)} PII entities")

# 2) Redact model output before returning it
output = "Email me at jane.doe@example.com"
safe_result = datafog.filter_output(output, engine="regex")
print(safe_result.redacted_text)
# Email me at [EMAIL_1]

# 3) One-liner redaction
print(datafog.sanitize("Card: 4111-1111-1111-1111", engine="regex"))
# Card: [CREDIT_CARD_1]

German Structured PII

German structured PII is country-specific and opt-in. Use explicit locale selection or entity-type filtering when you want German VAT IDs, German IBANs, tax IDs, postal codes, passports, or residence permits.

import datafog

text = "Steuer-ID 12345678901 liegt vor."

print(datafog.scan(text, engine="regex").entities)
# []

print(datafog.scan(text, engine="regex", locales=["de"]).entities)
# [Entity(type='DE_TAX_ID', text='12345678901', ...)]

Guardrails

import datafog

# Reusable guardrail object
guard = datafog.create_guardrail(engine="regex", on_detect="redact")

@guard
def call_llm() -> str:
    return "Send to admin@example.com"

print(call_llm())
# Send to [EMAIL_1]

Engines

Use the engine that matches your accuracy and dependency constraints:

  • regex:
    • Fastest and always available.
    • Best for default structured entities: EMAIL, PHONE, SSN, CREDIT_CARD, IP_ADDRESS, DATE, ZIP_CODE.
    • Use locales=["de"] for German structured IDs such as DE_VAT_ID, DE_IBAN, DE_TAX_ID, DE_POSTAL_CODE, and passport or residence permit numbers.
  • spacy:
    • Requires pip install datafog[nlp].
    • Useful for unstructured entities like person and organization names.
  • gliner:
    • Requires pip install datafog[nlp-advanced].
    • Stronger NER coverage than regex for unstructured text.
  • smart:
    • Cascades regex with optional NER engines.
    • If optional deps are missing, it degrades gracefully and warns.

Optional OCR And Spark Surfaces

DataFog 4.5 keeps the main package story centered on lightweight text PII screening. OCR and Spark remain supported optional surfaces for users who already rely on them, but they are not required for the core import, default scan/redact helpers, or guardrail helpers.

  • OCR:
    • Install datafog[ocr] for local image OCR helpers.
    • URL-based image downloading also needs datafog[web,ocr].
    • Tesseract usage requires the system tesseract binary.
    • Python 3.13 is validated for the OCR install profile, Pillow, pytesseract, and system Tesseract smoke checks.
    • Donut OCR requires datafog[nlp-advanced,ocr] and a model already available locally.
  • Spark:
    • Install datafog[distributed] for SparkService.
    • Spark PII UDF helpers also require datafog[nlp] and an installed spaCy model.
    • A Java runtime is required by PySpark.

OCR and Spark are not deprecated. Their broader API and packaging overhaul is deferred; the 4.5 goal is to keep them explicit, documented, and isolated from the lightweight core path.

Backward-Compatible APIs

The existing public API remains available.

DataFog class

from datafog import DataFog

result = DataFog().scan_text("Email john@example.com")
print(result["EMAIL"])

TextService class

from datafog.services import TextService

service = TextService(engine="regex")
result = service.annotate_text_sync("Call (555) 123-4567")
print(result["PHONE"])

CLI

# Scan text
datafog scan-text "john@example.com"

# Redact text
datafog redact-text "john@example.com"

# Replace text with pseudonyms
datafog replace-text "john@example.com"

# Hash detected entities
datafog hash-text "john@example.com"

# Enable German regex identifiers
datafog redact-text "Steuer-ID 12345678901" --locale de

Telemetry

DataFog telemetry is disabled by default.

To opt in:

export DATAFOG_TELEMETRY=1

To force telemetry off:

export DATAFOG_NO_TELEMETRY=1
# or
export DO_NOT_TRACK=1

Telemetry does not include input text or detected PII values.

Development

git clone https://github.com/datafog/datafog-python
cd datafog-python
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e ".[all,dev]"
pip install -r requirements-dev.txt
pytest tests/

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datafog-4.5.0b4.tar.gz (84.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datafog-4.5.0b4-py3-none-any.whl (66.1 kB view details)

Uploaded Python 3

File details

Details for the file datafog-4.5.0b4.tar.gz.

File metadata

  • Download URL: datafog-4.5.0b4.tar.gz
  • Upload date:
  • Size: 84.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for datafog-4.5.0b4.tar.gz
Algorithm Hash digest
SHA256 eab3a0d89ce406cff3c56274d1a131966296333bef1af20f4a61802a237e1c1f
MD5 05122a1a5c23e0ead7bdfbe90b476734
BLAKE2b-256 1cdb9c78b66b1a091592ef95c2680b3634966fe9865fc5f20f3a89a57529e527

See more details on using hashes here.

File details

Details for the file datafog-4.5.0b4-py3-none-any.whl.

File metadata

  • Download URL: datafog-4.5.0b4-py3-none-any.whl
  • Upload date:
  • Size: 66.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for datafog-4.5.0b4-py3-none-any.whl
Algorithm Hash digest
SHA256 f63c50590199f32848c37065e21644b45771b66a417fdc0f62d59dce3f38ccff
MD5 00f8af6798f270164c4ba1f76af838d7
BLAKE2b-256 945207ae06f7e47da7c6c2da6faee54d555a6f6f28b1d88e2f0d51b47a8fa8a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page