Skip to main content

Local-first PII detection engine using Presidio and Apple Vision OCR

Project description

Hush Engine

Hush Engine

PyPI version License: AGPL v3 Python 3.10+ Tests

Local-first PII detection for images, PDFs, and spreadsheets. Uses Microsoft Presidio for text detection and Apple Vision for OCR. Runs on your machine; nothing is uploaded.

Prefer a GUI? hushbee.app ships a free macOS app built on this engine. Drop files in, get redacted versions out.

Features

Formats Images (PNG, JPEG, HEIC), PDFs, Excel and CSV. Apple Vision OCR runs at 400 DPI.

Detection 27 PII types out of the box: names, emails, phone numbers, SSN, credit cards, IBAN, API keys, crypto wallets, passports, medical identifiers, and more. The full table is below.

NER stack LightGBM classifiers handle token-level PERSON, LOCATION, ORGANIZATION, DATE_TIME, and ADDRESS at ~10MB total. A 7,500-name curated database across 54 locales provides fallback coverage. Optional heavyweight models (Flair, Transformers/BERT, GLiNER, OpenAI Privacy Filter) slot into the cascade for workloads where recall matters more than throughput.

International 116 IBAN countries via python-stdnum. 249 phone country codes via phonenumbers. 35+ national ID formats. 800+ cities for LOCATION disambiguation.

Validation Luhn for credit cards. Verhoeff for Aadhaar. Mod-11 and Mod-97 for other IDs. Context-aware thresholds boost confidence when headers or labels disambiguate.

Extras Face detection (OpenCV Haar). QR and barcode decoding (Apple Vision). Runtime toggle for each NER backend.

Install

pip install hush-engine
python -m spacy download en_core_web_lg
brew install poppler  # for PDFs

Optional extras:

pip install hush-engine[accurate]         # Flair + Transformers + GLiNER (~2GB)
pip install hush-engine[medical]          # Disease + drug NER
pip install hush-engine[address]          # libpostal bindings (99.45% accuracy, requires brew install libpostal)
pip install hush-engine[names]            # names-dataset (GPL-3.0, opt-in)
pip install hush-engine[privacy-filter]   # OpenAI Privacy Filter add-on backend (~3GB, Apache-2.0)
pip install hush-engine[full]             # medical + address + accurate + privacy-filter

Quick start

from hush_engine import FileRouter

router = FileRouter()

# Image
result = router.detect_pii_image("screenshot.png")
for d in result["detections"]:
    print(f"{d['entity_type']}: {d['text']} ({d['confidence']:.2f})")

# PDF
result = router.detect_pii_pdf("document.pdf")
print(f"{result['total_pages']} pages, {len(result['detections'])} detections")

Direct use of the detector:

from hush_engine import PIIDetector
detector = PIIDetector()
for e in detector.analyze_text("John Doe's email is john@example.com"):
    print(f"{e.entity_type}: {e.text}")

Entity types

Category Type Notes
Personal PERSON Multi-NER cascade with 7,500-name database
EMAIL_ADDRESS Regex with validation
PHONE_NUMBER 249 countries via libphonenumber
DATE_TIME Multiple formats including DD/MM/YYYY and card expiry (MM/YY)
AGE "25 years old", "Age: 45"
GENDER, NRP Demographic references
Financial CREDIT_CARD Luhn-validated, reconstructs fragmented OCR blocks
FINANCIAL SWIFT/BIC, IBAN (116 countries), crypto wallets, salaries ($128k/yr), labeled balances, masked accounts (****7823)
AWS_ACCESS_KEY, STRIPE_KEY Pattern-matched API keys
Government NATIONAL_ID SSN, passport, driver's license across 35+ countries
Medical MEDICAL ICD-10, conditions, medications (pattern-based by default)
Technical CREDENTIAL Passwords, tokens, keys (Shannon entropy)
IP_ADDRESS IPv4/IPv6 with version-string disambiguation
URL via urlextract
NETWORK MAC, IMEI, UUID, cookies, device IDs
Location LOCATION Addresses, cities, countries (libpostal optional)
COORDINATES Lat/long
Visual FACE, QR_CODE, BARCODE Apple Vision framework
Organization COMPANY, ORGANIZATION S&P 500 + international database
Vehicle VEHICLE VIN, license plates
Biometric BIOMETRIC Fingerprint IDs
Generic ID Employee ID, customer ID, generic identifiers

See docs/PII_REFERENCE.md for regulatory mapping (HIPAA, GDPR, CCPA).

Architecture

Component Role
FileRouter Entry point for file-level processing
PIIDetector Presidio analyzer with 50+ custom recognizers
PersonRecognizer NER cascade: NLTagger → LightGBM → name database → (optional) spaCy/Flair/Transformers/GLiNER
VisionOCR Apple Vision wrapper at 400 DPI
PDFProcessor PDF-to-image with parallel page processing
TableDetector Context-aware detection for spreadsheets and tables
ImageAnonymizer, SpreadsheetAnonymizer Redaction output
FaceDetector OpenCV Haar cascade
AddressVerifier, CompanyVerifier, CredentialEntropy, HeuristicVerifier Precision verifiers
DetectionConfig Runtime thresholds and toggles

PERSON cascade

pattern match → LightGBM NER → NLTagger → names database → [spaCy] → [Flair] → [Transformers] → [GLiNER] → [Privacy Filter]

Lightweight engines run first. The cascade exits early when a high-confidence match is found. Heavy engines are skipped unless installed and explicitly enabled.

When openai_privacy_filter_authoritative=True, Privacy Filter runs before anything else and its verdict replaces the rest of the cascade for PERSON.

Custom recognizers

from hush_engine import PIIDetector
from presidio_analyzer import Pattern, PatternRecognizer

detector = PIIDetector()
detector.analyzer.registry.add_recognizer(
    PatternRecognizer(
        supported_entity="CUSTOM_ID",
        patterns=[Pattern("custom", r"[A-Z]{3}-\d{6}", 0.8)],
    )
)

Configuration

from hush_engine import DetectionConfig

config = DetectionConfig()
config.set_threshold("PERSON", 0.60)
config.set_enabled_entity("FACE", False)
config.set_enabled_integration("flair", False)

Thresholds persist to ~/.hush/detection_config.json. Integrations: lgbm_ner, spacy, flair, transformers, gliner, name_dataset, libpostal, urlextract, phonenumbers, openai_privacy_filter, openai_privacy_filter_authoritative.

Add-on backend: OpenAI Privacy Filter

OpenAI released Privacy Filter on 2026-04-22 as an open-weight PII-redaction model: Apache-2.0, 1.5B parameters total with 50M active (mixture-of-experts), 128K context, bidirectional token classifier with constrained Viterbi span decoding. Weights sit on HuggingFace; source is at github.com/openai/privacy-filter; the full methodology is in the model card PDF.

Hush 1.11.0 ships an opt-in integration. Install the extra, then enable it through the config:

pip install hush-engine[privacy-filter]
from hush_engine import DetectionConfig
cfg = DetectionConfig()
cfg.set_enabled_integration("openai_privacy_filter", True)
# Optional: let Privacy Filter's PERSON verdict short-circuit the cascade.
cfg.set_enabled_integration("openai_privacy_filter_authoritative", False)

Two gating modes:

  • candidate (default when enabled): Privacy Filter votes in the ensemble alongside LightGBM, spaCy, Flair, Transformers. The cascade's early-exit threshold still applies, so it runs only when lighter engines haven't produced a high-confidence hit.
  • authoritative: Privacy Filter's PERSON decision replaces the cascade output. Verifiers skip.

Privacy Filter covers 8 span categories: private_person, private_email, private_phone, private_address, private_url, private_date, account_number, secret. The 6 non-PERSON categories register as a Presidio recognizer that feeds into Hush's standard entity-type pipeline. To load weights from disk instead of HuggingFace Hub, set HUSH_PRIVACY_FILTER_MODEL=/path/to/dir.

Cascade modes (1.12.0+)

privacy_filter_mode replaces the authoritative boolean with five options. Default is off, so 1.11.x configs keep working unchanged.

Mode Trigger Action
off (default) Skip PF entirely.
candidate Cascade didn't hit the early-exit confidence PF votes in the ensemble alongside LightGBM, spaCy, Flair, Transformers.
authoritative Always PF's PERSON verdict short-circuits the cascade; verifiers skip.
tiebreaker Any ensemble span in privacy_filter_contested_band (default [0.45, 0.75]) and no early-exit winner PF runs once. Matching spans get boosted; PF-only spans are added.
veto Every Hush detection with score < 0.75 PF scans the document. Hush spans PF doesn't corroborate are dropped (unless an arbiter keeps them).
from hush_engine import DetectionConfig
cfg = DetectionConfig()
cfg.set_privacy_filter_mode("tiebreaker")
cfg.set_privacy_filter_contested_band([0.40, 0.80])
cfg.set_privacy_filter_excluded_entities(["PHONE_NUMBER"])  # default

Per-entity exclude

privacy_filter_excluded_entities removes specific hush-mapped entity types from PF output. The default ships with ["PHONE_NUMBER"] because the 2026-04-23 Kaggle ablation showed PF dropping PHONE F1 by 5.71 pp versus Hush's libphonenumber-validated spans. Empty the list to let PF contribute phones when your document mix benefits.

Arbiter callback

PersonRecognizer(privacy_filter_arbiter=callable) passes a callback that fires on tiebreaker/veto disagreements:

def arbiter(text, span_text, start, end, hush_score, pf_score) -> float | None:
    """Return the new confidence, or None to drop the span."""

Scores are None when that engine didn't produce a hit. Plug in a local LLM, an external rules engine, or any custom heuristic to resolve hard cases.

License compatibility: Privacy Filter ships under Apache-2.0, which the AGPL-3.0 engine can link against. See the LICENSE for Hush and COMMERCIAL-LICENSING.md for proprietary-deployment terms. The add-on does not change either.

Release privacy gates

Set the HUSH_AUDIT=1 environment variable to opt into internal audit logging (dev + calibration use). Release builds should leave it unset, which:

  • Attaches a NullHandler to hush.audit, so ~/.hush/audit.log never gets created.
  • Removes ingestTrainingFeedback from the RPC allow-list, so the Swift UI has no path to read ~/.hush/training_feedback.jsonl on end-user machines.
  • Hashes filenames in any audit line that does emit (defense-in-depth), so a 10-char SHA-256 prefix takes the place of the filename.

~/.hush/config.json and ~/.hush/detection_config.json stay unchanged. Those are user settings (locale, thresholds, enabled libraries), not telemetry.

FileRouter also sweeps stragglers out of ~/.hush/tmp on startup and wraps every temp-file caller in try/finally unlink, so preview JPEGs don't accumulate between runs.

Performance

Synthetic golden set (1,000 samples generated with Faker):

Metric Score
F1 97.2%
Precision 98.3%
Recall 96.2%

Kaggle PII Detection 2024 (1,000 student essays, 1,606 GT entities):

Metric Score
F1 93.2%
Precision 94.4%
Recall 91.9%

Per-entity on the Kaggle set: PERSON 93.7% F1, EMAIL 98.7%, ID 88.6%, URL 88.8%, PHONE 85.7%. Latency: 289 ms/doc with libpostal enabled.

Hush vs LLMs

Same Kaggle set, 1,000 samples. The Privacy Filter rows come from the same benchmark harness, run with [privacy-filter] installed and openai_privacy_filter enabled.

Model F1 Precision Recall Latency RAM
Hush Engine v1.11.0 93.2% 94.4% 91.9% 289ms ~15MB
Hush + OpenAI Privacy Filter 93.0% 94.2% 91.9% 5,017ms ~3GB
OpenAI Privacy Filter (standalone) 86.9% 77.2% 99.4% 5,386ms ~3GB
Mistral 7B 77.8% 64.6% 97.9% 3,486ms 10.2GB
Phi-4 (14B) 75.3% 65.0% 89.5% 6,046ms 14.3GB
Qwen 2.5 (7B) 65.7% 49.8% 96.5% 3,105ms 8.4GB
Gemma 2 (9B) 63.7% 47.2% 97.9% 4,250ms 9.0GB
Llama 3.2 (1B) 21.2% 11.9% 95.3% 4,208ms 4.7GB

Two results stand out.

OpenAI Privacy Filter alone catches almost every PII span (99.4% recall) and flags 23% false positives. In a redaction pipeline, each false positive deletes text the user wants kept. The 17-point precision gap translates into real content loss.

Adding Privacy Filter to Hush in candidate mode does not lift F1 (93.0% vs 93.2% baseline) and costs 17x the runtime. Hush sits at the ceiling its validators produce on this set. A learned model cannot push past it for entities that already pass Luhn, mod-97, or similar arithmetic.

Reproduce:

# LLM comparison: Hush vs LLMs (includes openai-privacy-filter as a row)
python tests/benchmark_llm_comparison.py --samples 1000 \
  --models mistral:7b,phi4:latest,openai-privacy-filter

# Ablation: baseline vs Hush + Privacy Filter
python tests/benchmark_accuracy.py --samples 1000 \
  --datasets kaggle_golden_1000.json --privacy-filter-ablation --no-pdf

Development

git clone https://github.com/NewMediaStudio/hush-engine.git
cd hush-engine
pip install -e ".[dev]"
python -m spacy download en_core_web_lg
pytest tests/

Benchmarks

python tests/benchmark_accuracy.py --samples 100
python tests/benchmark_accuracy.py --samples 1000
python tests/benchmark_server.py  # dashboard at http://localhost:8000

Bootstrap 95% confidence intervals:

python tools/bootstrap_ci.py --dataset tests/data/synthetic_golden.json

Training LightGBM classifiers

python tools/train_lgbm_ner.py --entity-type PERSON --samples 5000
python tools/train_lgbm_ner.py --all --ai4privacy --augment --samples 10000
python tools/train_lgbm_ner.py --entity-type PERSON --custom-dataset path/to/data.json

Kaggle dataset (optional)

The Kaggle PII Detection 2024 set requires a Kaggle account. After downloading train.json:

python tools/create_kaggle_golden.py  # 1,000-sample golden set for benchmarks
python tools/kaggle_pii_adapter.py --input tests/data/kaggle_train.json

Requirements

  • macOS 10.15+ (Apple Vision OCR)
  • Python 3.10+

Windows and Linux support is on the roadmap but not yet available.

Contributing

See CONTRIBUTING.md. Report security issues per SECURITY.md instead of the public tracker.

Maintainers

Built and maintained by Valentine Makhouleen at New Media Studio.

License

Hush Engine is dual-licensed.

Open source: AGPL-3.0. Free to use, modify, and distribute under AGPL terms. If you run Hush over a network (for example, inside a SaaS), AGPL § 13 requires you to open-source the service that uses it.

Commercial: a paid commercial license is available for proprietary products, closed-source SaaS, or any use where AGPL obligations don't fit. See COMMERCIAL-LICENSING.md or email studio@newmediastudio.com.

Related

Acknowledgments

Built on Presidio, Apple Vision, spaCy, Flair, GLiNER, libpostal, and python-stdnum. Optional add-on: OpenAI Privacy Filter.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hush_engine-1.12.0.tar.gz (395.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hush_engine-1.12.0-py3-none-any.whl (405.0 kB view details)

Uploaded Python 3

File details

Details for the file hush_engine-1.12.0.tar.gz.

File metadata

  • Download URL: hush_engine-1.12.0.tar.gz
  • Upload date:
  • Size: 395.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hush_engine-1.12.0.tar.gz
Algorithm Hash digest
SHA256 1a53a5fc8c3b68626ccb14f44aaa713f6acc05c8db8a39e52070d929c53c822e
MD5 66665918f45a85191a58c5432fcaff15
BLAKE2b-256 414df23d967621370f01ce0fe25351be75f8c9c27616278de75e4cc208b729a6

See more details on using hashes here.

Provenance

The following attestation bundles were made for hush_engine-1.12.0.tar.gz:

Publisher: publish.yml on NewMediaStudio/hush-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hush_engine-1.12.0-py3-none-any.whl.

File metadata

  • Download URL: hush_engine-1.12.0-py3-none-any.whl
  • Upload date:
  • Size: 405.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hush_engine-1.12.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ff473cd30ca383cba1bd2fd14a0f25fc4370d81f428db88ea9fb1d4e5db86285
MD5 ba26a8a7b3f397a2ca402f95cae9d38e
BLAKE2b-256 7fd3bd162e4e396a82f87b299a4c2646a23a2b2e0af799c16b944a4aea0d823d

See more details on using hashes here.

Provenance

The following attestation bundles were made for hush_engine-1.12.0-py3-none-any.whl:

Publisher: publish.yml on NewMediaStudio/hush-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page