Skip to main content

Local-first PII detection engine using Presidio and Apple Vision OCR

Project description

Hush Engine

Hush Engine

PyPI version License: AGPL v3 Python 3.10+ Tests

Local-first PII detection for images, PDFs, and spreadsheets. Uses Microsoft Presidio for text detection and Apple Vision for OCR. Runs on your machine; nothing is uploaded.

Prefer a GUI? hushbee.app ships a free macOS app built on this engine. Drop files in, get redacted versions out.

Features

Formats Images (PNG, JPEG, HEIC), PDFs, Excel and CSV. Apple Vision OCR runs at 400 DPI.

Detection 27 PII types out of the box: names, emails, phone numbers, SSN, credit cards, IBAN, API keys, crypto wallets, passports, medical identifiers, and more. The full table is below.

NER stack LightGBM classifiers handle token-level PERSON, LOCATION, ORGANIZATION, DATE_TIME, and ADDRESS at ~10MB total. A 7,500-name curated database across 54 locales provides fallback coverage. Optional heavyweight models (Flair, Transformers/BERT, GLiNER, OpenAI Privacy Filter) slot into the cascade for workloads where recall matters more than throughput.

International 116 IBAN countries via python-stdnum. 249 phone country codes via phonenumbers. 35+ national ID formats. 800+ cities for LOCATION disambiguation.

Validation Luhn for credit cards. Verhoeff for Aadhaar. Mod-11 and Mod-97 for other IDs. Context-aware thresholds boost confidence when headers or labels disambiguate.

Extras Face detection (OpenCV Haar). QR and barcode decoding (Apple Vision). Runtime toggle for each NER backend.

Install

pip install hush-engine
python -m spacy download en_core_web_lg
brew install poppler  # for PDFs

Optional extras:

pip install hush-engine[accurate]         # Flair + Transformers + GLiNER (~2GB)
pip install hush-engine[medical]          # Disease + drug NER
pip install hush-engine[address]          # libpostal bindings (99.45% accuracy, requires brew install libpostal)
pip install hush-engine[names]            # names-dataset (GPL-3.0, opt-in)
pip install hush-engine[privacy-filter]   # OpenAI Privacy Filter add-on backend (~3GB, Apache-2.0)
pip install hush-engine[full]             # medical + address + accurate + privacy-filter

Quick start

from hush_engine import FileRouter

router = FileRouter()

# Image
result = router.detect_pii_image("screenshot.png")
for d in result["detections"]:
    print(f"{d['entity_type']}: {d['text']} ({d['confidence']:.2f})")

# PDF
result = router.detect_pii_pdf("document.pdf")
print(f"{result['total_pages']} pages, {len(result['detections'])} detections")

Direct use of the detector:

from hush_engine import PIIDetector
detector = PIIDetector()
for e in detector.analyze_text("John Doe's email is john@example.com"):
    print(f"{e.entity_type}: {e.text}")

Entity types

Category Type Notes
Personal PERSON Multi-NER cascade with 7,500-name database
EMAIL_ADDRESS Regex with validation
PHONE_NUMBER 249 countries via libphonenumber
DATE_TIME Multiple formats including DD/MM/YYYY and card expiry (MM/YY)
AGE "25 years old", "Age: 45"
GENDER, NRP Demographic references
Financial CREDIT_CARD Luhn-validated, reconstructs fragmented OCR blocks
FINANCIAL SWIFT/BIC, IBAN (116 countries), crypto wallets, salaries ($128k/yr), labeled balances, masked accounts (****7823)
AWS_ACCESS_KEY, STRIPE_KEY Pattern-matched API keys
Government NATIONAL_ID SSN, passport, driver's license across 35+ countries
Medical MEDICAL ICD-10, conditions, medications (pattern-based by default)
Technical CREDENTIAL Passwords, tokens, keys (Shannon entropy)
IP_ADDRESS IPv4/IPv6 with version-string disambiguation
URL via urlextract
NETWORK MAC, IMEI, UUID, cookies, device IDs
Location LOCATION Addresses, cities, countries (libpostal optional)
COORDINATES Lat/long
Visual FACE, QR_CODE, BARCODE Apple Vision framework
Organization COMPANY, ORGANIZATION S&P 500 + international database
Vehicle VEHICLE VIN, license plates
Biometric BIOMETRIC Fingerprint IDs
Generic ID Employee ID, customer ID, generic identifiers

See docs/PII_REFERENCE.md for regulatory mapping (HIPAA, GDPR, CCPA).

Architecture

Component Role
FileRouter Entry point for file-level processing
PIIDetector Presidio analyzer with 50+ custom recognizers
PersonRecognizer NER cascade: NLTagger → LightGBM → name database → (optional) spaCy/Flair/Transformers/GLiNER
VisionOCR Apple Vision wrapper at 400 DPI
PDFProcessor PDF-to-image with parallel page processing
TableDetector Context-aware detection for spreadsheets and tables
ImageAnonymizer, SpreadsheetAnonymizer Redaction output
FaceDetector OpenCV Haar cascade
AddressVerifier, CompanyVerifier, CredentialEntropy, HeuristicVerifier Precision verifiers
DetectionConfig Runtime thresholds and toggles

PERSON cascade

pattern match → LightGBM NER → NLTagger → names database → [spaCy] → [Flair] → [Transformers] → [GLiNER] → [Privacy Filter]

Lightweight engines run first. The cascade exits early when a high-confidence match is found. Heavy engines are skipped unless installed and explicitly enabled.

When openai_privacy_filter_authoritative=True, Privacy Filter runs before anything else and its verdict replaces the rest of the cascade for PERSON.

Custom recognizers

from hush_engine import PIIDetector
from presidio_analyzer import Pattern, PatternRecognizer

detector = PIIDetector()
detector.analyzer.registry.add_recognizer(
    PatternRecognizer(
        supported_entity="CUSTOM_ID",
        patterns=[Pattern("custom", r"[A-Z]{3}-\d{6}", 0.8)],
    )
)

Configuration

from hush_engine import DetectionConfig

config = DetectionConfig()
config.set_threshold("PERSON", 0.60)
config.set_enabled_entity("FACE", False)
config.set_enabled_integration("flair", False)

Thresholds persist to ~/.hush/detection_config.json. Integrations: lgbm_ner, spacy, flair, transformers, gliner, name_dataset, libpostal, urlextract, phonenumbers, openai_privacy_filter, openai_privacy_filter_authoritative.

Add-on backend: OpenAI Privacy Filter

OpenAI released Privacy Filter on 2026-04-22 as an open-weight PII-redaction model: Apache-2.0, 1.5B parameters total with 50M active (mixture-of-experts), 128K context, bidirectional token classifier with constrained Viterbi span decoding. Weights sit on HuggingFace; source is at github.com/openai/privacy-filter; the full methodology is in the model card PDF.

Hush 1.11.0 ships an opt-in integration. Install the extra, then enable it through the config:

pip install hush-engine[privacy-filter]
from hush_engine import DetectionConfig
cfg = DetectionConfig()
cfg.set_enabled_integration("openai_privacy_filter", True)
# Optional: let Privacy Filter's PERSON verdict short-circuit the cascade.
cfg.set_enabled_integration("openai_privacy_filter_authoritative", False)

Two gating modes:

  • candidate (default when enabled): Privacy Filter votes in the ensemble alongside LightGBM, spaCy, Flair, Transformers. The cascade's early-exit threshold still applies, so it runs only when lighter engines haven't produced a high-confidence hit.
  • authoritative: Privacy Filter's PERSON decision replaces the cascade output. Verifiers skip.

Privacy Filter covers 8 span categories: private_person, private_email, private_phone, private_address, private_url, private_date, account_number, secret. The 6 non-PERSON categories register as a Presidio recognizer that feeds into Hush's standard entity-type pipeline. To load weights from disk instead of HuggingFace Hub, set HUSH_PRIVACY_FILTER_MODEL=/path/to/dir.

License compatibility: Privacy Filter ships under Apache-2.0, which the AGPL-3.0 engine can link against. See the LICENSE for Hush and COMMERCIAL-LICENSING.md for proprietary-deployment terms. The add-on does not change either.

Release privacy gates

Set the HUSH_AUDIT=1 environment variable to opt into internal audit logging (dev + calibration use). Release builds should leave it unset, which:

  • Attaches a NullHandler to hush.audit, so ~/.hush/audit.log never gets created.
  • Removes ingestTrainingFeedback from the RPC allow-list, so the Swift UI has no path to read ~/.hush/training_feedback.jsonl on end-user machines.
  • Hashes filenames in any audit line that does emit (defense-in-depth), so a 10-char SHA-256 prefix takes the place of the filename.

~/.hush/config.json and ~/.hush/detection_config.json stay unchanged. Those are user settings (locale, thresholds, enabled libraries), not telemetry.

FileRouter also sweeps stragglers out of ~/.hush/tmp on startup and wraps every temp-file caller in try/finally unlink, so preview JPEGs don't accumulate between runs.

Performance

Synthetic golden set (1,000 samples generated with Faker):

Metric Score
F1 97.2%
Precision 98.3%
Recall 96.2%

Kaggle PII Detection 2024 (1,000 student essays, 1,606 GT entities):

Metric Score
F1 93.2%
Precision 94.4%
Recall 91.9%

Per-entity on the Kaggle set: PERSON 93.7% F1, EMAIL 98.7%, ID 88.6%, URL 88.8%, PHONE 85.7%. Latency: 289 ms/doc with libpostal enabled.

Hush vs LLMs

Same Kaggle set, 1,000 samples. The Privacy Filter rows come from the same benchmark harness, run with [privacy-filter] installed and openai_privacy_filter enabled.

Model F1 Precision Recall Latency RAM
Hush Engine v1.11.0 93.2% 94.4% 91.9% 289ms ~15MB
Hush + OpenAI Privacy Filter 93.0% 94.2% 91.9% 5,017ms ~3GB
OpenAI Privacy Filter (standalone) 86.9% 77.2% 99.4% 5,386ms ~3GB
Mistral 7B 77.8% 64.6% 97.9% 3,486ms 10.2GB
Phi-4 (14B) 75.3% 65.0% 89.5% 6,046ms 14.3GB
Qwen 2.5 (7B) 65.7% 49.8% 96.5% 3,105ms 8.4GB
Gemma 2 (9B) 63.7% 47.2% 97.9% 4,250ms 9.0GB
Llama 3.2 (1B) 21.2% 11.9% 95.3% 4,208ms 4.7GB

Two results stand out.

OpenAI Privacy Filter alone catches almost every PII span (99.4% recall) and flags 23% false positives. In a redaction pipeline, each false positive deletes text the user wants kept. The 17-point precision gap translates into real content loss.

Adding Privacy Filter to Hush in candidate mode does not lift F1 (93.0% vs 93.2% baseline) and costs 17x the runtime. Hush sits at the ceiling its validators produce on this set. A learned model cannot push past it for entities that already pass Luhn, mod-97, or similar arithmetic.

Reproduce:

# LLM comparison: Hush vs LLMs (includes openai-privacy-filter as a row)
python tests/benchmark_llm_comparison.py --samples 1000 \
  --models mistral:7b,phi4:latest,openai-privacy-filter

# Ablation: baseline vs Hush + Privacy Filter
python tests/benchmark_accuracy.py --samples 1000 \
  --datasets kaggle_golden_1000.json --privacy-filter-ablation --no-pdf

Development

git clone https://github.com/NewMediaStudio/hush-engine.git
cd hush-engine
pip install -e ".[dev]"
python -m spacy download en_core_web_lg
pytest tests/

Benchmarks

python tests/benchmark_accuracy.py --samples 100
python tests/benchmark_accuracy.py --samples 1000
python tests/benchmark_server.py  # dashboard at http://localhost:8000

Bootstrap 95% confidence intervals:

python tools/bootstrap_ci.py --dataset tests/data/synthetic_golden.json

Training LightGBM classifiers

python tools/train_lgbm_ner.py --entity-type PERSON --samples 5000
python tools/train_lgbm_ner.py --all --ai4privacy --augment --samples 10000
python tools/train_lgbm_ner.py --entity-type PERSON --custom-dataset path/to/data.json

Kaggle dataset (optional)

The Kaggle PII Detection 2024 set requires a Kaggle account. After downloading train.json:

python tools/create_kaggle_golden.py  # 1,000-sample golden set for benchmarks
python tools/kaggle_pii_adapter.py --input tests/data/kaggle_train.json

Requirements

  • macOS 10.15+ (Apple Vision OCR)
  • Python 3.10+

Windows and Linux support is on the roadmap but not yet available.

Contributing

See CONTRIBUTING.md. Report security issues per SECURITY.md instead of the public tracker.

Maintainers

Built and maintained by Valentine Makhouleen at New Media Studio.

License

Hush Engine is dual-licensed.

Open source: AGPL-3.0. Free to use, modify, and distribute under AGPL terms. If you run Hush over a network (for example, inside a SaaS), AGPL § 13 requires you to open-source the service that uses it.

Commercial: a paid commercial license is available for proprietary products, closed-source SaaS, or any use where AGPL obligations don't fit. See COMMERCIAL-LICENSING.md or email studio@newmediastudio.com.

Related

Acknowledgments

Built on Presidio, Apple Vision, spaCy, Flair, GLiNER, libpostal, and python-stdnum. Optional add-on: OpenAI Privacy Filter.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hush_engine-1.11.1.tar.gz (386.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hush_engine-1.11.1-py3-none-any.whl (399.6 kB view details)

Uploaded Python 3

File details

Details for the file hush_engine-1.11.1.tar.gz.

File metadata

  • Download URL: hush_engine-1.11.1.tar.gz
  • Upload date:
  • Size: 386.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hush_engine-1.11.1.tar.gz
Algorithm Hash digest
SHA256 7b2d1a502ccae486fc4b03d29b1d531e369962e169c0e6d006969a32ebd17584
MD5 04aee87b962fe98e422e87b51c17e881
BLAKE2b-256 5e8916d2bebc7e130974f045f8d82fe727d981ba12b1c7ddbd3874b926b6a4c8

See more details on using hashes here.

Provenance

The following attestation bundles were made for hush_engine-1.11.1.tar.gz:

Publisher: publish.yml on NewMediaStudio/hush-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hush_engine-1.11.1-py3-none-any.whl.

File metadata

  • Download URL: hush_engine-1.11.1-py3-none-any.whl
  • Upload date:
  • Size: 399.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hush_engine-1.11.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5ef202426dfa07b9cfe584150fb0aaaa8f8ce4c8da3a1dc000966ea368e9c143
MD5 fe732e4b8240cb215d6391c655d1915b
BLAKE2b-256 ccc0efb245773f1dd4e83c505f5bfd3ecceba6572b60b6506d0d05f1dc185322

See more details on using hashes here.

Provenance

The following attestation bundles were made for hush_engine-1.11.1-py3-none-any.whl:

Publisher: publish.yml on NewMediaStudio/hush-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page