Local-first PII detection engine using Presidio and Apple Vision OCR
Project description
Hush Engine
Local-first PII detection for images, PDFs, and spreadsheets. Uses Microsoft Presidio for text detection and Apple Vision for OCR. Runs on your machine; nothing is uploaded.
Prefer a GUI? hushbee.app ships a free macOS app built on this engine. Drop files in, get redacted versions out.
Features
Formats Images (PNG, JPEG, HEIC), PDFs, Excel and CSV. Apple Vision OCR runs at 400 DPI.
Detection 27 PII types out of the box: names, emails, phone numbers, SSN, credit cards, IBAN, API keys, crypto wallets, passports, medical identifiers, and more. The full table is below.
NER stack LightGBM classifiers handle token-level PERSON, LOCATION, ORGANIZATION, DATE_TIME, and ADDRESS at ~10MB total. A 7,500-name curated database across 54 locales provides fallback coverage. Optional heavyweight models (Flair, Transformers/BERT, GLiNER, OpenAI Privacy Filter) slot into the cascade for workloads where recall matters more than throughput.
International
116 IBAN countries via python-stdnum. 249 phone country codes via phonenumbers. 35+ national ID formats. 800+ cities for LOCATION disambiguation.
Validation Luhn for credit cards. Verhoeff for Aadhaar. Mod-11 and Mod-97 for other IDs. Context-aware thresholds boost confidence when headers or labels disambiguate.
Extras Face detection (OpenCV Haar). QR and barcode decoding (Apple Vision). Runtime toggle for each NER backend.
Install
pip install hush-engine
python -m spacy download en_core_web_lg
brew install poppler # for PDFs
Optional extras:
pip install hush-engine[accurate] # Flair + Transformers + GLiNER (~2GB)
pip install hush-engine[medical] # Disease + drug NER
pip install hush-engine[address] # libpostal bindings (99.45% accuracy, requires brew install libpostal)
pip install hush-engine[names] # names-dataset (GPL-3.0, opt-in)
pip install hush-engine[privacy-filter] # OpenAI Privacy Filter add-on backend (~3GB, Apache-2.0)
pip install hush-engine[full] # medical + address + accurate + privacy-filter
Quick start
from hush_engine import FileRouter
router = FileRouter()
# Image
result = router.detect_pii_image("screenshot.png")
for d in result["detections"]:
print(f"{d['entity_type']}: {d['text']} ({d['confidence']:.2f})")
# PDF
result = router.detect_pii_pdf("document.pdf")
print(f"{result['total_pages']} pages, {len(result['detections'])} detections")
Direct use of the detector:
from hush_engine import PIIDetector
detector = PIIDetector()
for e in detector.analyze_text("John Doe's email is john@example.com"):
print(f"{e.entity_type}: {e.text}")
Entity types
| Category | Type | Notes |
|---|---|---|
| Personal | PERSON |
Multi-NER cascade with 7,500-name database |
EMAIL_ADDRESS |
Regex with validation | |
PHONE_NUMBER |
249 countries via libphonenumber | |
DATE_TIME |
Multiple formats including DD/MM/YYYY and card expiry (MM/YY) | |
AGE |
"25 years old", "Age: 45" | |
GENDER, NRP |
Demographic references | |
| Financial | CREDIT_CARD |
Luhn-validated, reconstructs fragmented OCR blocks |
FINANCIAL |
SWIFT/BIC, IBAN (116 countries), crypto wallets, salaries ($128k/yr), labeled balances, masked accounts (****7823) |
|
AWS_ACCESS_KEY, STRIPE_KEY |
Pattern-matched API keys | |
| Government | NATIONAL_ID |
SSN, passport, driver's license across 35+ countries |
| Medical | MEDICAL |
ICD-10, conditions, medications (pattern-based by default) |
| Technical | CREDENTIAL |
Passwords, tokens, keys (Shannon entropy) |
IP_ADDRESS |
IPv4/IPv6 with version-string disambiguation | |
URL |
via urlextract |
|
NETWORK |
MAC, IMEI, UUID, cookies, device IDs | |
| Location | LOCATION |
Addresses, cities, countries (libpostal optional) |
COORDINATES |
Lat/long | |
| Visual | FACE, QR_CODE, BARCODE |
Apple Vision framework |
| Organization | COMPANY, ORGANIZATION |
S&P 500 + international database |
| Vehicle | VEHICLE |
VIN, license plates |
| Biometric | BIOMETRIC |
Fingerprint IDs |
| Generic | ID |
Employee ID, customer ID, generic identifiers |
See docs/PII_REFERENCE.md for regulatory mapping (HIPAA, GDPR, CCPA).
Architecture
| Component | Role |
|---|---|
FileRouter |
Entry point for file-level processing |
PIIDetector |
Presidio analyzer with 50+ custom recognizers |
PersonRecognizer |
NER cascade: NLTagger → LightGBM → name database → (optional) spaCy/Flair/Transformers/GLiNER |
VisionOCR |
Apple Vision wrapper at 400 DPI |
PDFProcessor |
PDF-to-image with parallel page processing |
TableDetector |
Context-aware detection for spreadsheets and tables |
ImageAnonymizer, SpreadsheetAnonymizer |
Redaction output |
FaceDetector |
OpenCV Haar cascade |
AddressVerifier, CompanyVerifier, CredentialEntropy, HeuristicVerifier |
Precision verifiers |
DetectionConfig |
Runtime thresholds and toggles |
PERSON cascade
pattern match → LightGBM NER → NLTagger → names database → [spaCy] → [Flair] → [Transformers] → [GLiNER] → [Privacy Filter]
Lightweight engines run first. The cascade exits early when a high-confidence match is found. Heavy engines are skipped unless installed and explicitly enabled.
When openai_privacy_filter_authoritative=True, Privacy Filter runs before anything else and its verdict replaces the rest of the cascade for PERSON.
Custom recognizers
from hush_engine import PIIDetector
from presidio_analyzer import Pattern, PatternRecognizer
detector = PIIDetector()
detector.analyzer.registry.add_recognizer(
PatternRecognizer(
supported_entity="CUSTOM_ID",
patterns=[Pattern("custom", r"[A-Z]{3}-\d{6}", 0.8)],
)
)
Configuration
from hush_engine import DetectionConfig
config = DetectionConfig()
config.set_threshold("PERSON", 0.60)
config.set_enabled_entity("FACE", False)
config.set_enabled_integration("flair", False)
Thresholds persist to ~/.hush/detection_config.json. Integrations: lgbm_ner, spacy, flair, transformers, gliner, name_dataset, libpostal, urlextract, phonenumbers, openai_privacy_filter, openai_privacy_filter_authoritative.
Add-on backend: OpenAI Privacy Filter
Hush ships an opt-in integration with OpenAI Privacy Filter (Apache-2.0, 1.5B parameters, 50M active, bidirectional token classifier). Install the extra and flip two flags:
pip install hush-engine[privacy-filter]
from hush_engine import DetectionConfig
cfg = DetectionConfig()
cfg.set_enabled_integration("openai_privacy_filter", True)
# Optional: let Privacy Filter's PERSON verdict short-circuit the cascade.
cfg.set_enabled_integration("openai_privacy_filter_authoritative", False)
Two gating modes:
- candidate (default when enabled): Privacy Filter votes in the ensemble alongside LightGBM, spaCy, Flair, Transformers. The cascade's early-exit threshold still applies, so it runs only when lighter engines haven't produced a high-confidence hit.
- authoritative: Privacy Filter's PERSON decision replaces the cascade output. Verifiers skip.
Privacy Filter covers 8 span categories: private_person, private_email, private_phone, private_address, private_url, private_date, account_number, secret. The six non-PERSON categories register as a Presidio recognizer that feeds into Hush's standard entity-type pipeline. To load weights from disk instead of HuggingFace Hub, set HUSH_PRIVACY_FILTER_MODEL=/path/to/dir.
Release privacy gates
Set the HUSH_AUDIT=1 environment variable to opt into internal audit logging (dev + calibration use). Release builds should leave it unset, which:
- Attaches a
NullHandlertohush.audit, so~/.hush/audit.lognever gets created. - Removes
ingestTrainingFeedbackfrom the RPC allow-list, so the Swift UI has no path to read~/.hush/training_feedback.jsonlon end-user machines. - Hashes filenames in any audit line that does emit (defense-in-depth), so a 10-char SHA-256 prefix takes the place of the filename.
~/.hush/config.json and ~/.hush/detection_config.json stay unchanged. Those are user settings (locale, thresholds, enabled libraries), not telemetry.
FileRouter also sweeps stragglers out of ~/.hush/tmp on startup and wraps every temp-file caller in try/finally unlink, so preview JPEGs don't accumulate between runs.
Performance
Synthetic golden set (1,000 samples generated with Faker):
| Metric | Score |
|---|---|
| F1 | 97.2% |
| Precision | 98.3% |
| Recall | 96.2% |
Kaggle PII Detection 2024 (1,000 student essays, 1,606 GT entities):
| Metric | Score |
|---|---|
| F1 | 93.2% |
| Precision | 94.4% |
| Recall | 91.9% |
Per-entity on the Kaggle set: PERSON 93.7% F1, EMAIL 98.7%, ID 88.6%, URL 88.8%, PHONE 85.7%. Latency: 289 ms/doc with libpostal enabled.
Hush vs LLMs
Same Kaggle set, 1,000 samples. The Privacy Filter rows come from the same benchmark harness, run with [privacy-filter] installed and openai_privacy_filter enabled.
| Model | F1 | Precision | Recall | Latency | RAM |
|---|---|---|---|---|---|
| Hush Engine v1.11.0 | 93.2% | 94.4% | 91.9% | 289ms | ~15MB |
| Hush + OpenAI Privacy Filter | 93.0% | 94.2% | 91.9% | 5,017ms | ~3GB |
| OpenAI Privacy Filter (standalone) | 86.9% | 77.2% | 99.4% | 5,386ms | ~3GB |
| Mistral 7B | 77.8% | 64.6% | 97.9% | 3,486ms | 10.2GB |
| Phi-4 (14B) | 75.3% | 65.0% | 89.5% | 6,046ms | 14.3GB |
| Qwen 2.5 (7B) | 65.7% | 49.8% | 96.5% | 3,105ms | 8.4GB |
| Gemma 2 (9B) | 63.7% | 47.2% | 97.9% | 4,250ms | 9.0GB |
| Llama 3.2 (1B) | 21.2% | 11.9% | 95.3% | 4,208ms | 4.7GB |
Two results stand out.
OpenAI Privacy Filter alone catches almost every PII span (99.4% recall) and flags 23% false positives. In a redaction pipeline, each false positive deletes text the user wants kept. The 17-point precision gap translates into real content loss.
Adding Privacy Filter to Hush in candidate mode does not lift F1 (93.0% vs 93.2% baseline) and costs 17x the runtime. Hush sits at the ceiling its validators produce on this set. A learned model cannot push past it for entities that already pass Luhn, mod-97, or similar arithmetic.
Reproduce:
# LLM comparison: Hush vs LLMs (includes openai-privacy-filter as a row)
python tests/benchmark_llm_comparison.py --samples 1000 \
--models mistral:7b,phi4:latest,openai-privacy-filter
# Ablation: baseline vs Hush + Privacy Filter
python tests/benchmark_accuracy.py --samples 1000 \
--datasets kaggle_golden_1000.json --privacy-filter-ablation --no-pdf
Development
git clone https://github.com/NewMediaStudio/hush-engine.git
cd hush-engine
pip install -e ".[dev]"
python -m spacy download en_core_web_lg
pytest tests/
Benchmarks
python tests/benchmark_accuracy.py --samples 100
python tests/benchmark_accuracy.py --samples 1000
python tests/benchmark_server.py # dashboard at http://localhost:8000
Bootstrap 95% confidence intervals:
python tools/bootstrap_ci.py --dataset tests/data/synthetic_golden.json
Training LightGBM classifiers
python tools/train_lgbm_ner.py --entity-type PERSON --samples 5000
python tools/train_lgbm_ner.py --all --ai4privacy --augment --samples 10000
python tools/train_lgbm_ner.py --entity-type PERSON --custom-dataset path/to/data.json
Kaggle dataset (optional)
The Kaggle PII Detection 2024 set requires a Kaggle account. After downloading train.json:
python tools/create_kaggle_golden.py # 1,000-sample golden set for benchmarks
python tools/kaggle_pii_adapter.py --input tests/data/kaggle_train.json
Requirements
- macOS 10.15+ (Apple Vision OCR)
- Python 3.10+
Windows and Linux support is on the roadmap but not yet available.
Contributing
See CONTRIBUTING.md. Report security issues per SECURITY.md instead of the public tracker.
Maintainers
Built and maintained by Valentine Makhouleen at New Media Studio.
License
Hush Engine is dual-licensed.
Open source: AGPL-3.0. Free to use, modify, and distribute under AGPL terms. If you run Hush over a network (for example, inside a SaaS), AGPL § 13 requires you to open-source the service that uses it.
Commercial: a paid commercial license is available for proprietary products, closed-source SaaS, or any use where AGPL obligations don't fit. See COMMERCIAL-LICENSING.md or email studio@newmediastudio.com.
Related
- Hushbee — free macOS app built on this engine. Download there for a drag-and-drop GUI over the same detection pipeline.
- Microsoft Presidio — the detection framework Hush builds on.
Acknowledgments
Built on Presidio, Apple Vision, spaCy, Flair, GLiNER, libpostal, and python-stdnum.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hush_engine-1.11.0.tar.gz.
File metadata
- Download URL: hush_engine-1.11.0.tar.gz
- Upload date:
- Size: 385.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
800548e76cb719ab7913f5290a98a7069690f62e5546b841343fa7360020d2b7
|
|
| MD5 |
a7ffc20e257a753842c534570e01189b
|
|
| BLAKE2b-256 |
fae3562cf66ad10e63a6310f233491bac870a68185f9112f5f5a0376a1e6f405
|
Provenance
The following attestation bundles were made for hush_engine-1.11.0.tar.gz:
Publisher:
publish.yml on NewMediaStudio/hush-engine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hush_engine-1.11.0.tar.gz -
Subject digest:
800548e76cb719ab7913f5290a98a7069690f62e5546b841343fa7360020d2b7 - Sigstore transparency entry: 1365313801
- Sigstore integration time:
-
Permalink:
NewMediaStudio/hush-engine@0c11ff371f118b28e2f291eaddf9516d26a10c12 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/NewMediaStudio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0c11ff371f118b28e2f291eaddf9516d26a10c12 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file hush_engine-1.11.0-py3-none-any.whl.
File metadata
- Download URL: hush_engine-1.11.0-py3-none-any.whl
- Upload date:
- Size: 399.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2de336eae1af5ef56c7f2b0904a5a551fd159a95ec2c04bb279a5f47ba410d2
|
|
| MD5 |
bc0fc44b3ab140c0b07aab8d75be7bef
|
|
| BLAKE2b-256 |
a4d33fb44b36774d3e3329f0e2208cdae7b62668546432881ce3eabb955e4b94
|
Provenance
The following attestation bundles were made for hush_engine-1.11.0-py3-none-any.whl:
Publisher:
publish.yml on NewMediaStudio/hush-engine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hush_engine-1.11.0-py3-none-any.whl -
Subject digest:
f2de336eae1af5ef56c7f2b0904a5a551fd159a95ec2c04bb279a5f47ba410d2 - Sigstore transparency entry: 1365313903
- Sigstore integration time:
-
Permalink:
NewMediaStudio/hush-engine@0c11ff371f118b28e2f291eaddf9516d26a10c12 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/NewMediaStudio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0c11ff371f118b28e2f291eaddf9516d26a10c12 -
Trigger Event:
workflow_dispatch
-
Statement type: