Scrub PII from PDFs, images, and DOCX files. Local-only. One command.
Project description
scrubfile
Scrub PII from PDFs, images, and DOCX files. Local-only. One command.
Install
pip install scrubfile
For auto-detection (--auto), also download the spaCy model (~560MB, one-time):
python -m spacy download en_core_web_lg
Quick Start
# Redact specific terms
scrubfile document.pdf -r "John Doe" -r "123-45-6789"
# Auto-detect all PII (names, SSNs, emails, phones, addresses, ...)
scrubfile document.pdf --auto
# Preview what would be redacted (no changes made)
scrubfile document.pdf --auto --preview
# Machine-readable output
scrubfile document.pdf --auto --json
Python API
from scrubfile import redact
result = redact("document.pdf", terms=["John Doe", "123-45-6789"])
print(result.total_redactions) # 5
print(result.output_path) # document_redacted_20260330_120000.pdf
Features
| Feature | Details |
|---|---|
| Multi-format | PDF, PNG, JPG, TIFF, BMP, DOCX |
| Auto-detect PII | Names, SSNs, emails, phones, addresses, credit cards, IBANs, and 20+ entity types via Presidio + spaCy |
| Local-only | No cloud APIs. No data leaves your machine. Zero network calls after model download. |
| Permanent redaction | Text removed from PDF content stream, not just visual overlay |
| Metadata scrubbing | PDF metadata, XMP, EXIF, DOCX properties — all cleared |
| OCR support | Redact scanned documents and images via EasyOCR or Tesseract |
| Thorough mode | --thorough also redacts name fragments ("John", "J. Doe") to prevent inference |
| Term expansion | Provide one SSN/phone format, all variants searched automatically |
| JSON output | Machine-readable output for pipelines and automation |
| MCP server | AI agents can call scrubfile directly (see below) |
| Privacy-safe output | Detected PII is never echoed in CLI, JSON, or MCP output |
Comparison
| scrubfile | Adobe Acrobat | Google Cloud DLP | Presidio (standalone) | |
|---|---|---|---|---|
| Local-only | Yes | Yes | No (cloud) | Yes |
| Multi-format | PDF, images, DOCX | PDF only | Text/images | Text only |
| CLI | Yes | No | No | No |
| Auto-detect PII | Yes | No | Yes | Yes |
| Agent-ready (MCP) | Yes | No | No | No |
| Metadata scrubbing | Yes | Partial | No | No |
| Free | Yes | No ($240/yr) | No (pay per API call) | Yes |
Supported Formats
| Format | Redaction method | Notes |
|---|---|---|
| PDF (.pdf) | Text search + content stream removal | Permanent, not visual overlay |
| PNG, JPG, JPEG, TIFF, BMP | OCR + bounding box blackout | EXIF metadata stripped |
| DOCX (.docx) | Paragraph/table/header/footer search | Unicode block chars (████) |
MCP Server (for AI Agents)
scrubfile includes an MCP server so AI agents (Claude Code, Cursor, etc.) can redact documents directly.
Setup — add to your MCP config:
{
"mcpServers": {
"scrubfile": {
"command": "python",
"args": ["-m", "scrubfile.mcp_server"]
}
}
}
Available tools:
| Tool | Description |
|---|---|
redact_file |
Redact PII from a file (explicit terms or auto-detect) |
detect_pii |
Scan a file for PII without modifying it |
preview_redactions |
Preview what would be redacted (no file changes) |
All MCP tool responses use masked labels ([TERM-1], [DETECTED-1]). Raw PII is never included in responses.
CLI Reference
scrubfile <file> [OPTIONS]
| Option | Short | Description |
|---|---|---|
--redact TEXT |
-r |
PII term to redact (repeatable) |
--redact-file PATH |
-f |
File with terms, one per line |
--output PATH |
-o |
Output path (default: <name>_redacted_<timestamp>.<ext>) |
--auto |
Auto-detect PII using NLP | |
--threshold FLOAT |
Confidence threshold for auto-detect (default: 0.7) | |
--types TEXT |
Comma-separated entity types (e.g., PERSON,US_SSN) |
|
--thorough |
Also redact name fragments and initials | |
--preview |
Show detections without redacting | |
--json |
Machine-readable JSON output | |
--ocr-engine TEXT |
OCR engine: easyocr (default) or tesseract |
Model Requirements
| Component | Size | When needed | How to install |
|---|---|---|---|
| Python packages | ~200MB | Always | pip install scrubfile |
| spaCy model | ~560MB | --auto mode |
python -m spacy download en_core_web_lg |
| EasyOCR models | ~300MB | Image files | Auto-downloads on first use |
| PyTorch | ~1.5GB | Image files | Installed with pip install scrubfile |
If models are missing, scrubfile fails with a clear error message — it will not silently download during redaction.
Best-Effort Redaction
scrubfile redacts PII from text content in supported formats. Some document elements cannot be reliably redacted:
- Excel formulas that reference cells containing PII
- Embedded objects (charts, SmartArt, OLE objects)
- Non-text elements (form fields, annotations, JavaScript)
These locations are flagged with warnings when detected. Always perform a manual review for highly sensitive documents.
Contributing
See CONTRIBUTING.md for setup and guidelines.
License
AGPL-3.0-only — required by the PyMuPDF dependency.
Links
- Website: scrubfile.com
- PyPI: pypi.org/project/scrubfile
- GitHub: github.com/scrubfile/scrubfile
- Issues: github.com/scrubfile/scrubfile/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrubfile-1.0.1.tar.gz.
File metadata
- Download URL: scrubfile-1.0.1.tar.gz
- Upload date:
- Size: 146.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.10.0 readme-renderer/44.0 requests/2.31.0 requests-toolbelt/1.0.0 urllib3/2.2.3 tqdm/4.66.5 importlib-metadata/7.0.1 keyring/24.3.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
277d1f83f9714ded1fb083d712b262a4f9e893996b2553603e167ffea5f85409
|
|
| MD5 |
1d40bfe45a50434adb9b3d4f2af56334
|
|
| BLAKE2b-256 |
7ae7298e970290acae16f71a46c401df3231210d909ed61c3cff7c8070ee7400
|
File details
Details for the file scrubfile-1.0.1-py3-none-any.whl.
File metadata
- Download URL: scrubfile-1.0.1-py3-none-any.whl
- Upload date:
- Size: 33.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.10.0 readme-renderer/44.0 requests/2.31.0 requests-toolbelt/1.0.0 urllib3/2.2.3 tqdm/4.66.5 importlib-metadata/7.0.1 keyring/24.3.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5d551b24a5609567cbf267984bed0162e63f2770a8ea7fc1dfbeef4c349f6ce
|
|
| MD5 |
5562767dc69e71418563aa32338e7689
|
|
| BLAKE2b-256 |
c89abf75e37b5cce1c5af221f6e2a7d65eae984ade9f90c5762491d87d1d4e50
|