Skip to main content

Scrub PII from PDFs, images, and DOCX files. Local-only. One command.

Project description

scrubfile

Scrub PII from PDFs, images, and DOCX files. Local-only. One command.

Tests PyPI Python License

Install

pip install scrubfile

For auto-detection (--auto), also download the spaCy model (~560MB, one-time):

python -m spacy download en_core_web_lg

Quick Start

# Redact specific terms
scrubfile document.pdf -r "John Doe" -r "123-45-6789"

# Auto-detect all PII (names, SSNs, emails, phones, addresses, ...)
scrubfile document.pdf --auto

# Preview what would be redacted (no changes made)
scrubfile document.pdf --auto --preview

# Machine-readable output
scrubfile document.pdf --auto --json

Python API

from scrubfile import redact

result = redact("document.pdf", terms=["John Doe", "123-45-6789"])
print(result.total_redactions)  # 5
print(result.output_path)       # document_redacted_20260330_120000.pdf

Features

Feature Details
Multi-format PDF, PNG, JPG, TIFF, BMP, DOCX
Auto-detect PII Names, SSNs, emails, phones, addresses, credit cards, IBANs, and 20+ entity types via Presidio + spaCy
Local-only No cloud APIs. No data leaves your machine. Zero network calls after model download.
Permanent redaction Text removed from PDF content stream, not just visual overlay
Metadata scrubbing PDF metadata, XMP, EXIF, DOCX properties — all cleared
OCR support Redact scanned documents and images via EasyOCR or Tesseract
Thorough mode --thorough also redacts name fragments ("John", "J. Doe") to prevent inference
Term expansion Provide one SSN/phone format, all variants searched automatically
JSON output Machine-readable output for pipelines and automation
MCP server AI agents can call scrubfile directly (see below)
Privacy-safe output Detected PII is never echoed in CLI, JSON, or MCP output

Comparison

scrubfile Adobe Acrobat Google Cloud DLP Presidio (standalone)
Local-only Yes Yes No (cloud) Yes
Multi-format PDF, images, DOCX PDF only Text/images Text only
CLI Yes No No No
Auto-detect PII Yes No Yes Yes
Agent-ready (MCP) Yes No No No
Metadata scrubbing Yes Partial No No
Free Yes No ($240/yr) No (pay per API call) Yes

Supported Formats

Format Redaction method Notes
PDF (.pdf) Text search + content stream removal Permanent, not visual overlay
PNG, JPG, JPEG, TIFF, BMP OCR + bounding box blackout EXIF metadata stripped
DOCX (.docx) Paragraph/table/header/footer search Unicode block chars (████)

MCP Server (for AI Agents)

scrubfile includes an MCP server so AI agents (Claude Code, Cursor, etc.) can redact documents directly.

Setup — add to your MCP config:

{
  "mcpServers": {
    "scrubfile": {
      "command": "python",
      "args": ["-m", "scrubfile.mcp_server"]
    }
  }
}

Available tools:

Tool Description
redact_file Redact PII from a file (explicit terms or auto-detect)
detect_pii Scan a file for PII without modifying it
preview_redactions Preview what would be redacted (no file changes)

All MCP tool responses use masked labels ([TERM-1], [DETECTED-1]). Raw PII is never included in responses.

CLI Reference

scrubfile <file> [OPTIONS]
Option Short Description
--redact TEXT -r PII term to redact (repeatable)
--redact-file PATH -f File with terms, one per line
--output PATH -o Output path (default: <name>_redacted_<timestamp>.<ext>)
--auto Auto-detect PII using NLP
--threshold FLOAT Confidence threshold for auto-detect (default: 0.7)
--types TEXT Comma-separated entity types (e.g., PERSON,US_SSN)
--thorough Also redact name fragments and initials
--preview Show detections without redacting
--json Machine-readable JSON output
--ocr-engine TEXT OCR engine: easyocr (default) or tesseract

Model Requirements

Component Size When needed How to install
Python packages ~200MB Always pip install scrubfile
spaCy model ~560MB --auto mode python -m spacy download en_core_web_lg
EasyOCR models ~300MB Image files Auto-downloads on first use
PyTorch ~1.5GB Image files Installed with pip install scrubfile

If models are missing, scrubfile fails with a clear error message — it will not silently download during redaction.

Best-Effort Redaction

scrubfile redacts PII from text content in supported formats. Some document elements cannot be reliably redacted:

  • Excel formulas that reference cells containing PII
  • Embedded objects (charts, SmartArt, OLE objects)
  • Non-text elements (form fields, annotations, JavaScript)

These locations are flagged with warnings when detected. Always perform a manual review for highly sensitive documents.

Contributing

See CONTRIBUTING.md for setup and guidelines.

License

AGPL-3.0-only — required by the PyMuPDF dependency.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrubfile-1.0.1.tar.gz (146.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrubfile-1.0.1-py3-none-any.whl (33.3 kB view details)

Uploaded Python 3

File details

Details for the file scrubfile-1.0.1.tar.gz.

File metadata

  • Download URL: scrubfile-1.0.1.tar.gz
  • Upload date:
  • Size: 146.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.10.0 readme-renderer/44.0 requests/2.31.0 requests-toolbelt/1.0.0 urllib3/2.2.3 tqdm/4.66.5 importlib-metadata/7.0.1 keyring/24.3.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.12.7

File hashes

Hashes for scrubfile-1.0.1.tar.gz
Algorithm Hash digest
SHA256 277d1f83f9714ded1fb083d712b262a4f9e893996b2553603e167ffea5f85409
MD5 1d40bfe45a50434adb9b3d4f2af56334
BLAKE2b-256 7ae7298e970290acae16f71a46c401df3231210d909ed61c3cff7c8070ee7400

See more details on using hashes here.

File details

Details for the file scrubfile-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: scrubfile-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 33.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.10.0 readme-renderer/44.0 requests/2.31.0 requests-toolbelt/1.0.0 urllib3/2.2.3 tqdm/4.66.5 importlib-metadata/7.0.1 keyring/24.3.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.12.7

File hashes

Hashes for scrubfile-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b5d551b24a5609567cbf267984bed0162e63f2770a8ea7fc1dfbeef4c349f6ce
MD5 5562767dc69e71418563aa32338e7689
BLAKE2b-256 c89abf75e37b5cce1c5af221f6e2a7d65eae984ade9f90c5762491d87d1d4e50

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page