Scrub PII from PDFs, images, and DOCX files. Local-only. One command.

These details have not been verified by PyPI

Project links

Project description

scrubfile

Scrub PII from PDFs, images, and DOCX files. Local-only. One command.

Install

pip install scrubfile

For auto-detection (--auto), also download the spaCy model (~560MB, one-time):

python -m spacy download en_core_web_lg

Quick Start

# Redact specific terms
scrubfile document.pdf -r "John Doe" -r "123-45-6789"

# Auto-detect all PII (names, SSNs, emails, phones, addresses, ...)
scrubfile document.pdf --auto

# Preview what would be redacted (no changes made)
scrubfile document.pdf --auto --preview

# Machine-readable output
scrubfile document.pdf --auto --json

Python API

from scrubfile import redact

result = redact("document.pdf", terms=["John Doe", "123-45-6789"])
print(result.total_redactions)  # 5
print(result.output_path)       # document_redacted_20260330_120000.pdf

Features

Feature	Details
Multi-format	PDF, PNG, JPG, TIFF, BMP, DOCX
Auto-detect PII	Names, SSNs, emails, phones, addresses, credit cards, IBANs, and 20+ entity types via Presidio + spaCy
Local-only	No cloud APIs. No data leaves your machine. Zero network calls after model download.
Permanent redaction	Text removed from PDF content stream, not just visual overlay
Metadata scrubbing	PDF metadata, XMP, EXIF, DOCX properties — all cleared
OCR support	Redact scanned documents and images via EasyOCR or Tesseract
Thorough mode	`--thorough` also redacts name fragments ("John", "J. Doe") to prevent inference
Term expansion	Provide one SSN/phone format, all variants searched automatically
JSON output	Machine-readable output for pipelines and automation
MCP server	AI agents can call scrubfile directly (see below)
Privacy-safe output	Detected PII is never echoed in CLI, JSON, or MCP output

Comparison

	scrubfile	Adobe Acrobat	Google Cloud DLP	Presidio (standalone)
Local-only	Yes	Yes	No (cloud)	Yes
Multi-format	PDF, images, DOCX	PDF only	Text/images	Text only
CLI	Yes	No	No	No
Auto-detect PII	Yes	No	Yes	Yes
Agent-ready (MCP)	Yes	No	No	No
Metadata scrubbing	Yes	Partial	No	No
Free	Yes	No ($240/yr)	No (pay per API call)	Yes

Supported Formats

Format	Redaction method	Notes
PDF (.pdf)	Text search + content stream removal	Permanent, not visual overlay
PNG, JPG, JPEG, TIFF, BMP	OCR + bounding box blackout	EXIF metadata stripped
DOCX (.docx)	Paragraph/table/header/footer search	Unicode block chars (████)

MCP Server (for AI Agents)

scrubfile includes an MCP server so AI agents (Claude Code, Cursor, etc.) can redact documents directly.

Setup — add to your MCP config:

{
  "mcpServers": {
    "scrubfile": {
      "command": "python",
      "args": ["-m", "scrubfile.mcp_server"]
    }
  }
}

Available tools:

Tool	Description
`redact_file`	Redact PII from a file (explicit terms or auto-detect)
`detect_pii`	Scan a file for PII without modifying it
`preview_redactions`	Preview what would be redacted (no file changes)

All MCP tool responses use masked labels ([TERM-1], [DETECTED-1]). Raw PII is never included in responses.

CLI Reference

scrubfile <file> [OPTIONS]

Option	Short	Description
`--redact TEXT`	`-r`	PII term to redact (repeatable)
`--redact-file PATH`	`-f`	File with terms, one per line
`--output PATH`	`-o`	Output path (default: `<name>_redacted_<timestamp>.<ext>`)
`--auto`		Auto-detect PII using NLP
`--threshold FLOAT`		Confidence threshold for auto-detect (default: 0.7)
`--types TEXT`		Comma-separated entity types (e.g., `PERSON,US_SSN`)
`--thorough`		Also redact name fragments and initials
`--preview`		Show detections without redacting
`--json`		Machine-readable JSON output
`--ocr-engine TEXT`		OCR engine: `easyocr` (default) or `tesseract`

Model Requirements

Component	Size	When needed	How to install
Python packages	~200MB	Always	`pip install scrubfile`
spaCy model	~560MB	`--auto` mode	`python -m spacy download en_core_web_lg`
EasyOCR models	~300MB	Image files	Auto-downloads on first use
PyTorch	~1.5GB	Image files	Installed with `pip install scrubfile`

If models are missing, scrubfile fails with a clear error message — it will not silently download during redaction.

Best-Effort Redaction

scrubfile redacts PII from text content in supported formats. Some document elements cannot be reliably redacted:

Excel formulas that reference cells containing PII
Embedded objects (charts, SmartArt, OLE objects)
Non-text elements (form fields, annotations, JavaScript)

These locations are flagged with warnings when detected. Always perform a manual review for highly sensitive documents.

Contributing

See CONTRIBUTING.md for setup and guidelines.

License

AGPL-3.0-only — required by the PyMuPDF dependency.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.1

Apr 1, 2026

1.0.0

Apr 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrubfile-1.0.1.tar.gz (146.4 kB view details)

Uploaded Apr 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrubfile-1.0.1-py3-none-any.whl (33.3 kB view details)

Uploaded Apr 1, 2026 Python 3

File details

Details for the file scrubfile-1.0.1.tar.gz.

File metadata

Download URL: scrubfile-1.0.1.tar.gz
Upload date: Apr 1, 2026
Size: 146.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.10.0 readme-renderer/44.0 requests/2.31.0 requests-toolbelt/1.0.0 urllib3/2.2.3 tqdm/4.66.5 importlib-metadata/7.0.1 keyring/24.3.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.12.7

File hashes

Hashes for scrubfile-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`277d1f83f9714ded1fb083d712b262a4f9e893996b2553603e167ffea5f85409`
MD5	`1d40bfe45a50434adb9b3d4f2af56334`
BLAKE2b-256	`7ae7298e970290acae16f71a46c401df3231210d909ed61c3cff7c8070ee7400`

See more details on using hashes here.

File details

Details for the file scrubfile-1.0.1-py3-none-any.whl.

File metadata

Download URL: scrubfile-1.0.1-py3-none-any.whl
Upload date: Apr 1, 2026
Size: 33.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.10.0 readme-renderer/44.0 requests/2.31.0 requests-toolbelt/1.0.0 urllib3/2.2.3 tqdm/4.66.5 importlib-metadata/7.0.1 keyring/24.3.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.12.7

File hashes

Hashes for scrubfile-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b5d551b24a5609567cbf267984bed0162e63f2770a8ea7fc1dfbeef4c349f6ce`
MD5	`5562767dc69e71418563aa32338e7689`
BLAKE2b-256	`c89abf75e37b5cce1c5af221f6e2a7d65eae984ade9f90c5762491d87d1d4e50`

See more details on using hashes here.

scrubfile 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

scrubfile

Install

Quick Start

Python API

Features

Comparison

Supported Formats

MCP Server (for AI Agents)

CLI Reference

Model Requirements

Best-Effort Redaction

Contributing

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes