A privacy-focused CLI tool that removes sensitive metadata from image files
Project description
๐ Metadata Scrubber
A privacy-focused CLI tool that removes sensitive metadata from files. Supports images, PDFs, and Microsoft Office documents. Perfect for protecting your privacy before sharing files online.
โจ Features
- Multi-format support - Images (JPEG, PNG), PDFs, and Office docs (Word, Excel, PowerPoint)
- Concurrent processing - Process 1000+ files efficiently with ThreadPoolExecutor
- Dry-run mode - Preview what would be scrubbed without making changes
- Verification reports - Before/after comparison to confirm removal
- Smart format detection - Uses library-level format detection, not just file extensions
- Beautiful CLI - Rich progress bars and formatted output
- Privacy-first - Removes GPS coordinates, author info, timestamps, camera data
๐ Supported Formats
| Category | Extensions | Metadata Removed |
|---|---|---|
| Images | .jpg, .jpeg, .png |
EXIF, GPS, camera info, timestamps |
.pdf |
Author, creator, producer, dates | |
| Word | .docx |
Author, title, comments, keywords |
| Excel | .xlsx, .xlsm, .xltx, .xltm |
Author, title, company, comments |
| PowerPoint | .pptx, .pptm, .potx, .potm |
Author, title, comments, keywords |
๐ Quick Start
Installation
# Using uv (recommended)
uv pip install metadata-scrubber
# Or clone and install locally
git clone https://github.com/Heritage-XioN/metadata-scrubber-tool.git
cd metadata-scrubber-tool
uv sync
Basic Usage
# Read metadata from a file
mst read document.pdf
# Scrub metadata and save to output folder
mst scrub photo.jpg --output ./cleaned
# Batch process entire folder
mst scrub ./documents -r -ext docx --output ./cleaned
# Verify removal
mst verify original.jpg ./cleaned/processed_original.jpg
๐ Commands
mst read - View Metadata
Extract and display all embedded metadata from a file.
mst read photo.jpg # Single file
mst read report.pdf # PDF file
mst read ./docs -r -ext docx # All Word docs recursively
Example output:
โญโโโโโโโโโโโโโโโโโโ Metadata Report โโโโโโโโโโโโโโโโโโโฎ
โ โญโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ
โ โ Property โ Value โ โ
โ โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ
โ โ ๐ท Camera โ โ โ
โ โ Make โ Canon โ โ
โ โ Model โ Canon EOS 80D โ โ
โ โ Software โ Adobe Photoshop โ โ
โ โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ
โ โ ๐ GPS โ โ โ
โ โ GPSLatitude โ 40.7128 โ โ
โ โ GPSLongitude โ -74.0060 โ โ
โ โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ
โ โ ๐
Dates โ โ โ
โ โ DateTimeOriginal โ 2024:01:15 14:30:00 โ โ
โ โ created โ 2024-01-15 14:30:00 โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
mst scrub - Remove Metadata
Remove sensitive metadata from files and save cleaned copies.
mst scrub photo.jpg --output ./out # Single file
mst scrub ./photos -r -ext jpg -o ./out # All JPEGs in directory
mst scrub ./docs -r -ext pdf --dry-run # Preview without changes
mst scrub ./files -r -ext xlsx -w 8 # 8 concurrent workers
Example output:
Processing 42 files with 4 workers...
โ ธ Scrubbing metadata... โโโโโโโโโโโโโโโโโโโโโโโโโโ 100% 42/42 0:00:12
โญโโโโโโโโโโโโโโโโโโโโโ Summary โโโโโโโโโโโโโโโโโโโโโโฎ
โ โ
Processed: 42 โ
โ โ Failed: 0 โ
โ ๐ Output: C:\Users\...\cleaned โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Dry-run example:
mst scrub ./photos -r -ext jpg --dry-run
๐ DRY-RUN MODE - No files will be modified
Would process 15 files:
โข photo1.jpg โ processed_photo1.jpg
โข photo2.jpg โ processed_photo2.jpg
โข vacation/beach.jpg โ processed_beach.jpg
...
mst verify - Verify Metadata Removal
Compare original and processed files to confirm sensitive data was removed.
mst verify original.jpg ./out/processed_original.jpg
Example output:
Comparing: test_canon.jpg โ processed_test_canon.jpg
Verification Report
โโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโ
โ Property โ Before โ After โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ Make โ Canon โ โ
Removed โ
โ Model โ Canon EOS 80D โ โ
Removed โ
โ Software โ Adobe Photoshop โ โ
Removed โ
โ GPSLatitude โ 40.7128 โ โ
Removed โ
โ GPSLongitude โ -74.0060 โ โ
Removed โ
โ Artist โ John Smith โ โ
Removed โ
โ Copyright โ ยฉ 2024 John Smith โ โ
Removed โ
โ DateTimeOriginal โ 2024:01:15 14:30:00 โ โช Preserved โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโ
โ
Status: CLEAN - All sensitive metadata removed
Removed: 38 | Preserved: 2
โ๏ธ CLI Options
| Option | Description |
|---|---|
-r, --recursive |
Process directories recursively |
-ext, --extension |
Filter by file extension (jpg, png, pdf, docx, xlsx, pptx) |
-o, --output |
Output directory for cleaned files |
-d, --dry-run |
Preview without making changes |
-w, --workers |
Number of concurrent workers (default: 4, max: 16) |
-V, --verbose |
Show detailed debug logs |
-v, --version |
Show version |
๐ ๏ธ Development
Setup
git clone https://github.com/Heritage-XioN/metadata-scrubber-tool.git
cd metadata-scrubber-tool
# Install with dev dependencies
uv sync --all-extras
# Run tests
pytest
# Run linting
ruff check .
# Run type checking
mypy src
Project Structure
src/
โโโ main.py # CLI entry point (Typer app)
โโโ commands/
โ โโโ read.py # Read metadata command
โ โโโ scrub.py # Scrub metadata command
โ โโโ verify.py # Verify removal command
โโโ services/
โ โโโ metadata_factory.py # Factory for creating handlers
โ โโโ metadata_handler.py # Abstract base class
โ โโโ image_handler.py # JPEG/PNG handler
โ โโโ pdf_handler.py # PDF handler
โ โโโ excel_handler.py # Excel handler
โ โโโ powerpoint_handler.py # PowerPoint handler
โ โโโ worddoc_handler.py # Word document handler
โ โโโ report_generator.py # Verification reports
โ โโโ batch_processor.py # Concurrent batch processing
โโโ core/
โโโ jpeg_metadata.py # JPEG EXIF processor
โโโ png_metadata.py # PNG metadata processor
docs/
โโโ metadata-risks.md # Privacy risks documentation
โโโ best-practices.md # Secure file sharing guide
๐ Documentation
- Metadata Risks - Why metadata matters for privacy
- Best Practices - Guidelines for secure file sharing
โ ๏ธ Known Limitations
File Format Support
| Category | Supported | Not Supported |
|---|---|---|
| Images | JPEG, PNG | TIFF, GIF, HEIC, WebP, RAW |
| Documents | .docx |
Legacy .doc |
| Spreadsheets | .xlsx, .xlsm, .xltx, .xltm |
Legacy .xls |
| Presentations | .pptx, .pptm, .potx, .potm |
Legacy .ppt |
| Standard PDFs | Encrypted/password-protected |
Known Constraints
- No in-place editing - Always creates a processed copy (by design for safety)
- Password-protected files - Cannot process encrypted documents
- PNG metadata - Many PNGs have minimal/no extractable metadata
- Embedded files - Objects embedded in Office documents are not deep-scanned
- PDF embedded images - Images inside PDFs retain their original metadata
- Large files - Files are loaded into memory; very large files may be slow
PNG Verification Behavior
When a PNG file has no EXIF metadata (only PngInfo text chunks), the scrub operation removes all text keys. Attempting to verify or read the processed file will show:
Error during verification: No metadata found in the PNG image.
This is expected behavior - the error confirms that all metadata has been successfully removed. You can also use mst read processed_file.png to verify; the same error indicates a clean file.
Future Enhancements
- HEIC/HEIF support (common on iOS devices)
- Legacy Office format support (
.doc,.xls,.ppt) - Deep scanning of embedded objects
- PDF embedded image metadata stripping
โ ๏ธ Security Considerations
- Original files are never modified - processed copies are created
- Use
--dry-runto preview changes before committing - Use
mst verifyto confirm sensitive data was removed - GPS coordinates are completely stripped for privacy
- Author information is removed from all supported formats
- Always backup files before scrubbing in production
๐ License
MIT License - See LICENSE for details.
Made with โค๏ธ for privacy
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file metadata_scrubber-0.3.0.tar.gz.
File metadata
- Download URL: metadata_scrubber-0.3.0.tar.gz
- Upload date:
- Size: 2.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b25fa64d22a4d5055e27775606367d7c34fa339a0ea2f54b04f47b94dedda399
|
|
| MD5 |
b07f2b91acaed4f3f5ec8a9c3111b257
|
|
| BLAKE2b-256 |
3cd9ae402705233d7095852acaab68dff47e0724d15e1483769198fb8fa1ad53
|
Provenance
The following attestation bundles were made for metadata_scrubber-0.3.0.tar.gz:
Publisher:
publish.yml on Heritage-XioN/metadata-scrubber-tool
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
metadata_scrubber-0.3.0.tar.gz -
Subject digest:
b25fa64d22a4d5055e27775606367d7c34fa339a0ea2f54b04f47b94dedda399 - Sigstore transparency entry: 813995734
- Sigstore integration time:
-
Permalink:
Heritage-XioN/metadata-scrubber-tool@c2cd3c5260e29c4a253657b92b8e2064a2320aa2 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/Heritage-XioN
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c2cd3c5260e29c4a253657b92b8e2064a2320aa2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file metadata_scrubber-0.3.0-py3-none-any.whl.
File metadata
- Download URL: metadata_scrubber-0.3.0-py3-none-any.whl
- Upload date:
- Size: 36.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fbd5534aaef1d202410c3c75a1ebecf59bff67f7654e2c9099cb0936cd70f2ff
|
|
| MD5 |
d2887e67884a9f7565e47a7fb6320fee
|
|
| BLAKE2b-256 |
909a0af053551c6ccedd4e9a924b7fe96b8fb618f42d0c043e66595d202caff4
|
Provenance
The following attestation bundles were made for metadata_scrubber-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on Heritage-XioN/metadata-scrubber-tool
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
metadata_scrubber-0.3.0-py3-none-any.whl -
Subject digest:
fbd5534aaef1d202410c3c75a1ebecf59bff67f7654e2c9099cb0936cd70f2ff - Sigstore transparency entry: 813995738
- Sigstore integration time:
-
Permalink:
Heritage-XioN/metadata-scrubber-tool@c2cd3c5260e29c4a253657b92b8e2064a2320aa2 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/Heritage-XioN
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c2cd3c5260e29c4a253657b92b8e2064a2320aa2 -
Trigger Event:
release
-
Statement type: