Skip to main content

A privacy-focused CLI tool that removes sensitive metadata from image files

Project description

๐Ÿ”’ Metadata Scrubber

A privacy-focused CLI tool that removes sensitive metadata from files. Supports images, PDFs, and Microsoft Office documents. Perfect for protecting your privacy before sharing files online.

Tests Python 3.10+ License: MIT

โœจ Features

  • Multi-format support - Images (JPEG, PNG), PDFs, and Office docs (Word, Excel, PowerPoint)
  • Concurrent processing - Process 1000+ files efficiently with ThreadPoolExecutor
  • Dry-run mode - Preview what would be scrubbed without making changes
  • Verification reports - Before/after comparison to confirm removal
  • Smart format detection - Uses library-level format detection, not just file extensions
  • Beautiful CLI - Rich progress bars and formatted output
  • Privacy-first - Removes GPS coordinates, author info, timestamps, camera data

๐Ÿ“ Supported Formats

Category Extensions Metadata Removed
Images .jpg, .jpeg, .png EXIF, GPS, camera info, timestamps
PDF .pdf Author, creator, producer, dates
Word .docx Author, title, comments, keywords
Excel .xlsx, .xlsm, .xltx, .xltm Author, title, company, comments
PowerPoint .pptx, .pptm, .potx, .potm Author, title, comments, keywords

๐Ÿš€ Quick Start

Installation

# Using uv (recommended)
uv pip install metadata-scrubber

# Or clone and install locally
git clone https://github.com/Heritage-XioN/metadata-scrubber-tool.git
cd metadata-scrubber-tool
uv sync

Basic Usage

# Read metadata from a file
mst read document.pdf

# Scrub metadata and save to output folder
mst scrub photo.jpg --output ./cleaned

# Batch process entire folder
mst scrub ./documents -r -ext docx --output ./cleaned

# Verify removal
mst verify original.jpg ./cleaned/processed_original.jpg

๐Ÿ“– Commands

mst read - View Metadata

Extract and display all embedded metadata from a file.

mst read photo.jpg                      # Single file
mst read report.pdf                     # PDF file
mst read ./docs -r -ext docx            # All Word docs recursively

Example output:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Metadata Report โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ โ”‚
โ”‚ โ”‚ Property           โ”‚ Value                      โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚
โ”‚ โ”‚ ๐Ÿ“ท Camera          โ”‚                            โ”‚ โ”‚
โ”‚ โ”‚   Make             โ”‚ Canon                      โ”‚ โ”‚
โ”‚ โ”‚   Model            โ”‚ Canon EOS 80D              โ”‚ โ”‚
โ”‚ โ”‚   Software         โ”‚ Adobe Photoshop            โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚
โ”‚ โ”‚ ๐Ÿ“ GPS             โ”‚                            โ”‚ โ”‚
โ”‚ โ”‚   GPSLatitude      โ”‚ 40.7128                    โ”‚ โ”‚
โ”‚ โ”‚   GPSLongitude     โ”‚ -74.0060                   โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚
โ”‚ โ”‚ ๐Ÿ“… Dates           โ”‚                            โ”‚ โ”‚
โ”‚ โ”‚   DateTimeOriginal โ”‚ 2024:01:15 14:30:00        โ”‚ โ”‚
โ”‚ โ”‚   created          โ”‚ 2024-01-15 14:30:00        โ”‚ โ”‚
โ”‚ โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

mst scrub - Remove Metadata

Remove sensitive metadata from files and save cleaned copies.

mst scrub photo.jpg --output ./out      # Single file
mst scrub ./photos -r -ext jpg -o ./out # All JPEGs in directory
mst scrub ./docs -r -ext pdf --dry-run  # Preview without changes
mst scrub ./files -r -ext xlsx -w 8     # 8 concurrent workers

Example output:

Processing 42 files with 4 workers...

โ ธ Scrubbing metadata... โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100% 42/42 0:00:12

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Summary โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ โœ… Processed: 42                                  โ”‚
โ”‚ โŒ Failed:    0                                   โ”‚
โ”‚ ๐Ÿ“ Output:    C:\Users\...\cleaned                โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Dry-run example:

mst scrub ./photos -r -ext jpg --dry-run
๐Ÿ” DRY-RUN MODE - No files will be modified

Would process 15 files:
  โ€ข photo1.jpg โ†’ processed_photo1.jpg
  โ€ข photo2.jpg โ†’ processed_photo2.jpg
  โ€ข vacation/beach.jpg โ†’ processed_beach.jpg
  ...

mst verify - Verify Metadata Removal

Compare original and processed files to confirm sensitive data was removed.

mst verify original.jpg ./out/processed_original.jpg

Example output:

Comparing: test_canon.jpg โ†’ processed_test_canon.jpg

                          Verification Report                          
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Property                โ”ƒ Before                   โ”ƒ After          โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ Make                    โ”‚ Canon                    โ”‚ โœ… Removed     โ”‚
โ”‚ Model                   โ”‚ Canon EOS 80D            โ”‚ โœ… Removed     โ”‚
โ”‚ Software                โ”‚ Adobe Photoshop          โ”‚ โœ… Removed     โ”‚
โ”‚ GPSLatitude             โ”‚ 40.7128                  โ”‚ โœ… Removed     โ”‚
โ”‚ GPSLongitude            โ”‚ -74.0060                 โ”‚ โœ… Removed     โ”‚
โ”‚ Artist                  โ”‚ John Smith               โ”‚ โœ… Removed     โ”‚
โ”‚ Copyright               โ”‚ ยฉ 2024 John Smith        โ”‚ โœ… Removed     โ”‚
โ”‚ DateTimeOriginal        โ”‚ 2024:01:15 14:30:00      โ”‚ โšช Preserved   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โœ… Status: CLEAN - All sensitive metadata removed
Removed: 38 | Preserved: 2

โš™๏ธ CLI Options

Option Description
-r, --recursive Process directories recursively
-ext, --extension Filter by file extension (jpg, png, pdf, docx, xlsx, pptx)
-o, --output Output directory for cleaned files
-d, --dry-run Preview without making changes
-w, --workers Number of concurrent workers (default: 4, max: 16)
-V, --verbose Show detailed debug logs
-v, --version Show version

๐Ÿ› ๏ธ Development

Setup

git clone https://github.com/Heritage-XioN/metadata-scrubber-tool.git
cd metadata-scrubber-tool

# Install with dev dependencies
uv sync --all-extras

# Run tests
pytest

# Run linting
ruff check .

# Run type checking
mypy src

Project Structure

src/
โ”œโ”€โ”€ main.py                   # CLI entry point (Typer app)
โ”œโ”€โ”€ commands/
โ”‚   โ”œโ”€โ”€ read.py               # Read metadata command
โ”‚   โ”œโ”€โ”€ scrub.py              # Scrub metadata command
โ”‚   โ””โ”€โ”€ verify.py             # Verify removal command
โ”œโ”€โ”€ services/
โ”‚   โ”œโ”€โ”€ metadata_factory.py   # Factory for creating handlers
โ”‚   โ”œโ”€โ”€ metadata_handler.py   # Abstract base class
โ”‚   โ”œโ”€โ”€ image_handler.py      # JPEG/PNG handler
โ”‚   โ”œโ”€โ”€ pdf_handler.py        # PDF handler
โ”‚   โ”œโ”€โ”€ excel_handler.py      # Excel handler
โ”‚   โ”œโ”€โ”€ powerpoint_handler.py # PowerPoint handler
โ”‚   โ”œโ”€โ”€ worddoc_handler.py    # Word document handler
โ”‚   โ”œโ”€โ”€ report_generator.py   # Verification reports
โ”‚   โ””โ”€โ”€ batch_processor.py    # Concurrent batch processing
โ””โ”€โ”€ core/
    โ”œโ”€โ”€ jpeg_metadata.py      # JPEG EXIF processor
    โ””โ”€โ”€ png_metadata.py       # PNG metadata processor

docs/
โ”œโ”€โ”€ metadata-risks.md         # Privacy risks documentation
โ””โ”€โ”€ best-practices.md         # Secure file sharing guide

๐Ÿ“š Documentation


โš ๏ธ Known Limitations

File Format Support

Category Supported Not Supported
Images JPEG, PNG TIFF, GIF, HEIC, WebP, RAW
Documents .docx Legacy .doc
Spreadsheets .xlsx, .xlsm, .xltx, .xltm Legacy .xls
Presentations .pptx, .pptm, .potx, .potm Legacy .ppt
PDF Standard PDFs Encrypted/password-protected

Known Constraints

  • No in-place editing - Always creates a processed copy (by design for safety)
  • Password-protected files - Cannot process encrypted documents
  • PNG metadata - Many PNGs have minimal/no extractable metadata
  • Embedded files - Objects embedded in Office documents are not deep-scanned
  • PDF embedded images - Images inside PDFs retain their original metadata
  • Large files - Files are loaded into memory; very large files may be slow

PNG Verification Behavior

When a PNG file has no EXIF metadata (only PngInfo text chunks), the scrub operation removes all text keys. Attempting to verify or read the processed file will show:

Error during verification: No metadata found in the PNG image.

This is expected behavior - the error confirms that all metadata has been successfully removed. You can also use mst read processed_file.png to verify; the same error indicates a clean file.

Future Enhancements

  • HEIC/HEIF support (common on iOS devices)
  • Legacy Office format support (.doc, .xls, .ppt)
  • Deep scanning of embedded objects
  • PDF embedded image metadata stripping

โš ๏ธ Security Considerations

  • Original files are never modified - processed copies are created
  • Use --dry-run to preview changes before committing
  • Use mst verify to confirm sensitive data was removed
  • GPS coordinates are completely stripped for privacy
  • Author information is removed from all supported formats
  • Always backup files before scrubbing in production

๐Ÿ“„ License

MIT License - See LICENSE for details.


Made with โค๏ธ for privacy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metadata_scrubber-0.3.0.tar.gz (2.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

metadata_scrubber-0.3.0-py3-none-any.whl (36.2 kB view details)

Uploaded Python 3

File details

Details for the file metadata_scrubber-0.3.0.tar.gz.

File metadata

  • Download URL: metadata_scrubber-0.3.0.tar.gz
  • Upload date:
  • Size: 2.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metadata_scrubber-0.3.0.tar.gz
Algorithm Hash digest
SHA256 b25fa64d22a4d5055e27775606367d7c34fa339a0ea2f54b04f47b94dedda399
MD5 b07f2b91acaed4f3f5ec8a9c3111b257
BLAKE2b-256 3cd9ae402705233d7095852acaab68dff47e0724d15e1483769198fb8fa1ad53

See more details on using hashes here.

Provenance

The following attestation bundles were made for metadata_scrubber-0.3.0.tar.gz:

Publisher: publish.yml on Heritage-XioN/metadata-scrubber-tool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file metadata_scrubber-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for metadata_scrubber-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fbd5534aaef1d202410c3c75a1ebecf59bff67f7654e2c9099cb0936cd70f2ff
MD5 d2887e67884a9f7565e47a7fb6320fee
BLAKE2b-256 909a0af053551c6ccedd4e9a924b7fe96b8fb618f42d0c043e66595d202caff4

See more details on using hashes here.

Provenance

The following attestation bundles were made for metadata_scrubber-0.3.0-py3-none-any.whl:

Publisher: publish.yml on Heritage-XioN/metadata-scrubber-tool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page