Skip to main content

A CLI tool for extracting and analyzing metadata from PDFs, including embedded images and XMP/RDF metadata.

Project description

PDF Metadata Scanner

A Python tool to recursively scan a folder for PDF files and extract:

  • PDF metadata (Info dictionary via pikepdf)
  • XMP and RDF metadata
  • Metadata from embedded images (JPEG, PNG, TIFF — EXIF, text, and other supported fields)

🛠 Features

  • Recursive folder scanning
  • Clean separation of logs (warnings/errors) and metadata output
  • Supports multiple image formats (via Pillow)
  • Handles XMP/RDF and embedded image metadata

✅ Requirements

pip install -r requirements.txt 

🚀 Usage

python scanner.py <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]

Or if you want to install:

pip install .
pdfscan <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]

Arguments:

Flag Description Default
folder Folder to recursively scan for PDFs required
--log Log file for warnings/errors scanner_warnings.log
--out Output file for extracted metadata pdf_metadata_output.txt
--verbose Output logs to both file and console
--progress Show a live progress bar while scanning PDFs

🧾 Example

python scanner.py ./documents --log logs.txt --out metadata.txt
  • logs.txt: Contains only errors or warnings.
  • metadata.txt: Contains all extracted metadata.

📦 Output Structure

Metadata output (--out) includes:

[PDF Metadata] ...
    /Author: John Doe
    /Title: Sample
[XMP Metadata] ...
[Image Metadata] ...
    306: 2023:12:31 12:34:56
    dpi: (300, 300)

🧪 Unit Testing

This project includes unit tests to ensure core functionality works correctly.

Running Tests

Make sure you have unittest (comes with Python standard library) and the required dependencies installed:

pip install -r requirements-dev.txt

To run the tests, execute:

python -m unittest test_scanner.py

What is Tested?

  • Extraction of PDF metadata using mocked PDF files
  • Parsing of XMP and RDF metadata
  • Extraction of image metadata from embedded images (JPEG, PNG)
  • Proper handling of non-image PDF objects

Adding Tests

Feel free to add more tests in test_scanner.py for new features or edge cases.

🔒 Notes

  • Some image formats in PDFs (e.g. CCITT, JBIG2) are skipped due to incompatibility.
  • PNG metadata (text fields) and EXIF from JPEG/TIFF are both supported.
  • This tool does not modify the PDFs — it only reads metadata.

📃 License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_metadata_scanner-0.1.1.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_metadata_scanner-0.1.1-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file pdf_metadata_scanner-0.1.1.tar.gz.

File metadata

  • Download URL: pdf_metadata_scanner-0.1.1.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for pdf_metadata_scanner-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9b1bdfe1dd0c6a562de1ed2223cac2b6c3fb9c5cbc37e59d1208d54fb33987e2
MD5 0f306793e374b7f61187ad26f4cadb58
BLAKE2b-256 65b0145dcbb9781b4dd1566238048b295d82f904945e21b64fd415df085cb9cb

See more details on using hashes here.

File details

Details for the file pdf_metadata_scanner-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_metadata_scanner-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e3f337edb1813b3d1f0ad6465fb3afc207bade0e2ac555104fca66c36583700f
MD5 48cdbe51c0d95993d61ff9b6d60a4a7e
BLAKE2b-256 0975b0217d77afbfbc25bf1969bf201ede6396dc48af1979765b48aa080d63a0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page