Skip to main content

A CLI tool for extracting and analyzing metadata from PDFs, including embedded images and XMP/RDF metadata.

Project description

PDF Metadata Scanner

A command-line tool to recursively scan folders for PDF files and extract:

  • PDF metadata (Info dictionary via pikepdf)
  • XMP and RDF metadata
  • Embedded image metadata (JPEG, PNG, TIFF — EXIF, text, and other supported fields)

🛠 Features

  • 🔍 Recursive folder scanning
  • 🧼 Clean separation of metadata output and error/warning logs
  • 🖼 Embedded image metadata support via Pillow (JPEG, PNG, TIFF)
  • 📑 XMP/RDF metadata parsing
  • ⚙️ Optional progress bar and verbose logging

📦 Installation

Install directly from PyPI:

pip install pdf-metadata-scanner

Or from source:

git clone https://github.com/yourname/pdf-metadata-scanner.git
cd pdf-metadata-scanner
pip install .

🚀 Usage

After installing:

pdfscan <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]

If running from source without installation:

python scanner.py <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]

Arguments:

Flag Shorthand Description Default
folder Folder to recursively scan for PDFs required
--log LOG_FILE -l Log file for warnings/errors scanner_warnings.log
--out OUTPUT_FILE -o Output file for extracted metadata pdf_metadata_output.txt
--verbose -v Output logs to both file and console (off)
--progress -p Show a live progress bar while scanning PDFs (off)

🧾 Example

pdfscan ./documents --log logs.txt --out metadata.txt --verbose --progress
  • logs.txt: Contains only errors or warnings.
  • metadata.txt: Contains all extracted metadata.

📄 Output Format

[PDF Metadata] test.pdf
    /Author: Jane Doe
    /Title: Example Document

[XMP Metadata] test.pdf
    <dc:title>Example</dc:title>
    <dc:creator>Jane Doe</dc:creator>

[Image Metadata] test.pdf - Page 1 - Im0
    DateTimeOriginal: 2024:01:01 12:00:00
    DPI: (300, 300)

🧪 Testing

This project includes unit tests for core functionality.

Run tests:

pip install -r requirements-dev.txt
python -m unittest test_scanner.py

🔒 Notes

  • Some image formats (e.g. CCITT, JBIG2) are skipped due to decoding limitations.
  • PNG and JPEG/TIFF metadata is extracted where available.
  • The tool is read-only — it does not modify PDFs.

🧾 License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_metadata_scanner-0.1.2.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_metadata_scanner-0.1.2-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file pdf_metadata_scanner-0.1.2.tar.gz.

File metadata

  • Download URL: pdf_metadata_scanner-0.1.2.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for pdf_metadata_scanner-0.1.2.tar.gz
Algorithm Hash digest
SHA256 8cec251b3c72ed1f7b4b06994e04334ae794d19161a7f09c3379cca43fc32d3f
MD5 1759f1af3ee9dd512ff7a5285d8a414f
BLAKE2b-256 33b325b94c57f7f490fab0315b53cef3d806c39a0ee2947158834c7126ab1497

See more details on using hashes here.

File details

Details for the file pdf_metadata_scanner-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_metadata_scanner-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 be3398a93cd57922956b99eed797d7671009f4ce153f5af5a5819219e6994501
MD5 b61051aee45340e03815c4ef974695d8
BLAKE2b-256 db1c95271d2199d29c67f57de08b8781240f3bb0238417769a93eaeaca941734

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page