A CLI tool for extracting and analyzing metadata from PDFs, including embedded images and XMP/RDF metadata.

These details have not been verified by PyPI

Project description

PDF Metadata Scanner

A command-line tool to recursively scan folders for PDF files and extract:

PDF metadata (Info dictionary via pikepdf)
XMP and RDF metadata
Embedded image metadata (JPEG, PNG, TIFF — EXIF, text, and other supported fields)

🛠 Features

🔍 Recursive folder scanning
🧼 Clean separation of metadata output and error/warning logs
🖼 Embedded image metadata support via Pillow (JPEG, PNG, TIFF)
📑 XMP/RDF metadata parsing
⚙️ Optional progress bar and verbose logging

📦 Installation

Install directly from PyPI:

pip install pdf-metadata-scanner

Or from source:

git clone https://github.com/yourname/pdf-metadata-scanner.git
cd pdf-metadata-scanner
pip install .

🚀 Usage

After installing:

pdfscan <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]

If running from source without installation:

python scanner.py <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]

Arguments:

Flag	Shorthand	Description	Default
`folder`		Folder to recursively scan for PDFs	required
`--log LOG_FILE`	`-l`	Log file for warnings/errors	`scanner_warnings.log`
`--out OUTPUT_FILE`	`-o`	Output file for extracted metadata	`pdf_metadata_output.txt`
`--verbose`	`-v`	Output logs to both file and console	(off)
`--progress`	`-p`	Show a live progress bar while scanning PDFs	(off)

🧾 Example

pdfscan ./documents --log logs.txt --out metadata.txt --verbose --progress

logs.txt: Contains only errors or warnings.
metadata.txt: Contains all extracted metadata.

📄 Output Format

[PDF Metadata] test.pdf
    /Author: Jane Doe
    /Title: Example Document

[XMP Metadata] test.pdf
    <dc:title>Example</dc:title>
    <dc:creator>Jane Doe</dc:creator>

[Image Metadata] test.pdf - Page 1 - Im0
    DateTimeOriginal: 2024:01:01 12:00:00
    DPI: (300, 300)

🧪 Testing

This project includes unit tests for core functionality.

Run tests:

pip install -r requirements-dev.txt
python -m unittest test_scanner.py

🔒 Notes

Some image formats (e.g. CCITT, JBIG2) are skipped due to decoding limitations.
PNG and JPEG/TIFF metadata is extracted where available.
The tool is read-only — it does not modify PDFs.

🧾 License

MIT License

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.2

Jun 13, 2025

0.1.1

Jun 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_metadata_scanner-0.1.2.tar.gz (5.2 kB view details)

Uploaded Jun 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_metadata_scanner-0.1.2-py3-none-any.whl (5.4 kB view details)

Uploaded Jun 13, 2025 Python 3

File details

Details for the file pdf_metadata_scanner-0.1.2.tar.gz.

File metadata

Download URL: pdf_metadata_scanner-0.1.2.tar.gz
Upload date: Jun 13, 2025
Size: 5.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for pdf_metadata_scanner-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`8cec251b3c72ed1f7b4b06994e04334ae794d19161a7f09c3379cca43fc32d3f`
MD5	`1759f1af3ee9dd512ff7a5285d8a414f`
BLAKE2b-256	`33b325b94c57f7f490fab0315b53cef3d806c39a0ee2947158834c7126ab1497`

See more details on using hashes here.

File details

Details for the file pdf_metadata_scanner-0.1.2-py3-none-any.whl.

File metadata

Download URL: pdf_metadata_scanner-0.1.2-py3-none-any.whl
Upload date: Jun 13, 2025
Size: 5.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for pdf_metadata_scanner-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`be3398a93cd57922956b99eed797d7671009f4ce153f5af5a5819219e6994501`
MD5	`b61051aee45340e03815c4ef974695d8`
BLAKE2b-256	`db1c95271d2199d29c67f57de08b8781240f3bb0238417769a93eaeaca941734`

See more details on using hashes here.

pdf-metadata-scanner 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

PDF Metadata Scanner

🛠 Features

📦 Installation

🚀 Usage

Arguments:

🧾 Example

📄 Output Format

🧪 Testing

Run tests:

🔒 Notes

🧾 License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes