A CLI tool for extracting and analyzing metadata from PDFs, including embedded images and XMP/RDF metadata.
Project description
PDF Metadata Scanner
A command-line tool to recursively scan folders for PDF files and extract:
- PDF metadata (Info dictionary via
pikepdf) - XMP and RDF metadata
- Embedded image metadata (JPEG, PNG, TIFF — EXIF, text, and other supported fields)
🛠 Features
- 🔍 Recursive folder scanning
- 🧼 Clean separation of metadata output and error/warning logs
- 🖼 Embedded image metadata support via Pillow (JPEG, PNG, TIFF)
- 📑 XMP/RDF metadata parsing
- ⚙️ Optional progress bar and verbose logging
📦 Installation
Install directly from PyPI:
pip install pdf-metadata-scanner
Or from source:
git clone https://github.com/yourname/pdf-metadata-scanner.git
cd pdf-metadata-scanner
pip install .
🚀 Usage
After installing:
pdfscan <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]
If running from source without installation:
python scanner.py <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]
Arguments:
| Flag | Shorthand | Description | Default |
|---|---|---|---|
folder |
Folder to recursively scan for PDFs | required | |
--log LOG_FILE |
-l |
Log file for warnings/errors | scanner_warnings.log |
--out OUTPUT_FILE |
-o |
Output file for extracted metadata | pdf_metadata_output.txt |
--verbose |
-v |
Output logs to both file and console | (off) |
--progress |
-p |
Show a live progress bar while scanning PDFs | (off) |
🧾 Example
pdfscan ./documents --log logs.txt --out metadata.txt --verbose --progress
logs.txt: Contains only errors or warnings.metadata.txt: Contains all extracted metadata.
📄 Output Format
[PDF Metadata] test.pdf
/Author: Jane Doe
/Title: Example Document
[XMP Metadata] test.pdf
<dc:title>Example</dc:title>
<dc:creator>Jane Doe</dc:creator>
[Image Metadata] test.pdf - Page 1 - Im0
DateTimeOriginal: 2024:01:01 12:00:00
DPI: (300, 300)
🧪 Testing
This project includes unit tests for core functionality.
Run tests:
pip install -r requirements-dev.txt
python -m unittest test_scanner.py
🔒 Notes
- Some image formats (e.g. CCITT, JBIG2) are skipped due to decoding limitations.
- PNG and JPEG/TIFF metadata is extracted where available.
- The tool is read-only — it does not modify PDFs.
🧾 License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_metadata_scanner-0.1.2.tar.gz.
File metadata
- Download URL: pdf_metadata_scanner-0.1.2.tar.gz
- Upload date:
- Size: 5.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8cec251b3c72ed1f7b4b06994e04334ae794d19161a7f09c3379cca43fc32d3f
|
|
| MD5 |
1759f1af3ee9dd512ff7a5285d8a414f
|
|
| BLAKE2b-256 |
33b325b94c57f7f490fab0315b53cef3d806c39a0ee2947158834c7126ab1497
|
File details
Details for the file pdf_metadata_scanner-0.1.2-py3-none-any.whl.
File metadata
- Download URL: pdf_metadata_scanner-0.1.2-py3-none-any.whl
- Upload date:
- Size: 5.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be3398a93cd57922956b99eed797d7671009f4ce153f5af5a5819219e6994501
|
|
| MD5 |
b61051aee45340e03815c4ef974695d8
|
|
| BLAKE2b-256 |
db1c95271d2199d29c67f57de08b8781240f3bb0238417769a93eaeaca941734
|