A CLI tool for extracting and analyzing metadata from PDFs, including embedded images and XMP/RDF metadata.
Project description
PDF Metadata Scanner
A Python tool to recursively scan a folder for PDF files and extract:
- PDF metadata (Info dictionary via
pikepdf) - XMP and RDF metadata
- Metadata from embedded images (JPEG, PNG, TIFF — EXIF, text, and other supported fields)
🛠 Features
- Recursive folder scanning
- Clean separation of logs (warnings/errors) and metadata output
- Supports multiple image formats (via Pillow)
- Handles XMP/RDF and embedded image metadata
✅ Requirements
pip install -r requirements.txt
🚀 Usage
python scanner.py <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]
Or if you want to install:
pip install .
pdfscan <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]
Arguments:
| Flag | Description | Default |
|---|---|---|
folder |
Folder to recursively scan for PDFs | required |
--log |
Log file for warnings/errors | scanner_warnings.log |
--out |
Output file for extracted metadata | pdf_metadata_output.txt |
--verbose |
Output logs to both file and console | |
--progress |
Show a live progress bar while scanning PDFs |
🧾 Example
python scanner.py ./documents --log logs.txt --out metadata.txt
logs.txt: Contains only errors or warnings.metadata.txt: Contains all extracted metadata.
📦 Output Structure
Metadata output (--out) includes:
[PDF Metadata] ...
/Author: John Doe
/Title: Sample
[XMP Metadata] ...
[Image Metadata] ...
306: 2023:12:31 12:34:56
dpi: (300, 300)
🧪 Unit Testing
This project includes unit tests to ensure core functionality works correctly.
Running Tests
Make sure you have unittest (comes with Python standard library) and the required dependencies installed:
pip install -r requirements-dev.txt
To run the tests, execute:
python -m unittest test_scanner.py
What is Tested?
- Extraction of PDF metadata using mocked PDF files
- Parsing of XMP and RDF metadata
- Extraction of image metadata from embedded images (JPEG, PNG)
- Proper handling of non-image PDF objects
Adding Tests
Feel free to add more tests in test_scanner.py for new features or edge cases.
🔒 Notes
- Some image formats in PDFs (e.g. CCITT, JBIG2) are skipped due to incompatibility.
- PNG metadata (text fields) and EXIF from JPEG/TIFF are both supported.
- This tool does not modify the PDFs — it only reads metadata.
📃 License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_metadata_scanner-0.1.1.tar.gz.
File metadata
- Download URL: pdf_metadata_scanner-0.1.1.tar.gz
- Upload date:
- Size: 5.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b1bdfe1dd0c6a562de1ed2223cac2b6c3fb9c5cbc37e59d1208d54fb33987e2
|
|
| MD5 |
0f306793e374b7f61187ad26f4cadb58
|
|
| BLAKE2b-256 |
65b0145dcbb9781b4dd1566238048b295d82f904945e21b64fd415df085cb9cb
|
File details
Details for the file pdf_metadata_scanner-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pdf_metadata_scanner-0.1.1-py3-none-any.whl
- Upload date:
- Size: 5.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3f337edb1813b3d1f0ad6465fb3afc207bade0e2ac555104fca66c36583700f
|
|
| MD5 |
48cdbe51c0d95993d61ff9b6d60a4a7e
|
|
| BLAKE2b-256 |
0975b0217d77afbfbc25bf1969bf201ede6396dc48af1979765b48aa080d63a0
|