Skip to main content

A simple PDF forensic toolkit using pdfresurrect and bash utilities

Project description

๐Ÿ“„ pdforensic

A lightweight Python toolkit for forensic analysis of PDF files using pdfresurrect and unix kernel shell utilities.

pdforensic wraps common PDF forensic techniques into an easy-to-use Python and CLI interface โ€” allowing you to extract metadata, recover previous versions, count EOF markers, and inspect version layers of PDF files.

Project Directory

.
โ”œโ”€โ”€ bin
โ”‚   โ”œโ”€โ”€ pdfresurrect
โ”‚   โ””โ”€โ”€ pdfresurrect.1
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ pdforensic
โ”‚   โ”œโ”€โ”€ __init__.py
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ setup.py
โ””โ”€โ”€ tests
    โ”œโ”€โ”€ pdf-to-test
    โ”‚   โ”œโ”€โ”€ pdf-to-test-a.pdf
    โ”‚   โ””โ”€โ”€ pdf-to-test-b.pdf
    โ”œโ”€โ”€ test_check_eof_markers.py
    โ”œโ”€โ”€ test_check_versions.py
    โ”œโ”€โ”€ test_retrieve_all-versions.py
    โ””โ”€โ”€ test_retrieve_metadata.py

The Package

This package has been built to work on unix kernel i.e., linux OS and MacOS.

You will require Python 3.13.x to work with this package.

This package is an ongoing experiment to understand how to check for an edited PDF and automate the process.

It is built on top of pdfresurrect which is a C tool that reads the PDF at it's lowest level extracting metadata , object streams , check for previous versions by checking for cross-referencing of streams and also able to rewrite previous versions.

The pdfresurrect functionalities have been wrapped to be reusable quickly with Python.

An additional functionality from my PDF research has been added that check for %%EOF markers its absence means the PDF is corrupted and not in a proper format. A linearized or an original or freshly saved has 1 %%EOF marker , more than 1 means the PDF has been tampered with.

Hence you can use above functionalities to build a PDF verification algorithim if you do now what type of PDF file you will be processing by comparing it's properties against new incoming PDF's.

๐Ÿ Using the Python Package

You can use pdforensic directly from Python code by importing its core functions.

๐Ÿ“ฆ Importable Functions

from pdforensic import (
    extract_pdf_metadata,
    recover_pdf_versions,
    count_pdf_eof_markers,
    check_no_of_versions
)
  1. Extract PDF Metadata
from pdforensic import extract_pdf_metadata

metadata = extract_pdf_metadata("tests/pdf-to-test/pdf-to-test-a.pdf")
print(metadata)

Returns:

{
  'Versions': '1',
  'PDF Version': '1.4',
  'Title': 'My Document',
  'Producer': 'Skia/PDF',
  ...
}
  1. Recover Previous Versions
from pdforensic import recover_pdf_versions

message = recover_pdf_versions("tests/pdf-to-test/pdf-to-test-b.pdf")
print(message)

Example output:

Recovered 2 version(s). Found in: pdf-to-test-b-versions/
  1. Count %%EOF Markers
from pdforensic import count_pdf_eof_markers

count = count_pdf_eof_markers("tests/pdf-to-test/pdf-to-test-a.pdf")
print(f"EOF markers: {count}")
  1. Check Number of PDF Versions
from pdforensic import check_no_of_versions

num_versions = check_no_of_versions("tests/pdf-to-test/pdf-to-test-a.pdf")
print(f"PDF contains {num_versions} version(s).")

๐Ÿงช Pro Tip You can integrate these tools into a PDF auditing script or pipeline for digital forensics, penetration testing, academic research, or version tracking.

๐Ÿ“ฆ Using CLI

๐Ÿ”ง Command-Line Interface (CLI)

Once installed (with pip install -e .), the following CLI commands are available:

Command Description

pdf-meta	Extract metadata from a PDF file
pdf-recover	Recover previous versions of a PDF
pdf-eof	Count %%EOF markers in a PDF
pdf-versions	Check number of versions in a PDF (via -q)

Example Usage

pdf-meta <pdf_path>

pdf-recover <pdf_path>

pdf-eof <pdf_path>

pdf-versions <pdf_path>

๐Ÿ› ๏ธ Installation

Option 1: Clone and Install from GitHub (Recommended)

git clone https://github.com/yourusername/pdforensic.git
cd pdforensic
pip install -e .
#This installs pdforensic in editable mode, meaning any changes you make to the code will take effect immediately.

Option 2: Install with Dev Dependencies (for testing and development)

pip install -e '.[dev]'
#This includes testing tools i.e., pytest.

Option 3: Install directly via GitHub URL (no clone)

pip install git+https://github.com/genie360s/pdforensic.git

๐Ÿ“œ License

MIT License ยฉ 2025 Alex Mkwizu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdforensic_authentic_check-0.1.3.tar.gz (20.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdforensic_authentic_check-0.1.3-py3-none-any.whl (31.7 kB view details)

Uploaded Python 3

File details

Details for the file pdforensic_authentic_check-0.1.3.tar.gz.

File metadata

File hashes

Hashes for pdforensic_authentic_check-0.1.3.tar.gz
Algorithm Hash digest
SHA256 17d65534e5c18296ff1d9db49e3aae14abb2fc9f6ceb3ca12fe2db80fd92d373
MD5 4d3bf1700ca2e11e26755151be5e84d0
BLAKE2b-256 616d70f9f9a0cc17e4df348b53fb76a58f4cd443464fae4eaea627b7f712dae8

See more details on using hashes here.

File details

Details for the file pdforensic_authentic_check-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for pdforensic_authentic_check-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5971ff716eb63a72fb0d1710145eda7e875c0792375faae8960504484ec6cd6e
MD5 fc27b799d63e3912d516266398f094f6
BLAKE2b-256 6c256c5f8b96b7bcdbb40f25c68851f2475f0f81aae8c965fccdb255a1b84fa8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page