Visual diff for born-digital PDFs — highlights changes directly on the original pages
Project description
pdfdelta
pdfdelta is a lightweight visual diff tool for born-digital PDFs. It is designed to compare revisions of academic papers and technical documents by highlighting changes directly on the original pages.
The tool generates two annotated files: deletions are marked on the old version, and additions are marked on the new version.
Capabilities
- Works fine with multi-column layouts and complex papers.
- Skips "fake" changes caused by text moving to a new line, paragraph and page breaking.
- Not confused by moving figures, tables, or math formulas.
Installation
pip install pdfdelta
If you want to install directly from the repository:
pip install git+https://github.com/mli55/pdfdelta.git
Usage
pdfdelta old.pdf new.pdf
This writes two annotated files:
old_marked.pdf— original pages with deletions highlightednew_marked.pdf— revised pages with additions highlighted
Options
| Flag | Default | Description |
|---|---|---|
--old-out |
old_marked.pdf |
Output path for the annotated old PDF |
--new-out |
new_marked.pdf |
Output path for the annotated new PDF |
--opacity |
0.35 |
Highlight opacity (0.0–1.0) |
Features
- Direct Annotation: Highlights changes as native PDF annotations on the original layout.
- Layout Aware: Optimized for multi-column papers and technical reports.
- Noise Reduction: Filters out visual artifacts caused by simple text reflow across lines or pages.
- Structural Support: Better handling of figures and tables.
How It Works
old.pdf new.pdf
│ │
▼ ▼
┌──────────────────┐
│ Extract words │ PyMuPDF: word text + bounding boxes
└────────┬─────────┘
▼
┌──────────────────┐
│ Global diff │ Flatten all pages → SequenceMatcher
└────────┬─────────┘
▼
┌──────────────────┐
│ Word-level diff │ Per-word & sub-word precision
└────────┬─────────┘
▼
┌──────────────────┐
│ Reflow filter │ Suppress cross-page / cross-column noise
└────────┬─────────┘
▼
┌──────────────────┐
│ Annotate PDFs │ Highlights on original pages
└────────┬─────────┘
▼
old_marked.pdf
new_marked.pdf
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfdelta-0.1.1.tar.gz.
File metadata
- Download URL: pdfdelta-0.1.1.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c373ee80ad49d3e8e33717d7658885dfde3a139ec377d2edc6a9c062474a55a
|
|
| MD5 |
f67e003b1f2061843a87ab720346ecc8
|
|
| BLAKE2b-256 |
a6b534550d55cf75a11dd702f9b356d6a40aaae221cddfb546994668ce4bc490
|
File details
Details for the file pdfdelta-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pdfdelta-0.1.1-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
13a06dfabb45d6b06140aaa3f9aad09fcb0bed8c1f3fd5231a3233dd960c9e6a
|
|
| MD5 |
40547e63f7c67c58cce63cd56a499742
|
|
| BLAKE2b-256 |
4e94172804520fcadb0f690352e51b1861047df456a6c29de2cd27bf15e62e06
|