Skip to main content

Visual diff for born-digital PDFs — highlights changes directly on the original pages

Project description

pdfdelta

pdfdelta is a lightweight visual diff tool for born-digital PDFs. It is designed to compare revisions of academic papers and technical documents by highlighting changes directly on the original pages.

The tool generates two annotated files: deletions are marked on the old version, and additions are marked on the new version.

Old PDF with deletions highlighted New PDF with additions highlighted

Capabilities

  • Works fine with multi-column layouts and complex papers.
  • Skips "fake" changes caused by text moving to a new line, paragraph and page breaking.
  • Not confused by moving figures, tables, or math formulas.

Installation

pip install pdfdelta

If you want to install directly from the repository:

pip install git+https://github.com/mli55/pdfdelta.git

Usage

pdfdelta old.pdf new.pdf

This writes two annotated files:

  • old_marked.pdf — original pages with deletions highlighted
  • new_marked.pdf — revised pages with additions highlighted

Options

Flag Default Description
--old-out old_marked.pdf Output path for the annotated old PDF
--new-out new_marked.pdf Output path for the annotated new PDF
--opacity 0.35 Highlight opacity (0.0–1.0)

Features

  • Direct Annotation: Highlights changes as native PDF annotations on the original layout.
  • Layout Aware: Optimized for multi-column papers and technical reports.
  • Noise Reduction: Filters out visual artifacts caused by simple text reflow across lines or pages.
  • Structural Support: Better handling of figures and tables.

How It Works

 old.pdf    new.pdf
   │           │
   ▼           ▼
┌──────────────────┐
│  Extract words   │  PyMuPDF: word text + bounding boxes
└────────┬─────────┘
         ▼
┌──────────────────┐
│  Global diff     │  Flatten all pages → SequenceMatcher
└────────┬─────────┘
         ▼
┌──────────────────┐
│  Word-level diff │  Per-word & sub-word precision
└────────┬─────────┘
         ▼
┌──────────────────┐
│  Reflow filter   │  Suppress cross-page / cross-column noise
└────────┬─────────┘
         ▼
┌──────────────────┐
│  Annotate PDFs   │  Highlights on original pages
└────────┬─────────┘
         ▼
 old_marked.pdf
 new_marked.pdf

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfdelta-0.1.1.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfdelta-0.1.1-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file pdfdelta-0.1.1.tar.gz.

File metadata

  • Download URL: pdfdelta-0.1.1.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for pdfdelta-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8c373ee80ad49d3e8e33717d7658885dfde3a139ec377d2edc6a9c062474a55a
MD5 f67e003b1f2061843a87ab720346ecc8
BLAKE2b-256 a6b534550d55cf75a11dd702f9b356d6a40aaae221cddfb546994668ce4bc490

See more details on using hashes here.

File details

Details for the file pdfdelta-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pdfdelta-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for pdfdelta-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 13a06dfabb45d6b06140aaa3f9aad09fcb0bed8c1f3fd5231a3233dd960c9e6a
MD5 40547e63f7c67c58cce63cd56a499742
BLAKE2b-256 4e94172804520fcadb0f690352e51b1861047df456a6c29de2cd27bf15e62e06

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page