Skip to main content

Visual diff for born-digital PDFs — highlights changes directly on the original pages

Project description

pdfdelta

pdfdelta is a lightweight visual diff tool for born-digital PDFs. It is designed to compare revisions of academic papers and technical documents by highlighting changes directly on the original pages.

The tool generates two annotated files: deletions are marked on the old version, and additions are marked on the new version.

Old PDF with deletions highlighted New PDF with additions highlighted

Capabilities

  • Works fine with multi-column layouts and complex papers.
  • Skips "fake" changes caused by text moving to a new line, paragraph and page breaking.
  • Not confused by moving figures, tables, or math formulas.

Installation

Via PyPI:

pip install pdfdelta

If you want to install directly from the repository:

pip install git+https://github.com/mli55/pdfdelta.git

Usage

pdfdelta old.pdf new.pdf

This writes two annotated files:

  • old_marked.pdf — original pages with deletions highlighted
  • new_marked.pdf — revised pages with additions highlighted

Options

Flag Default Description
--old-out old_marked.pdf Output path for the annotated old PDF
--new-out new_marked.pdf Output path for the annotated new PDF
--opacity 0.35 Highlight opacity (0.0–1.0)

Features

  • Direct Annotation: Highlights changes as native PDF annotations on the original layout.
  • Layout Aware: Optimized for multi-column papers and technical reports.
  • Noise Reduction: Filters out visual artifacts caused by simple text reflow across lines or pages.
  • Structural Support: Better handling of figures and tables.

How It Works

 old.pdf    new.pdf
   │           │
   ▼           ▼
┌──────────────────┐
│  Extract words   │  PyMuPDF: word text + bounding boxes
└────────┬─────────┘
         ▼
┌──────────────────┐
│  Global diff     │  Flatten all pages → SequenceMatcher
└────────┬─────────┘
         ▼
┌──────────────────┐
│  Word-level diff │  Per-word & sub-word precision
└────────┬─────────┘
         ▼
┌──────────────────┐
│  Reflow filter   │  Suppress cross-page / cross-column noise
└────────┬─────────┘
         ▼
┌──────────────────┐
│  Annotate PDFs   │  Highlights on original pages
└────────┬─────────┘
         ▼
 old_marked.pdf
 new_marked.pdf

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfdelta-0.1.2.tar.gz (12.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfdelta-0.1.2-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file pdfdelta-0.1.2.tar.gz.

File metadata

  • Download URL: pdfdelta-0.1.2.tar.gz
  • Upload date:
  • Size: 12.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for pdfdelta-0.1.2.tar.gz
Algorithm Hash digest
SHA256 5a9f6dc0fd4369fb26215cc54a5a3d64ef3c89528e6845069e038aa3383666cc
MD5 c81c92de4395f10420861548ca17ec36
BLAKE2b-256 4723b3c0a6d6fe35aaeea15234ae204987a2bff01ad28b054be5e42391293695

See more details on using hashes here.

File details

Details for the file pdfdelta-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: pdfdelta-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for pdfdelta-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2c835b58756e3b9ea5331c10792b308ede833d5b427b608ca08ee334bc40e5bc
MD5 57ef62ab614c60babe4350e1a560d43b
BLAKE2b-256 4392e67fc42fc90c0916b2125a908d8b5061febb51148d448dfb7d1358934588

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page