Skip to main content

Visual diff for born-digital PDFs — highlights changes directly on the original pages

Project description

pdfdelta

pdfdelta is a lightweight visual diff tool for born-digital PDFs.

Given an old and a new version of a PDF, it writes highlights directly onto the original pages so revisions are easy to review: deletions on the old file, additions on the new file.

It is mainly designed for academic papers and technical documents, where small wording changes matter and layout is part of the review process.

Old PDF with deletions highlighted New PDF with additions highlighted

Features

  • Highlights changes directly on the original PDF pages
  • Works well for born-digital PDFs such as papers, reports, and drafts
  • Handles multi-column layouts better than plain text diff tools
  • Tries to reduce noisy highlights from simple reflow
  • Keeps the review workflow visual and page-based

Installation

If you are using the repository directly:

pip install git+https://github.com/mli55/pdfdelta.git

Usage

pdfdelta old.pdf new.pdf

This writes two annotated files:

  • old_marked.pdf — original pages with deletions highlighted
  • new_marked.pdf — revised pages with additions highlighted

Options

Flag Default Description
--old-out old_marked.pdf Output path for the annotated old PDF
--new-out new_marked.pdf Output path for the annotated new PDF
--opacity 0.35 Highlight opacity (0.0–1.0)

How It Works

 old.pdf    new.pdf
   │           │
   ▼           ▼
┌──────────────────┐
│  Extract words   │  PyMuPDF: word text + bounding boxes
└────────┬─────────┘
         ▼
┌──────────────────┐
│  Global diff     │  Flatten all pages → SequenceMatcher
└────────┬─────────┘
         ▼
┌──────────────────┐
│  Word-level diff │  Per-word & sub-word precision
└────────┬─────────┘
         ▼
┌──────────────────┐
│  Reflow filter   │  Suppress cross-page / cross-column noise
└────────┬─────────┘
         ▼
┌──────────────────┐
│  Annotate PDFs   │  Highlights on original pages
└────────┬─────────┘
         ▼
 old_marked.pdf
 new_marked.pdf

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfdelta-0.1.0.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfdelta-0.1.0-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file pdfdelta-0.1.0.tar.gz.

File metadata

  • Download URL: pdfdelta-0.1.0.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for pdfdelta-0.1.0.tar.gz
Algorithm Hash digest
SHA256 39a48b4e20f087ebaf7fc6c96800af164dde995a8d23c29c584dd6c6ff2cb5c0
MD5 d10642f0182252738be0efb70d7fc20f
BLAKE2b-256 ca585f1066400c3e3e07c87e98ec24943332ba6be551e51abb262cee3c7cbec2

See more details on using hashes here.

File details

Details for the file pdfdelta-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdfdelta-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for pdfdelta-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a61e564848df7ac1bbe47ecc21a0ca9b2e223422da5fefbb71eca3cd0e2725aa
MD5 38220c808494f8f69d98f82d5e1bc330
BLAKE2b-256 4713325a0ae01d9e3039ac2315597641728da157be07a891a47eb8e0d081773f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page