Skip to main content

A triage-and-recovery toolkit for PDFs saved with incremental updates.

Project description

revpdf

revpdf is a triage-and-recovery toolkit for PDFs saved with incremental updates. It helps you inspect a PDF's edit history and safely roll it back to an earlier revision when later saves added unwanted markup.

Installation

pip install revpdf

Quick Start

1. Inspect a PDF

revpdf-inspect input.pdf

This prints revision numbers, byte ranges, and counts of suspicious annotation markers.

2. List object-level changes per revision

revpdf-list input.pdf

This identifies which objects each revision appends and distinguishes likely annotation-only changes from content-affecting changes.

3. Extract a chosen revision

revpdf-extract input.pdf --revision-index 1 --output cleaned.pdf

4. Automated Sanitization (New)

revpdf-clean input.pdf --output cleaned.pdf --strategies acrobat samsung --manifest integrity.json

This uses best-guess heuristics to automatically identify and surgically remove platform-specific markup while regenerating the XRef stream for a clean file.

Forensic Features:

  • --manifest (-m): Generates a machine-readable JSON report mapping original object IDs to their cryptographic hashes.
  • Global Signature: A cumulative SHA-256 fingerprint of all kept content, providing proof that the original textbook data is byte-identical.

Forensic Integrity Module

revpdf is designed for high-assurance environments where data provenance is critical.

Tiered Hashing

To maintain high performance on large files, the integrity engine uses a tiered approach:

  • Hashed: Text content streams, page geometry, and structural objects.
  • Skipped: Large binary assets like embedded images and fonts (can be enabled via API).

Integrity Manifest (JSON)

The manifest provides a verifiable audit trail:

  • global_signature: The unique hash for the entire "cleaned" document state.
  • object_hashes: A dictionary of New_ID -> SHA-256_Hash.

Developer SDK (v0.2.0+)

revpdf now provides a tiered Python SDK with a high-performance Rust backend. It supports asyncio for non-blocking I/O.

Object-Model API

Designed for ease of use and integration into Python workflows.

import asyncio
from revpdf import PdfDocument, Sanitizer

async def main():
    # Load document (Lazy loading handled by Rust)
    doc = await PdfDocument.open("textbook.pdf")
    
    # Apply automated sanitization strategies
    sanitizer = Sanitizer(strategies=["acrobat", "samsung"])
    removed_count = await sanitizer.apply(doc)
    print(f"Identified {removed_count} objects for removal")
    
    # Surgical Save (Modern XRef Stream regeneration)
    manifest = await doc.save("cleaned.pdf", surgical=True)
    print(f"Surgical Save Complete!")
    print(f"Global Document Signature: {manifest.global_signature}")
    print(f"Verified Objects: {len(manifest.object_hashes)}")

if __name__ == "__main__":
    asyncio.run(main())

The Workflow

When this works well

This method is appropriate when the PDF was modified by incremental saves. In that format, each save appends a new revision to the end of the file.

Typical signs:

  • multiple %%EOF markers in the file
  • trailer dictionaries with /Prev pointers
  • later revisions containing annotation markers such as /Type /Annot, /Subtype /Stamp, /InkList, /AAPL:AKExtras, or /PPKType (draw)

Safety Rule

Only roll back to an earlier revision if the later revisions contain unwanted annotations or annotation appearance streams and do not replace the actual textbook page content you need to keep.

Do not roll back blindly if later revisions also change:

  • page content streams
  • text objects
  • fonts
  • images
  • page tree structure for real content changes

If those appear in the later revisions, you need a more careful repair strategy.

The Manual Workflow

1. Find the revision boundaries

Each incremental revision normally ends with %%EOF.

Run:

python3 - <<'PY'
from pathlib import Path

data = Path("input.pdf").read_bytes()
cursor = 0
index = 1
while True:
    pos = data.find(b"%%EOF", cursor)
    if pos == -1:
        break
    print(f"Revision {index}: EOF at byte {pos}")
    cursor = pos + 1
    index += 1
PY

If you see more than one %%EOF, the file contains multiple revisions.

2. Inspect the trailer chain

Each later trailer often points backward to the previous revision using /Prev.

Run:

python3 - <<'PY'
from pathlib import Path

data = Path("input.pdf").read_bytes()
cursor = 0
while True:
    pos = data.find(b"%%EOF", cursor)
    if pos == -1:
        break
    snippet = data[max(0, pos - 400):pos + 20]
    print(snippet.decode("latin1", "replace"))
    print("-----")
    cursor = pos + 1
PY

This helps confirm that the file was saved incrementally rather than rewritten from scratch.

3. Search for suspicious annotation markers

Search the raw PDF bytes for common overlay markers:

rg -a -n '/Subtype /(Stamp|Ink|FreeText|Square|Circle|Highlight)|/InkList|/AAPL:AKExtras|/PPKType \(draw\)|Mobile User' input.pdf

Common signs of hand-drawn markup include:

  • /Subtype /Stamp
  • /Subtype /Ink
  • /InkList
  • /PPKType (draw)
  • /AAPL:AKExtras
  • Mobile User

4. Compare revisions, not just the whole file

The important question is not whether the PDF contains annotations somewhere. The important question is when those objects first appear.

For each appended revision, inspect whether it adds:

  • only annotation objects and annotation appearance streams
  • page /Annots references pointing to those annotations

That is usually safe to roll back.

If the appended revision adds or replaces actual page contents, treat it as unsafe for blind rollback.

5. Choose the rollback target

Choose the last revision before the unwanted markers first appear.

Example logic:

  • revision 1: no unwanted annotation markers
  • revision 2: unwanted drawing markers appear
  • revision 3: more of the same unwanted drawing markers

In that case, revision 1 is the clean rollback target.

6. Extract the earlier revision into a new file

Once you know the correct revision boundary, copy the file only up to that revision's final %%EOF.

Manual example:

head -c <SAFE_END_OFFSET> input.pdf > cleaned.pdf

Do this into a new output file. Leave the original untouched.

7. Verify the cleaned file

Run:

pdfinfo cleaned.pdf

Then confirm the unwanted markers are gone:

python3 - <<'PY'
from pathlib import Path

data = Path("cleaned.pdf").read_bytes()
for token in [
    b"Mobile User",
    b"/PPKType (draw)",
    b"/Subtype /Stamp",
    b"/Subtype /Ink",
    b"/AAPL:AKExtras",
]:
    print(token.decode("latin1"), data.count(token))
PY

Check:

  • page count still matches what you expect
  • the cleaned file opens normally
  • the unwanted annotation markers are gone
  • the original content remains intact

Reusable Commands

The package includes four core commands:

  • revpdf-inspect
  • revpdf-extract
  • revpdf-list
  • revpdf-clean

Inspect a PDF

revpdf-inspect input.pdf

This prints:

  • revision number
  • revision byte range
  • end offset
  • trailer /Prev
  • trailer /Size
  • counts of suspicious annotation markers in that revision

List object-level changes per revision

revpdf-list input.pdf

This prints, for each revision:

  • object count
  • per-kind counts such as page, annot, xobject, font, and generic objects
  • whether each appended object was added, redefined, or repeated
  • a revision assessment such as likely_annotation_only or content_affecting_or_mixed
  • notable details such as /Rect, /Annots, /Contents, stream filters, compressed-object containers, and vendor markers when present

Useful options:

revpdf-list input.pdf --revision-index 3
revpdf-list input.pdf --show-baseline-objects
revpdf-list input.pdf --summary-only
revpdf-list input.pdf --json

Extract a chosen revision

revpdf-extract input.pdf --revision-index 1 --output cleaned.pdf

This writes a new PDF containing only the bytes through the selected revision.

You can also extract by byte offset:

revpdf-extract input.pdf --end-offset 229086321 --output cleaned.pdf

Practical Decision Checklist

Use rollback when all of these are true:

  • the PDF has multiple revisions
  • the unwanted changes were introduced in later revisions
  • those later revisions are annotation-only or annotation-dominant
  • the earlier revision already contains the correct textbook content

Do not use blind rollback when any of these are true:

  • the later revisions contain actual content changes you need
  • you cannot tell whether the later objects are only annotations
  • the PDF was fully rewritten instead of saved incrementally

Notes

  • Some tools warn about broken or invalid linearization tables. That does not automatically mean the PDF is unusable.
  • Some annotation systems render hand-drawn marks as /Stamp objects with appearance streams rather than /Ink objects. Search broadly.
  • Some editors store changed objects inside compressed object streams (/ObjStm) or xref streams. The change-report script now detects these and expands common Flate-compressed object streams.
  • Always work on a copy when the document is important.

Recommended Sequence

  1. Run revpdf-inspect.
  2. Run revpdf-list to see exactly which objects each revision added or redefined.
  3. Identify the first revision that introduces unwanted annotation markers.
  4. Choose between:
    • Rollback: Use revpdf-extract to truncate at a safe revision.
    • Surgical Cleanup: Use revpdf-clean to remove specific markup layers while keeping recent content.
  5. Verify the page count, metadata, and marker counts.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

revpdf-0.1.0.tar.gz (21.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

revpdf-0.1.0-cp313-cp313-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

File details

Details for the file revpdf-0.1.0.tar.gz.

File metadata

  • Download URL: revpdf-0.1.0.tar.gz
  • Upload date:
  • Size: 21.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for revpdf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 060c260c04fc537d1bd110f90b499e26eb3dd38234da68c22f99d15b989aa28b
MD5 81516262818ec3a250f15a37426b6e7e
BLAKE2b-256 573262471930ee62cbd2bdb514686383a8f058c0302747394a913ee25175a6c3

See more details on using hashes here.

File details

Details for the file revpdf-0.1.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for revpdf-0.1.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9969af4db0167bbaeb8c521442ec7e67fedfd39c0d4f1f7ea9e64a89f30f1031
MD5 ebb4f8d90dda393ba625164dabddfa78
BLAKE2b-256 ec0e7eee5edd64812541c029c576a7dd92d0d99967c42bc8f3c06fce88bd32c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page