A triage-and-recovery toolkit for PDFs saved with incremental updates.
Project description
revpdf
revpdf is a triage-and-recovery toolkit for PDFs saved with incremental updates. It helps you inspect a PDF's edit history and safely roll it back to an earlier revision when later saves added unwanted markup.
Installation
pip install revpdf
Quick Start
1. Inspect a PDF
revpdf-inspect input.pdf
This prints revision numbers, byte ranges, and counts of suspicious annotation markers.
2. List object-level changes per revision
revpdf-list input.pdf
This identifies which objects each revision appends and distinguishes likely annotation-only changes from content-affecting changes.
3. Extract a chosen revision
revpdf-extract input.pdf --revision-index 1 --output cleaned.pdf
4. Automated Sanitization (New)
revpdf-clean input.pdf --output cleaned.pdf --strategies acrobat samsung --manifest integrity.json
This uses best-guess heuristics to automatically identify and surgically remove platform-specific markup while regenerating the XRef stream for a clean file.
Forensic Features:
- --manifest (-m): Generates a machine-readable JSON report mapping original object IDs to their cryptographic hashes.
- Global Signature: A cumulative SHA-256 fingerprint of all kept content, providing proof that the original textbook data is byte-identical.
Forensic Integrity Module
revpdf is designed for high-assurance environments where data provenance is critical.
Tiered Hashing
To maintain high performance on large files, the integrity engine uses a tiered approach:
- Hashed: Text content streams, page geometry, and structural objects.
- Skipped: Large binary assets like embedded images and fonts (can be enabled via API).
Integrity Manifest (JSON)
The manifest provides a verifiable audit trail:
global_signature: The unique hash for the entire "cleaned" document state.object_hashes: A dictionary ofNew_ID -> SHA-256_Hash.
Developer SDK (v0.2.0+)
revpdf now provides a tiered Python SDK with a high-performance Rust backend. It supports asyncio for non-blocking I/O.
Object-Model API
Designed for ease of use and integration into Python workflows.
import asyncio
from revpdf import PdfDocument, Sanitizer
async def main():
# Load document (Lazy loading handled by Rust)
doc = await PdfDocument.open("textbook.pdf")
# Apply automated sanitization strategies
sanitizer = Sanitizer(strategies=["acrobat", "samsung"])
removed_count = await sanitizer.apply(doc)
print(f"Identified {removed_count} objects for removal")
# Surgical Save (Modern XRef Stream regeneration)
manifest = await doc.save("cleaned.pdf", surgical=True)
print(f"Surgical Save Complete!")
print(f"Global Document Signature: {manifest.global_signature}")
print(f"Verified Objects: {len(manifest.object_hashes)}")
if __name__ == "__main__":
asyncio.run(main())
The Workflow
When this works well
This method is appropriate when the PDF was modified by incremental saves. In that format, each save appends a new revision to the end of the file.
Typical signs:
- multiple
%%EOFmarkers in the file - trailer dictionaries with
/Prevpointers - later revisions containing annotation markers such as
/Type /Annot,/Subtype /Stamp,/InkList,/AAPL:AKExtras, or/PPKType (draw)
Safety Rule
Only roll back to an earlier revision if the later revisions contain unwanted annotations or annotation appearance streams and do not replace the actual textbook page content you need to keep.
Do not roll back blindly if later revisions also change:
- page content streams
- text objects
- fonts
- images
- page tree structure for real content changes
If those appear in the later revisions, you need a more careful repair strategy.
The Manual Workflow
1. Find the revision boundaries
Each incremental revision normally ends with %%EOF.
Run:
python3 - <<'PY'
from pathlib import Path
data = Path("input.pdf").read_bytes()
cursor = 0
index = 1
while True:
pos = data.find(b"%%EOF", cursor)
if pos == -1:
break
print(f"Revision {index}: EOF at byte {pos}")
cursor = pos + 1
index += 1
PY
If you see more than one %%EOF, the file contains multiple revisions.
2. Inspect the trailer chain
Each later trailer often points backward to the previous revision using /Prev.
Run:
python3 - <<'PY'
from pathlib import Path
data = Path("input.pdf").read_bytes()
cursor = 0
while True:
pos = data.find(b"%%EOF", cursor)
if pos == -1:
break
snippet = data[max(0, pos - 400):pos + 20]
print(snippet.decode("latin1", "replace"))
print("-----")
cursor = pos + 1
PY
This helps confirm that the file was saved incrementally rather than rewritten from scratch.
3. Search for suspicious annotation markers
Search the raw PDF bytes for common overlay markers:
rg -a -n '/Subtype /(Stamp|Ink|FreeText|Square|Circle|Highlight)|/InkList|/AAPL:AKExtras|/PPKType \(draw\)|Mobile User' input.pdf
Common signs of hand-drawn markup include:
/Subtype /Stamp/Subtype /Ink/InkList/PPKType (draw)/AAPL:AKExtrasMobile User
4. Compare revisions, not just the whole file
The important question is not whether the PDF contains annotations somewhere. The important question is when those objects first appear.
For each appended revision, inspect whether it adds:
- only annotation objects and annotation appearance streams
- page
/Annotsreferences pointing to those annotations
That is usually safe to roll back.
If the appended revision adds or replaces actual page contents, treat it as unsafe for blind rollback.
5. Choose the rollback target
Choose the last revision before the unwanted markers first appear.
Example logic:
- revision 1: no unwanted annotation markers
- revision 2: unwanted drawing markers appear
- revision 3: more of the same unwanted drawing markers
In that case, revision 1 is the clean rollback target.
6. Extract the earlier revision into a new file
Once you know the correct revision boundary, copy the file only up to that revision's final %%EOF.
Manual example:
head -c <SAFE_END_OFFSET> input.pdf > cleaned.pdf
Do this into a new output file. Leave the original untouched.
7. Verify the cleaned file
Run:
pdfinfo cleaned.pdf
Then confirm the unwanted markers are gone:
python3 - <<'PY'
from pathlib import Path
data = Path("cleaned.pdf").read_bytes()
for token in [
b"Mobile User",
b"/PPKType (draw)",
b"/Subtype /Stamp",
b"/Subtype /Ink",
b"/AAPL:AKExtras",
]:
print(token.decode("latin1"), data.count(token))
PY
Check:
- page count still matches what you expect
- the cleaned file opens normally
- the unwanted annotation markers are gone
- the original content remains intact
Reusable Commands
The package includes four core commands:
revpdf-inspectrevpdf-extractrevpdf-listrevpdf-clean
Inspect a PDF
revpdf-inspect input.pdf
This prints:
- revision number
- revision byte range
- end offset
- trailer
/Prev - trailer
/Size - counts of suspicious annotation markers in that revision
List object-level changes per revision
revpdf-list input.pdf
This prints, for each revision:
- object count
- per-kind counts such as page, annot, xobject, font, and generic objects
- whether each appended object was added, redefined, or repeated
- a revision assessment such as
likely_annotation_onlyorcontent_affecting_or_mixed - notable details such as
/Rect,/Annots,/Contents, stream filters, compressed-object containers, and vendor markers when present
Useful options:
revpdf-list input.pdf --revision-index 3
revpdf-list input.pdf --show-baseline-objects
revpdf-list input.pdf --summary-only
revpdf-list input.pdf --json
Extract a chosen revision
revpdf-extract input.pdf --revision-index 1 --output cleaned.pdf
This writes a new PDF containing only the bytes through the selected revision.
You can also extract by byte offset:
revpdf-extract input.pdf --end-offset 229086321 --output cleaned.pdf
Practical Decision Checklist
Use rollback when all of these are true:
- the PDF has multiple revisions
- the unwanted changes were introduced in later revisions
- those later revisions are annotation-only or annotation-dominant
- the earlier revision already contains the correct textbook content
Do not use blind rollback when any of these are true:
- the later revisions contain actual content changes you need
- you cannot tell whether the later objects are only annotations
- the PDF was fully rewritten instead of saved incrementally
Notes
- Some tools warn about broken or invalid linearization tables. That does not automatically mean the PDF is unusable.
- Some annotation systems render hand-drawn marks as
/Stampobjects with appearance streams rather than/Inkobjects. Search broadly. - Some editors store changed objects inside compressed object streams (
/ObjStm) or xref streams. The change-report script now detects these and expands common Flate-compressed object streams. - Always work on a copy when the document is important.
Recommended Sequence
- Run
revpdf-inspect. - Run
revpdf-listto see exactly which objects each revision added or redefined. - Identify the first revision that introduces unwanted annotation markers.
- Choose between:
- Rollback: Use
revpdf-extractto truncate at a safe revision. - Surgical Cleanup: Use
revpdf-cleanto remove specific markup layers while keeping recent content.
- Rollback: Use
- Verify the page count, metadata, and marker counts.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file revpdf-0.1.0.tar.gz.
File metadata
- Download URL: revpdf-0.1.0.tar.gz
- Upload date:
- Size: 21.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
060c260c04fc537d1bd110f90b499e26eb3dd38234da68c22f99d15b989aa28b
|
|
| MD5 |
81516262818ec3a250f15a37426b6e7e
|
|
| BLAKE2b-256 |
573262471930ee62cbd2bdb514686383a8f058c0302747394a913ee25175a6c3
|
File details
Details for the file revpdf-0.1.0-cp313-cp313-macosx_11_0_arm64.whl.
File metadata
- Download URL: revpdf-0.1.0-cp313-cp313-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.13, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9969af4db0167bbaeb8c521442ec7e67fedfd39c0d4f1f7ea9e64a89f30f1031
|
|
| MD5 |
ebb4f8d90dda393ba625164dabddfa78
|
|
| BLAKE2b-256 |
ec0e7eee5edd64812541c029c576a7dd92d0d99967c42bc8f3c06fce88bd32c7
|