Turn PDFs into editable JSON and visually matching replica PDFs.
Project description
PDFTwin
PDFTwin turns a PDF into files that are easier to work with:
- An editable JSON representation of the document
- A replica PDF that stays visually close to the original
This is useful when you want more than plain OCR text. PDFTwin keeps layout, text blocks, images, vectors, fonts, and page geometry so you can inspect, transform, compare, or regenerate the document.
What You Get
Given a file like invoice.pdf, PDFTwin can create:
invoice_twin.jsoninvoice_twin.pdf
The JSON is the editable version. The PDF is the visual replica.
Quick Start
After the package is published to PyPI:
pip install pdftwin
From the source repo today:
# clone this repository first
cd PDFTwin
python3 -m pip install -e .
If you want development tools too:
python3 -m pip install -e .[dev]
Create Editable Outputs From a PDF
pdftwin input.pdf -o /tmp
This writes:
/tmp/input_twin.pdf/tmp/input_twin.json
If you omit -o, PDFTwin writes both files to the current folder:
./input_twin.pdf./input_twin.json
-o is an output folder, not a filename.
Common Workflows
1. Create both editable JSON and a replica PDF
pdftwin contract.pdf -o ./outputs
This creates:
./outputs/contract_twin.json./outputs/contract_twin.pdf
2. Create only the editable JSON
pdftwin extract contract.pdf -o contract_twin.json
3. Create a replica PDF from JSON
pdftwin render contract_twin.json -o recreated.pdf
4. Compare the original PDF with the replica PDF
pdftwin diff contract.pdf recreated.pdf --report diff_report.md --images diff_artifacts/
5. Inspect a PDF before processing
pdftwin inspect contract.pdf
Why Use PDFTwin Instead Of Basic OCR
Basic OCR usually gives you text only.
PDFTwin is designed to preserve document structure such as:
- Page sizes and layout
- Text spans and positions
- Images and their placements
- Vector lines and shapes
- Font information and fallback matching
That makes it better suited for rebuilding documents, document analysis, migrations, validation, and automated processing pipelines.
How It Works
PDFTwin extracts the PDF into a structured JSON model using PyMuPDF and a set of specialized agents. That JSON can then be used to create a replica PDF, inspect document structure, or compare output quality against the original.
For harder documents, the tool can optionally use LLM-assisted OCR and visual verification.
Configuration
Show the current config:
pdftwin config show
If you plan to use Gemini-powered OCR or visual checks, set your API key:
export GEMINI_API_KEY="your-google-gemini-key"
Automatic PyPI Publishing
This repository now includes GitHub Actions workflows for both CI and PyPI publishing:
.github/workflows/ci.ymlruns tests and validates the package build on pushes and pull requests.github/workflows/publish.ymlpublishes to PyPI with GitHub Trusted Publishing
Trusted Publisher Settings
In PyPI, the trusted publisher should point to:
- Owner:
homerquan - Repository:
PDFTwin - Workflow file:
.github/workflows/publish.yml - Environment:
pypi
Release Flow
- Update the version in
pyproject.tomlandsrc/pdftwin/__init__.py - Commit your changes
- Create and push a version tag such as
v0.1.1
git tag v0.1.1
git push origin v0.1.1
That tag triggers the publish workflow, which:
- verifies the tag matches the package version
- runs the test suite
- builds the wheel and source distribution
- publishes to PyPI using GitHub OIDC, without a saved PyPI API token
Manual Build Fallback
If you ever want to build locally before releasing:
python3 -m build
python3 -m twine check dist/*
Notes
- For advanced image diffing, your system should have common Pillow dependencies available, such as
libjpegandzlib. - Some scanned PDFs may contain a full-page image plus a text layer. PDFTwin preserves searchability while avoiding duplicated visible text in the replica PDF.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdftwin-0.1.0.tar.gz.
File metadata
- Download URL: pdftwin-0.1.0.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc19f285158096bb898e69f437b12c15c723febf2c537fa113d27f3dae27ea60
|
|
| MD5 |
094f5083ac1cc4d8a9aa538732624346
|
|
| BLAKE2b-256 |
989aed6a7bf09cdb1e00bbccf05a777f883bdfbbd12fcef94dc1aecdf7b6cdc7
|
File details
Details for the file pdftwin-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pdftwin-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f043c679007da38010120280b9dd980aa6a450a7ede0c164ad6995a8b542d63
|
|
| MD5 |
c0550216ebea4069b04b01bed0ca08b4
|
|
| BLAKE2b-256 |
dce2d14b79035d6f54dca45e163ff431d4cc823cc5deab0c3685ff309e1236b7
|