Turn PDFs into editable JSON and visually matching replica PDFs.

These details have not been verified by PyPI

Project links

Project description

PDFTwin

PDFTwin turns a PDF into files that are easier to work with:

An editable JSON representation of the document
A replica PDF that stays visually close to the original

This is useful when you want more than plain OCR text. PDFTwin keeps layout, text blocks, images, vectors, fonts, and page geometry so you can inspect, transform, compare, or regenerate the document.

What You Get

Given a file like invoice.pdf, PDFTwin can create:

invoice_twin.json
invoice_twin.pdf

The JSON is the editable version. The PDF is the visual replica.

Quick Start

After the package is published to PyPI:

pip install pdftwin

From the source repo today:

# clone this repository first
cd PDFTwin
python3 -m pip install -e .

If you want development tools too:

python3 -m pip install -e .[dev]

Create Editable Outputs From a PDF

pdftwin input.pdf -o /tmp

This writes:

/tmp/input_twin.pdf
/tmp/input_twin.json

If you omit -o, PDFTwin writes both files to the current folder:

./input_twin.pdf
./input_twin.json

-o is an output folder, not a filename.

Common Workflows

1. Create both editable JSON and a replica PDF

pdftwin contract.pdf -o ./outputs

This creates:

./outputs/contract_twin.json
./outputs/contract_twin.pdf

2. Create only the editable JSON

pdftwin extract contract.pdf -o contract_twin.json

3. Create a replica PDF from JSON

pdftwin render contract_twin.json -o recreated.pdf

4. Compare the original PDF with the replica PDF

pdftwin diff contract.pdf recreated.pdf --report diff_report.md --images diff_artifacts/

5. Inspect a PDF before processing

pdftwin inspect contract.pdf

Why Use PDFTwin Instead Of Basic OCR

Basic OCR usually gives you text only.

PDFTwin is designed to preserve document structure such as:

Page sizes and layout
Text spans and positions
Images and their placements
Vector lines and shapes
Font information and fallback matching

That makes it better suited for rebuilding documents, document analysis, migrations, validation, and automated processing pipelines.

How It Works

PDFTwin extracts the PDF into a structured JSON model using PyMuPDF and a set of specialized agents. That JSON can then be used to create a replica PDF, inspect document structure, or compare output quality against the original.

For harder documents, the tool can optionally use LLM-assisted OCR and visual verification.

Configuration

Show the current config:

pdftwin config show

If you plan to use Gemini-powered OCR or visual checks, set your API key:

export GEMINI_API_KEY="your-google-gemini-key"

Automatic PyPI Publishing

This repository now includes GitHub Actions workflows for both CI and PyPI publishing:

.github/workflows/ci.yml runs tests and validates the package build on pushes and pull requests
.github/workflows/publish.yml publishes to PyPI with GitHub Trusted Publishing

Trusted Publisher Settings

In PyPI, the trusted publisher should point to:

Owner: homerquan
Repository: PDFTwin
Workflow file: .github/workflows/publish.yml
Environment: pypi

Release Flow

Update the version in pyproject.toml and src/pdftwin/__init__.py
Commit your changes
Create and push a version tag such as v0.1.1

git tag v0.1.1
git push origin v0.1.1

That tag triggers the publish workflow, which:

verifies the tag matches the package version
runs the test suite
builds the wheel and source distribution
publishes to PyPI using GitHub OIDC, without a saved PyPI API token

Manual Build Fallback

If you ever want to build locally before releasing:

python3 -m build
python3 -m twine check dist/*

Notes

For advanced image diffing, your system should have common Pillow dependencies available, such as libjpeg and zlib.
Some scanned PDFs may contain a full-page image plus a text layer. PDFTwin preserves searchability while avoiding duplicated visible text in the replica PDF.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftwin-0.1.0.tar.gz (1.1 MB view details)

Uploaded Mar 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdftwin-0.1.0-py3-none-any.whl (23.9 kB view details)

Uploaded Mar 18, 2026 Python 3

File details

Details for the file pdftwin-0.1.0.tar.gz.

File metadata

Download URL: pdftwin-0.1.0.tar.gz
Upload date: Mar 18, 2026
Size: 1.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for pdftwin-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`dc19f285158096bb898e69f437b12c15c723febf2c537fa113d27f3dae27ea60`
MD5	`094f5083ac1cc4d8a9aa538732624346`
BLAKE2b-256	`989aed6a7bf09cdb1e00bbccf05a777f883bdfbbd12fcef94dc1aecdf7b6cdc7`

See more details on using hashes here.

File details

Details for the file pdftwin-0.1.0-py3-none-any.whl.

File metadata

Download URL: pdftwin-0.1.0-py3-none-any.whl
Upload date: Mar 18, 2026
Size: 23.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for pdftwin-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6f043c679007da38010120280b9dd980aa6a450a7ede0c164ad6995a8b542d63`
MD5	`c0550216ebea4069b04b01bed0ca08b4`
BLAKE2b-256	`dce2d14b79035d6f54dca45e163ff431d4cc823cc5deab0c3685ff309e1236b7`

See more details on using hashes here.

pdftwin 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDFTwin

What You Get

Quick Start

Create Editable Outputs From a PDF

Common Workflows

1. Create both editable JSON and a replica PDF

2. Create only the editable JSON

3. Create a replica PDF from JSON

4. Compare the original PDF with the replica PDF

5. Inspect a PDF before processing

Why Use PDFTwin Instead Of Basic OCR

How It Works

Configuration

Automatic PyPI Publishing

Trusted Publisher Settings

Release Flow

Manual Build Fallback

Notes

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes