Skip to main content

Turn PDFs into editable JSON and visually matching replica PDFs.

Project description

PDFTwin

PDFTwin turns a PDF into files that are easier to work with:

  • An editable JSON representation of the document
  • A replica PDF that stays visually close to the original

This is useful when you want more than plain OCR text. PDFTwin keeps layout, text blocks, images, vectors, fonts, and page geometry so you can inspect, transform, compare, or regenerate the document.

What You Get

Given a file like invoice.pdf, PDFTwin can create:

  • invoice_twin.json
  • invoice_twin.pdf

The JSON is the editable version. The PDF is the visual replica.

Quick Start

After the package is published to PyPI:

pip install pdftwin

From the source repo today:

# clone this repository first
cd PDFTwin
python3 -m pip install -e .

If you want development tools too:

python3 -m pip install -e .[dev]

Create Editable Outputs From a PDF

pdftwin input.pdf -o /tmp

This writes:

  • /tmp/input_twin.pdf
  • /tmp/input_twin.json

If you omit -o, PDFTwin writes both files to the current folder:

  • ./input_twin.pdf
  • ./input_twin.json

-o is an output folder, not a filename.

Common Workflows

1. Create both editable JSON and a replica PDF

pdftwin contract.pdf -o ./outputs

This creates:

  • ./outputs/contract_twin.json
  • ./outputs/contract_twin.pdf

2. Create only the editable JSON

pdftwin extract contract.pdf -o contract_twin.json

3. Create a replica PDF from JSON

pdftwin render contract_twin.json -o recreated.pdf

4. Compare the original PDF with the replica PDF

pdftwin diff contract.pdf recreated.pdf --report diff_report.md --images diff_artifacts/

5. Inspect a PDF before processing

pdftwin inspect contract.pdf

Why Use PDFTwin Instead Of Basic OCR

Basic OCR usually gives you text only.

PDFTwin is designed to preserve document structure such as:

  • Page sizes and layout
  • Text spans and positions
  • Images and their placements
  • Vector lines and shapes
  • Font information and fallback matching

That makes it better suited for rebuilding documents, document analysis, migrations, validation, and automated processing pipelines.

How It Works

PDFTwin extracts the PDF into a structured JSON model using PyMuPDF and a set of specialized agents. That JSON can then be used to create a replica PDF, inspect document structure, or compare output quality against the original.

For harder documents, the tool can optionally use LLM-assisted OCR and visual verification.

Configuration

Show the current config:

pdftwin config show

If you plan to use Gemini-powered OCR or visual checks, set your API key:

export GEMINI_API_KEY="your-google-gemini-key"

Automatic PyPI Publishing

This repository now includes GitHub Actions workflows for both CI and PyPI publishing:

  • .github/workflows/ci.yml runs tests and validates the package build on pushes and pull requests
  • .github/workflows/publish.yml publishes to PyPI with GitHub Trusted Publishing

Trusted Publisher Settings

In PyPI, the trusted publisher should point to:

  • Owner: homerquan
  • Repository: PDFTwin
  • Workflow file: .github/workflows/publish.yml
  • Environment: pypi

Release Flow

  1. Update the version in pyproject.toml and src/pdftwin/__init__.py
  2. Commit your changes
  3. Create and push a version tag such as v0.1.1
git tag v0.1.1
git push origin v0.1.1

That tag triggers the publish workflow, which:

  • verifies the tag matches the package version
  • runs the test suite
  • builds the wheel and source distribution
  • publishes to PyPI using GitHub OIDC, without a saved PyPI API token

Manual Build Fallback

If you ever want to build locally before releasing:

python3 -m build
python3 -m twine check dist/*

Notes

  • For advanced image diffing, your system should have common Pillow dependencies available, such as libjpeg and zlib.
  • Some scanned PDFs may contain a full-page image plus a text layer. PDFTwin preserves searchability while avoiding duplicated visible text in the replica PDF.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftwin-0.1.0.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdftwin-0.1.0-py3-none-any.whl (23.9 kB view details)

Uploaded Python 3

File details

Details for the file pdftwin-0.1.0.tar.gz.

File metadata

  • Download URL: pdftwin-0.1.0.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for pdftwin-0.1.0.tar.gz
Algorithm Hash digest
SHA256 dc19f285158096bb898e69f437b12c15c723febf2c537fa113d27f3dae27ea60
MD5 094f5083ac1cc4d8a9aa538732624346
BLAKE2b-256 989aed6a7bf09cdb1e00bbccf05a777f883bdfbbd12fcef94dc1aecdf7b6cdc7

See more details on using hashes here.

File details

Details for the file pdftwin-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdftwin-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for pdftwin-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6f043c679007da38010120280b9dd980aa6a450a7ede0c164ad6995a8b542d63
MD5 c0550216ebea4069b04b01bed0ca08b4
BLAKE2b-256 dce2d14b79035d6f54dca45e163ff431d4cc823cc5deab0c3685ff309e1236b7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page