Skip to main content

Turn image-only PDFs into searchable, selectable PDFs with an OCR text layer

Project description

readablepdf

readablepdf converts image-only PDFs into searchable and selectable PDFs by adding an OCR text layer while preserving page visuals.

How it works

  1. Render each page to a PNG (pdf2image backed by Poppler).
  2. OCR each page with Tesseract in PDF mode (image layer + invisible text layer).
  3. Merge all OCR page PDFs into one output PDF.

All intermediate files are written inside an OS temporary directory and removed automatically.

System dependencies (Linux + macOS)

You need these binaries on the machine where you run readablepdf:

  • tesseract
  • pdftoppm (from Poppler)

Ubuntu/Debian:

sudo apt update
sudo apt install -y tesseract-ocr poppler-utils

macOS (Homebrew):

brew install tesseract poppler

If you need languages beyond English, install matching Tesseract language packs.

Install and run with pipx

Once published on PyPI, you can run it directly without managing a virtualenv:

pipx run readablepdf input.pdf

Custom output/language/DPI:

pipx run readablepdf input.pdf -o output_ocr.pdf --lang eng --dpi 200

Local development

python3 -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
ruff check .
ruff format --check .
pytest -q
python -m build

GitHub Actions

  • CI workflow: lint + format check + tests + build on Ubuntu and macOS.
  • Publish to PyPI workflow: runs on GitHub Release publish (and manual trigger).

GitHub configuration needed for publish

  1. Create repository secret: PYPI_API_TOKEN.
  2. Set it to a PyPI API token with upload permission for this project.
  3. (Optional but recommended) create a protected environment named pypi and require approvals.

Release flow

  1. Merge to main.
  2. Create a Git tag/release (for example v0.1.1).
  3. Publish to PyPI workflow uploads artifacts to PyPI with version derived from that tag.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

readablepdf-0.2.0.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

readablepdf-0.2.0-py3-none-any.whl (5.6 kB view details)

Uploaded Python 3

File details

Details for the file readablepdf-0.2.0.tar.gz.

File metadata

  • Download URL: readablepdf-0.2.0.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for readablepdf-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2096fa5f11b9c4933ee1f3fe816c660f6c678fffdee21e08831c445e9b43dd70
MD5 a7d16b9c78ecb8900540cfa2b4eded41
BLAKE2b-256 c0b18e20619e395c1b78e4d8434f13e9624ab413aed0281693bfda7f6253a354

See more details on using hashes here.

Provenance

The following attestation bundles were made for readablepdf-0.2.0.tar.gz:

Publisher: publish.yml on acoomans/readablepdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file readablepdf-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: readablepdf-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 5.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for readablepdf-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c94fd62276552b2bd51d2da177f94a3f27d24eca4c85604a1455f028a350fb99
MD5 b071bb56aa9180af6c62641d74097f5d
BLAKE2b-256 30ac3f964156ec064f4f9698bee5587d72b4d9cedf4b55a8358608e51c5c4c02

See more details on using hashes here.

Provenance

The following attestation bundles were made for readablepdf-0.2.0-py3-none-any.whl:

Publisher: publish.yml on acoomans/readablepdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page