Skip to main content

Turn image-only PDFs into searchable, selectable PDFs with an OCR text layer

Project description

readablepdf

readablepdf converts image-only PDFs into searchable and selectable PDFs by adding an OCR text layer while preserving page visuals.

How it works

  1. Render each page to a PNG (pdf2image backed by Poppler).
  2. OCR each page with Tesseract in PDF mode (image layer + invisible text layer).
  3. Merge all OCR page PDFs into one output PDF.

All intermediate files are written inside an OS temporary directory and removed automatically.

System dependencies (Linux + macOS)

You need these binaries on the machine where you run readablepdf:

  • tesseract
  • pdftoppm (from Poppler)

Ubuntu/Debian:

sudo apt update
sudo apt install -y tesseract-ocr poppler-utils

macOS (Homebrew):

brew install tesseract poppler

If you need languages beyond English, install matching Tesseract language packs.

Install and run with pipx

Once published on PyPI, you can run it directly without managing a virtualenv:

pipx run readablepdf input.pdf

Custom output/language/DPI:

pipx run readablepdf input.pdf -o output_ocr.pdf --lang eng --dpi 200

Local development

python3 -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
ruff check .
ruff format --check .
pytest -q
python -m build

GitHub Actions

  • CI workflow: lint + format check + tests + build on Ubuntu and macOS.
  • Publish to PyPI workflow: runs on GitHub Release publish (and manual trigger).

GitHub configuration needed for publish

  1. Create repository secret: PYPI_API_TOKEN.
  2. Set it to a PyPI API token with upload permission for this project.
  3. (Optional but recommended) create a protected environment named pypi and require approvals.

Release flow

  1. Merge to main.
  2. Create a Git tag/release (for example v0.1.1).
  3. Publish to PyPI workflow uploads artifacts to PyPI with version derived from that tag.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

readablepdf-0.1.0.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

readablepdf-0.1.0-py3-none-any.whl (5.3 kB view details)

Uploaded Python 3

File details

Details for the file readablepdf-0.1.0.tar.gz.

File metadata

  • Download URL: readablepdf-0.1.0.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for readablepdf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cae1029a2e3fd32bb5a6b8c373ef58a904a9015718b0a02f655d4636a6f875e0
MD5 f6a6386b742de1e72e5086cfe2c0bc7e
BLAKE2b-256 16fb60a48b872100feec5909f73dc0019522e78a2ef01bfa0f66ae5c45997063

See more details on using hashes here.

Provenance

The following attestation bundles were made for readablepdf-0.1.0.tar.gz:

Publisher: publish.yml on acoomans/readablepdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file readablepdf-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: readablepdf-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for readablepdf-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b666b738f4f3d07c5be8cfaf7e53d447dfb4b55116a80d5333331a6358e18285
MD5 d7826a3144ceb2c24a5f6d353010f01b
BLAKE2b-256 80e3ed1ce7cbc59bf55f7b5d0a2e04313ef0624016d0b5691defb6f0dd9d7a25

See more details on using hashes here.

Provenance

The following attestation bundles were made for readablepdf-0.1.0-py3-none-any.whl:

Publisher: publish.yml on acoomans/readablepdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page