OCR plugin for claude-pdf2md — recognise text in scanned PDFs via Tesseract (and other backends) and feed it into the existing PDF→Markdown pipeline.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

skippdot

These details have not been verified by PyPI

Project links

Base-package

Project description

claude-pdf2md-ocr

OCR plugin for claude-pdf2md. Detects PDF pages that have no usable text layer (pure scans, hybrid scans with minimal overlay) and fills them in using Tesseract, so that the existing claude-pdf2md pipeline — headings, lists, tables, links — works on scanned documents the same way it works on native PDFs.

Features

Auto language detection. Sniffs the document's text layer (or a cheap first-pass OCR probe when the text layer is missing) and picks a narrow Tesseract pack like ces+eng or ukr+eng. Avoids the cross-script confusion that plagues a default ukr+rus+eng model on Czech or English documents.
Post-OCR spellcheck. Cleans up the cross-script errors that do slip through (опе → one, Ме → №, Мо1224 → №1224, ІВАМ → IBAN, І → I) by running each token against a language-aware wordfreq dictionary. Conservative — only edits a token when the correction is unambiguous.
Smart page selection. By default OCR runs only on pages that are scans (no text layer, or a whole-page background image + tiny text overlay). --all-pages forces it on every page.
Heading-safe injection. OCR word heights are normalised to a per-page median so the base pipeline's heading detector stops treating every line with tall capitals as a tier-2 heading.

System dependency

Tesseract must be installed separately.

macOS

brew install tesseract tesseract-lang     # ships 100+ language packs
tesseract --list-langs | grep -E '^(ukr|rus|ces|eng)$'

Debian / Ubuntu

sudo apt-get install tesseract-ocr tesseract-ocr-ukr tesseract-ocr-rus \
                     tesseract-ocr-ces tesseract-ocr-eng
tesseract --list-langs | grep -E '^(ukr|rus|ces|eng)$'

Windows

Download the UB Mannheim Tesseract installer: https://github.com/UB-Mannheim/tesseract/wiki (64-bit recommended).
During installation, expand "Additional language data (download)" and tick Ukrainian, Russian, Czech, English (or whichever languages your documents use).
Add the install directory to PATH (default: C:\Program Files\Tesseract-OCR). The installer offers a checkbox; if missed, set it manually in System Properties → Environment Variables.

Verify in a fresh PowerShell:

tesseract --version
tesseract --list-langs

Install

macOS / Linux

python3 -m venv .venv
source .venv/bin/activate
pip install claude-pdf2md-ocr                               # once on PyPI
# or, while the package is still a prototype:
pip install git+https://github.com/skippdot/claude-pdf2md-ocr

Windows

# Python 3.10+ from python.org with "Add Python to PATH" checked
py -m venv .venv
.venv\Scripts\Activate.ps1

pip install claude-pdf2md-ocr
# or the git variant:
pip install git+https://github.com/skippdot/claude-pdf2md-ocr

Usage

# Auto language detection (default). Tesseract gets a narrow pack based on
# the PDF's text layer or a cheap probe pass.
claude-pdf2md-ocr scan.pdf -o out.md

# Force an explicit Tesseract language string when you know the document:
claude-pdf2md-ocr scan.pdf -o out.md --lang ces+eng

# OCR every page, even those that already have a text layer:
claude-pdf2md-ocr scan.pdf -o out.md --all-pages

# Skip the post-OCR spellcheck pass:
claude-pdf2md-ocr scan.pdf -o out.md --no-spellcheck

Or programmatically:

from claude_pdf2md_ocr import convert_with_ocr

md = convert_with_ocr("scan.pdf")                 # auto-detect lang
md = convert_with_ocr("scan.pdf", lang="ces+eng") # explicit override

Status

Prototype. Published to PyPI once accuracy on real scanned Ukrainian / Czech legal documents stabilises and the base claude-pdf2md package grows the enrichers= hook that will collapse this package's bespoke pipeline composition into a one-liner.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

skippdot

These details have not been verified by PyPI

Project links

Base-package

Release history Release notifications | RSS feed

This version

0.0.2

Apr 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

claude_pdf2md_ocr-0.0.2.tar.gz (17.8 kB view details)

Uploaded Apr 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

claude_pdf2md_ocr-0.0.2-py3-none-any.whl (17.5 kB view details)

Uploaded Apr 21, 2026 Python 3

File details

Details for the file claude_pdf2md_ocr-0.0.2.tar.gz.

File metadata

Download URL: claude_pdf2md_ocr-0.0.2.tar.gz
Upload date: Apr 21, 2026
Size: 17.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for claude_pdf2md_ocr-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`e89c0aec4c6b217d47594057f17c325de7e54c0b2d49a0d5414b942bfb62d87d`
MD5	`30a6e7a642beae8b8ee4232a026e57aa`
BLAKE2b-256	`bf154d4d7a8e3d3c947b5db3421f9ae2f03a18ffabf12b8179c4d19e75288ca6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for claude_pdf2md_ocr-0.0.2.tar.gz:

Publisher: release.yml on skippdot/claude-pdf2md-ocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: claude_pdf2md_ocr-0.0.2.tar.gz
- Subject digest: e89c0aec4c6b217d47594057f17c325de7e54c0b2d49a0d5414b942bfb62d87d
- Sigstore transparency entry: 1350446273
- Sigstore integration time: Apr 21, 2026
Source repository:
- Permalink: skippdot/claude-pdf2md-ocr@425f1b2ad38895509b2db4f92280cef09186b934
- Branch / Tag: refs/tags/v0.0.2
- Owner: https://github.com/skippdot
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@425f1b2ad38895509b2db4f92280cef09186b934
- Trigger Event: push

File details

Details for the file claude_pdf2md_ocr-0.0.2-py3-none-any.whl.

File metadata

Download URL: claude_pdf2md_ocr-0.0.2-py3-none-any.whl
Upload date: Apr 21, 2026
Size: 17.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for claude_pdf2md_ocr-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6a36fe54288a62bbd0c2af1aa5f88da7a648cc03301f9cc351c1539237e697bb`
MD5	`24fb036f9b988894072848477f5b7c64`
BLAKE2b-256	`f801e97814f47d1373148bd964ad74bf48b74b1770968ac3ec7b12dad67e11e2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for claude_pdf2md_ocr-0.0.2-py3-none-any.whl:

Publisher: release.yml on skippdot/claude-pdf2md-ocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: claude_pdf2md_ocr-0.0.2-py3-none-any.whl
- Subject digest: 6a36fe54288a62bbd0c2af1aa5f88da7a648cc03301f9cc351c1539237e697bb
- Sigstore transparency entry: 1350446385
- Sigstore integration time: Apr 21, 2026
Source repository:
- Permalink: skippdot/claude-pdf2md-ocr@425f1b2ad38895509b2db4f92280cef09186b934
- Branch / Tag: refs/tags/v0.0.2
- Owner: https://github.com/skippdot
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@425f1b2ad38895509b2db4f92280cef09186b934
- Trigger Event: push

claude-pdf2md-ocr 0.0.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Project description

claude-pdf2md-ocr

Features

System dependency

macOS

Debian / Ubuntu

Windows

Install

macOS / Linux

Windows

Usage

Status

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance