OCR plugin for claude-pdf2md — recognise text in scanned PDFs via Tesseract (and other backends) and feed it into the existing PDF→Markdown pipeline.
Project description
claude-pdf2md-ocr
OCR plugin for claude-pdf2md.
Detects PDF pages that have no usable text layer (pure scans, hybrid scans
with minimal overlay) and fills them in using Tesseract, so that the
existing claude-pdf2md pipeline — headings, lists, tables, links — works
on scanned documents the same way it works on native PDFs.
Features
- Auto language detection. Sniffs the document's text layer (or a cheap
first-pass OCR probe when the text layer is missing) and picks a narrow
Tesseract pack like
ces+engorukr+eng. Avoids the cross-script confusion that plagues a defaultukr+rus+engmodel on Czech or English documents. - Post-OCR spellcheck. Cleans up the cross-script errors that do slip
through (
опе → one,Ме → №,Мо1224 → №1224,ІВАМ → IBAN,І → I) by running each token against a language-awarewordfreqdictionary. Conservative — only edits a token when the correction is unambiguous. - Smart page selection. By default OCR runs only on pages that are
scans (no text layer, or a whole-page background image + tiny text
overlay).
--all-pagesforces it on every page. - Heading-safe injection. OCR word heights are normalised to a per-page median so the base pipeline's heading detector stops treating every line with tall capitals as a tier-2 heading.
System dependency
Tesseract must be installed separately.
macOS
brew install tesseract tesseract-lang # ships 100+ language packs
tesseract --list-langs | grep -E '^(ukr|rus|ces|eng)$'
Debian / Ubuntu
sudo apt-get install tesseract-ocr tesseract-ocr-ukr tesseract-ocr-rus \
tesseract-ocr-ces tesseract-ocr-eng
tesseract --list-langs | grep -E '^(ukr|rus|ces|eng)$'
Windows
-
Download the UB Mannheim Tesseract installer: https://github.com/UB-Mannheim/tesseract/wiki (64-bit recommended).
-
During installation, expand "Additional language data (download)" and tick Ukrainian, Russian, Czech, English (or whichever languages your documents use).
-
Add the install directory to
PATH(default:C:\Program Files\Tesseract-OCR). The installer offers a checkbox; if missed, set it manually in System Properties → Environment Variables. -
Verify in a fresh PowerShell:
tesseract --version tesseract --list-langs
Install
macOS / Linux
python3 -m venv .venv
source .venv/bin/activate
pip install claude-pdf2md-ocr # once on PyPI
# or, while the package is still a prototype:
pip install git+https://github.com/skippdot/claude-pdf2md-ocr
Windows
# Python 3.10+ from python.org with "Add Python to PATH" checked
py -m venv .venv
.venv\Scripts\Activate.ps1
pip install claude-pdf2md-ocr
# or the git variant:
pip install git+https://github.com/skippdot/claude-pdf2md-ocr
Usage
# Auto language detection (default). Tesseract gets a narrow pack based on
# the PDF's text layer or a cheap probe pass.
claude-pdf2md-ocr scan.pdf -o out.md
# Force an explicit Tesseract language string when you know the document:
claude-pdf2md-ocr scan.pdf -o out.md --lang ces+eng
# OCR every page, even those that already have a text layer:
claude-pdf2md-ocr scan.pdf -o out.md --all-pages
# Skip the post-OCR spellcheck pass:
claude-pdf2md-ocr scan.pdf -o out.md --no-spellcheck
Or programmatically:
from claude_pdf2md_ocr import convert_with_ocr
md = convert_with_ocr("scan.pdf") # auto-detect lang
md = convert_with_ocr("scan.pdf", lang="ces+eng") # explicit override
Status
Prototype. Published to PyPI once accuracy on real scanned Ukrainian /
Czech legal documents stabilises and the base claude-pdf2md package
grows the enrichers= hook that will collapse this package's bespoke
pipeline composition into a one-liner.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file claude_pdf2md_ocr-0.0.2.tar.gz.
File metadata
- Download URL: claude_pdf2md_ocr-0.0.2.tar.gz
- Upload date:
- Size: 17.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e89c0aec4c6b217d47594057f17c325de7e54c0b2d49a0d5414b942bfb62d87d
|
|
| MD5 |
30a6e7a642beae8b8ee4232a026e57aa
|
|
| BLAKE2b-256 |
bf154d4d7a8e3d3c947b5db3421f9ae2f03a18ffabf12b8179c4d19e75288ca6
|
Provenance
The following attestation bundles were made for claude_pdf2md_ocr-0.0.2.tar.gz:
Publisher:
release.yml on skippdot/claude-pdf2md-ocr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
claude_pdf2md_ocr-0.0.2.tar.gz -
Subject digest:
e89c0aec4c6b217d47594057f17c325de7e54c0b2d49a0d5414b942bfb62d87d - Sigstore transparency entry: 1350446273
- Sigstore integration time:
-
Permalink:
skippdot/claude-pdf2md-ocr@425f1b2ad38895509b2db4f92280cef09186b934 -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/skippdot
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@425f1b2ad38895509b2db4f92280cef09186b934 -
Trigger Event:
push
-
Statement type:
File details
Details for the file claude_pdf2md_ocr-0.0.2-py3-none-any.whl.
File metadata
- Download URL: claude_pdf2md_ocr-0.0.2-py3-none-any.whl
- Upload date:
- Size: 17.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a36fe54288a62bbd0c2af1aa5f88da7a648cc03301f9cc351c1539237e697bb
|
|
| MD5 |
24fb036f9b988894072848477f5b7c64
|
|
| BLAKE2b-256 |
f801e97814f47d1373148bd964ad74bf48b74b1770968ac3ec7b12dad67e11e2
|
Provenance
The following attestation bundles were made for claude_pdf2md_ocr-0.0.2-py3-none-any.whl:
Publisher:
release.yml on skippdot/claude-pdf2md-ocr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
claude_pdf2md_ocr-0.0.2-py3-none-any.whl -
Subject digest:
6a36fe54288a62bbd0c2af1aa5f88da7a648cc03301f9cc351c1539237e697bb - Sigstore transparency entry: 1350446385
- Sigstore integration time:
-
Permalink:
skippdot/claude-pdf2md-ocr@425f1b2ad38895509b2db4f92280cef09186b934 -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/skippdot
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@425f1b2ad38895509b2db4f92280cef09186b934 -
Trigger Event:
push
-
Statement type: