Skip to main content

OCR plugin for claude-pdf2md — recognise text in scanned PDFs via Tesseract (and other backends) and feed it into the existing PDF→Markdown pipeline.

Project description

claude-pdf2md-ocr

OCR plugin for claude-pdf2md. Detects PDF pages that have no usable text layer (pure scans, hybrid scans with minimal overlay) and fills them in using Tesseract, so that the existing claude-pdf2md pipeline — headings, lists, tables, links — works on scanned documents the same way it works on native PDFs.

Features

  • Auto language detection. Sniffs the document's text layer (or a cheap first-pass OCR probe when the text layer is missing) and picks a narrow Tesseract pack like ces+eng or ukr+eng. Avoids the cross-script confusion that plagues a default ukr+rus+eng model on Czech or English documents.
  • Post-OCR spellcheck. Cleans up the cross-script errors that do slip through (опе → one, Ме → №, Мо1224 → №1224, ІВАМ → IBAN, І → I) by running each token against a language-aware wordfreq dictionary. Conservative — only edits a token when the correction is unambiguous.
  • Smart page selection. By default OCR runs only on pages that are scans (no text layer, or a whole-page background image + tiny text overlay). --all-pages forces it on every page.
  • Heading-safe injection. OCR word heights are normalised to a per-page median so the base pipeline's heading detector stops treating every line with tall capitals as a tier-2 heading.

System dependency

Tesseract must be installed separately.

macOS

brew install tesseract tesseract-lang     # ships 100+ language packs
tesseract --list-langs | grep -E '^(ukr|rus|ces|eng)$'

Debian / Ubuntu

sudo apt-get install tesseract-ocr tesseract-ocr-ukr tesseract-ocr-rus \
                     tesseract-ocr-ces tesseract-ocr-eng
tesseract --list-langs | grep -E '^(ukr|rus|ces|eng)$'

Windows

  1. Download the UB Mannheim Tesseract installer: https://github.com/UB-Mannheim/tesseract/wiki (64-bit recommended).

  2. During installation, expand "Additional language data (download)" and tick Ukrainian, Russian, Czech, English (or whichever languages your documents use).

  3. Add the install directory to PATH (default: C:\Program Files\Tesseract-OCR). The installer offers a checkbox; if missed, set it manually in System Properties → Environment Variables.

  4. Verify in a fresh PowerShell:

    tesseract --version
    tesseract --list-langs
    

Install

macOS / Linux

python3 -m venv .venv
source .venv/bin/activate
pip install claude-pdf2md-ocr                               # once on PyPI
# or, while the package is still a prototype:
pip install git+https://github.com/skippdot/claude-pdf2md-ocr

Windows

# Python 3.10+ from python.org with "Add Python to PATH" checked
py -m venv .venv
.venv\Scripts\Activate.ps1

pip install claude-pdf2md-ocr
# or the git variant:
pip install git+https://github.com/skippdot/claude-pdf2md-ocr

Usage

# Auto language detection (default). Tesseract gets a narrow pack based on
# the PDF's text layer or a cheap probe pass.
claude-pdf2md-ocr scan.pdf -o out.md

# Force an explicit Tesseract language string when you know the document:
claude-pdf2md-ocr scan.pdf -o out.md --lang ces+eng

# OCR every page, even those that already have a text layer:
claude-pdf2md-ocr scan.pdf -o out.md --all-pages

# Skip the post-OCR spellcheck pass:
claude-pdf2md-ocr scan.pdf -o out.md --no-spellcheck

Or programmatically:

from claude_pdf2md_ocr import convert_with_ocr

md = convert_with_ocr("scan.pdf")                 # auto-detect lang
md = convert_with_ocr("scan.pdf", lang="ces+eng") # explicit override

Status

Prototype. Published to PyPI once accuracy on real scanned Ukrainian / Czech legal documents stabilises and the base claude-pdf2md package grows the enrichers= hook that will collapse this package's bespoke pipeline composition into a one-liner.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

claude_pdf2md_ocr-0.0.2.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

claude_pdf2md_ocr-0.0.2-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file claude_pdf2md_ocr-0.0.2.tar.gz.

File metadata

  • Download URL: claude_pdf2md_ocr-0.0.2.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for claude_pdf2md_ocr-0.0.2.tar.gz
Algorithm Hash digest
SHA256 e89c0aec4c6b217d47594057f17c325de7e54c0b2d49a0d5414b942bfb62d87d
MD5 30a6e7a642beae8b8ee4232a026e57aa
BLAKE2b-256 bf154d4d7a8e3d3c947b5db3421f9ae2f03a18ffabf12b8179c4d19e75288ca6

See more details on using hashes here.

Provenance

The following attestation bundles were made for claude_pdf2md_ocr-0.0.2.tar.gz:

Publisher: release.yml on skippdot/claude-pdf2md-ocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file claude_pdf2md_ocr-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for claude_pdf2md_ocr-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6a36fe54288a62bbd0c2af1aa5f88da7a648cc03301f9cc351c1539237e697bb
MD5 24fb036f9b988894072848477f5b7c64
BLAKE2b-256 f801e97814f47d1373148bd964ad74bf48b74b1770968ac3ec7b12dad67e11e2

See more details on using hashes here.

Provenance

The following attestation bundles were made for claude_pdf2md_ocr-0.0.2-py3-none-any.whl:

Publisher: release.yml on skippdot/claude-pdf2md-ocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page