Skip to main content

Add your description here

Project description

ocr-pdf2md

A CLI tool that converts PDF documents to clean Markdown. Handles both digital PDFs (with extractable text) and scanned/image-based PDFs via OCR.

Installation

uv tool install ocr-pdf2md

OCR requirement

For scanned or image-based PDFs, Tesseract must be installed on your system:

Tesseract is only invoked when a page has little or no extractable text. Fully digital PDFs work without it.

Usage

ocr-pdf2md input.pdf output.md

Features

  • Extracts text from digital PDFs using pypdf
  • Falls back to OCR (Tesseract) for scanned/image pages
  • Detects and removes repeating headers and footers
  • Identifies and reformats Table of Contents pages
  • Detects headings (ALL CAPS -> H2, Title Case -> H3)
  • Formats bullet and numbered lists
  • Rejoins hyphenated words split across lines
  • Cleans Unicode characters to ASCII equivalents

Acknowledgments

Inspired by rubysash/pdf2md. This project adds OCR support for scanned PDFs, improved type hints, and other enhancements.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocr_pdf2md-0.1.0.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocr_pdf2md-0.1.0-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file ocr_pdf2md-0.1.0.tar.gz.

File metadata

  • Download URL: ocr_pdf2md-0.1.0.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ocr_pdf2md-0.1.0.tar.gz
Algorithm Hash digest
SHA256 86529ad337b77ebfc9d0b9d8976623d7c8ca9319baf2c3d13ca9d6a8da9e63dd
MD5 3cded1591e51666c047021e05ec0da7a
BLAKE2b-256 c4df1b47c869eae8240a4b82a0e5b27553bb557285dce52c4e2812e382df5f69

See more details on using hashes here.

File details

Details for the file ocr_pdf2md-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ocr_pdf2md-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ocr_pdf2md-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f6597018bb1366a3e2dd010a608a2a46f3b782486487415855d5d21b619cd7e8
MD5 6f39ac1d8c49158f24942eb235652e5e
BLAKE2b-256 23d3373154b3f358c9de53312e318fb6c6f8c95c01623f0a61231052538fa83c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page