Add your description here
Project description
ocr-pdf2md
A CLI tool that converts PDF documents to clean Markdown. Handles both digital PDFs (with extractable text) and scanned/image-based PDFs via OCR.
Installation
uv tool install ocr-pdf2md
OCR requirement
For scanned or image-based PDFs, Tesseract must be installed on your system:
- Ubuntu/Debian:
sudo apt install tesseract-ocr - macOS:
brew install tesseract - Windows: https://github.com/tesseract-ocr/tesseract
Tesseract is only invoked when a page has little or no extractable text. Fully digital PDFs work without it.
Usage
ocr-pdf2md input.pdf output.md
Features
- Extracts text from digital PDFs using pypdf
- Falls back to OCR (Tesseract) for scanned/image pages
- Detects and removes repeating headers and footers
- Identifies and reformats Table of Contents pages
- Detects headings (ALL CAPS -> H2, Title Case -> H3)
- Formats bullet and numbered lists
- Rejoins hyphenated words split across lines
- Cleans Unicode characters to ASCII equivalents
Acknowledgments
Inspired by rubysash/pdf2md. This project adds OCR support for scanned PDFs, improved type hints, and other enhancements.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ocr_pdf2md-0.1.0.tar.gz.
File metadata
- Download URL: ocr_pdf2md-0.1.0.tar.gz
- Upload date:
- Size: 6.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86529ad337b77ebfc9d0b9d8976623d7c8ca9319baf2c3d13ca9d6a8da9e63dd
|
|
| MD5 |
3cded1591e51666c047021e05ec0da7a
|
|
| BLAKE2b-256 |
c4df1b47c869eae8240a4b82a0e5b27553bb557285dce52c4e2812e382df5f69
|
File details
Details for the file ocr_pdf2md-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ocr_pdf2md-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6597018bb1366a3e2dd010a608a2a46f3b782486487415855d5d21b619cd7e8
|
|
| MD5 |
6f39ac1d8c49158f24942eb235652e5e
|
|
| BLAKE2b-256 |
23d3373154b3f358c9de53312e318fb6c6f8c95c01623f0a61231052538fa83c
|