A CLI tool to apply OCR on PDF files and export to multiple formats.

Project description

pdf2ocr

A CLI tool to apply OCR on PDF files and export to multiple formats.

📄 Features

🔍 Extracts text from scanned PDFs using Tesseract OCR with advanced image preprocessing
📘 Outputs DOCX, HTML, EPUB and searchable PDF files with preserved paragraph structure
📚 Converts DOCX to EPUB via Calibre, including metadata
📈 Displays progress bars and detailed summary logs
📂 Supports layout-preserving mode for high-fidelity PDF OCR
🖼️ Advanced image enhancement for improved OCR accuracy on distorted documents
⚡ Intelligent preprocessing with noise reduction, contrast optimization, and text sharpening

🖼️ Advanced Image Processing

pdf2ocr includes sophisticated image preprocessing to maximize OCR accuracy, especially for documents with distortions, poor contrast, or noise:

🔧 Automatic Enhancements Applied:

Grayscale Conversion - Optimizes images for text recognition
Auto Contrast - Automatically adjusts contrast for better text visibility
Noise Reduction - Removes artifacts while preserving text edges
Adaptive Histogram Equalization (CLAHE) - Enhances local contrast for varying lighting conditions
Text Sharpening - Improves character definition and clarity
Unsharp Masking - Fine-tunes text edges for optimal recognition

📊 Benefits:

✅ Better accuracy on scanned documents with poor quality
✅ Improved recognition of faded or low-contrast text
✅ Enhanced performance on documents with noise or artifacts
✅ Automatic fallbacks ensure processing never fails
✅ Works with all output formats (PDF, DOCX, HTML, EPUB)
✅ Configurable quality with --dpi option (72-1200 range)

⚙️ Quality Control:

The --dpi parameter controls the resolution of PDF to image conversion:

Low DPI (72-150): Faster processing, smaller memory usage, suitable for clean documents
Medium DPI (200-400): Balanced quality and performance (default: 400)
High DPI (500-1200): Maximum quality for challenging documents, slower processing

💡 Note: All image enhancements are applied automatically - no configuration needed!

🚀 Quick Install & Usage

Install globally

Install pdf2ocr and use it as a command-line tool:

pip install pdf2ocr

📌 Usage Examples

Generate multiple output formats with logging:

pdf2ocr ./pdfs --docx --pdf --epub --html --dest-dir ./output --logfile pdf2ocr.log

Generate layout-preserving OCR PDFs only:

pdf2ocr ./pdfs --pdf --preserve-layout --dest-dir ./output --logfile pdf2ocr.log

Process multiple files in parallel with 8 workers:

pdf2ocr ./pdfs --pdf --html --epub --workers 8 --logfile pdf2ocr.log

Enable batch processing for large PDFs to reduce memory usage:

pdf2ocr ./pdfs --pdf --batch-size 5 --logfile pdf2ocr.log  # Process 5 pages at a time

High-quality OCR with custom DPI for challenging documents:

pdf2ocr ./pdfs --pdf --dpi 600 --logfile pdf2ocr.log  # Higher DPI for better quality

Fast processing for clean documents with lower DPI:

pdf2ocr ./pdfs --pdf --dpi 150 --logfile pdf2ocr.log  # Lower DPI for faster processing

Control paragraph length with max sentences:

pdf2ocr ./pdfs --pdf --max-sentences 10  # Split paragraphs longer than 10 sentences
pdf2ocr ./pdfs --pdf --max-sentences 0   # Disable sentence-based splitting

⚠️ When using --preserve-layout, only PDF output is supported. Other formats will be automatically disabled.

🌍 Language Support

pdf2ocr is currently optimized for Portuguese 🇧🇷🇵🇹 and uses it as the default OCR language.

You can override the language using the --lang option. Examples:

pdf2ocr ./pdfs --pdf --lang eng  # For English 🇬🇧🇺🇲

pdf2ocr ./pdfs --pdf --lang spa  # For Spanish (Español) 🇪🇸🇲🇽🇦🇷🇨🇱🇨🇴

pdf2ocr ./pdfs --pdf --lang fra  # For French (Français) 🇫🇷

To check the code for all languages supported by Tesseract, run the command below:

tesseract --list-langs

🧱 System Requirements and Tesseract language models

Ubuntu / Debian (APT)

Install Tesseract OCR and the most common language models:

sudo apt update && sudo apt install tesseract-ocr \
    tesseract-ocr-por tesseract-ocr-eng tesseract-ocr-spa \
    tesseract-ocr-fra tesseract-ocr-ita

For optimal image processing performance, also install:

sudo apt install python3-scipy python3-skimage

Or, to install all available language models:

sudo apt install tesseract-ocr-all

Fedora / Red Hat / CentOS / AlmaLinux / Rocky Linux (DNF or YUM)

OCR requirements:

# For modern systems (DNF)
sudo dnf install tesseract poppler-utils calibre

# For older systems (YUM)
sudo yum install tesseract poppler-utils calibre

To install additional OCR language models:

sudo dnf install tesseract-langpack-por tesseract-langpack-eng \
    tesseract-langpack-spa tesseract-langpack-fra tesseract-langpack-ita

There is no equivalent to tesseract-ocr-all on Red Hat-based systems — install only the languages you need.

macOS (Homebrew)

brew install tesseract poppler
brew install --cask calibre

💡 Tip for macOS/Homebrew users:

📌 Important: If ebook-convert is not available after installing Calibre, add it to your PATH:
export PATH="$PATH:/Applications/calibre.app/Contents/MacOS"
To make it permanent:
echo 'export PATH="$PATH:/Applications/calibre.app/Contents/MacOS"' >> ~/.zshrc
source ~/.zshrc

Check Calibre installation:
ebook-convert --version

🐍 Python Setup (for development)

To use in a virtual environment:

python3 -m venv venv_pdf2ocr
source venv_pdf2ocr/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

📦 Dependencies

The tool includes advanced image processing capabilities with the following key dependencies:

Core OCR: pytesseract, pdf2image, pillow
Document Generation: python-docx, reportlab, pypdf
Advanced Image Processing: numpy, scipy, scikit-image
Progress & UI: tqdm

💡 Note: Advanced image processing dependencies (scipy, scikit-image) are optional - the tool will automatically fall back to basic processing if they're not available.

⚙️ Command Line Options

pdf2ocr -h

source_folder: Folder containing the input PDF files.
--dest-dir: Destination folder for output files (default: same as input).
--docx: Generate DOCX files with preserved paragraph structure.
--pdf: Generate OCR-processed PDF files.
--epub: Generate EPUB files (requires --docx; uses Calibre).
--html: Generate HTML files.
--preserve-layout: Preserve the visual layout of original documents (PDF only).
--lang: Set the OCR language code (default: por). Use tesseract --list-langs to check installed options.
--quiet: Run silently without progress output.
--summary: Display only final conversion summary.
--logfile: Path to save detailed log output (UTF-8 encoded).
--workers: Number of parallel workers for processing (default: 2).
--batch-size: Number of pages to process in each batch (disabled by default). Use this to optimize memory usage for large PDFs.
--dpi: DPI for PDF to image conversion (default: 400, range: 72-1200). Higher values improve OCR quality but increase processing time and memory usage.
--max-sentences: Max sentences per paragraph — splits overly long paragraphs (default: 15, 0 to disable).
--version: show program's version number and exit

🛠️ Makefile Commands

Command	Description
`make venv`	Create and set up a virtual environment (`venv_pdf2ocr`)
`make install`	Install `pdf2ocr` globally (or into active virtualenv)
`make run`	Run `pdf2ocr` with example parameters (PDF, DOCX, EPUB, HTML)
`make test`	Run automated tests with `pytest`
`make lint`	Run `flake8` to check code quality
`make format`	Auto-format code using `black` and `isort`
`make clean`	Remove Python cache, logs, build files and other generated artifacts

📄 License

MIT

Project details

Release history Release notifications | RSS feed

This version

1.1.2

Apr 30, 2026

1.1.1

Apr 11, 2026

1.1.0

Apr 11, 2026

1.0.19

May 27, 2025

1.0.16

May 10, 2025

1.0.15

May 9, 2025

1.0.14

May 9, 2025

1.0.13

May 9, 2025

1.0.12

May 8, 2025

1.0.11

May 7, 2025

1.0.10

May 7, 2025

1.0.8

May 7, 2025

1.0.7

May 7, 2025

1.0.6

May 7, 2025

1.0.5

May 5, 2025

1.0.4

May 5, 2025

1.0.3

May 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2ocr-1.1.2.tar.gz (48.5 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf2ocr-1.1.2-py3-none-any.whl (36.9 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file pdf2ocr-1.1.2.tar.gz.

File metadata

Download URL: pdf2ocr-1.1.2.tar.gz
Upload date: Apr 30, 2026
Size: 48.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf2ocr-1.1.2.tar.gz
Algorithm	Hash digest
SHA256	`a13b30529b1948e5c983a4c80bf05c2828c16237dc2ffc1bf9368474da61a644`
MD5	`94f9422b23158088cebc6c21448b5689`
BLAKE2b-256	`537d3aca080d5a083f88808323a97934ddf1799d08761702cf80704fe25b5020`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2ocr-1.1.2.tar.gz:

Publisher: python-publish.yml on rdantassilva/pdf2ocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdf2ocr-1.1.2.tar.gz
- Subject digest: a13b30529b1948e5c983a4c80bf05c2828c16237dc2ffc1bf9368474da61a644
- Sigstore transparency entry: 1413467111
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: rdantassilva/pdf2ocr@17b18666bc3e34b81d1464dc54ffd652fad17c4d
- Branch / Tag: refs/tags/v1.1.2
- Owner: https://github.com/rdantassilva
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@17b18666bc3e34b81d1464dc54ffd652fad17c4d
- Trigger Event: release

File details

Details for the file pdf2ocr-1.1.2-py3-none-any.whl.

File metadata

Download URL: pdf2ocr-1.1.2-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 36.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf2ocr-1.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b81e63dc28386fa97a726e0402760b7f28eaa68537a56ec8a5993dc75d1d6eb8`
MD5	`b427119a53dbb7a77d4ae02d75a221d7`
BLAKE2b-256	`01e3f08c527ea2cb8fafdf223f9fe9da81201d88c0bd4790160aa3480c8043c2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2ocr-1.1.2-py3-none-any.whl:

Publisher: python-publish.yml on rdantassilva/pdf2ocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdf2ocr-1.1.2-py3-none-any.whl
- Subject digest: b81e63dc28386fa97a726e0402760b7f28eaa68537a56ec8a5993dc75d1d6eb8
- Sigstore transparency entry: 1413467219
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: rdantassilva/pdf2ocr@17b18666bc3e34b81d1464dc54ffd652fad17c4d
- Branch / Tag: refs/tags/v1.1.2
- Owner: https://github.com/rdantassilva
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@17b18666bc3e34b81d1464dc54ffd652fad17c4d
- Trigger Event: release

pdf2ocr 1.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

pdf2ocr

📄 Features

🖼️ Advanced Image Processing

🔧 Automatic Enhancements Applied:

📊 Benefits:

⚙️ Quality Control:

🚀 Quick Install & Usage

Install globally

📌 Usage Examples

🌍 Language Support

🧱 System Requirements and Tesseract language models

Ubuntu / Debian (APT)

Fedora / Red Hat / CentOS / AlmaLinux / Rocky Linux (DNF or YUM)

OCR requirements:

To install additional OCR language models:

macOS (Homebrew)

💡 Tip for macOS/Homebrew users:

🐍 Python Setup (for development)

📦 Dependencies

⚙️ Command Line Options

🛠️ Makefile Commands

📄 License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance