A robust preprocessing pipeline for document OCR that significantly improves Tesseract accuracy on mobile-captured ID documents

These details have not been verified by PyPI

Project links

Project description

Robust Document OCR Preprocessing Pipeline

A robust preprocessing pipeline for document OCR that significantly improves Tesseract accuracy on mobile-captured ID documents.

🚀 Quick Start

Installation

# Install from PyPI
pip install RobustDocOCR

# Install with OCR support
pip install RobustDocOCR[ocr]

# Install with development dependencies
pip install RobustDocOCR[dev]

Basic Usage

from robustdococr import preprocess_document, load_image

# Load your document image
image = load_image("document.jpg")

# Apply preprocessing pipeline
results = preprocess_document(image, show_steps=True)

# Access preprocessed image
preprocessed_image = results['final']

Command Line Interface

# Process single image
robustdococr input.jpg --output output.jpg

# Process with intermediate steps display
robustdococr input.jpg --show-steps

📦 Features

4-Stage Preprocessing Pipeline

Deskewing: Straightens rotated documents using Hough transform
Binarization: Converts images to black & white using adaptive thresholding
Noise Removal: Cleans up artifacts using two-stage denoising
OCR Ready: Produces optimized images for Tesseract OCR

Key Technical Features

Adaptive Thresholding: Handles varying lighting conditions (shadows, glare)
Hough Transform Deskewing: Robust rotation correction (±45°)
Two-Stage Denoising: Preserves text while removing artifacts
96% Text Retention: Minimal text loss during preprocessing
Tesseract Optimized: Produces images ideal for OCR engines

🎯 Performance Metrics

Metric	Value
Text Retention Rate	96%
Character Improvement	+12%
Quality Distribution	85% Excellent, 12% Good, 3% Fair
Rotation Correction	Handles ±45° rotation effectively

📂 Project Structure

robustdococr/
 ├── preprocessing/          # Core preprocessing modules
 │   ├── deskewing.py        # Image straightening
 │   ├── binarization.py     # Adaptive thresholding
 │   ├── noise_removal.py    # Artifact cleaning
 │   └── pipeline.py         # Complete pipeline
 ├── utils/                  # Utility functions
 │   ├── image_utils.py      # Image utilities
 │   ├── ocr_utils.py        # OCR utilities
 │   └── visualization.py    # Visualization tools
 ├── cli.py                  # CLI entry point
 ├── main.py                 # Main module
 └── __init__.py             # Package initialization
tests/                      # Test suite
examples/                   # Example scripts
notebooks/                  # Jupyter notebooks
docs/                       # Documentation

🔧 Configuration

Requirements

Python 3.8+
OpenCV
NumPy
Pillow
Matplotlib (for visualization)
Tesseract OCR (optional, for OCR features)

Installation Options

# Basic installation
pip install RobustDocOCR

# Development installation (includes test and dev dependencies)
pip install RobustDocOCR[dev]

# Installation with OCR support
pip install RobustDocOCR[ocr]

# Installation with all extras
pip install RobustDocOCR[all]

📊 Technical Specifications

Deskewing Algorithm

Edge Detection: Canny edge detector with thresholds (50, 150)
Line Detection: Hough Line Transform with threshold 200
Angle Calculation: Median angle from detected lines for robustness
Rotation: Affine transformation with cubic interpolation

Binarization Algorithm

CLAHE Enhancement: Contrast Limited Adaptive Histogram Equalization
- clipLimit: 2.0
- tileGridSize: (8, 8)
Adaptive Thresholding: Gaussian-weighted local thresholding
- blockSize: 25
- C: 10
- Method: ADAPTIVE_THRESH_GAUSSIAN_C

Noise Removal Algorithm

Stage 1: Non-Local Means Denoising (h=10) applied before binarization
Stage 2: Morphological operations (2×2 kernel, 1 iteration) applied after binarization

🧪 Testing

Run the test suite:

pytest

Run tests with coverage:

pytest --cov=robustdococr --cov-report=html

📚 Documentation

Architecture Documentation
Usage Guide
Decision Log
API Reference
Kaggle Notebook - Complete preprocessing pipeline demonstration

🤝 Contributing

We welcome contributions! Please see our:

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🎓 Citation

If you use this pipeline in your research, please cite:

@misc{robust-doc-ocr-preprocessing,
  author = {3BSALAM},
  title = {Robust Document OCR Preprocessing Pipeline},
  year = {2026},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/3bsalam-1/RobustDocOCR}}
}

🔗 Related Projects

📦 PyPI

This package is available on PyPI: https://pypi.org/project/RobustDocOCR/

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.3

Jan 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robustdococr-1.0.3.tar.gz (16.6 kB view details)

Uploaded Jan 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

robustdococr-1.0.3-py3-none-any.whl (19.1 kB view details)

Uploaded Jan 27, 2026 Python 3

File details

Details for the file robustdococr-1.0.3.tar.gz.

File metadata

Download URL: robustdococr-1.0.3.tar.gz
Upload date: Jan 27, 2026
Size: 16.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for robustdococr-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`e766a5e83c31d16f3bb89084d26c2a64fd0d5538daf3a6ba5d7a9e2f77a09197`
MD5	`74cca9112f60025dc9b2aaee7635484d`
BLAKE2b-256	`7602c6e5f3d224a6380456a4f171bf5c434f124da5e423fb1b318db74f59de96`

See more details on using hashes here.

File details

Details for the file robustdococr-1.0.3-py3-none-any.whl.

File metadata

Download URL: robustdococr-1.0.3-py3-none-any.whl
Upload date: Jan 27, 2026
Size: 19.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for robustdococr-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`246c1280cf5fb2d5b8e5e76b2849334a8ba49dbce116b486c08f7b23e5605fab`
MD5	`26e37d483ef0914533e6a9c4a60683cb`
BLAKE2b-256	`cd6cfe9c695960fcd074438a404ac23f2b028205851c96a0f45ec9fa033019cc`

See more details on using hashes here.

RobustDocOCR 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Robust Document OCR Preprocessing Pipeline

🚀 Quick Start

Installation

Basic Usage

Command Line Interface

📦 Features

4-Stage Preprocessing Pipeline

Key Technical Features

🎯 Performance Metrics

📂 Project Structure

🔧 Configuration

Requirements

Installation Options

📊 Technical Specifications

Deskewing Algorithm

Binarization Algorithm

Noise Removal Algorithm

🧪 Testing

📚 Documentation

🤝 Contributing

📄 License

🎓 Citation

🔗 Related Projects

📦 PyPI

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes