A robust preprocessing pipeline for document OCR that significantly improves Tesseract accuracy on mobile-captured ID documents
Project description
Robust Document OCR Preprocessing Pipeline
A robust preprocessing pipeline for document OCR that significantly improves Tesseract accuracy on mobile-captured ID documents.
🚀 Quick Start
Installation
# Install from PyPI
pip install RobustDocOCR
# Install with OCR support
pip install RobustDocOCR[ocr]
# Install with development dependencies
pip install RobustDocOCR[dev]
Basic Usage
from robustdococr import preprocess_document, load_image
# Load your document image
image = load_image("document.jpg")
# Apply preprocessing pipeline
results = preprocess_document(image, show_steps=True)
# Access preprocessed image
preprocessed_image = results['final']
Command Line Interface
# Process single image
robustdococr input.jpg --output output.jpg
# Process with intermediate steps display
robustdococr input.jpg --show-steps
📦 Features
4-Stage Preprocessing Pipeline
- Deskewing: Straightens rotated documents using Hough transform
- Binarization: Converts images to black & white using adaptive thresholding
- Noise Removal: Cleans up artifacts using two-stage denoising
- OCR Ready: Produces optimized images for Tesseract OCR
Key Technical Features
- Adaptive Thresholding: Handles varying lighting conditions (shadows, glare)
- Hough Transform Deskewing: Robust rotation correction (±45°)
- Two-Stage Denoising: Preserves text while removing artifacts
- 96% Text Retention: Minimal text loss during preprocessing
- Tesseract Optimized: Produces images ideal for OCR engines
🎯 Performance Metrics
| Metric | Value |
|---|---|
| Text Retention Rate | 96% |
| Character Improvement | +12% |
| Quality Distribution | 85% Excellent, 12% Good, 3% Fair |
| Rotation Correction | Handles ±45° rotation effectively |
📂 Project Structure
robustdococr/
├── preprocessing/ # Core preprocessing modules
│ ├── deskewing.py # Image straightening
│ ├── binarization.py # Adaptive thresholding
│ ├── noise_removal.py # Artifact cleaning
│ └── pipeline.py # Complete pipeline
├── utils/ # Utility functions
│ ├── image_utils.py # Image utilities
│ ├── ocr_utils.py # OCR utilities
│ └── visualization.py # Visualization tools
├── cli.py # CLI entry point
├── main.py # Main module
└── __init__.py # Package initialization
tests/ # Test suite
examples/ # Example scripts
notebooks/ # Jupyter notebooks
docs/ # Documentation
🔧 Configuration
Requirements
- Python 3.8+
- OpenCV
- NumPy
- Pillow
- Matplotlib (for visualization)
- Tesseract OCR (optional, for OCR features)
Installation Options
# Basic installation
pip install RobustDocOCR
# Development installation (includes test and dev dependencies)
pip install RobustDocOCR[dev]
# Installation with OCR support
pip install RobustDocOCR[ocr]
# Installation with all extras
pip install RobustDocOCR[all]
📊 Technical Specifications
Deskewing Algorithm
- Edge Detection: Canny edge detector with thresholds (50, 150)
- Line Detection: Hough Line Transform with threshold 200
- Angle Calculation: Median angle from detected lines for robustness
- Rotation: Affine transformation with cubic interpolation
Binarization Algorithm
- CLAHE Enhancement: Contrast Limited Adaptive Histogram Equalization
clipLimit: 2.0tileGridSize: (8, 8)
- Adaptive Thresholding: Gaussian-weighted local thresholding
blockSize: 25C: 10Method: ADAPTIVE_THRESH_GAUSSIAN_C
Noise Removal Algorithm
- Stage 1: Non-Local Means Denoising (
h=10) applied before binarization - Stage 2: Morphological operations (2×2 kernel, 1 iteration) applied after binarization
🧪 Testing
Run the test suite:
pytest
Run tests with coverage:
pytest --cov=robustdococr --cov-report=html
📚 Documentation
- Architecture Documentation
- Usage Guide
- Decision Log
- API Reference
- Kaggle Notebook - Complete preprocessing pipeline demonstration
🤝 Contributing
We welcome contributions! Please see our:
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🎓 Citation
If you use this pipeline in your research, please cite:
@misc{robust-doc-ocr-preprocessing,
author = {3BSALAM},
title = {Robust Document OCR Preprocessing Pipeline},
year = {2026},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/3bsalam-1/RobustDocOCR}}
}
🔗 Related Projects
📦 PyPI
This package is available on PyPI: https://pypi.org/project/RobustDocOCR/
© 2026 Robust Document OCR Preprocessing Pipeline
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file robustdococr-1.0.3.tar.gz.
File metadata
- Download URL: robustdococr-1.0.3.tar.gz
- Upload date:
- Size: 16.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e766a5e83c31d16f3bb89084d26c2a64fd0d5538daf3a6ba5d7a9e2f77a09197
|
|
| MD5 |
74cca9112f60025dc9b2aaee7635484d
|
|
| BLAKE2b-256 |
7602c6e5f3d224a6380456a4f171bf5c434f124da5e423fb1b318db74f59de96
|
File details
Details for the file robustdococr-1.0.3-py3-none-any.whl.
File metadata
- Download URL: robustdococr-1.0.3-py3-none-any.whl
- Upload date:
- Size: 19.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
246c1280cf5fb2d5b8e5e76b2849334a8ba49dbce116b486c08f7b23e5605fab
|
|
| MD5 |
26e37d483ef0914533e6a9c4a60683cb
|
|
| BLAKE2b-256 |
cd6cfe9c695960fcd074438a404ac23f2b028205851c96a0f45ec9fa033019cc
|