Advanced PDF OCR processing with AI-powered text extraction and selectable text overlays
Project description
PDF OCR Processor
Advanced PDF processing with AI-powered OCR, text extraction, and selectable text overlays using Ollama models
๐ Features
- AI-Powered OCR using Ollama models (llava, moondream, etc.)
- Modular Architecture with clear separation of concerns
- Multiple Output Formats:
- SVG with selectable text overlays
- Raw text extraction
- JSON metadata
- Image Enhancement with multiple strategies
- Robust Error Handling with configurable retries
- Parallel Processing for batch operations
- CLI Interface with progress tracking
๐ ๏ธ System Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PDF OCR Processor โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ PDF โ โ โ OCRProcessor โ โ
โ โ Processor โโโผโโถโ - Text extraction โ โ
โ โโโโโโโโโโโโโโ โ โ - Ollama integration โ โ
โ โ โโโโโโโโโโโโโโโฌโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโผโโโโโโโโโโโโ โ
โ โ Image โ โ โ SVG Generator โ โ
โ โ Enhancer โโโผโโถโ - Text overlay โ โ
โ โโโโโโโโโโโโโโ โ โ - Searchable output โ โ
โโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฆ Installation
Prerequisites
- Python 3.8+
- Ollama (for OCR processing)
- System dependencies:
# Ubuntu/Debian sudo apt-get install -y tesseract-ocr poppler-utils # macOS brew install tesseract poppler
Install from source
# Clone the repository
git clone https://github.com/wronai/ocr.git
cd ocr
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # Linux/macOS
# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt # For development
๐ Quick Start
Basic Usage
# Process a single PDF
python -m pdf_processor --input document.pdf --output output/
# Process all PDFs in a directory
python -m pdf_processor --input ./documents --output ./output --model llava:7b
# Show help
python -m pdf_processor --help
Python API
from pdf_processor import PDFProcessor
from pdf_processor.processing.pdf_processor import PDFProcessorConfig
# Configure the processor
config = PDFProcessorConfig(
input_path="document.pdf",
output_dir="./output",
ocr_model="llava:7b",
dpi=300,
max_workers=4
)
# Process a document
processor = PDFProcessor(config)
result = processor.process_pdf("document.pdf")
print(f"Processed {result['pages_processed']} pages")
โ๏ธ Configuration
Configuration File
Create a config.yaml file:
# config.yaml
input_path: ./documents # Input file or directory
output_dir: ./output # Output directory
ocr_model: llava:7b # Ollama model to use
dpi: 300 # Image resolution
max_workers: 4 # Number of worker threads
timeout: 300 # Timeout in seconds
max_retries: 3 # Max retry attempts
log_level: INFO # Logging level
log_file: pdf_processor.log # Log file path
# Image enhancement strategies
enhancement_strategies:
- original # Keep original image
- grayscale # Convert to grayscale
- adaptive_threshold # Apply adaptive thresholding
- contrast_stretch # Stretch contrast
- sharpen # Sharpen image
- denoise # Remove noise
Environment Variables
export OLLAMA_HOST="http://localhost:11434"
export OLLAMA_MODEL="llava:7b"
export LOG_LEVEL="DEBUG"
๐ Advanced Usage
Processing Options
# Process with specific DPI
python -m pdf_processor --input document.pdf --output output/ --dpi 400
# Limit number of pages to process
python -m pdf_processor --input document.pdf --output output/ --max-pages 10
# Use a specific enhancement strategy
python -m pdf_processor --input document.pdf --output output/ --enhance grayscale
# Process in verbose mode
python -m pdf_processor --input document.pdf --output output/ --verbose
Available Enhancement Strategies
original: Keep original image (fastest)grayscale: Convert to grayscale (good for text-heavy documents)adaptive_threshold: Apply adaptive thresholding (good for low-quality scans)contrast_stretch: Stretch contrast to improve readabilitysharpen: Apply sharpening filterdenoise: Remove image noise
๐ ๏ธ Development
Project Structure
pdf_processor/
โโโ __init__.py # Package initialization
โโโ cli.py # Command-line interface
โโโ config/ # Configuration files
โโโ models/ # Data models
โ โโโ __init__.py
โ โโโ ocr_result.py # OCR result data structures
โ โโโ retry_config.py # Retry configuration
โโโ processing/ # Core processing modules
โ โโโ __init__.py
โ โโโ image_enhancement.py # Image processing
โ โโโ ocr_processor.py # OCR processing
โ โโโ pdf_processor.py # Main PDF processing
โ โโโ svg_generator.py # SVG output generation
โโโ utils/ # Utility functions
โโโ file_utils.py # File operations
โโโ logging_utils.py # Logging configuration
โโโ validation_utils.py # Input validation
Running Tests
# Install test dependencies
pip install -r requirements-dev.txt
# Run all tests
pytest
# Run tests with coverage report
pytest --cov=pdf_processor --cov-report=html
๐ค Contributing
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
๐ License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
๐ Resources
๐ Acknowledgments
- The Ollama team for their amazing AI models
- The PyMuPDF team for excellent PDF processing
- All contributors who have helped improve this project
๐ ๏ธ Development Workflow
This project uses a script-based workflow for development tasks. All scripts are located in the scripts/ directory and can be run directly or via the Makefile.
Setup
-
Clone the repository and navigate to the project directory:
git clone https://github.com/wronai/ocr.git cd ocr
-
Set up the development environment:
make install-devThis will:
- Create and activate a virtual environment
- Install all development dependencies
- Set up pre-commit hooks
Common Development Tasks
# Run tests
make test
# Run tests with coverage
make test-cov
# Format code
make format
# Run linters
make lint
# Start development server
make dev-server
# Build documentation
make docs
make docs-serve # Serve docs locally
Scripts Directory
All development and build scripts are located in the scripts/ directory. See scripts/README.md for detailed documentation of each script.
Docker Development
# Build Docker image
make docker-build
# Start services with Docker Compose
make docker-run
# Stop services
make docker-stop
๐ค Contributing
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please ensure your code follows our coding standards and includes appropriate tests.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Changelog
See CHANGELOG.md for a list of changes in each version.**
python proc.py --model llava:7b --workers 4
- View Results
- Open
output/*_complete.svgin your browser - Check details in
output/processing_report.json
- Open
๐ Documentation
Full documentation is available in the docs/ directory:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_ocr_processor-2.0.3.tar.gz.
File metadata
- Download URL: pdf_ocr_processor-2.0.3.tar.gz
- Upload date:
- Size: 64.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.3 Linux/6.14.0-22-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
594bf07c1b5d7e9d09654ba056da00749015e5e3d6d451cc5a04d7111036aedc
|
|
| MD5 |
fdc679db5c2132437b501bcc09d6b0ef
|
|
| BLAKE2b-256 |
e3015cdca9c73405e0f1c4df60ffccfd511b885c4d628e05d22ebe11f491880f
|
File details
Details for the file pdf_ocr_processor-2.0.3-py3-none-any.whl.
File metadata
- Download URL: pdf_ocr_processor-2.0.3-py3-none-any.whl
- Upload date:
- Size: 78.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.3 Linux/6.14.0-22-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a45da53ada6c1b39d7b20b01a8d6dd4449162a9eca3b01cec9648ca902eafaa
|
|
| MD5 |
68f18a73bed371d5bf9d1fe4b1a92ae5
|
|
| BLAKE2b-256 |
93de567cb554d08aee52f18155bbb7da949026bf806c2e895c155ca3681221b5
|