Skip to main content

Advanced PDF OCR processing with AI-powered text extraction and selectable text overlays

Project description

PDF OCR Processor

Advanced PDF processing with AI-powered OCR, text extraction, and selectable text overlays using Ollama models

Python License Documentation Tests Code style: black

๐Ÿš€ Features

  • AI-Powered OCR using Ollama models (llava, moondream, etc.)
  • Modular Architecture with clear separation of concerns
  • Multiple Output Formats:
    • SVG with selectable text overlays
    • Raw text extraction
    • JSON metadata
  • Image Enhancement with multiple strategies
  • Robust Error Handling with configurable retries
  • Parallel Processing for batch operations
  • CLI Interface with progress tracking

๐Ÿ› ๏ธ System Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚               PDF OCR Processor                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚ PDF        โ”‚ โ”‚  โ”‚      OCRProcessor       โ”‚  โ”‚
โ”‚  โ”‚ Processor  โ”œโ”€โ”ผโ”€โ–ถโ”‚  - Text extraction      โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚  โ”‚  - Ollama integration   โ”‚  โ”‚
โ”‚                 โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚ Image      โ”‚ โ”‚  โ”‚      SVG Generator      โ”‚  โ”‚
โ”‚  โ”‚ Enhancer   โ”œโ”€โ”ผโ”€โ–ถโ”‚  - Text overlay         โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚  โ”‚  - Searchable output    โ”‚  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ฆ Installation

Prerequisites

  • Python 3.8+
  • Ollama (for OCR processing)
  • System dependencies:
    # Ubuntu/Debian
    sudo apt-get install -y tesseract-ocr poppler-utils
    
    # macOS
    brew install tesseract poppler
    

Install from source

# Clone the repository
git clone https://github.com/wronai/ocr.git
cd ocr

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate  # Linux/macOS

# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt  # For development

๐Ÿ Quick Start

Basic Usage

# Process a single PDF
python -m pdf_processor --input document.pdf --output output/

# Process all PDFs in a directory
python -m pdf_processor --input ./documents --output ./output --model llava:7b

# Show help
python -m pdf_processor --help

Python API

from pdf_processor import PDFProcessor
from pdf_processor.processing.pdf_processor import PDFProcessorConfig

# Configure the processor
config = PDFProcessorConfig(
    input_path="document.pdf",
    output_dir="./output",
    ocr_model="llava:7b",
    dpi=300,
    max_workers=4
)

# Process a document
processor = PDFProcessor(config)
result = processor.process_pdf("document.pdf")
print(f"Processed {result['pages_processed']} pages")

โš™๏ธ Configuration

Configuration File

Create a config.yaml file:

# config.yaml
input_path: ./documents    # Input file or directory
output_dir: ./output       # Output directory
ocr_model: llava:7b        # Ollama model to use
dpi: 300                   # Image resolution
max_workers: 4             # Number of worker threads
timeout: 300               # Timeout in seconds
max_retries: 3             # Max retry attempts
log_level: INFO            # Logging level
log_file: pdf_processor.log # Log file path

# Image enhancement strategies
enhancement_strategies:
  - original            # Keep original image
  - grayscale           # Convert to grayscale
  - adaptive_threshold  # Apply adaptive thresholding
  - contrast_stretch    # Stretch contrast
  - sharpen             # Sharpen image
  - denoise             # Remove noise

Environment Variables

export OLLAMA_HOST="http://localhost:11434"
export OLLAMA_MODEL="llava:7b"
export LOG_LEVEL="DEBUG"

๐Ÿš€ Advanced Usage

Processing Options

# Process with specific DPI
python -m pdf_processor --input document.pdf --output output/ --dpi 400

# Limit number of pages to process
python -m pdf_processor --input document.pdf --output output/ --max-pages 10

# Use a specific enhancement strategy
python -m pdf_processor --input document.pdf --output output/ --enhance grayscale

# Process in verbose mode
python -m pdf_processor --input document.pdf --output output/ --verbose

Available Enhancement Strategies

  • original: Keep original image (fastest)
  • grayscale: Convert to grayscale (good for text-heavy documents)
  • adaptive_threshold: Apply adaptive thresholding (good for low-quality scans)
  • contrast_stretch: Stretch contrast to improve readability
  • sharpen: Apply sharpening filter
  • denoise: Remove image noise

๐Ÿ› ๏ธ Development

Project Structure

pdf_processor/
โ”œโ”€โ”€ __init__.py          # Package initialization
โ”œโ”€โ”€ cli.py               # Command-line interface
โ”œโ”€โ”€ config/              # Configuration files
โ”œโ”€โ”€ models/              # Data models
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ ocr_result.py    # OCR result data structures
โ”‚   โ””โ”€โ”€ retry_config.py  # Retry configuration
โ”œโ”€โ”€ processing/          # Core processing modules
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ image_enhancement.py  # Image processing
โ”‚   โ”œโ”€โ”€ ocr_processor.py      # OCR processing
โ”‚   โ”œโ”€โ”€ pdf_processor.py      # Main PDF processing
โ”‚   โ””โ”€โ”€ svg_generator.py      # SVG output generation
โ””โ”€โ”€ utils/               # Utility functions
    โ”œโ”€โ”€ file_utils.py    # File operations
    โ”œโ”€โ”€ logging_utils.py # Logging configuration
    โ””โ”€โ”€ validation_utils.py # Input validation

Running Tests

# Install test dependencies
pip install -r requirements-dev.txt

# Run all tests
pytest

# Run tests with coverage report
pytest --cov=pdf_processor --cov-report=html

๐Ÿค Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

๐Ÿ“š Resources

๐Ÿ™ Acknowledgments

  • The Ollama team for their amazing AI models
  • The PyMuPDF team for excellent PDF processing
  • All contributors who have helped improve this project

๐Ÿ› ๏ธ Development Workflow

This project uses a script-based workflow for development tasks. All scripts are located in the scripts/ directory and can be run directly or via the Makefile.

Setup

  1. Clone the repository and navigate to the project directory:

    git clone https://github.com/wronai/ocr.git
    cd ocr
    
  2. Set up the development environment:

    make install-dev
    

    This will:

    • Create and activate a virtual environment
    • Install all development dependencies
    • Set up pre-commit hooks

Common Development Tasks

# Run tests
make test

# Run tests with coverage
make test-cov

# Format code
make format

# Run linters
make lint

# Start development server
make dev-server

# Build documentation
make docs
make docs-serve  # Serve docs locally

Scripts Directory

All development and build scripts are located in the scripts/ directory. See scripts/README.md for detailed documentation of each script.

Docker Development

# Build Docker image
make docker-build

# Start services with Docker Compose
make docker-run

# Stop services
make docker-stop

๐Ÿค Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Please ensure your code follows our coding standards and includes appropriate tests.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ“œ Changelog

See CHANGELOG.md for a list of changes in each version.**

python proc.py --model llava:7b --workers 4
  1. View Results
    • Open output/*_complete.svg in your browser
    • Check details in output/processing_report.json

๐Ÿ“š Documentation

Full documentation is available in the docs/ directory:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_ocr_processor-2.0.3.tar.gz (64.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_ocr_processor-2.0.3-py3-none-any.whl (78.3 kB view details)

Uploaded Python 3

File details

Details for the file pdf_ocr_processor-2.0.3.tar.gz.

File metadata

  • Download URL: pdf_ocr_processor-2.0.3.tar.gz
  • Upload date:
  • Size: 64.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.3 Linux/6.14.0-22-generic

File hashes

Hashes for pdf_ocr_processor-2.0.3.tar.gz
Algorithm Hash digest
SHA256 594bf07c1b5d7e9d09654ba056da00749015e5e3d6d451cc5a04d7111036aedc
MD5 fdc679db5c2132437b501bcc09d6b0ef
BLAKE2b-256 e3015cdca9c73405e0f1c4df60ffccfd511b885c4d628e05d22ebe11f491880f

See more details on using hashes here.

File details

Details for the file pdf_ocr_processor-2.0.3-py3-none-any.whl.

File metadata

  • Download URL: pdf_ocr_processor-2.0.3-py3-none-any.whl
  • Upload date:
  • Size: 78.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.3 Linux/6.14.0-22-generic

File hashes

Hashes for pdf_ocr_processor-2.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0a45da53ada6c1b39d7b20b01a8d6dd4449162a9eca3b01cec9648ca902eafaa
MD5 68f18a73bed371d5bf9d1fe4b1a92ae5
BLAKE2b-256 93de567cb554d08aee52f18155bbb7da949026bf806c2e895c155ca3681221b5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page