Advanced PDF OCR processing with AI-powered text extraction and selectable text overlays

These details have not been verified by PyPI

Project links

Project description

PDF OCR Processor

Advanced PDF processing with AI-powered OCR, text extraction, and selectable text overlays using Ollama models

🚀 Features

AI-Powered OCR using Ollama models (llava, moondream, etc.)
Modular Architecture with clear separation of concerns
Multiple Output Formats:
- SVG with selectable text overlays
- Raw text extraction
- JSON metadata
Image Enhancement with multiple strategies
Robust Error Handling with configurable retries
Parallel Processing for batch operations
CLI Interface with progress tracking

🛠️ System Architecture

┌─────────────────────────────────────────────────┐
│               PDF OCR Processor                 │
├─────────────────┬───────────────────────────────┤
│  ┌────────────┐ │  ┌─────────────────────────┐  │
│  │ PDF        │ │  │      OCRProcessor       │  │
│  │ Processor  ├─┼─▶│  - Text extraction      │  │
│  └────────────┘ │  │  - Ollama integration   │  │
│                 │  └─────────────┬───────────┘  │
│  ┌────────────┐ │  ┌─────────────▼───────────┐  │
│  │ Image      │ │  │      SVG Generator      │  │
│  │ Enhancer   ├─┼─▶│  - Text overlay         │  │
│  └────────────┘ │  │  - Searchable output    │  │
└─────────────────┴───────────────────────────────┘

📦 Installation

Prerequisites

Python 3.8+
Ollama (for OCR processing)

System dependencies:

# Ubuntu/Debian
sudo apt-get install -y tesseract-ocr poppler-utils

# macOS
brew install tesseract poppler

Install from source

# Clone the repository
git clone https://github.com/wronai/ocr.git
cd ocr

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate  # Linux/macOS

# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt  # For development

🏁 Quick Start

Basic Usage

# Process a single PDF
python -m pdf_processor --input document.pdf --output output/

# Process all PDFs in a directory
python -m pdf_processor --input ./documents --output ./output --model llava:7b

# Show help
python -m pdf_processor --help

Python API

from pdf_processor import PDFProcessor
from pdf_processor.processing.pdf_processor import PDFProcessorConfig

# Configure the processor
config = PDFProcessorConfig(
    input_path="document.pdf",
    output_dir="./output",
    ocr_model="llava:7b",
    dpi=300,
    max_workers=4
)

# Process a document
processor = PDFProcessor(config)
result = processor.process_pdf("document.pdf")
print(f"Processed {result['pages_processed']} pages")

⚙️ Configuration

Configuration File

Create a config.yaml file:

# config.yaml
input_path: ./documents    # Input file or directory
output_dir: ./output       # Output directory
ocr_model: llava:7b        # Ollama model to use
dpi: 300                   # Image resolution
max_workers: 4             # Number of worker threads
timeout: 300               # Timeout in seconds
max_retries: 3             # Max retry attempts
log_level: INFO            # Logging level
log_file: pdf_processor.log # Log file path

# Image enhancement strategies
enhancement_strategies:
  - original            # Keep original image
  - grayscale           # Convert to grayscale
  - adaptive_threshold  # Apply adaptive thresholding
  - contrast_stretch    # Stretch contrast
  - sharpen             # Sharpen image
  - denoise             # Remove noise

Environment Variables

export OLLAMA_HOST="http://localhost:11434"
export OLLAMA_MODEL="llava:7b"
export LOG_LEVEL="DEBUG"

🚀 Advanced Usage

Processing Options

# Process with specific DPI
python -m pdf_processor --input document.pdf --output output/ --dpi 400

# Limit number of pages to process
python -m pdf_processor --input document.pdf --output output/ --max-pages 10

# Use a specific enhancement strategy
python -m pdf_processor --input document.pdf --output output/ --enhance grayscale

# Process in verbose mode
python -m pdf_processor --input document.pdf --output output/ --verbose

Available Enhancement Strategies

original: Keep original image (fastest)
grayscale: Convert to grayscale (good for text-heavy documents)
adaptive_threshold: Apply adaptive thresholding (good for low-quality scans)
contrast_stretch: Stretch contrast to improve readability
sharpen: Apply sharpening filter
denoise: Remove image noise

🛠️ Development

Project Structure

pdf_processor/
├── __init__.py          # Package initialization
├── cli.py               # Command-line interface
├── config/              # Configuration files
├── models/              # Data models
│   ├── __init__.py
│   ├── ocr_result.py    # OCR result data structures
│   └── retry_config.py  # Retry configuration
├── processing/          # Core processing modules
│   ├── __init__.py
│   ├── image_enhancement.py  # Image processing
│   ├── ocr_processor.py      # OCR processing
│   ├── pdf_processor.py      # Main PDF processing
│   └── svg_generator.py      # SVG output generation
└── utils/               # Utility functions
    ├── file_utils.py    # File operations
    ├── logging_utils.py # Logging configuration
    └── validation_utils.py # Input validation

Running Tests

# Install test dependencies
pip install -r requirements-dev.txt

# Run all tests
pytest

# Run tests with coverage report
pytest --cov=pdf_processor --cov-report=html

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

📚 Resources

🙏 Acknowledgments

The Ollama team for their amazing AI models
The PyMuPDF team for excellent PDF processing
All contributors who have helped improve this project

🛠️ Development Workflow

This project uses a script-based workflow for development tasks. All scripts are located in the scripts/ directory and can be run directly or via the Makefile.

Setup

Clone the repository and navigate to the project directory:
```
git clone https://github.com/wronai/ocr.git
cd ocr
```
Set up the development environment:
```
make install-dev
```
This will:
- Create and activate a virtual environment
- Install all development dependencies
- Set up pre-commit hooks

Common Development Tasks

# Run tests
make test

# Run tests with coverage
make test-cov

# Format code
make format

# Run linters
make lint

# Start development server
make dev-server

# Build documentation
make docs
make docs-serve  # Serve docs locally

Scripts Directory

All development and build scripts are located in the scripts/ directory. See scripts/README.md for detailed documentation of each script.

Docker Development

# Build Docker image
make docker-build

# Start services with Docker Compose
make docker-run

# Stop services
make docker-stop

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Please ensure your code follows our coding standards and includes appropriate tests.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📜 Changelog

See CHANGELOG.md for a list of changes in each version.**

python proc.py --model llava:7b --workers 4

View Results
- Open output/*_complete.svg in your browser
- Check details in output/processing_report.json

📚 Documentation

Full documentation is available in the docs/ directory:

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.3

Jul 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_ocr_processor-2.0.3.tar.gz (64.2 kB view details)

Uploaded Jul 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_ocr_processor-2.0.3-py3-none-any.whl (78.3 kB view details)

Uploaded Jul 11, 2025 Python 3

File details

Details for the file pdf_ocr_processor-2.0.3.tar.gz.

File metadata

Download URL: pdf_ocr_processor-2.0.3.tar.gz
Upload date: Jul 11, 2025
Size: 64.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.13.3 Linux/6.14.0-22-generic

File hashes

Hashes for pdf_ocr_processor-2.0.3.tar.gz
Algorithm	Hash digest
SHA256	`594bf07c1b5d7e9d09654ba056da00749015e5e3d6d451cc5a04d7111036aedc`
MD5	`fdc679db5c2132437b501bcc09d6b0ef`
BLAKE2b-256	`e3015cdca9c73405e0f1c4df60ffccfd511b885c4d628e05d22ebe11f491880f`

See more details on using hashes here.

File details

Details for the file pdf_ocr_processor-2.0.3-py3-none-any.whl.

File metadata

Download URL: pdf_ocr_processor-2.0.3-py3-none-any.whl
Upload date: Jul 11, 2025
Size: 78.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.13.3 Linux/6.14.0-22-generic

File hashes

Hashes for pdf_ocr_processor-2.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0a45da53ada6c1b39d7b20b01a8d6dd4449162a9eca3b01cec9648ca902eafaa`
MD5	`68f18a73bed371d5bf9d1fe4b1a92ae5`
BLAKE2b-256	`93de567cb554d08aee52f18155bbb7da949026bf806c2e895c155ca3681221b5`

See more details on using hashes here.

pdf-ocr-processor 2.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF OCR Processor

🚀 Features

🛠️ System Architecture

📦 Installation

Prerequisites

Install from source

🏁 Quick Start

Basic Usage

Python API

⚙️ Configuration

Configuration File

Environment Variables

🚀 Advanced Usage

Processing Options

Available Enhancement Strategies

🛠️ Development

Project Structure

Running Tests

🤝 Contributing

📄 License

📚 Resources

🙏 Acknowledgments

🛠️ Development Workflow

Setup

Common Development Tasks

Scripts Directory

Docker Development

🤝 Contributing

📄 License

📜 Changelog

📚 Documentation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes