Skip to main content

A practical tool for converting PDF to Markdown

Project description

OCRRouter

A powerful Python library for converting PDFs and images to Markdown using multiple expert VLM backends

Python License

What is OCRRouter?

OCRRouter is a production-ready document processing library that converts PDFs and images to high-quality Markdown. It stands out with:

  • 6 Expert VLM Backends — Choose from MinerU, DeepSeek-OCR, DotsOCR, PaddleOCR, Hunyuan-OCR, or GeneralVLM (GPT/Claude/Gemini)
  • Composite Mode — Mix layout detection from one model with OCR from another for optimal results (unique feature!)
  • Rich Document Support — Tables, formulas, images, code blocks, lists, and complex layouts
  • Flexible APIs — Sync/async, single/batch processing, multiple output formats
  • Production Ready — Built-in observability (Langfuse), retries, error handling, debug mode

Quick Start

Installation

pip install ocrrouter

30-Second Example

from ocrrouter import process_document

# One-liner document conversion
result = process_document(
    "document.pdf",
    "output/",
    backend="deepseek",
    openai_api_key="your-api-key"
)

print(result["markdown"])

Basic Usage

from ocrrouter import DocumentPipeline, Settings

# Configure pipeline
settings = Settings(
    backend="deepseek",
    openai_base_url="https://api.example.com/v1",
    openai_api_key="your-api-key",
    output_mode="all"  # layout + OCR
)

# Process document
pipeline = DocumentPipeline(settings=settings)
result = pipeline.process("document.pdf", "output/")

# Access results
print(f"Markdown: {result['markdown'][:100]}...")
print(f"Output directory: {result['output_dir']}")

Async Processing

# Async processing for better performance
result = await pipeline.aio_process("document.pdf", "output/")

# Batch processing with concurrency control
results = await pipeline.aio_process_batch(
    ["doc1.pdf", "doc2.pdf", "doc3.pdf"],
    "output/",
    session_id="batch-001"
)

Key Features

1. Multiple Expert Backends

Each backend is optimized for different document types:

Backend Layout OCR Best For
MinerU Academic papers, complex layouts, formulas
DeepSeek General documents, efficiency, grounding mode
DotsOCR Flexible extraction (one-step or two-step)
PaddleOCR Fast OCR, multilingual support
Hunyuan Markdown-optimized output
GeneralVLM GPT-4V, Claude, Gemini, custom VLMs

2. Composite Mode (Mix & Match)

Combine the strengths of different models:

settings = Settings(
    backend="composite",
    layout_model="mineru",      # Best layout detection
    ocr_model="paddleocr",      # Fast OCR extraction
)

Why use composite mode?

  • Optimize for cost vs quality
  • Leverage each model's strengths
  • Example: MinerU's excellent layout + PaddleOCR's speed
  • 2-3x faster than single-model approaches in many cases

3. Three Output Modes

Control processing behavior:

# Full layout + OCR (default)
Settings(output_mode="all")

# Layout detection only
Settings(output_mode="layout_only")

# Direct OCR without layout analysis
Settings(output_mode="ocr_only")

4. Rich Output Formats

Multiple output files for different use cases:

  • Markdown (.md) — Human-readable converted text
  • Layout PDF (_layout.pdf) — Visual layout with bounding boxes
  • Model JSON (_model.json) — Raw model output
  • Middle JSON (_middle.json) — Processed structural data
  • Content List (_content_list.json) — Simplified flat structure
  • Images — Extracted figures, tables, equations

Use Cases

Academic Research

Extract formulas, citations, and complex layouts from research papers and theses:

settings = Settings(
    backend="mineru",
    formula_enable=True,
    table_merge_enable=True  # Cross-page table merging
)

Business Documents

Parse invoices, contracts, and forms with table extraction:

settings = Settings(
    backend="deepseek",
    table_enable=True,
    output_mode="all"
)

Document Digitization

Batch process archives with multilingual support:

settings = Settings(
    backend="composite",
    layout_model="deepseek",
    ocr_model="paddleocr",  # Strong multilingual support
    max_concurrency=10
)

AI/ML Pipelines

Extract structured data for RAG or training:

settings = Settings(
    backend="deepseek",
    dump_content_list=True,  # Simplified JSON for ML
    dump_middle_json=True     # Structured data
)

Backend Selection Guide

How to Choose?

Need layout detection + OCR?

  • Academic/Scientific → MinerU (best formula extraction)
  • General documents → DeepSeek (efficient grounding mode)
  • Flexible extraction → DotsOCR (one-step or two-step)

Need OCR only?

  • Fast processing → PaddleOCR
  • Markdown-focused → Hunyuan
  • Use GPT-4/Claude → GeneralVLM

Want to optimize cost/speed?

  • Use Composite Mode: strong layout + fast OCR

See Backend Guide for detailed comparison.

Documentation

Configuration

OCRRouter uses explicit configuration (no automatic .env loading):

from ocrrouter import Settings

# Method 1: Settings object
settings = Settings(
    backend="deepseek",
    openai_api_key="your-key",
    max_concurrency=20,
    http_timeout=120,
    max_retries=3
)

# Method 2: Constructor arguments
pipeline = DocumentPipeline(
    backend="deepseek",
    openai_api_key="your-key"
)

# Method 3: Settings with overrides
pipeline = DocumentPipeline(
    settings=settings,
    max_concurrency=50  # Override
)

See Configuration Guide for all available settings.

Advanced Features

Observability with Langfuse

from langfuse import Langfuse
from ocrrouter import DocumentPipeline, Settings

langfuse = Langfuse(
    public_key="pk-...",
    secret_key="sk-...",
    host="https://cloud.langfuse.com"
)

settings = Settings(backend="deepseek", openai_api_key="your-key")
pipeline = DocumentPipeline(settings=settings, langfuse=langfuse)

# Traces appear in Langfuse dashboard
result = await pipeline.aio_process("document.pdf", "output/")

Error Handling & Debug Mode

settings = Settings(
    backend="deepseek",
    max_retries=5,
    debug=True,           # Save failed requests
    debug_dir="./debug"   # Debug output location
)

Direct Backend Access

from ocrrouter import get_backend, Settings

settings = Settings(openai_api_key="your-key")
backend = get_backend("mineru", settings=settings)

# Advanced control
middle_json, model_output = await backend.analyze(pdf_bytes, image_writer)

Examples

See docs/EXAMPLES.md for comprehensive examples including:

  • Basic document processing
  • Batch processing with concurrency
  • Composite mode configurations
  • FastAPI integration
  • Custom pipelines
  • Use case-specific recipes

Or check out the demo scripts in demo/:

  • demo/quickstart.py — Minimal example
  • demo/composite_mode.py — Composite mode showcase
  • demo/demo.py — Comprehensive demo

Requirements

  • Python 3.10, 3.11, 3.12, or 3.13
  • VLM server access (for backends requiring API calls)
  • See pyproject.toml for full dependency list

Installation

# From PyPI
pip install ocrrouter

# From source
git clone https://github.com/yourusername/ocrrouter.git
cd ocrrouter
pip install -e .

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

License

This project is licensed under the AGPL-3.0 License - see the LICENSE file for details.

Support


Built with ❤️ for document processing needs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocrrouter-0.1.2.tar.gz (139.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocrrouter-0.1.2-py3-none-any.whl (176.5 kB view details)

Uploaded Python 3

File details

Details for the file ocrrouter-0.1.2.tar.gz.

File metadata

  • Download URL: ocrrouter-0.1.2.tar.gz
  • Upload date:
  • Size: 139.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for ocrrouter-0.1.2.tar.gz
Algorithm Hash digest
SHA256 46e913ab6cd8c2e0d318f000376002f46e11fbfb0f0b5d6b12292facb9cf051b
MD5 dc9629266faddc102fc398e4e7414b70
BLAKE2b-256 f97c7c82013d1a62627e013d41905a60cd3ef841648482e4e0de2fb218bf428c

See more details on using hashes here.

File details

Details for the file ocrrouter-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: ocrrouter-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 176.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for ocrrouter-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a300d0a4120b18dbb56bc971e69b3770abca4b59385bce24c8c4ca50b19b2edd
MD5 9f28677c5b53834e988d4ac6737cc54d
BLAKE2b-256 ed27e206b62e82fa704f431a4e95121d9175060b1df06d69bc18aa7d15fa7627

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page