A practical tool for converting PDF to Markdown
Project description
OCRRouter
A powerful Python library for converting PDFs and images to Markdown using multiple expert VLM backends
What is OCRRouter?
OCRRouter is a production-ready document processing library that converts PDFs and images to high-quality Markdown. It stands out with:
- 6 Expert VLM Backends — Choose from MinerU, DeepSeek-OCR, DotsOCR, PaddleOCR, Hunyuan-OCR, or GeneralVLM (GPT/Claude/Gemini)
- Composite Mode — Mix layout detection from one model with OCR from another for optimal results (unique feature!)
- Rich Document Support — Tables, formulas, images, code blocks, lists, and complex layouts
- Flexible APIs — Sync/async, single/batch processing, multiple output formats
- Production Ready — Built-in observability (Langfuse), retries, error handling, debug mode
Quick Start
Installation
pip install ocrrouter
30-Second Example
from ocrrouter import process_document
# One-liner document conversion
result = process_document(
"document.pdf",
"output/",
backend="deepseek",
openai_api_key="your-api-key"
)
print(result["markdown"])
Basic Usage
from ocrrouter import DocumentPipeline, Settings
# Configure pipeline
settings = Settings(
backend="deepseek",
openai_base_url="https://api.example.com/v1",
openai_api_key="your-api-key",
output_mode="all" # layout + OCR
)
# Process document
pipeline = DocumentPipeline(settings=settings)
result = pipeline.process("document.pdf", "output/")
# Access results
print(f"Markdown: {result['markdown'][:100]}...")
print(f"Output directory: {result['output_dir']}")
Async Processing
# Async processing for better performance
result = await pipeline.aio_process("document.pdf", "output/")
# Batch processing with concurrency control
results = await pipeline.aio_process_batch(
["doc1.pdf", "doc2.pdf", "doc3.pdf"],
"output/",
session_id="batch-001"
)
Key Features
1. Multiple Expert Backends
Each backend is optimized for different document types:
| Backend | Layout | OCR | Best For |
|---|---|---|---|
| MinerU | ✓ | ✓ | Academic papers, complex layouts, formulas |
| DeepSeek | ✓ | ✓ | General documents, efficiency, grounding mode |
| DotsOCR | ✓ | ✓ | Flexible extraction (one-step or two-step) |
| PaddleOCR | — | ✓ | Fast OCR, multilingual support |
| Hunyuan | — | ✓ | Markdown-optimized output |
| GeneralVLM | — | ✓ | GPT-4V, Claude, Gemini, custom VLMs |
2. Composite Mode (Mix & Match)
Combine the strengths of different models:
settings = Settings(
backend="composite",
layout_model="mineru", # Best layout detection
ocr_model="paddleocr", # Fast OCR extraction
)
Why use composite mode?
- Optimize for cost vs quality
- Leverage each model's strengths
- Example: MinerU's excellent layout + PaddleOCR's speed
- 2-3x faster than single-model approaches in many cases
3. Three Output Modes
Control processing behavior:
# Full layout + OCR (default)
Settings(output_mode="all")
# Layout detection only
Settings(output_mode="layout_only")
# Direct OCR without layout analysis
Settings(output_mode="ocr_only")
4. Rich Output Formats
Multiple output files for different use cases:
- Markdown (
.md) — Human-readable converted text - Layout PDF (
_layout.pdf) — Visual layout with bounding boxes - Model JSON (
_model.json) — Raw model output - Middle JSON (
_middle.json) — Processed structural data - Content List (
_content_list.json) — Simplified flat structure - Images — Extracted figures, tables, equations
Use Cases
Academic Research
Extract formulas, citations, and complex layouts from research papers and theses:
settings = Settings(
backend="mineru",
formula_enable=True,
table_merge_enable=True # Cross-page table merging
)
Business Documents
Parse invoices, contracts, and forms with table extraction:
settings = Settings(
backend="deepseek",
table_enable=True,
output_mode="all"
)
Document Digitization
Batch process archives with multilingual support:
settings = Settings(
backend="composite",
layout_model="deepseek",
ocr_model="paddleocr", # Strong multilingual support
max_concurrency=10
)
AI/ML Pipelines
Extract structured data for RAG or training:
settings = Settings(
backend="deepseek",
dump_content_list=True, # Simplified JSON for ML
dump_middle_json=True # Structured data
)
Backend Selection Guide
How to Choose?
Need layout detection + OCR?
- Academic/Scientific → MinerU (best formula extraction)
- General documents → DeepSeek (efficient grounding mode)
- Flexible extraction → DotsOCR (one-step or two-step)
Need OCR only?
- Fast processing → PaddleOCR
- Markdown-focused → Hunyuan
- Use GPT-4/Claude → GeneralVLM
Want to optimize cost/speed?
- Use Composite Mode: strong layout + fast OCR
See Backend Guide for detailed comparison.
Documentation
- Backend Guide — Model comparison and selection
- Examples — Code examples and recipes
- API Reference — Complete API documentation
- Configuration — Settings and environment variables
- Output Formats — Understanding output files
Configuration
OCRRouter uses explicit configuration (no automatic .env loading):
from ocrrouter import Settings
# Method 1: Settings object
settings = Settings(
backend="deepseek",
openai_api_key="your-key",
max_concurrency=20,
http_timeout=120,
max_retries=3
)
# Method 2: Constructor arguments
pipeline = DocumentPipeline(
backend="deepseek",
openai_api_key="your-key"
)
# Method 3: Settings with overrides
pipeline = DocumentPipeline(
settings=settings,
max_concurrency=50 # Override
)
See Configuration Guide for all available settings.
Advanced Features
Observability with Langfuse
from langfuse import Langfuse
from ocrrouter import DocumentPipeline, Settings
langfuse = Langfuse(
public_key="pk-...",
secret_key="sk-...",
host="https://cloud.langfuse.com"
)
settings = Settings(backend="deepseek", openai_api_key="your-key")
pipeline = DocumentPipeline(settings=settings, langfuse=langfuse)
# Traces appear in Langfuse dashboard
result = await pipeline.aio_process("document.pdf", "output/")
Error Handling & Debug Mode
settings = Settings(
backend="deepseek",
max_retries=5,
debug=True, # Save failed requests
debug_dir="./debug" # Debug output location
)
Direct Backend Access
from ocrrouter import get_backend, Settings
settings = Settings(openai_api_key="your-key")
backend = get_backend("mineru", settings=settings)
# Advanced control
middle_json, model_output = await backend.analyze(pdf_bytes, image_writer)
Examples
See docs/EXAMPLES.md for comprehensive examples including:
- Basic document processing
- Batch processing with concurrency
- Composite mode configurations
- FastAPI integration
- Custom pipelines
- Use case-specific recipes
Or check out the demo scripts in demo/:
demo/quickstart.py— Minimal exampledemo/composite_mode.py— Composite mode showcasedemo/demo.py— Comprehensive demo
Requirements
- Python 3.10, 3.11, 3.12, or 3.13
- VLM server access (for backends requiring API calls)
- See pyproject.toml for full dependency list
Installation
# From PyPI
pip install ocrrouter
# From source
git clone https://github.com/yourusername/ocrrouter.git
cd ocrrouter
pip install -e .
Contributing
Contributions are welcome! See CONTRIBUTING.md for guidelines.
License
This project is licensed under the AGPL-3.0 License - see the LICENSE file for details.
Support
- Issues: Report bugs and request features via GitHub Issues
- Documentation: Full documentation at docs/
- Examples: See docs/EXAMPLES.md and demo/
Built with ❤️ for document processing needs
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ocrrouter-0.1.2.tar.gz.
File metadata
- Download URL: ocrrouter-0.1.2.tar.gz
- Upload date:
- Size: 139.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46e913ab6cd8c2e0d318f000376002f46e11fbfb0f0b5d6b12292facb9cf051b
|
|
| MD5 |
dc9629266faddc102fc398e4e7414b70
|
|
| BLAKE2b-256 |
f97c7c82013d1a62627e013d41905a60cd3ef841648482e4e0de2fb218bf428c
|
File details
Details for the file ocrrouter-0.1.2-py3-none-any.whl.
File metadata
- Download URL: ocrrouter-0.1.2-py3-none-any.whl
- Upload date:
- Size: 176.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a300d0a4120b18dbb56bc971e69b3770abca4b59385bce24c8c4ca50b19b2edd
|
|
| MD5 |
9f28677c5b53834e988d4ac6737cc54d
|
|
| BLAKE2b-256 |
ed27e206b62e82fa704f431a4e95121d9175060b1df06d69bc18aa7d15fa7627
|