Abstractions & Tools for OCR / document processing
Project description
Docler
A unified Python library for document conversion and OCR that provides a consistent interface to multiple document processing providers. Extract text, images, and metadata from PDFs, images, and office documents using state-of-the-art OCR and document AI services.
Features
- Unified Interface: Single API for multiple document processing providers
- Multiple Providers: Support for 10+ OCR and document AI services
- Rich Output: Extract text, images, tables, and metadata
- Async Support: Built-in async/await support
- Flexible Configuration: Provider-specific settings and preferences
- Page Range Support: Process specific pages from documents
- Multi-language OCR: Support for 100+ languages across providers
- Structured Output: Standardized markdown with embedded metadata
Quick Start
import asyncio
from docler import MistralConverter
async def main():
# Use the aggregated converter for automatic provider selection
converter = MistralConverter()
# Convert a document
result = await converter.convert_file("document.pdf")
print(f"Title: {result.title}")
print(f"Content: {result.content[:500]}...")
print(f"Images: {len(result.images)} extracted")
print(f"Pages: {result.page_count}")
asyncio.run(main())
Available OCR Converters
Cloud API Providers
Azure Document Intelligence
from docler import AzureConverter
converter = AzureConverter(
endpoint="your-endpoint",
api_key="your-key",
model="prebuilt-layout"
)
Mistral OCR
from docler import MistralConverter
converter = MistralConverter(
api_key="your-key",
languages=["en", "fr", "de"]
)
LlamaParse
from docler import LlamaParseConverter
converter = LlamaParseConverter(
api_key="your-key",
adaptive_long_table=True
)
Upstage Document AI
from docler import UpstageConverter
converter = UpstageConverter(
api_key="your-key",
chart_recognition=True
)
DataLab
from docler import DataLabConverter
converter = DataLabConverter(
api_key="your-key",
use_llm=False # Enable for higher accuracy
)
Local/Self-Hosted Providers
Marker
from docler import MarkerConverter
converter = MarkerConverter(
dpi=192,
use_llm=True, # Requires local LLM setup
llm_provider="ollama"
)
Docling
from docler import DoclingConverter
converter = DoclingConverter(
ocr_engine="easy_ocr",
image_scale=2.0
)
Docling Remote
from docler import DoclingRemoteConverter
converter = DoclingRemoteConverter(
endpoint="http://localhost:5001",
pdf_backend="dlparse_v4"
)
MarkItDown (Microsoft)
from docler import MarkItDownConverter
converter = MarkItDownConverter()
LLM-Based Providers
LLM Converter
from docler import LLMConverter
converter = LLMConverter(
model="gpt-4o", # or claude-3-5-sonnet, etc.
system_prompt="Extract text preserving formatting..."
)
Provider Comparison
| Provider | Cost/Page | Local | API Required | Best For |
|---|---|---|---|---|
| Azure | $0.0096 | ❌ | ✅ | Enterprise forms, invoices |
| Mistral | Variable | ❌ | ✅ | High-quality text extraction |
| LlamaParse | $0.0045 | ❌ | ✅ | Complex layouts, academic papers |
| Upstage | $0.01 | ❌ | ✅ | Charts, presentations |
| DataLab | $0.0015 | ❌ | ✅ | Cost-effective processing |
| Marker | Free | ✅ | ❌ | Privacy-sensitive documents |
| Docling | Free | ✅ | ❌ | Open-source processing |
| MarkItDown | Free | ✅ | ❌ | Office documents |
| LLM | Variable | ❌ | ✅ | Latest AI capabilities |
Advanced Usage
Directory Processing
Process entire directories with progress tracking:
from docler import DirectoryConverter, MarkerConverter
base_converter = MarkerConverter()
dir_converter = DirectoryConverter(base_converter, chunk_size=10)
# Convert all supported files
results = await dir_converter.convert("./documents/")
# Or with progress tracking
async for state in dir_converter.convert_with_progress("./documents/"):
print(f"Progress: {state.processed_files}/{state.total_files}")
print(f"Current: {state.current_file}")
if state.errors:
print(f"Errors: {len(state.errors)}")
Page Range Processing
Extract specific pages from documents:
# Extract pages 1-5 and 10-15
converter = MistralConverter(page_range="1-5,10-15")
result = await converter.convert_file("large_document.pdf")
Batch Processing
Process multiple files efficiently:
files = ["doc1.pdf", "doc2.png", "doc3.docx"]
results = await converter.convert_files(files)
for file, result in zip(files, results):
print(f"{file}: {len(result.content)} characters extracted")
Output Format
All converters return a standardized Document object with:
class Document:
content: str # Extracted text in markdown format
images: list[Image] # Extracted images with metadata
title: str # Document title
source_path: str # Original file path
mime_type: str # File MIME type
metadata: dict # Provider-specific metadata
page_count: int # Number of pages processed
The markdown content includes standardized metadata for page breaks and structure:
<!-- docler:page_break {"next_page":1} -->
# Document Title
Content from page 1...
<!-- docler:page_break {"next_page":2} -->
More content from page 2...
Installation
# Basic installation
pip install docler
# With specific provider dependencies
pip install docler[azure] # Azure Document Intelligence
pip install docler[mistral] # Mistral OCR
pip install docler[marker] # Marker PDF processing
pip install docler[all] # All providers
Environment Variables
Configure API keys via environment variables:
export AZURE_DOC_INTELLIGENCE_ENDPOINT="your-endpoint"
export AZURE_DOC_INTELLIGENCE_KEY="your-key"
export MISTRAL_API_KEY="your-key"
export LLAMAPARSE_API_KEY="your-key"
export UPSTAGE_API_KEY="your-key"
export DATALAB_API_KEY="your-key"
Contributing
We welcome contributions! See our contributing guidelines for details.
License
MIT License - see LICENSE for details.
Links
- Documentation: https://phil65.github.io/docler/
- PyPI: https://pypi.org/project/docler/
- GitHub: https://github.com/phil65/docler/
- Issues: https://github.com/phil65/docler/issues
- Discussions: https://github.com/phil65/docler/discussions
Coming Soon: FastAPI demo with bring-your-own-keys on https://contexter.net
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docler-2.1.1.tar.gz.
File metadata
- Download URL: docler-2.1.1.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"25.10","id":"questing","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e231309e648924e8218475e37838ea1b917799d3cb5dfa7c10e7d31851fb7354
|
|
| MD5 |
93af3cfe729a12109efcf601ba559247
|
|
| BLAKE2b-256 |
d5fa9d0da5814fae67fc890613f1dcf00fca0b1aea4fbf430502a1f76a20e2fa
|
File details
Details for the file docler-2.1.1-py3-none-any.whl.
File metadata
- Download URL: docler-2.1.1-py3-none-any.whl
- Upload date:
- Size: 1.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"25.10","id":"questing","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
446d2476530017465519cb4d2da7d836a7b2cf531032bd16a3867549aef737ae
|
|
| MD5 |
1cca605b9665cb4080e9e8bb980bc85e
|
|
| BLAKE2b-256 |
4464f5901aa7e2029944de35b243164c0f01a45c06355d004a3b17cb0484de4b
|