Skip to main content

Abstractions & Tools for OCR / document processing

Project description

Docler

PyPI License Package status Monthly downloads Distribution format Wheel availability Python version Implementation Releases Github Contributors Github Discussions Github Forks Github Issues Github Issues Github Watchers Github Stars Github Repository size Github last commit Github release date Github language count Github commits this month Package status PyUp

Read the documentation!

A unified Python library for document conversion and OCR that provides a consistent interface to multiple document processing providers. Extract text, images, and metadata from PDFs, images, and office documents using state-of-the-art OCR and document AI services.

Features

  • Unified Interface: Single API for multiple document processing providers
  • Multiple Providers: Support for 10+ OCR and document AI services
  • Rich Output: Extract text, images, tables, and metadata
  • Async Support: Built-in async/await support
  • Flexible Configuration: Provider-specific settings and preferences
  • Page Range Support: Process specific pages from documents
  • Multi-language OCR: Support for 100+ languages across providers
  • Structured Output: Standardized markdown with embedded metadata

Quick Start

import asyncio
from docler import MistralConverter

async def main():
    # Use the aggregated converter for automatic provider selection
    converter = MistralConverter()

    # Convert a document
    result = await converter.convert_file("document.pdf")

    print(f"Title: {result.title}")
    print(f"Content: {result.content[:500]}...")
    print(f"Images: {len(result.images)} extracted")
    print(f"Pages: {result.page_count}")

asyncio.run(main())

Available OCR Converters

Cloud API Providers

Azure Document Intelligence

from docler import AzureConverter

converter = AzureConverter(
    endpoint="your-endpoint",
    api_key="your-key",
    model="prebuilt-layout"
)

Mistral OCR

from docler import MistralConverter

converter = MistralConverter(
    api_key="your-key",
    languages=["en", "fr", "de"]
)

LlamaParse

from docler import LlamaParseConverter

converter = LlamaParseConverter(
    api_key="your-key",
    adaptive_long_table=True
)

Upstage Document AI

from docler import UpstageConverter

converter = UpstageConverter(
    api_key="your-key",
    chart_recognition=True
)

DataLab

from docler import DataLabConverter

converter = DataLabConverter(
    api_key="your-key",
    use_llm=False  # Enable for higher accuracy
)

Local/Self-Hosted Providers

Marker

from docler import MarkerConverter

converter = MarkerConverter(
    dpi=192,
    use_llm=True,  # Requires local LLM setup
    llm_provider="ollama"
)

Docling

from docler import DoclingConverter

converter = DoclingConverter(
    ocr_engine="easy_ocr",
    image_scale=2.0
)

Docling Remote

from docler import DoclingRemoteConverter

converter = DoclingRemoteConverter(
    endpoint="http://localhost:5001",
    pdf_backend="dlparse_v4"
)

MarkItDown (Microsoft)

from docler import MarkItDownConverter

converter = MarkItDownConverter()

LLM-Based Providers

LLM Converter

from docler import LLMConverter

converter = LLMConverter(
    model="gpt-4o",  # or claude-3-5-sonnet, etc.
    system_prompt="Extract text preserving formatting..."
)

Provider Comparison

Provider Cost/Page Local API Required Best For
Azure $0.0096 Enterprise forms, invoices
Mistral Variable High-quality text extraction
LlamaParse $0.0045 Complex layouts, academic papers
Upstage $0.01 Charts, presentations
DataLab $0.0015 Cost-effective processing
Marker Free Privacy-sensitive documents
Docling Free Open-source processing
MarkItDown Free Office documents
LLM Variable Latest AI capabilities

Advanced Usage

Directory Processing

Process entire directories with progress tracking:

from docler import DirectoryConverter, MarkerConverter

base_converter = MarkerConverter()
dir_converter = DirectoryConverter(base_converter, chunk_size=10)

# Convert all supported files
results = await dir_converter.convert("./documents/")

# Or with progress tracking
async for state in dir_converter.convert_with_progress("./documents/"):
    print(f"Progress: {state.processed_files}/{state.total_files}")
    print(f"Current: {state.current_file}")
    if state.errors:
        print(f"Errors: {len(state.errors)}")

Page Range Processing

Extract specific pages from documents:

# Extract pages 1-5 and 10-15
converter = MistralConverter(page_range="1-5,10-15")
result = await converter.convert_file("large_document.pdf")

Batch Processing

Process multiple files efficiently:

files = ["doc1.pdf", "doc2.png", "doc3.docx"]
results = await converter.convert_files(files)

for file, result in zip(files, results):
    print(f"{file}: {len(result.content)} characters extracted")

Output Format

All converters return a standardized Document object with:

class Document:
    content: str           # Extracted text in markdown format
    images: list[Image]    # Extracted images with metadata
    title: str            # Document title
    source_path: str      # Original file path
    mime_type: str        # File MIME type
    metadata: dict        # Provider-specific metadata
    page_count: int       # Number of pages processed

The markdown content includes standardized metadata for page breaks and structure:

<!-- docler:page_break {"next_page":1} -->
# Document Title

Content from page 1...

<!-- docler:page_break {"next_page":2} -->
More content from page 2...

Installation

# Basic installation
pip install docler

# With specific provider dependencies
pip install docler[azure]      # Azure Document Intelligence
pip install docler[mistral]    # Mistral OCR
pip install docler[marker]     # Marker PDF processing
pip install docler[all]        # All providers

Environment Variables

Configure API keys via environment variables:

export AZURE_DOC_INTELLIGENCE_ENDPOINT="your-endpoint"
export AZURE_DOC_INTELLIGENCE_KEY="your-key"
export MISTRAL_API_KEY="your-key"
export LLAMAPARSE_API_KEY="your-key"
export UPSTAGE_API_KEY="your-key"
export DATALAB_API_KEY="your-key"

Contributing

We welcome contributions! See our contributing guidelines for details.

License

MIT License - see LICENSE for details.

Links


Coming Soon: FastAPI demo with bring-your-own-keys on https://contexter.net

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docler-2.1.1.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docler-2.1.1-py3-none-any.whl (1.4 MB view details)

Uploaded Python 3

File details

Details for the file docler-2.1.1.tar.gz.

File metadata

  • Download URL: docler-2.1.1.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"25.10","id":"questing","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for docler-2.1.1.tar.gz
Algorithm Hash digest
SHA256 e231309e648924e8218475e37838ea1b917799d3cb5dfa7c10e7d31851fb7354
MD5 93af3cfe729a12109efcf601ba559247
BLAKE2b-256 d5fa9d0da5814fae67fc890613f1dcf00fca0b1aea4fbf430502a1f76a20e2fa

See more details on using hashes here.

File details

Details for the file docler-2.1.1-py3-none-any.whl.

File metadata

  • Download URL: docler-2.1.1-py3-none-any.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"25.10","id":"questing","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for docler-2.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 446d2476530017465519cb4d2da7d836a7b2cf531032bd16a3867549aef737ae
MD5 1cca605b9665cb4080e9e8bb980bc85e
BLAKE2b-256 4464f5901aa7e2029944de35b243164c0f01a45c06355d004a3b17cb0484de4b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page