Skip to main content

Dead simple document extraction OCR powered by LLMs

Project description

docex

Dead simple document extraction OCR powered by LLMs.

DocEx is a dead-simple, fully pluggable OCR toolkit designed to turn any document—PDFs, DOCX, images, scans—into clean, structured data using any of 100+ LLM models via LiteLLM.

Features

  • 100+ LLM Models: Works with OpenAI, Anthropic, Google, Cohere, Replicate, Ollama, and many more via LiteLLM
  • Plug-and-Play: Drop DocEx into your Python project via pip
  • Visual-First Processing: Renders each page as an image, then leverages vision models to faithfully extract structured data
  • Schema-Based Extraction: Define your data structure with Pydantic and let the LLM extract exactly what you need
  • Async & Sync APIs: Use async/await or synchronous methods based on your needs
  • Extensible: Easy to add new document loaders and processors
  • Simple Configuration: Configure loaders and processors directly when instantiating them

Installation

pip install docex-llm

Note: You'll also need to install poppler-utils for PDF processing:

  • macOS: brew install poppler
  • Ubuntu/Debian: sudo apt-get install poppler-utils
  • Windows: Download from poppler website

Quick Start

import asyncio
from pydantic import BaseModel
from docex import Pipeline, PDFLoader, LLMProcessor

# Define your extraction schema
class Invoice(BaseModel):
    invoice_number: str
    vendor_name: str
    total_amount: float
    items: list[dict]

# Use any LiteLLM-supported model
processor = LLMProcessor(
    model="gpt-4-vision-preview",  # or "claude-3-opus", "gemini/gemini-1.5-flash", etc.
    api_key="your-api-key",  # or set via environment variable
    temperature=0.1,
    max_tokens=4096
)

# Create pipeline with configured loader
pipeline = Pipeline(
    loader=PDFLoader(dpi=300, max_pages=10),
    processor=processor
)

# Process document
result = await pipeline.process_document(
    file_path="invoice.pdf",
    schema=Invoice
)

# Access extracted data
print(f"Invoice #{result.extracted_data.invoice_number}")
print(f"Total: ${result.extracted_data.total_amount}")

Supported Models

DocEx supports any model available through LiteLLM, including:

  • OpenAI: GPT-4 Vision, GPT-4, GPT-3.5
  • Anthropic: Claude 3 Opus, Sonnet, Haiku
  • Google: Gemini 1.5 Pro, Flash
  • Open Source: Llama, Mistral, Mixtral via Ollama, Together, Replicate
  • And 100+ more: See LiteLLM docs for full list

Synchronous Usage

# Use the synchronous wrapper
result = pipeline.process_document_sync(
    file_path="document.pdf",
    schema=YourSchema
)

Loader Configuration

# Configure PDF loader
loader = PDFLoader(
    dpi=300,  # Resolution for rendering
    fmt='PNG',  # Output format
    thread_count=4,  # Parallel processing
    max_pages=10  # Limit pages to process
)

Processor Configuration

# Configure LLM processor
processor = LLMProcessor(
    model="gpt-4-vision-preview",
    api_key="your-key",
    temperature=0.2,  # Control randomness
    max_tokens=8192,  # Max output length
    system_prompt="Custom instructions...",  # Override default prompt
    litellm_params={
        "timeout": 30,
        "max_retries": 2
    }
)

Environment Variables

LiteLLM supports provider-specific environment variables:

# Provider API keys
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GEMINI_API_KEY="..."
# etc.

Advanced Usage

# Custom system prompt for specialized extraction
processor = LLMProcessor(
    model="claude-3-opus-20240229",
    system_prompt="""You are a specialized invoice processor. 
    Focus on extracting line items with extreme precision.
    Always validate totals and tax calculations.""",
    temperature=0.0  # Deterministic output
)

# Process only first 5 pages of large documents
loader = PDFLoader(dpi=200, max_pages=5)

# Use with custom schema
class Contract(BaseModel):
    party_names: list[str]
    effective_date: str
    terms: list[dict]
    signatures: list[dict]

pipeline = Pipeline(loader=loader, processor=processor)
result = await pipeline.process_document("contract.pdf", Contract)

Development

This project uses Poetry for dependency management and packaging.

  1. Install Poetry: pip install poetry
  2. Clone the repository: git clone https://github.com/yourusername/docex.git
  3. Navigate to the project directory: cd docex
  4. Install dependencies: poetry install

Running Tests

poetry run poe test

Linting

poetry run poe lint

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docex_llm-0.1.1.tar.gz (11.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docex_llm-0.1.1-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file docex_llm-0.1.1.tar.gz.

File metadata

  • Download URL: docex_llm-0.1.1.tar.gz
  • Upload date:
  • Size: 11.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.9.6 Darwin/23.6.0

File hashes

Hashes for docex_llm-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e82365b1f80de659bb43b011a4275e6bfa6c0c4cdcf853124d0d0971e2c8196a
MD5 9684959d5957617a77ab55056efebe4d
BLAKE2b-256 eb8e7ba1ae8a2811242f49ac5e9857fcce8fe9c858c8d8c6173d77acdc3bae8f

See more details on using hashes here.

File details

Details for the file docex_llm-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: docex_llm-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 12.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.9.6 Darwin/23.6.0

File hashes

Hashes for docex_llm-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ddeb472e7133511455dfc2b1b1771a37b8080a3090fd9041089f4339e1d6df15
MD5 8f2d7f290077055f20224ffefd07f0cb
BLAKE2b-256 e52f7df7dcb09e31b2a2a50dd42976ba5c9075dc3dbfc96db9e6e3267a2d1c5d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page