Dead simple document extraction OCR powered by LLMs
Project description
docex
Dead simple document extraction OCR powered by LLMs.
DocEx is a dead-simple, fully pluggable OCR toolkit designed to turn any document—PDFs, DOCX, images, scans—into clean, structured data using any of 100+ LLM models via LiteLLM.
Features
- 100+ LLM Models: Works with OpenAI, Anthropic, Google, Cohere, Replicate, Ollama, and many more via LiteLLM
- Plug-and-Play: Drop DocEx into your Python project via pip
- Visual-First Processing: Renders each page as an image, then leverages vision models to faithfully extract structured data
- Schema-Based Extraction: Define your data structure with Pydantic and let the LLM extract exactly what you need
- Async & Sync APIs: Use async/await or synchronous methods based on your needs
- Extensible: Easy to add new document loaders and processors
- Simple Configuration: Configure loaders and processors directly when instantiating them
Installation
pip install docex-llm
Note: You'll also need to install poppler-utils for PDF processing:
- macOS:
brew install poppler - Ubuntu/Debian:
sudo apt-get install poppler-utils - Windows: Download from poppler website
Quick Start
import asyncio
from pydantic import BaseModel
from docex import Pipeline, PDFLoader, LLMProcessor
# Define your extraction schema
class Invoice(BaseModel):
invoice_number: str
vendor_name: str
total_amount: float
items: list[dict]
# Use any LiteLLM-supported model
processor = LLMProcessor(
model="gpt-4-vision-preview", # or "claude-3-opus", "gemini/gemini-1.5-flash", etc.
api_key="your-api-key", # or set via environment variable
temperature=0.1,
max_tokens=4096
)
# Create pipeline with configured loader
pipeline = Pipeline(
loader=PDFLoader(dpi=300, max_pages=10),
processor=processor
)
# Process document
result = await pipeline.process_document(
file_path="invoice.pdf",
schema=Invoice
)
# Access extracted data
print(f"Invoice #{result.extracted_data.invoice_number}")
print(f"Total: ${result.extracted_data.total_amount}")
Supported Models
DocEx supports any model available through LiteLLM, including:
- OpenAI: GPT-4 Vision, GPT-4, GPT-3.5
- Anthropic: Claude 3 Opus, Sonnet, Haiku
- Google: Gemini 1.5 Pro, Flash
- Open Source: Llama, Mistral, Mixtral via Ollama, Together, Replicate
- And 100+ more: See LiteLLM docs for full list
Synchronous Usage
# Use the synchronous wrapper
result = pipeline.process_document_sync(
file_path="document.pdf",
schema=YourSchema
)
Loader Configuration
# Configure PDF loader
loader = PDFLoader(
dpi=300, # Resolution for rendering
fmt='PNG', # Output format
thread_count=4, # Parallel processing
max_pages=10 # Limit pages to process
)
Processor Configuration
# Configure LLM processor
processor = LLMProcessor(
model="gpt-4-vision-preview",
api_key="your-key",
temperature=0.2, # Control randomness
max_tokens=8192, # Max output length
system_prompt="Custom instructions...", # Override default prompt
litellm_params={
"timeout": 30,
"max_retries": 2
}
)
Environment Variables
LiteLLM supports provider-specific environment variables:
# Provider API keys
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GEMINI_API_KEY="..."
# etc.
Advanced Usage
# Custom system prompt for specialized extraction
processor = LLMProcessor(
model="claude-3-opus-20240229",
system_prompt="""You are a specialized invoice processor.
Focus on extracting line items with extreme precision.
Always validate totals and tax calculations.""",
temperature=0.0 # Deterministic output
)
# Process only first 5 pages of large documents
loader = PDFLoader(dpi=200, max_pages=5)
# Use with custom schema
class Contract(BaseModel):
party_names: list[str]
effective_date: str
terms: list[dict]
signatures: list[dict]
pipeline = Pipeline(loader=loader, processor=processor)
result = await pipeline.process_document("contract.pdf", Contract)
Development
This project uses Poetry for dependency management and packaging.
- Install Poetry:
pip install poetry - Clone the repository:
git clone https://github.com/yourusername/docex.git - Navigate to the project directory:
cd docex - Install dependencies:
poetry install
Running Tests
poetry run poe test
Linting
poetry run poe lint
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docex_llm-0.1.1.tar.gz.
File metadata
- Download URL: docex_llm-0.1.1.tar.gz
- Upload date:
- Size: 11.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.9.6 Darwin/23.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e82365b1f80de659bb43b011a4275e6bfa6c0c4cdcf853124d0d0971e2c8196a
|
|
| MD5 |
9684959d5957617a77ab55056efebe4d
|
|
| BLAKE2b-256 |
eb8e7ba1ae8a2811242f49ac5e9857fcce8fe9c858c8d8c6173d77acdc3bae8f
|
File details
Details for the file docex_llm-0.1.1-py3-none-any.whl.
File metadata
- Download URL: docex_llm-0.1.1-py3-none-any.whl
- Upload date:
- Size: 12.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.9.6 Darwin/23.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddeb472e7133511455dfc2b1b1771a37b8080a3090fd9041089f4339e1d6df15
|
|
| MD5 |
8f2d7f290077055f20224ffefd07f0cb
|
|
| BLAKE2b-256 |
e52f7df7dcb09e31b2a2a50dd42976ba5c9075dc3dbfc96db9e6e3267a2d1c5d
|