Unified Python toolkit for visual document processing - think Transformers for document AI
Project description
OmniDocs
Unified Python toolkit for visual document understanding
Documentation • Installation • Quick Start • Tasks • Contributing
OmniDocs provides a single, consistent API for document AI tasks: layout detection, OCR, text extraction, table parsing, structured extraction, and reading order. Swap models and backends without changing your code.
result = extractor.extract(image)
Why OmniDocs?
- One API —
.extract()for every task - Multi-backend — PyTorch, VLLM, MLX, API
- VLM API — Use any cloud VLM (Gemini, OpenRouter, Azure, OpenAI) with zero GPU
- Type-safe — Pydantic configs and outputs
- Structured extraction — Extract data into Pydantic schemas
- Production-ready — Modal deployment, batch processing
Installation
pip install omnidocs
Or with uv:
uv pip install omnidocs
Cloud API access (Gemini, OpenRouter, Azure, OpenAI, ANANNAS AI) works out of the box — LiteLLM is included as a core dependency.
Install extras
pip install omnidocs[pytorch] # Local GPU inference
pip install omnidocs[vllm] # High-throughput production
pip install omnidocs[mlx] # Apple Silicon
pip install omnidocs[ocr] # Tesseract, EasyOCR, PaddleOCR
pip install omnidocs[all] # Everything
From source
git clone https://github.com/adithya-s-k/Omnidocs.git
cd Omnidocs
uv sync
Flash Attention (optional, for PyTorch VLMs)
Download pre-built wheel from Flash Attention Releases:
# Example: Python 3.12, CUDA 12, PyTorch 2.5
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
Quick Start
VLM API (No GPU Required)
Use any cloud VLM through a single, provider-agnostic API:
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor
# Just set your env var: OPENROUTER_API_KEY, GOOGLE_API_KEY, etc.
config = VLMAPIConfig(model="openrouter/qwen/qwen3-vl-8b-instruct")
extractor = VLMTextExtractor(config=config)
result = extractor.extract("document.png", output_format="markdown")
print(result.content)
Works with any provider: OpenRouter, Gemini, Azure, OpenAI, ANANNAS AI, self-hosted VLLM — if it speaks the OpenAI API, it works.
Structured Extraction
Extract typed data directly into Pydantic schemas:
from pydantic import BaseModel
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.structured_extraction import VLMStructuredExtractor
class Invoice(BaseModel):
vendor: str
total: float
items: list[str]
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMStructuredExtractor(config=config)
result = extractor.extract(
image="invoice.png",
schema=Invoice,
prompt="Extract invoice details from this document.",
)
print(result.data.vendor, result.data.total)
Text Extraction (Local GPU)
from omnidocs import Document
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig
doc = Document.from_pdf("report.pdf")
extractor = QwenTextExtractor(
backend=QwenTextVLLMConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
result = extractor.extract(doc.get_page(0), output_format="markdown")
print(result.content)
Layout Detection
from omnidocs import Document
from omnidocs.tasks.layout_extraction import DocLayoutYOLO, DocLayoutYOLOConfig
doc = Document.from_pdf("paper.pdf")
detector = DocLayoutYOLO(config=DocLayoutYOLOConfig(device="cuda"))
result = detector.extract(doc.get_page(0))
for box in result.bboxes:
print(f"{box.label.value}: {box.confidence:.2f}")
Table Extraction
from omnidocs.tasks.table_extraction import TableFormerExtractor, TableFormerConfig
extractor = TableFormerExtractor(config=TableFormerConfig(device="cuda"))
result = extractor.extract(table_image)
df = result.to_dataframe()
html = result.to_html()
Supported Tasks
| Task | Description | Output |
|---|---|---|
| Text Extraction | Convert documents to Markdown/HTML | Formatted text |
| Layout Analysis | Detect titles, tables, figures, etc. | Bounding boxes + labels |
| OCR | Extract text with coordinates | Text blocks + positions |
| Table Extraction | Parse table structure | Cells, rows, columns |
| Structured Extraction | Extract typed data into Pydantic schemas | Validated model instances |
| Reading Order | Determine logical reading sequence | Ordered elements |
Supported Models
Text Extraction
| Model | Backends | Notes |
|---|---|---|
| VLM API | Any cloud API | Provider-agnostic via LiteLLM |
| Qwen3-VL | PyTorch, VLLM, MLX, API | Best quality |
| MinerU VL | PyTorch, VLLM, API | Layout-aware extraction |
| Nanonets OCR2 | PyTorch, VLLM, MLX | Fast, accurate |
| Granite Docling | PyTorch, VLLM, MLX, API | IBM research model |
| DotsOCR | PyTorch, VLLM, API | Layout-aware |
Layout Analysis
| Model | Backends | Notes |
|---|---|---|
| VLM API | Any cloud API | Custom labels support |
| DocLayoutYOLO | PyTorch | Fast (0.1s/page) |
| RT-DETR | PyTorch | Transformer-based |
| Qwen Layout | PyTorch, VLLM, MLX, API | Custom labels |
| MinerU VL Layout | PyTorch, VLLM, API | High accuracy |
Structured Extraction
| Model | Backends | Notes |
|---|---|---|
| VLM API | Any cloud API | Pydantic schema output |
OCR
| Model | Backends | Notes |
|---|---|---|
| Tesseract | CPU | 100+ languages |
| EasyOCR | PyTorch | 80+ languages |
| PaddleOCR | PaddlePaddle | CJK optimized |
Table Extraction
| Model | Backends | Notes |
|---|---|---|
| TableFormer | PyTorch | Structure + content |
Reading Order
| Model | Backends | Notes |
|---|---|---|
| Rule-based | CPU | R-tree indexing |
VLM API Providers
Use any VLM through a single config — just change the model string:
from omnidocs.vlm import VLMAPIConfig
# OpenRouter (100+ vision models)
config = VLMAPIConfig(model="openrouter/qwen/qwen3-vl-8b-instruct")
# Google Gemini
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
# Azure OpenAI
config = VLMAPIConfig(model="azure/gpt-5-mini", api_version="2024-12-01-preview")
# OpenAI
config = VLMAPIConfig(model="openai/gpt-4o")
# Any OpenAI-compatible API (ANANNAS AI, self-hosted VLLM, etc.)
config = VLMAPIConfig(
model="openai/model-name",
api_base="https://your-provider.com/v1",
)
See the VLM API docs for full provider setup and model lists.
Multi-Backend Support
All VLM models support multiple inference backends:
# PyTorch (local GPU)
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
config = QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct", device="cuda")
# VLLM (high-throughput)
from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig
config = QwenTextVLLMConfig(model="Qwen/Qwen3-VL-8B-Instruct", tensor_parallel_size=2)
# MLX (Apple Silicon)
from omnidocs.tasks.text_extraction.qwen import QwenTextMLXConfig
config = QwenTextMLXConfig(model="mlx-community/Qwen3-VL-8B-Instruct-4bit")
# API (provider-agnostic via litellm)
from omnidocs.tasks.text_extraction.qwen import QwenTextAPIConfig
config = QwenTextAPIConfig(model="openrouter/qwen/qwen3-vl-8b-instruct")
Document Loading
from omnidocs import Document
# From file
doc = Document.from_pdf("file.pdf", page_range=(0, 9))
# From URL
doc = Document.from_url("https://arxiv.org/pdf/1706.03762")
# From images
doc = Document.from_images(["page1.png", "page2.png"])
# Access pages
image = doc.get_page(0) # PIL Image
Roadmap
See the full Roadmap for planned features.
Coming soon:
- Math Recognition (LaTeX extraction)
- Chart Understanding
- Surya OCR + Layout
Contributing
Contributions are welcome! See our Contributing Guide to get started.
# Setup
git clone https://github.com/adithya-s-k/Omnidocs.git
cd Omnidocs && uv sync
# Test
uv run pytest tests/ -v
# Lint
uv run ruff check . && uv run ruff format .
# Docs
uv run mkdocs serve
Resources:
License
Apache 2.0 — See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file omnidocs-0.2.9.tar.gz.
File metadata
- Download URL: omnidocs-0.2.9.tar.gz
- Upload date:
- Size: 9.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0670991cb528425f2cbd6a6c01e2f0dbd6da96a65117d576b23cf365db1daa94
|
|
| MD5 |
6e7531e1dd32b92dc2a725db6238ce01
|
|
| BLAKE2b-256 |
3d304c61259775d24c470daea2d14cd096649b700bb1b1d77eadb02c78de6578
|
File details
Details for the file omnidocs-0.2.9-py3-none-any.whl.
File metadata
- Download URL: omnidocs-0.2.9-py3-none-any.whl
- Upload date:
- Size: 193.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6a5d37e8b7cc346e94d08c6dddc9edf58bc5f5839a1b9ad3c847e041122908a
|
|
| MD5 |
869bda57b85c4ddaf170d084c2b03282
|
|
| BLAKE2b-256 |
f8fbfd0b57624e0cc0e68b028db4d23e62cff3f7208898957743c2264829bdf9
|