Skip to main content

Unified Python toolkit for visual document processing - think Transformers for document AI

Project description

OmniDocs

OmniDocs Banner

Unified Python toolkit for visual document understanding
PyPI version License Python 3.10+ Ask DeepWiki

DocumentationInstallationQuick StartTasksContributing


OmniDocs provides a single, consistent API for document AI tasks: layout detection, OCR, text extraction, table parsing, structured extraction, and reading order. Swap models and backends without changing your code.

result = extractor.extract(image)

Why OmniDocs?

  • One API.extract() for every task
  • Multi-backend — PyTorch, VLLM, MLX, API
  • VLM API — Use any cloud VLM (Gemini, OpenRouter, Azure, OpenAI) with zero GPU
  • Type-safe — Pydantic configs and outputs
  • Structured extraction — Extract data into Pydantic schemas
  • Production-ready — Modal deployment, batch processing

Installation

pip install omnidocs

Or with uv:

uv pip install omnidocs

Cloud API access (Gemini, OpenRouter, Azure, OpenAI, ANANNAS AI) works out of the box — LiteLLM is included as a core dependency.

Install extras
pip install omnidocs[pytorch]   # Local GPU inference
pip install omnidocs[vllm]      # High-throughput production
pip install omnidocs[mlx]       # Apple Silicon
pip install omnidocs[ocr]       # Tesseract, EasyOCR, PaddleOCR
pip install omnidocs[all]       # Everything
From source
git clone https://github.com/adithya-s-k/Omnidocs.git
cd Omnidocs
uv sync
Flash Attention (optional, for PyTorch VLMs)

Download pre-built wheel from Flash Attention Releases:

# Example: Python 3.12, CUDA 12, PyTorch 2.5
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl

Quick Start

VLM API (No GPU Required)

Use any cloud VLM through a single, provider-agnostic API:

from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor

# Just set your env var: OPENROUTER_API_KEY, GOOGLE_API_KEY, etc.
config = VLMAPIConfig(model="openrouter/qwen/qwen3-vl-8b-instruct")

extractor = VLMTextExtractor(config=config)
result = extractor.extract("document.png", output_format="markdown")
print(result.content)

Works with any provider: OpenRouter, Gemini, Azure, OpenAI, ANANNAS AI, self-hosted VLLM — if it speaks the OpenAI API, it works.

Structured Extraction

Extract typed data directly into Pydantic schemas:

from pydantic import BaseModel
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.structured_extraction import VLMStructuredExtractor

class Invoice(BaseModel):
    vendor: str
    total: float
    items: list[str]

config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMStructuredExtractor(config=config)

result = extractor.extract(
    image="invoice.png",
    schema=Invoice,
    prompt="Extract invoice details from this document.",
)
print(result.data.vendor, result.data.total)

Text Extraction (Local GPU)

from omnidocs import Document
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig

doc = Document.from_pdf("report.pdf")

extractor = QwenTextExtractor(
    backend=QwenTextVLLMConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)

result = extractor.extract(doc.get_page(0), output_format="markdown")
print(result.content)

Layout Detection

from omnidocs import Document
from omnidocs.tasks.layout_extraction import DocLayoutYOLO, DocLayoutYOLOConfig

doc = Document.from_pdf("paper.pdf")

detector = DocLayoutYOLO(config=DocLayoutYOLOConfig(device="cuda"))
result = detector.extract(doc.get_page(0))

for box in result.bboxes:
    print(f"{box.label.value}: {box.confidence:.2f}")

Table Extraction

from omnidocs.tasks.table_extraction import TableFormerExtractor, TableFormerConfig

extractor = TableFormerExtractor(config=TableFormerConfig(device="cuda"))
result = extractor.extract(table_image)

df = result.to_dataframe()
html = result.to_html()

Supported Tasks

Task Description Output
Text Extraction Convert documents to Markdown/HTML Formatted text
Layout Analysis Detect titles, tables, figures, etc. Bounding boxes + labels
OCR Extract text with coordinates Text blocks + positions
Table Extraction Parse table structure Cells, rows, columns
Structured Extraction Extract typed data into Pydantic schemas Validated model instances
Reading Order Determine logical reading sequence Ordered elements

Supported Models

Text Extraction

Model Backends Notes
VLM API Any cloud API Provider-agnostic via LiteLLM
Qwen3-VL PyTorch, VLLM, MLX, API Best quality
MinerU VL PyTorch, VLLM, API Layout-aware extraction
Nanonets OCR2 PyTorch, VLLM, MLX Fast, accurate
Granite Docling PyTorch, VLLM, MLX, API IBM research model
DotsOCR PyTorch, VLLM, API Layout-aware

Layout Analysis

Model Backends Notes
VLM API Any cloud API Custom labels support
DocLayoutYOLO PyTorch Fast (0.1s/page)
RT-DETR PyTorch Transformer-based
Qwen Layout PyTorch, VLLM, MLX, API Custom labels
MinerU VL Layout PyTorch, VLLM, API High accuracy

Structured Extraction

Model Backends Notes
VLM API Any cloud API Pydantic schema output

OCR

Model Backends Notes
Tesseract CPU 100+ languages
EasyOCR PyTorch 80+ languages
PaddleOCR PaddlePaddle CJK optimized

Table Extraction

Model Backends Notes
TableFormer PyTorch Structure + content

Reading Order

Model Backends Notes
Rule-based CPU R-tree indexing

VLM API Providers

Use any VLM through a single config — just change the model string:

from omnidocs.vlm import VLMAPIConfig

# OpenRouter (100+ vision models)
config = VLMAPIConfig(model="openrouter/qwen/qwen3-vl-8b-instruct")

# Google Gemini
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")

# Azure OpenAI
config = VLMAPIConfig(model="azure/gpt-5-mini", api_version="2024-12-01-preview")

# OpenAI
config = VLMAPIConfig(model="openai/gpt-4o")

# Any OpenAI-compatible API (ANANNAS AI, self-hosted VLLM, etc.)
config = VLMAPIConfig(
    model="openai/model-name",
    api_base="https://your-provider.com/v1",
)

See the VLM API docs for full provider setup and model lists.


Multi-Backend Support

All VLM models support multiple inference backends:

# PyTorch (local GPU)
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
config = QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct", device="cuda")

# VLLM (high-throughput)
from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig
config = QwenTextVLLMConfig(model="Qwen/Qwen3-VL-8B-Instruct", tensor_parallel_size=2)

# MLX (Apple Silicon)
from omnidocs.tasks.text_extraction.qwen import QwenTextMLXConfig
config = QwenTextMLXConfig(model="mlx-community/Qwen3-VL-8B-Instruct-4bit")

# API (provider-agnostic via litellm)
from omnidocs.tasks.text_extraction.qwen import QwenTextAPIConfig
config = QwenTextAPIConfig(model="openrouter/qwen/qwen3-vl-8b-instruct")

Document Loading

from omnidocs import Document

# From file
doc = Document.from_pdf("file.pdf", page_range=(0, 9))

# From URL
doc = Document.from_url("https://arxiv.org/pdf/1706.03762")

# From images
doc = Document.from_images(["page1.png", "page2.png"])

# Access pages
image = doc.get_page(0)  # PIL Image

Roadmap

See the full Roadmap for planned features.

Coming soon:

  • Math Recognition (LaTeX extraction)
  • Chart Understanding
  • Surya OCR + Layout

Contributing

Contributions are welcome! See our Contributing Guide to get started.

# Setup
git clone https://github.com/adithya-s-k/Omnidocs.git
cd Omnidocs && uv sync

# Test
uv run pytest tests/ -v

# Lint
uv run ruff check . && uv run ruff format .

# Docs
uv run mkdocs serve

Resources:


License

Apache 2.0 — See LICENSE for details.


DocsIssuesPyPI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omnidocs-0.2.9.tar.gz (9.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omnidocs-0.2.9-py3-none-any.whl (193.1 kB view details)

Uploaded Python 3

File details

Details for the file omnidocs-0.2.9.tar.gz.

File metadata

  • Download URL: omnidocs-0.2.9.tar.gz
  • Upload date:
  • Size: 9.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for omnidocs-0.2.9.tar.gz
Algorithm Hash digest
SHA256 0670991cb528425f2cbd6a6c01e2f0dbd6da96a65117d576b23cf365db1daa94
MD5 6e7531e1dd32b92dc2a725db6238ce01
BLAKE2b-256 3d304c61259775d24c470daea2d14cd096649b700bb1b1d77eadb02c78de6578

See more details on using hashes here.

File details

Details for the file omnidocs-0.2.9-py3-none-any.whl.

File metadata

  • Download URL: omnidocs-0.2.9-py3-none-any.whl
  • Upload date:
  • Size: 193.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for omnidocs-0.2.9-py3-none-any.whl
Algorithm Hash digest
SHA256 c6a5d37e8b7cc346e94d08c6dddc9edf58bc5f5839a1b9ad3c847e041122908a
MD5 869bda57b85c4ddaf170d084c2b03282
BLAKE2b-256 f8fbfd0b57624e0cc0e68b028db4d23e62cff3f7208898957743c2264829bdf9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page