Production-ready document parsing with Vision Language Models
Project description
📄 DocVision Parser
Framework document parsing powered by Vision Language Models (VLMs) and PDF extraction.
Overview
DocVision Parser is a robust Python library designed to extract high-quality structured text and markdown from documents (images and PDFs). It combines the speed of native PDF extraction with the reasoning power of Vision Language Models (like GPT-4o, Claude, or Llama 3.2).
The framework provides three powerful parsing modes:
- PDF (Native): Ultra-fast extraction of text and tables using deterministic rules.
- VLM Mode: High-fidelity single-shot parsing using Vision models to understand layout and context.
- Agentic Mode: A self-correcting, iterative workflow that handles long documents and complex layouts by automatically detecting truncation or repetition.
Features
- Hybrid PDF Parsing: Extract native text/tables and optionally use VLM to describe charts and images in-situ.
- Agentic/Iterative Workflow: Self-correcting loop that handles model token limits and ensures complete transcription for long pages.
- Intelligent Vision Pipeline: Automatic image rotation correction, DPI management, and dynamic optimization for the best VLM input.
- Async-First: High-throughput processing with built-in concurrency control (Semaphores).
- Structured Output: Native Pydantic support for extracting structured JSON data from any document.
- Production-Ready: Automatic retries, error handling, and direct export to Markdown or JSON files.
Installation
Install using pip:
pip install doc-vision-parser
Or using uv (recommended):
uv add doc-vision-parser
Quick Start
Basic Usage
Initialize the DocumentParser and parse an image into Markdown.
import asyncio
from docvision import DocumentParser
async def main():
# Initialize the parser
parser = DocumentParser(
vlm_base_url="https://api.openai.com/v1",
vlm_model="gpt-4o-mini",
vlm_api_key="your_api_key"
)
# Parse an image
result = await parser.parse_image("document.jpg")
print(result.content)
print(f"ID: {result.id}")
if __name__ == "__main__":
asyncio.run(main())
Parsing PDFs
The parser can handle PDFs using different strategies.
from docvision import DocumentParser, ParsingMode
async def parse_doc():
parser = DocumentParser(vlm_base_url=..., vlm_model=..., vlm_api_key=...)
# Mode 1: Native PDF (Fastest, no Vision costs)
results = await parser.parse_pdf("report.pdf", parsing_mode=ParsingMode.PDF)
# Mode 2: VLM (Best for complex layouts/handwriting)
results = await parser.parse_pdf("scanned.pdf", parsing_mode=ParsingMode.VLM)
# Mode 3: AGENTIC (Self-correcting for long tables/text)
results = await parser.parse_pdf("dense.pdf", parsing_mode=ParsingMode.AGENTIC)
# Save results directly to file
await parser.parse_pdf("input.pdf", save_path="./output/results.md")
Advanced Features
Structured Output (JSON)
Extract data directly into Pydantic models.
from pydantic import BaseModel
from typing import List
class Item(BaseModel):
description: str
price: float
class Invoice(BaseModel):
invoice_no: str
items: List[Item]
# Note: system_prompt is required when using structured output
parser = DocumentParser(
vlm_api_key="...",
system_prompt="Extract invoice details correctly."
)
result = await parser.parse_image("invoice.png", output_schema=Invoice)
print(result.content.invoice_no) # Content is now a Pydantic object
Hybrid Parsing (Native + VLM)
Use native extraction for text but let the VLM describe the charts.
parser = DocumentParser(
vlm_api_key="...",
chart_description=True # This enables VLM hybrid for Native Mode
)
# Text and Tables are extracted natively, but <chart> tags
# will contain VLM-generated descriptions.
results = await parser.parse_pdf("chart_heavy.pdf", parsing_mode=ParsingMode.PDF)
Configuration
The DocumentParser is configured during initialization.
| Parameter | Type | Default | Description |
|---|---|---|---|
vlm_base_url |
str |
None |
OpenAI-compatible API base URL. |
vlm_model |
str |
None |
Model name (e.g., gpt-4o). |
vlm_api_key |
str |
None |
Your API key. |
temperature |
float |
0.7 |
Model sampling temperature. |
max_tokens |
int |
4096 |
Max tokens per VLM call. |
max_iterations |
int |
3 |
Max retries/loops in Agentic mode. |
max_concurrency |
int |
5 |
Max concurrent pages being processed. |
enable_rotate |
bool |
True |
Auto-fix image orientation. |
chart_description |
bool |
False |
Use VLM to describe charts in Native mode. |
render_zoom |
float |
2.0 |
DPI multiplier for PDF rendering. |
debug_dir |
str |
None |
Directory to save debug images. |
Architecture
DocVision Parser is built for reliability and scale:
- VLMClient: Handles asynchronous communication with OpenAI/Groq/OpenRouter with built-in retries and timeout management.
- NativePDFParser: Uses
pdfplumberto extract structured text and complex tables while maintaining reading order. - ImageProcessor: A high-performance pipeline for converting PDFs and optimizing images (resizing, padding, rotating).
- AgenticWorkflow: A state-machine that manages long-running generation tasks, ensuring complete document transcription.
Development
# Setup
uv sync --dev
# Run Tests
make test
# Lint & Format
make lint
make format
License
Apache 2.0 License. See LICENSE for details.
Author
Fahmi Aziz Fadhil
- GitHub: @fahmiaziz98
- Email: fahmiazizfadhil09@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docvision-0.2.0.tar.gz.
File metadata
- Download URL: docvision-0.2.0.tar.gz
- Upload date:
- Size: 6.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed2b826d2e621ca063a273aa61509b24a05153a904a28fa9e5d511919679034c
|
|
| MD5 |
17056620d34406592f107b40aeef9550
|
|
| BLAKE2b-256 |
2e847b08766a364ad4b4026e2a5f85bd72bfa669c4c0000d8e8e6185e35b971e
|
Provenance
The following attestation bundles were made for docvision-0.2.0.tar.gz:
Publisher:
publish.yml on fahmiaziz98/docvision
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docvision-0.2.0.tar.gz -
Subject digest:
ed2b826d2e621ca063a273aa61509b24a05153a904a28fa9e5d511919679034c - Sigstore transparency entry: 975726188
- Sigstore integration time:
-
Permalink:
fahmiaziz98/docvision@8f62d53f856d63c56baece8214aae230f0a53d62 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/fahmiaziz98
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8f62d53f856d63c56baece8214aae230f0a53d62 -
Trigger Event:
release
-
Statement type:
File details
Details for the file docvision-0.2.0-py3-none-any.whl.
File metadata
- Download URL: docvision-0.2.0-py3-none-any.whl
- Upload date:
- Size: 35.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93754baf8ee2d2a0becd4a915264c33b3dc21e09cb6b40f7eb4a6645753395c3
|
|
| MD5 |
8d42698e81093754e2d00955876565f7
|
|
| BLAKE2b-256 |
ba206ddefc18ecab2909b9ec151407189a4b454b975e0571b0a01b853772996e
|
Provenance
The following attestation bundles were made for docvision-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on fahmiaziz98/docvision
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docvision-0.2.0-py3-none-any.whl -
Subject digest:
93754baf8ee2d2a0becd4a915264c33b3dc21e09cb6b40f7eb4a6645753395c3 - Sigstore transparency entry: 975726194
- Sigstore integration time:
-
Permalink:
fahmiaziz98/docvision@8f62d53f856d63c56baece8214aae230f0a53d62 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/fahmiaziz98
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8f62d53f856d63c56baece8214aae230f0a53d62 -
Trigger Event:
release
-
Statement type: