Production-ready document parsing with Vision Language Models
Project description
📄 DocVision Parser
Framework document parsing powered by Vision Language Models (VLMs) and OCR.
[!WARNING] This project is still under active development and is not ready for production environments. The API, code structure, and behavior may change at any time without prior notice. Use only in development or experimental environments.
Overview
DocVision Parser is a Python library for extracting high-quality structured text and markdown from documents (images and PDFs). It combines PaddleOCR ONNX for fast, offline text extraction with the reasoning power of Vision Language Models (GPT-4o, Claude, Llama, etc.).
Three parsing modes:
| Mode | Best For | Requires |
|---|---|---|
| BASIC_OCR | Fast offline extraction, no GPU needed | — |
| VLM | Complex layouts, handwriting, mixed content | VLM API key |
| AGENTIC | Long documents, dense tables, self-correcting | VLM API key |
What's New in v0.3.0
BASIC_OCRmode — PaddleOCR ONNX via RapidOCR, models auto-downloaded from HuggingFace on first use. No PyTorch, no GPU required.- Dual preprocessing pipeline —
preprocess_for_ocr(CLAHE, deskew, DPI normalization) andpreprocess_for_vlm(adaptive resize, rotation, crop) are now separate optimized pipelines. - Agentic reflect pattern — Critic/refiner replace the old repetition-detection loop. Critic uses Pydantic structured output for reliable evaluation.
- Multi-language OCR — English, Latin (ID/FR/DE/ES), Chinese, Korean, Arabic, Hindi, Tamil, Telugu.
- Breaking:
ParsingMode.PDFrenamed toParsingMode.BASIC_OCR. - Breaking:
process_image()replaced bypreprocess_for_ocr()/preprocess_for_vlm().
Installation
pip install docvision
Or using uv (recommended):
uv add docvision
Note: OCR models (~100MB) are downloaded automatically to
~/.cache/docvision/models/on first use.
Quick Start
BASIC_OCR — No API key needed
import asyncio
from docvision import DocumentParser, ParsingMode
async def main():
parser = DocumentParser(
ocr_language="english", # or "latin" for Indonesian/European
)
# Parse a single image
result = await parser.parse_image("document.jpg", parsing_mode=ParsingMode.BASIC_OCR)
print(result.content)
# Parse a PDF
results = await parser.parse_pdf("report.pdf", parsing_mode=ParsingMode.BASIC_OCR)
for page in results:
print(f"Page {page.metadata['page_number']}:\n{page.content}")
asyncio.run(main())
VLM — High-fidelity parsing
from docvision import DocumentParser, ParsingMode
async def main():
parser = DocumentParser(
base_url="https://api.openai.com/v1",
model_name="gpt-4o-mini",
api_key="your_api_key",
)
result = await parser.parse_image("scanned.jpg", parsing_mode=ParsingMode.VLM)
print(result.content)
AGENTIC — Self-correcting for complex documents
async def main():
parser = DocumentParser(
base_url="https://api.openai.com/v1",
model_name="gpt-4o",
api_key="your_api_key",
max_reflect_cycles=2, # critic→refine cycles per page (default: 2, max recommended: 2)
)
results = await parser.parse_pdf(
"dense_report.pdf",
parsing_mode=ParsingMode.AGENTIC,
start_page=1,
end_page=10,
)
for page in results:
print(f"Page {page.metadata['page_number']} "
f"(critic score: {page.metadata['final_critic_score']}):\n"
f"{page.content}")
Advanced Features
Structured Output (JSON)
Extract data directly into Pydantic models using VLM mode.
from pydantic import BaseModel
from typing import List
class LineItem(BaseModel):
description: str
quantity: int
price: float
class Invoice(BaseModel):
invoice_no: str
total: float
items: List[LineItem]
parser = DocumentParser(
base_url="...",
model_name="gpt-4o",
api_key="...",
system_prompt="Extract all invoice fields accurately.",
)
result = await parser.parse_image("invoice.png", output_schema=Invoice)
# result.content is a JSON string of the validated Invoice
print(result.content)
Multi-language OCR
# Indonesian, French, German, Spanish, etc. → use "latin"
parser = DocumentParser(ocr_language="latin")
# Chinese, Korean, Arabic, Hindi, Tamil, Telugu
parser = DocumentParser(ocr_language="chinese")
# Custom model directory (skip auto-download)
parser = DocumentParser(
ocr_language="english",
ocr_model_dir="/path/to/models",
)
Save Results
# Save as Markdown
await parser.parse_pdf("input.pdf", save_path="output/result.md")
# Save as JSON
await parser.parse_pdf("input.pdf", save_path="output/result.json")
# Save to directory (auto-creates output.json inside)
await parser.parse_pdf("input.pdf", save_path="output/")
Configuration
parser = DocumentParser(
# VLM config (required for VLM and AGENTIC modes)
base_url="https://api.openai.com/v1",
model_name="gpt-4o",
api_key="your_key",
temperature=0.7,
max_tokens=4096,
system_prompt=None,
# Agentic config
max_reflect_cycles=2, # values > 2 emit UserWarning
# OCR config (for BASIC_OCR mode)
ocr_language="english", # see supported languages below
ocr_model_dir=None, # None = auto-download to ~/.cache/docvision/
# Image processing
enable_crop=True, # crop image to content
enable_rotate=True, # auto-correct orientation
enable_deskew=True, # correct small skew angles (OCR mode)
dpi=300, # PDF render DPI multiplier
post_crop_max_size=1024, # max image dimension for VLM input
max_concurrency=5, # max concurrent pages
debug_dir=None, # save debug images here
)
Supported OCR Languages
| Value | Covers |
|---|---|
"english" |
English |
"latin" |
Indonesian, French, German, Spanish, Portuguese, and other Latin-script languages |
"chinese" |
Simplified + Traditional Chinese |
"korean" |
Korean |
"arabic" |
Arabic |
"hindi" |
Hindi (Devanagari) |
"tamil" |
Tamil |
"telugu" |
Telugu |
Architecture
DocumentParser
├── VLMClient — async OpenAI-compatible API
├── OCREngine — PaddleOCR ONNX via RapidOCR, HuggingFace
├── ImageProcessor
│ ├── preprocess_for_ocr() — deskew, DPI normalization, CLAHE contrast
│ └── preprocess_for_vlm() — adaptive resize
└── AgenticWorkflow (LangGraph)
├── generate — initial VLM parse
├── critic — structural evaluation via Pydantic structured output
├── refine — targeted fix based on critic issues
└── complete — terminal node
Agentic reflect loop:
generate → critic ──(score ≥ 8 or max cycles)──→ complete → END
└──(score < 9)──→ refine → critic (loop)
Development
# Setup
uv sync --dev
# Run tests
make test
# Lint & format
make lint
make format
License
Apache 2.0. See LICENSE for details.
Author
Fahmi Aziz Fadhil
- GitHub: @fahmiaziz98
- Email: fahmiazizfadhil09@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docvision-0.3.0.tar.gz.
File metadata
- Download URL: docvision-0.3.0.tar.gz
- Upload date:
- Size: 6.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a0afcc9e673057576096f207b00b52e2c0e2107f77eb3adbfba2e4380ec37017
|
|
| MD5 |
fe7f30e21541fb546e1b44b75b568eb8
|
|
| BLAKE2b-256 |
e4493d328e969ac6483cf0d794fe765e8f8c86f0b4bc48df0b6a5fb6ddbaed03
|
Provenance
The following attestation bundles were made for docvision-0.3.0.tar.gz:
Publisher:
publish.yml on fahmiaziz98/docvision
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docvision-0.3.0.tar.gz -
Subject digest:
a0afcc9e673057576096f207b00b52e2c0e2107f77eb3adbfba2e4380ec37017 - Sigstore transparency entry: 1003786485
- Sigstore integration time:
-
Permalink:
fahmiaziz98/docvision@ac868c4d910100e7777c0d77ca863d18d4667db1 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/fahmiaziz98
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ac868c4d910100e7777c0d77ca863d18d4667db1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file docvision-0.3.0-py3-none-any.whl.
File metadata
- Download URL: docvision-0.3.0-py3-none-any.whl
- Upload date:
- Size: 37.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f20599b791e4aa261030adc80c6fee2bd6142991319451d1f1c9dceb1f5a8407
|
|
| MD5 |
afba46a09de06191874b2e5a163fc5d2
|
|
| BLAKE2b-256 |
0e24ce479de9a36b98adb534f46c0f9c6312786683d63f61ff6d567f8c35e3c8
|
Provenance
The following attestation bundles were made for docvision-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on fahmiaziz98/docvision
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docvision-0.3.0-py3-none-any.whl -
Subject digest:
f20599b791e4aa261030adc80c6fee2bd6142991319451d1f1c9dceb1f5a8407 - Sigstore transparency entry: 1003786489
- Sigstore integration time:
-
Permalink:
fahmiaziz98/docvision@ac868c4d910100e7777c0d77ca863d18d4667db1 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/fahmiaziz98
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ac868c4d910100e7777c0d77ca863d18d4667db1 -
Trigger Event:
release
-
Statement type: