Production-ready document parsing with Vision Language Models

These details have not been verified by PyPI

Project links

Project description

📄 DocVision Parser

Framework document parsing powered by Vision Language Models (VLMs) and PDF extraction.

Overview

DocVision Parser is a robust Python library designed to extract high-quality structured text and markdown from documents (images and PDFs). It combines the speed of native PDF extraction with the reasoning power of Vision Language Models (like GPT-4o, Claude, or Llama 3.2).

The framework provides three powerful parsing modes:

PDF (Native): Ultra-fast extraction of text and tables using deterministic rules.
VLM Mode: High-fidelity single-shot parsing using Vision models to understand layout and context.
Agentic Mode: A self-correcting, iterative workflow that handles long documents and complex layouts by automatically detecting truncation or repetition.

Features

Hybrid PDF Parsing: Extract native text/tables and optionally use VLM to describe charts and images in-situ.
Agentic/Iterative Workflow: Self-correcting loop that handles model token limits and ensures complete transcription for long pages.
Intelligent Vision Pipeline: Automatic image rotation correction, DPI management, and dynamic optimization for the best VLM input.
Async-First: High-throughput processing with built-in concurrency control (Semaphores).
Structured Output: Native Pydantic support for extracting structured JSON data from any document.
Production-Ready: Automatic retries, error handling, and direct export to Markdown or JSON files.

Installation

Install using pip:

pip install doc-vision-parser

Or using uv (recommended):

uv add doc-vision-parser

Quick Start

Basic Usage

Initialize the DocumentParser and parse an image into Markdown.

import asyncio
from docvision import DocumentParser

async def main():
    # Initialize the parser
    parser = DocumentParser(
        vlm_base_url="https://api.openai.com/v1",
        vlm_model="gpt-4o-mini",
        vlm_api_key="your_api_key"
    )

    # Parse an image
    result = await parser.parse_image("document.jpg")
    
    print(result.content)
    print(f"ID: {result.id}")

if __name__ == "__main__":
    asyncio.run(main())

Parsing PDFs

The parser can handle PDFs using different strategies.

from docvision import DocumentParser, ParsingMode

async def parse_doc():
    parser = DocumentParser(vlm_base_url=..., vlm_model=..., vlm_api_key=...)

    # Mode 1: Native PDF (Fastest, no Vision costs)
    results = await parser.parse_pdf("report.pdf", parsing_mode=ParsingMode.PDF)

    # Mode 2: VLM (Best for complex layouts/handwriting)
    results = await parser.parse_pdf("scanned.pdf", parsing_mode=ParsingMode.VLM)

    # Mode 3: AGENTIC (Self-correcting for long tables/text)
    results = await parser.parse_pdf("dense.pdf", parsing_mode=ParsingMode.AGENTIC)

    # Save results directly to file
    await parser.parse_pdf("input.pdf", save_path="./output/results.md")

Advanced Features

Structured Output (JSON)

Extract data directly into Pydantic models.

from pydantic import BaseModel
from typing import List

class Item(BaseModel):
    description: str
    price: float

class Invoice(BaseModel):
    invoice_no: str
    items: List[Item]

# Note: system_prompt is required when using structured output
parser = DocumentParser(
    vlm_api_key="...", 
    system_prompt="Extract invoice details correctly."
)

result = await parser.parse_image("invoice.png", output_schema=Invoice)
print(result.content.invoice_no) # Content is now a Pydantic object

Hybrid Parsing (Native + VLM)

Use native extraction for text but let the VLM describe the charts.

parser = DocumentParser(
    vlm_api_key="...", 
    chart_description=True # This enables VLM hybrid for Native Mode
)

# Text and Tables are extracted natively, but <chart> tags 
# will contain VLM-generated descriptions.
results = await parser.parse_pdf("chart_heavy.pdf", parsing_mode=ParsingMode.PDF)

Configuration

The DocumentParser is configured during initialization.

Parameter	Type	Default	Description
`vlm_base_url`	`str`	`None`	OpenAI-compatible API base URL.
`vlm_model`	`str`	`None`	Model name (e.g., `gpt-4o`).
`vlm_api_key`	`str`	`None`	Your API key.
`temperature`	`float`	`0.7`	Model sampling temperature.
`max_tokens`	`int`	`4096`	Max tokens per VLM call.
`max_iterations`	`int`	`3`	Max retries/loops in Agentic mode.
`max_concurrency`	`int`	`5`	Max concurrent pages being processed.
`enable_rotate`	`bool`	`True`	Auto-fix image orientation.
`chart_description`	`bool`	`False`	Use VLM to describe charts in Native mode.
`render_zoom`	`float`	`2.0`	DPI multiplier for PDF rendering.
`debug_dir`	`str`	`None`	Directory to save debug images.

Architecture

DocVision Parser is built for reliability and scale:

VLMClient: Handles asynchronous communication with OpenAI/Groq/OpenRouter with built-in retries and timeout management.
NativePDFParser: Uses pdfplumber to extract structured text and complex tables while maintaining reading order.
ImageProcessor: A high-performance pipeline for converting PDFs and optimizing images (resizing, padding, rotating).
AgenticWorkflow: A state-machine that manages long-running generation tasks, ensuring complete document transcription.

Development

# Setup
uv sync --dev

# Run Tests
make test

# Lint & Format
make lint
make format

License

Apache 2.0 License. See LICENSE for details.

Author

Fahmi Aziz Fadhil

GitHub: @fahmiaziz98
Email: fahmiazizfadhil09@gmail.com

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Feb 27, 2026

This version

0.2.0

Feb 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docvision-0.2.0.tar.gz (6.9 MB view details)

Uploaded Feb 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docvision-0.2.0-py3-none-any.whl (35.6 kB view details)

Uploaded Feb 21, 2026 Python 3

File details

Details for the file docvision-0.2.0.tar.gz.

File metadata

Download URL: docvision-0.2.0.tar.gz
Upload date: Feb 21, 2026
Size: 6.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docvision-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ed2b826d2e621ca063a273aa61509b24a05153a904a28fa9e5d511919679034c`
MD5	`17056620d34406592f107b40aeef9550`
BLAKE2b-256	`2e847b08766a364ad4b4026e2a5f85bd72bfa669c4c0000d8e8e6185e35b971e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docvision-0.2.0.tar.gz:

Publisher: publish.yml on fahmiaziz98/docvision

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docvision-0.2.0.tar.gz
- Subject digest: ed2b826d2e621ca063a273aa61509b24a05153a904a28fa9e5d511919679034c
- Sigstore transparency entry: 975726188
- Sigstore integration time: Feb 21, 2026
Source repository:
- Permalink: fahmiaziz98/docvision@8f62d53f856d63c56baece8214aae230f0a53d62
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/fahmiaziz98
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8f62d53f856d63c56baece8214aae230f0a53d62
- Trigger Event: release

File details

Details for the file docvision-0.2.0-py3-none-any.whl.

File metadata

Download URL: docvision-0.2.0-py3-none-any.whl
Upload date: Feb 21, 2026
Size: 35.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docvision-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`93754baf8ee2d2a0becd4a915264c33b3dc21e09cb6b40f7eb4a6645753395c3`
MD5	`8d42698e81093754e2d00955876565f7`
BLAKE2b-256	`ba206ddefc18ecab2909b9ec151407189a4b454b975e0571b0a01b853772996e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docvision-0.2.0-py3-none-any.whl:

Publisher: publish.yml on fahmiaziz98/docvision

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docvision-0.2.0-py3-none-any.whl
- Subject digest: 93754baf8ee2d2a0becd4a915264c33b3dc21e09cb6b40f7eb4a6645753395c3
- Sigstore transparency entry: 975726194
- Sigstore integration time: Feb 21, 2026
Source repository:
- Permalink: fahmiaziz98/docvision@8f62d53f856d63c56baece8214aae230f0a53d62
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/fahmiaziz98
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8f62d53f856d63c56baece8214aae230f0a53d62
- Trigger Event: release

docvision 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

📄 DocVision Parser

Overview

Features

Installation

Quick Start

Basic Usage

Parsing PDFs

Advanced Features

Structured Output (JSON)

Hybrid Parsing (Native + VLM)

Configuration

Architecture

Development

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance