Skip to main content

Production-ready document parsing with Vision Language Models

Project description

📄 DocVision Parser

Framework document parsing powered by Vision Language Models (VLMs) and PDF extraction.

Tests PyPI version Python 3.10+ License: Apache 2.0


Overview

DocVision Parser is a robust Python library designed to extract high-quality structured text and markdown from documents (images and PDFs). It combines the speed of native PDF extraction with the reasoning power of Vision Language Models (like GPT-4o, Claude, or Llama 3.2).

The framework provides three powerful parsing modes:

  1. PDF (Native): Ultra-fast extraction of text and tables using deterministic rules.
  2. VLM Mode: High-fidelity single-shot parsing using Vision models to understand layout and context.
  3. Agentic Mode: A self-correcting, iterative workflow that handles long documents and complex layouts by automatically detecting truncation or repetition.

Features

  • Hybrid PDF Parsing: Extract native text/tables and optionally use VLM to describe charts and images in-situ.
  • Agentic/Iterative Workflow: Self-correcting loop that handles model token limits and ensures complete transcription for long pages.
  • Intelligent Vision Pipeline: Automatic image rotation correction, DPI management, and dynamic optimization for the best VLM input.
  • Async-First: High-throughput processing with built-in concurrency control (Semaphores).
  • Structured Output: Native Pydantic support for extracting structured JSON data from any document.
  • Production-Ready: Automatic retries, error handling, and direct export to Markdown or JSON files.

Installation

Install using pip:

pip install doc-vision-parser

Or using uv (recommended):

uv add doc-vision-parser

Quick Start

Basic Usage

Initialize the DocumentParser and parse an image into Markdown.

import asyncio
from docvision import DocumentParser

async def main():
    # Initialize the parser
    parser = DocumentParser(
        vlm_base_url="https://api.openai.com/v1",
        vlm_model="gpt-4o-mini",
        vlm_api_key="your_api_key"
    )

    # Parse an image
    result = await parser.parse_image("document.jpg")
    
    print(result.content)
    print(f"ID: {result.id}")

if __name__ == "__main__":
    asyncio.run(main())

Parsing PDFs

The parser can handle PDFs using different strategies.

from docvision import DocumentParser, ParsingMode

async def parse_doc():
    parser = DocumentParser(vlm_base_url=..., vlm_model=..., vlm_api_key=...)

    # Mode 1: Native PDF (Fastest, no Vision costs)
    results = await parser.parse_pdf("report.pdf", parsing_mode=ParsingMode.PDF)

    # Mode 2: VLM (Best for complex layouts/handwriting)
    results = await parser.parse_pdf("scanned.pdf", parsing_mode=ParsingMode.VLM)

    # Mode 3: AGENTIC (Self-correcting for long tables/text)
    results = await parser.parse_pdf("dense.pdf", parsing_mode=ParsingMode.AGENTIC)

    # Save results directly to file
    await parser.parse_pdf("input.pdf", save_path="./output/results.md")

Advanced Features

Structured Output (JSON)

Extract data directly into Pydantic models.

from pydantic import BaseModel
from typing import List

class Item(BaseModel):
    description: str
    price: float

class Invoice(BaseModel):
    invoice_no: str
    items: List[Item]

# Note: system_prompt is required when using structured output
parser = DocumentParser(
    vlm_api_key="...", 
    system_prompt="Extract invoice details correctly."
)

result = await parser.parse_image("invoice.png", output_schema=Invoice)
print(result.content.invoice_no) # Content is now a Pydantic object

Hybrid Parsing (Native + VLM)

Use native extraction for text but let the VLM describe the charts.

parser = DocumentParser(
    vlm_api_key="...", 
    chart_description=True # This enables VLM hybrid for Native Mode
)

# Text and Tables are extracted natively, but <chart> tags 
# will contain VLM-generated descriptions.
results = await parser.parse_pdf("chart_heavy.pdf", parsing_mode=ParsingMode.PDF)

Configuration

The DocumentParser is configured during initialization.

Parameter Type Default Description
vlm_base_url str None OpenAI-compatible API base URL.
vlm_model str None Model name (e.g., gpt-4o).
vlm_api_key str None Your API key.
temperature float 0.7 Model sampling temperature.
max_tokens int 4096 Max tokens per VLM call.
max_iterations int 3 Max retries/loops in Agentic mode.
max_concurrency int 5 Max concurrent pages being processed.
enable_rotate bool True Auto-fix image orientation.
chart_description bool False Use VLM to describe charts in Native mode.
render_zoom float 2.0 DPI multiplier for PDF rendering.
debug_dir str None Directory to save debug images.

Architecture

DocVision Parser is built for reliability and scale:

  1. VLMClient: Handles asynchronous communication with OpenAI/Groq/OpenRouter with built-in retries and timeout management.
  2. NativePDFParser: Uses pdfplumber to extract structured text and complex tables while maintaining reading order.
  3. ImageProcessor: A high-performance pipeline for converting PDFs and optimizing images (resizing, padding, rotating).
  4. AgenticWorkflow: A state-machine that manages long-running generation tasks, ensuring complete document transcription.

Development

# Setup
uv sync --dev

# Run Tests
make test

# Lint & Format
make lint
make format

License

Apache 2.0 License. See LICENSE for details.

Author

Fahmi Aziz Fadhil

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docvision-0.2.0.tar.gz (6.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docvision-0.2.0-py3-none-any.whl (35.6 kB view details)

Uploaded Python 3

File details

Details for the file docvision-0.2.0.tar.gz.

File metadata

  • Download URL: docvision-0.2.0.tar.gz
  • Upload date:
  • Size: 6.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docvision-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ed2b826d2e621ca063a273aa61509b24a05153a904a28fa9e5d511919679034c
MD5 17056620d34406592f107b40aeef9550
BLAKE2b-256 2e847b08766a364ad4b4026e2a5f85bd72bfa669c4c0000d8e8e6185e35b971e

See more details on using hashes here.

Provenance

The following attestation bundles were made for docvision-0.2.0.tar.gz:

Publisher: publish.yml on fahmiaziz98/docvision

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docvision-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: docvision-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 35.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docvision-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 93754baf8ee2d2a0becd4a915264c33b3dc21e09cb6b40f7eb4a6645753395c3
MD5 8d42698e81093754e2d00955876565f7
BLAKE2b-256 ba206ddefc18ecab2909b9ec151407189a4b454b975e0571b0a01b853772996e

See more details on using hashes here.

Provenance

The following attestation bundles were made for docvision-0.2.0-py3-none-any.whl:

Publisher: publish.yml on fahmiaziz98/docvision

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page