Skip to main content

Production-ready document parsing with Vision Language Models

Project description

📄 DocVision Parser

Framework document parsing powered by Vision Language Models (VLMs) and OCR.

Tests PyPI version Python 3.10+ License: Apache 2.0


[!WARNING] This project is still under active development and is not ready for production environments. The API, code structure, and behavior may change at any time without prior notice. Use only in development or experimental environments.


Overview

DocVision Parser is a Python library for extracting high-quality structured text and markdown from documents (images and PDFs). It combines PaddleOCR ONNX for fast, offline text extraction with the reasoning power of Vision Language Models (GPT-4o, Claude, Llama, etc.).

Three parsing modes:

Mode Best For Requires
BASIC_OCR Fast offline extraction, no GPU needed
VLM Complex layouts, handwriting, mixed content VLM API key
AGENTIC Long documents, dense tables, self-correcting VLM API key

What's New in v0.3.0

  • BASIC_OCR mode — PaddleOCR ONNX via RapidOCR, models auto-downloaded from HuggingFace on first use. No PyTorch, no GPU required.
  • Dual preprocessing pipelinepreprocess_for_ocr (CLAHE, deskew, DPI normalization) and preprocess_for_vlm (adaptive resize, rotation, crop) are now separate optimized pipelines.
  • Agentic reflect pattern — Critic/refiner replace the old repetition-detection loop. Critic uses Pydantic structured output for reliable evaluation.
  • Multi-language OCR — English, Latin (ID/FR/DE/ES), Chinese, Korean, Arabic, Hindi, Tamil, Telugu.
  • Breaking: ParsingMode.PDF renamed to ParsingMode.BASIC_OCR.
  • Breaking: process_image() replaced by preprocess_for_ocr() / preprocess_for_vlm().

Installation

pip install docvision

Or using uv (recommended):

uv add docvision

Note: OCR models (~100MB) are downloaded automatically to ~/.cache/docvision/models/ on first use.


Quick Start

BASIC_OCR — No API key needed

import asyncio
from docvision import DocumentParser, ParsingMode

async def main():
    parser = DocumentParser(
        ocr_language="english",  # or "latin" for Indonesian/European
    )

    # Parse a single image
    result = await parser.parse_image("document.jpg", parsing_mode=ParsingMode.BASIC_OCR)
    print(result.content)

    # Parse a PDF
    results = await parser.parse_pdf("report.pdf", parsing_mode=ParsingMode.BASIC_OCR)
    for page in results:
        print(f"Page {page.metadata['page_number']}:\n{page.content}")

asyncio.run(main())

VLM — High-fidelity parsing

from docvision import DocumentParser, ParsingMode

async def main():
    parser = DocumentParser(
        base_url="https://api.openai.com/v1",
        model_name="gpt-4o-mini",
        api_key="your_api_key",
    )

    result = await parser.parse_image("scanned.jpg", parsing_mode=ParsingMode.VLM)
    print(result.content)

AGENTIC — Self-correcting for complex documents

async def main():
    parser = DocumentParser(
        base_url="https://api.openai.com/v1",
        model_name="gpt-4o",
        api_key="your_api_key",
        max_reflect_cycles=2,  # critic→refine cycles per page (default: 2, max recommended: 2)
    )

    results = await parser.parse_pdf(
        "dense_report.pdf",
        parsing_mode=ParsingMode.AGENTIC,
        start_page=1,
        end_page=10,
    )

    for page in results:
        print(f"Page {page.metadata['page_number']} "
              f"(critic score: {page.metadata['final_critic_score']}):\n"
              f"{page.content}")

Advanced Features

Structured Output (JSON)

Extract data directly into Pydantic models using VLM mode.

from pydantic import BaseModel
from typing import List

class LineItem(BaseModel):
    description: str
    quantity: int
    price: float

class Invoice(BaseModel):
    invoice_no: str
    total: float
    items: List[LineItem]

parser = DocumentParser(
    base_url="...",
    model_name="gpt-4o",
    api_key="...",
    system_prompt="Extract all invoice fields accurately.",
)

result = await parser.parse_image("invoice.png", output_schema=Invoice)
# result.content is a JSON string of the validated Invoice
print(result.content)

Multi-language OCR

# Indonesian, French, German, Spanish, etc. → use "latin"
parser = DocumentParser(ocr_language="latin")

# Chinese, Korean, Arabic, Hindi, Tamil, Telugu
parser = DocumentParser(ocr_language="chinese")

# Custom model directory (skip auto-download)
parser = DocumentParser(
    ocr_language="english",
    ocr_model_dir="/path/to/models",
)

Save Results

# Save as Markdown
await parser.parse_pdf("input.pdf", save_path="output/result.md")

# Save as JSON
await parser.parse_pdf("input.pdf", save_path="output/result.json")

# Save to directory (auto-creates output.json inside)
await parser.parse_pdf("input.pdf", save_path="output/")

Configuration

parser = DocumentParser(
    # VLM config (required for VLM and AGENTIC modes)
    base_url="https://api.openai.com/v1",
    model_name="gpt-4o",
    api_key="your_key",
    temperature=0.7,
    max_tokens=4096,
    system_prompt=None,

    # Agentic config
    max_reflect_cycles=2,       # values > 2 emit UserWarning

    # OCR config (for BASIC_OCR mode)
    ocr_language="english",     # see supported languages below
    ocr_model_dir=None,         # None = auto-download to ~/.cache/docvision/

    # Image processing
    enable_crop=True,           # crop image to content
    enable_rotate=True,         # auto-correct orientation
    enable_deskew=True,         # correct small skew angles (OCR mode)
    dpi=300,                    # PDF render DPI multiplier
    post_crop_max_size=1024,    # max image dimension for VLM input
    max_concurrency=5,          # max concurrent pages
    debug_dir=None,             # save debug images here
)

Supported OCR Languages

Value Covers
"english" English
"latin" Indonesian, French, German, Spanish, Portuguese, and other Latin-script languages
"chinese" Simplified + Traditional Chinese
"korean" Korean
"arabic" Arabic
"hindi" Hindi (Devanagari)
"tamil" Tamil
"telugu" Telugu

Architecture

DocumentParser
├── VLMClient          — async OpenAI-compatible API
├── OCREngine          — PaddleOCR ONNX via RapidOCR, HuggingFace 
├── ImageProcessor
│   ├── preprocess_for_ocr()   — deskew, DPI normalization, CLAHE contrast
│   └── preprocess_for_vlm()   — adaptive resize
└── AgenticWorkflow (LangGraph)
    ├── generate   — initial VLM parse
    ├── critic     — structural evaluation via Pydantic structured output
    ├── refine     — targeted fix based on critic issues
    └── complete   — terminal node

Agentic reflect loop:

generate → critic ──(score ≥ 8 or max cycles)──→ complete → END
               └──(score < 9)──→ refine → critic (loop)

Development

# Setup
uv sync --dev

# Run tests
make test

# Lint & format
make lint
make format

License

Apache 2.0. See LICENSE for details.

Author

Fahmi Aziz Fadhil

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docvision-0.3.0.tar.gz (6.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docvision-0.3.0-py3-none-any.whl (37.4 kB view details)

Uploaded Python 3

File details

Details for the file docvision-0.3.0.tar.gz.

File metadata

  • Download URL: docvision-0.3.0.tar.gz
  • Upload date:
  • Size: 6.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docvision-0.3.0.tar.gz
Algorithm Hash digest
SHA256 a0afcc9e673057576096f207b00b52e2c0e2107f77eb3adbfba2e4380ec37017
MD5 fe7f30e21541fb546e1b44b75b568eb8
BLAKE2b-256 e4493d328e969ac6483cf0d794fe765e8f8c86f0b4bc48df0b6a5fb6ddbaed03

See more details on using hashes here.

Provenance

The following attestation bundles were made for docvision-0.3.0.tar.gz:

Publisher: publish.yml on fahmiaziz98/docvision

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docvision-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: docvision-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 37.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docvision-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f20599b791e4aa261030adc80c6fee2bd6142991319451d1f1c9dceb1f5a8407
MD5 afba46a09de06191874b2e5a163fc5d2
BLAKE2b-256 0e24ce479de9a36b98adb534f46c0f9c6312786683d63f61ff6d567f8c35e3c8

See more details on using hashes here.

Provenance

The following attestation bundles were made for docvision-0.3.0-py3-none-any.whl:

Publisher: publish.yml on fahmiaziz98/docvision

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page