Production-ready document parsing with Vision Language Models

These details have not been verified by PyPI

Project links

Project description

📄 DocVision Parser

Framework document parsing powered by Vision Language Models (VLMs) and OCR.

[!WARNING] This project is still under active development and is not ready for production environments. The API, code structure, and behavior may change at any time without prior notice. Use only in development or experimental environments.

Overview

DocVision Parser is a Python library for extracting high-quality structured text and markdown from documents (images and PDFs). It combines PaddleOCR ONNX for fast, offline text extraction with the reasoning power of Vision Language Models (GPT-4o, Claude, Llama, etc.).

Three parsing modes:

Mode	Best For	Requires
BASIC_OCR	Fast offline extraction, no GPU needed	—
VLM	Complex layouts, handwriting, mixed content	VLM API key
AGENTIC	Long documents, dense tables, self-correcting	VLM API key

What's New in v0.3.0

BASIC_OCR mode — PaddleOCR ONNX via RapidOCR, models auto-downloaded from HuggingFace on first use. No PyTorch, no GPU required.
Dual preprocessing pipeline — preprocess_for_ocr (CLAHE, deskew, DPI normalization) and preprocess_for_vlm (adaptive resize, rotation, crop) are now separate optimized pipelines.
Agentic reflect pattern — Critic/refiner replace the old repetition-detection loop. Critic uses Pydantic structured output for reliable evaluation.
Multi-language OCR — English, Latin (ID/FR/DE/ES), Chinese, Korean, Arabic, Hindi, Tamil, Telugu.
Breaking: ParsingMode.PDF renamed to ParsingMode.BASIC_OCR.
Breaking: process_image() replaced by preprocess_for_ocr() / preprocess_for_vlm().

Installation

pip install docvision

Or using uv (recommended):

uv add docvision

Note: OCR models (~100MB) are downloaded automatically to ~/.cache/docvision/models/ on first use.

Quick Start

BASIC_OCR — No API key needed

import asyncio
from docvision import DocumentParser, ParsingMode

async def main():
    parser = DocumentParser(
        ocr_language="english",  # or "latin" for Indonesian/European
    )

    # Parse a single image
    result = await parser.parse_image("document.jpg", parsing_mode=ParsingMode.BASIC_OCR)
    print(result.content)

    # Parse a PDF
    results = await parser.parse_pdf("report.pdf", parsing_mode=ParsingMode.BASIC_OCR)
    for page in results:
        print(f"Page {page.metadata['page_number']}:\n{page.content}")

asyncio.run(main())

VLM — High-fidelity parsing

from docvision import DocumentParser, ParsingMode

async def main():
    parser = DocumentParser(
        base_url="https://api.openai.com/v1",
        model_name="gpt-4o-mini",
        api_key="your_api_key",
    )

    result = await parser.parse_image("scanned.jpg", parsing_mode=ParsingMode.VLM)
    print(result.content)

AGENTIC — Self-correcting for complex documents

async def main():
    parser = DocumentParser(
        base_url="https://api.openai.com/v1",
        model_name="gpt-4o",
        api_key="your_api_key",
        max_reflect_cycles=2,  # critic→refine cycles per page (default: 2, max recommended: 2)
    )

    results = await parser.parse_pdf(
        "dense_report.pdf",
        parsing_mode=ParsingMode.AGENTIC,
        start_page=1,
        end_page=10,
    )

    for page in results:
        print(f"Page {page.metadata['page_number']} "
              f"(critic score: {page.metadata['final_critic_score']}):\n"
              f"{page.content}")

Advanced Features

Structured Output (JSON)

Extract data directly into Pydantic models using VLM mode.

from pydantic import BaseModel
from typing import List

class LineItem(BaseModel):
    description: str
    quantity: int
    price: float

class Invoice(BaseModel):
    invoice_no: str
    total: float
    items: List[LineItem]

parser = DocumentParser(
    base_url="...",
    model_name="gpt-4o",
    api_key="...",
    system_prompt="Extract all invoice fields accurately.",
)

result = await parser.parse_image("invoice.png", output_schema=Invoice)
# result.content is a JSON string of the validated Invoice
print(result.content)

Multi-language OCR

# Indonesian, French, German, Spanish, etc. → use "latin"
parser = DocumentParser(ocr_language="latin")

# Chinese, Korean, Arabic, Hindi, Tamil, Telugu
parser = DocumentParser(ocr_language="chinese")

# Custom model directory (skip auto-download)
parser = DocumentParser(
    ocr_language="english",
    ocr_model_dir="/path/to/models",
)

Save Results

# Save as Markdown
await parser.parse_pdf("input.pdf", save_path="output/result.md")

# Save as JSON
await parser.parse_pdf("input.pdf", save_path="output/result.json")

# Save to directory (auto-creates output.json inside)
await parser.parse_pdf("input.pdf", save_path="output/")

Configuration

parser = DocumentParser(
    # VLM config (required for VLM and AGENTIC modes)
    base_url="https://api.openai.com/v1",
    model_name="gpt-4o",
    api_key="your_key",
    temperature=0.7,
    max_tokens=4096,
    system_prompt=None,

    # Agentic config
    max_reflect_cycles=2,       # values > 2 emit UserWarning

    # OCR config (for BASIC_OCR mode)
    ocr_language="english",     # see supported languages below
    ocr_model_dir=None,         # None = auto-download to ~/.cache/docvision/

    # Image processing
    enable_crop=True,           # crop image to content
    enable_rotate=True,         # auto-correct orientation
    enable_deskew=True,         # correct small skew angles (OCR mode)
    dpi=300,                    # PDF render DPI multiplier
    post_crop_max_size=1024,    # max image dimension for VLM input
    max_concurrency=5,          # max concurrent pages
    debug_dir=None,             # save debug images here
)

Supported OCR Languages

Value	Covers
`"english"`	English
`"latin"`	Indonesian, French, German, Spanish, Portuguese, and other Latin-script languages
`"chinese"`	Simplified + Traditional Chinese
`"korean"`	Korean
`"arabic"`	Arabic
`"hindi"`	Hindi (Devanagari)
`"tamil"`	Tamil
`"telugu"`	Telugu

Architecture

DocumentParser
├── VLMClient          — async OpenAI-compatible API
├── OCREngine          — PaddleOCR ONNX via RapidOCR, HuggingFace 
├── ImageProcessor
│   ├── preprocess_for_ocr()   — deskew, DPI normalization, CLAHE contrast
│   └── preprocess_for_vlm()   — adaptive resize
└── AgenticWorkflow (LangGraph)
    ├── generate   — initial VLM parse
    ├── critic     — structural evaluation via Pydantic structured output
    ├── refine     — targeted fix based on critic issues
    └── complete   — terminal node

Agentic reflect loop:

generate → critic ──(score ≥ 8 or max cycles)──→ complete → END
               └──(score < 9)──→ refine → critic (loop)

Development

# Setup
uv sync --dev

# Run tests
make test

# Lint & format
make lint
make format

License

Apache 2.0. See LICENSE for details.

Author

Fahmi Aziz Fadhil

GitHub: @fahmiaziz98
Email: fahmiazizfadhil09@gmail.com

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Feb 27, 2026

0.2.0

Feb 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docvision-0.3.0.tar.gz (6.9 MB view details)

Uploaded Feb 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docvision-0.3.0-py3-none-any.whl (37.4 kB view details)

Uploaded Feb 27, 2026 Python 3

File details

Details for the file docvision-0.3.0.tar.gz.

File metadata

Download URL: docvision-0.3.0.tar.gz
Upload date: Feb 27, 2026
Size: 6.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docvision-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`a0afcc9e673057576096f207b00b52e2c0e2107f77eb3adbfba2e4380ec37017`
MD5	`fe7f30e21541fb546e1b44b75b568eb8`
BLAKE2b-256	`e4493d328e969ac6483cf0d794fe765e8f8c86f0b4bc48df0b6a5fb6ddbaed03`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docvision-0.3.0.tar.gz:

Publisher: publish.yml on fahmiaziz98/docvision

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docvision-0.3.0.tar.gz
- Subject digest: a0afcc9e673057576096f207b00b52e2c0e2107f77eb3adbfba2e4380ec37017
- Sigstore transparency entry: 1003786485
- Sigstore integration time: Feb 27, 2026
Source repository:
- Permalink: fahmiaziz98/docvision@ac868c4d910100e7777c0d77ca863d18d4667db1
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/fahmiaziz98
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ac868c4d910100e7777c0d77ca863d18d4667db1
- Trigger Event: release

File details

Details for the file docvision-0.3.0-py3-none-any.whl.

File metadata

Download URL: docvision-0.3.0-py3-none-any.whl
Upload date: Feb 27, 2026
Size: 37.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docvision-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f20599b791e4aa261030adc80c6fee2bd6142991319451d1f1c9dceb1f5a8407`
MD5	`afba46a09de06191874b2e5a163fc5d2`
BLAKE2b-256	`0e24ce479de9a36b98adb534f46c0f9c6312786683d63f61ff6d567f8c35e3c8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docvision-0.3.0-py3-none-any.whl:

Publisher: publish.yml on fahmiaziz98/docvision

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docvision-0.3.0-py3-none-any.whl
- Subject digest: f20599b791e4aa261030adc80c6fee2bd6142991319451d1f1c9dceb1f5a8407
- Sigstore transparency entry: 1003786489
- Sigstore integration time: Feb 27, 2026
Source repository:
- Permalink: fahmiaziz98/docvision@ac868c4d910100e7777c0d77ca863d18d4667db1
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/fahmiaziz98
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ac868c4d910100e7777c0d77ca863d18d4667db1
- Trigger Event: release

docvision 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

📄 DocVision Parser

Overview

What's New in v0.3.0

Installation

Quick Start

BASIC_OCR — No API key needed

VLM — High-fidelity parsing

AGENTIC — Self-correcting for complex documents

Advanced Features

Structured Output (JSON)

Multi-language OCR

Save Results

Configuration

Supported OCR Languages

Architecture

Development

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance