Skip to main content

Parse PDF documents into markdown formatted content using Vision LLMs

Project description

Vision Parse

License: MIT Author: Arun Brahma PyPI version

🚀 Parse PDF documents into beautifully formatted markdown content using state-of-the-art Vision Language Models - all with just a few lines of code!

🎯 Introduction

Vision Parse harnesses the power of Vision Language Models to revolutionize document processing:

  • 📝 Smart Content Extraction: Intelligently identifies and extracts text, tables, and LaTeX equations with high precision
  • 🎨 Advanced Content Formatting: Preserves LaTeX equations, hyperlinks, images, and footnotes for markdown-formatted content
  • 🤖 Multi-LLM Support: Seamlessly integrates with multiple Vision LLM providers such as OpenAI, Gemini, and Llama for optimal accuracy and speed
  • 🔄 Scanned PDF Document Processing: Extracts text, tables, images, and LaTeX equations from scanned PDF documents into well-structured markdown content
  • 📁 Local Model Hosting: Supports local model hosting with Ollama for secure, no-cost, private, and offline document processing

🚀 Getting Started

Prerequisites

  • 🐍 Python >= 3.9
  • 🖥️ Ollama (if you want to use local models)
  • 🤖 API Key for OpenAI or Google Gemini (if you want to use OpenAI or Google Gemini)

Installation

Install the core package using pip (Recommended):

pip install vision-parse

Install the additional dependencies for OpenAI or Gemini:

# For OpenAI support
pip install 'vision-parse[openai]'
# For Gemini support
pip install 'vision-parse[gemini]'
# To install all the additional dependencies
pip install 'vision-parse[all]'

Install the package from source:

pip install 'git+https://github.com/iamarunbrahma/vision-parse.git#egg=vision-parse[all]'

Setting up Ollama (Optional)

See examples/ollama_setup.md on how to setup Ollama locally.

⌛️ Usage

Basic Example Usage

from vision_parse import VisionParser

# Initialize parser
parser = VisionParser(
    model_name="llama3.2-vision:11b", # For local models, you don't need to provide the api key
    temperature=0.4,
    top_p=0.5,
    image_mode="url", # Image mode can be "url", "base64" or None
    detailed_extraction=False, # Set to True for more detailed extraction
    enable_concurrency=False, # Set to True for parallel processing
)

# Convert PDF to markdown
pdf_path = "path/to/your/document.pdf" # local path to your pdf file
markdown_pages = parser.convert_pdf(pdf_path)

# Process results
for i, page_content in enumerate(markdown_pages):
    print(f"\n--- Page {i+1} ---\n{page_content}")

Customize Ollama configuration for better performance

from vision_parse import VisionParser

custom_prompt = """
Strictly preserve markdown formatting during text extraction from scanned document.
"""

# Initialize parser with Ollama configuration
parser = VisionParser(
    model_name="llama3.2-vision:11b",
    temperature=0.7,
    top_p=0.6,
    num_ctx=4096,
    image_mode="base64",
    custom_prompt=custom_prompt,
    detailed_extraction=True,
    ollama_config={
        "OLLAMA_NUM_PARALLEL": "8",
        "OLLAMA_REQUEST_TIMEOUT": "240.0",
    },
    enable_concurrency=True,
)

# Convert PDF to markdown
pdf_path = "path/to/your/document.pdf"
markdown_pages = parser.convert_pdf(pdf_path)

OpenAI or Gemini Model Usage

from vision_parse import VisionParser

# Initialize parser with OpenAI model
parser = VisionParser(
    model_name="gpt-4o",
    api_key="your-openai-api-key", # Get the OpenAI API key from https://platform.openai.com/api-keys
    temperature=0.7,
    top_p=0.4,
    image_mode="url",
    detailed_extraction=True, # Set to True for more detailed extraction
    enable_concurrency=True,
)

# Initialize parser with Google Gemini model
parser = VisionParser(
    model_name="gemini-1.5-flash",
    api_key="your-gemini-api-key", # Get the Gemini API key from https://aistudio.google.com/app/apikey
    temperature=0.7,
    top_p=0.4,
    image_mode="url",
    detailed_extraction=True, # Set to True for more detailed extraction
    enable_concurrency=True,
)

✅ Supported Models

This package supports the following Vision LLM models:

  • OpenAI: gpt-4o, gpt-4o-mini
  • Google Gemini: gemini-1.5-flash, gemini-2.0-flash-exp, gemini-1.5-pro
  • Meta Llama and LLava from Ollama: llava:13b, llava:34b, llama3.2-vision:11b, llama3.2-vision:70b

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vision_parse-0.1.9.tar.gz (49.8 kB view details)

Uploaded Source

Built Distribution

vision_parse-0.1.9-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file vision_parse-0.1.9.tar.gz.

File metadata

  • Download URL: vision_parse-0.1.9.tar.gz
  • Upload date:
  • Size: 49.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for vision_parse-0.1.9.tar.gz
Algorithm Hash digest
SHA256 b752c239f49c0b8a92da0532e3fa1c1ac4b42529c38d2bf0a2d3970839bc898a
MD5 c1a0c3b32c137993268b097d15785594
BLAKE2b-256 71fd56fcb2a5ad450585c78a17972fed0c69652a96a4a989f99b19aa37bf23ee

See more details on using hashes here.

File details

Details for the file vision_parse-0.1.9-py3-none-any.whl.

File metadata

File hashes

Hashes for vision_parse-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 ceb5e05fa5af005c36900fb6f71f5d341a38f4b50b9444442e748ee4515d1af9
MD5 bd75f1b6f07332753b56c1803fa3f7fc
BLAKE2b-256 eade0b2a28728ca203f9f254a65c1cb49f335b7115325339913337d3214ff947

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page