Parse PDF documents into markdown formatted content using Vision LLMs
Project description
Vision Parse
🚀 Parse PDF documents into beautifully formatted markdown content using state-of-the-art Vision Language Models - all with just a few lines of code!
🎯 Introduction
Vision Parse harnesses the power of Vision Language Models to revolutionize document processing:
- 📝 Smart Content Extraction: Intelligently identifies and extracts text, tables, and LaTeX equations with high precision
- 🎨 Advanced Content Formatting: Preserves LaTeX equations, hyperlinks, images, and footnotes for markdown-formatted content
- 🤖 Multi-LLM Support: Seamlessly integrates with multiple Vision LLM providers such as OpenAI, Gemini, and Llama for optimal accuracy and speed
- 🔄 Scanned PDF Document Processing: Extracts text, tables, images, and LaTeX equations from scanned PDF documents into well-structured markdown content
- 📁 Local Model Hosting: Supports local model hosting with Ollama for secure, no-cost, private, and offline document processing
🚀 Getting Started
Prerequisites
- 🐍 Python >= 3.9
- 🖥️ Ollama (if you want to use local models)
- 🤖 API Key for OpenAI or Google Gemini (if you want to use OpenAI or Google Gemini)
Installation
Install the core package using pip (Recommended):
pip install vision-parse
Install the additional dependencies for OpenAI or Gemini:
# For OpenAI support
pip install 'vision-parse[openai]'
# For Gemini support
pip install 'vision-parse[gemini]'
# To install all the additional dependencies
pip install 'vision-parse[all]'
Install the package from source:
pip install 'git+https://github.com/iamarunbrahma/vision-parse.git#egg=vision-parse[all]'
Setting up Ollama (Optional)
See examples/ollama_setup.md on how to setup Ollama locally.
⌛️ Usage
Basic Example Usage
from vision_parse import VisionParser
# Initialize parser
parser = VisionParser(
model_name="llama3.2-vision:11b", # For local models, you don't need to provide the api key
temperature=0.4,
top_p=0.5,
image_mode="url", # Image mode can be "url", "base64" or None
detailed_extraction=False, # Set to True for more detailed extraction
enable_concurrency=False, # Set to True for parallel processing
)
# Convert PDF to markdown
pdf_path = "path/to/your/document.pdf" # local path to your pdf file
markdown_pages = parser.convert_pdf(pdf_path)
# Process results
for i, page_content in enumerate(markdown_pages):
print(f"\n--- Page {i+1} ---\n{page_content}")
Customize Ollama configuration for better performance
from vision_parse import VisionParser
custom_prompt = """
Strictly preserve markdown formatting during text extraction from scanned document.
"""
# Initialize parser with Ollama configuration
parser = VisionParser(
model_name="llama3.2-vision:11b",
temperature=0.7,
top_p=0.6,
num_ctx=4096,
image_mode="base64",
custom_prompt=custom_prompt,
detailed_extraction=True,
ollama_config={
"OLLAMA_NUM_PARALLEL": "8",
"OLLAMA_REQUEST_TIMEOUT": "240.0",
},
enable_concurrency=True,
)
# Convert PDF to markdown
pdf_path = "path/to/your/document.pdf"
markdown_pages = parser.convert_pdf(pdf_path)
OpenAI or Gemini Model Usage
from vision_parse import VisionParser
# Initialize parser with OpenAI model
parser = VisionParser(
model_name="gpt-4o",
api_key="your-openai-api-key", # Get the OpenAI API key from https://platform.openai.com/api-keys
temperature=0.7,
top_p=0.4,
image_mode="url",
detailed_extraction=True, # Set to True for more detailed extraction
enable_concurrency=True,
)
# Initialize parser with Google Gemini model
parser = VisionParser(
model_name="gemini-1.5-flash",
api_key="your-gemini-api-key", # Get the Gemini API key from https://aistudio.google.com/app/apikey
temperature=0.7,
top_p=0.4,
image_mode="url",
detailed_extraction=True, # Set to True for more detailed extraction
enable_concurrency=True,
)
✅ Supported Models
This package supports the following Vision LLM models:
- OpenAI:
gpt-4o
,gpt-4o-mini
- Google Gemini:
gemini-1.5-flash
,gemini-2.0-flash-exp
,gemini-1.5-pro
- Meta Llama and LLava from Ollama:
llava:13b
,llava:34b
,llama3.2-vision:11b
,llama3.2-vision:70b
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file vision_parse-0.1.9.tar.gz
.
File metadata
- Download URL: vision_parse-0.1.9.tar.gz
- Upload date:
- Size: 49.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b752c239f49c0b8a92da0532e3fa1c1ac4b42529c38d2bf0a2d3970839bc898a |
|
MD5 | c1a0c3b32c137993268b097d15785594 |
|
BLAKE2b-256 | 71fd56fcb2a5ad450585c78a17972fed0c69652a96a4a989f99b19aa37bf23ee |
File details
Details for the file vision_parse-0.1.9-py3-none-any.whl
.
File metadata
- Download URL: vision_parse-0.1.9-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ceb5e05fa5af005c36900fb6f71f5d341a38f4b50b9444442e748ee4515d1af9 |
|
MD5 | bd75f1b6f07332753b56c1803fa3f7fc |
|
BLAKE2b-256 | eade0b2a28728ca203f9f254a65c1cb49f335b7115325339913337d3214ff947 |