A tool for converting PDF documents to Markdown using OCR and vision language models

These details have not been verified by PyPI

Project links

Project description

GPTParse

GPTParse is a powerful and versatile document parser designed specifically for Retrieval-Augmented Generation (RAG) systems. It enables seamless conversion of PDF documents and images into Markdown format using either advanced vision language models (VLMs) or fast local processing, facilitating easy integration into text-based workflows and applications.

With GPTParse, you can:

Convert complex PDFs and images, including those with tables, lists, and embedded images, into well-structured Markdown.
Choose between AI-powered processing (using OpenAI, Anthropic, or Google) or fast local processing.
Use GPTParse as a Python library or via a command-line interface (CLI), offering flexibility in how you integrate it into your projects.

It's as simple as:

# Convert a PDF using Vision Language Models
gptparse vision example.pdf --output_file output.md

# Convert a PDF using fast local processing (no VLM or internet connection required)
gptparse fast example.pdf --output_file output.md

# Convert using hybrid mode (combines fast and vision for better results)
gptparse hybrid example.pdf --output_file output.md

# Convert an image
gptparse vision screenshot.png --output_file output.md

Features

Convert PDFs and Images to Markdown: Transform PDF documents and image files (PNG, JPG, JPEG) into Markdown format, preserving the structure and content.
Multiple Parsing Methods: Choose between using Vision Language Models (VLMs) for high-fidelity conversion, fast local processing for quick results, hybrid mode for enhanced accuracy, or OCR mode for direct text extraction.
Support for Multiple AI Providers: Seamlessly integrate with OpenAI, Anthropic, and Google AI models, selecting the one that best fits your needs.
Python Library and CLI Application: Use GPTParse within your Python applications or interact with it through the command line.
Customizable Processing Options: Configure concurrency levels, select specific pages to process, and customize system prompts to tailor the output.
Page Selection: Process entire documents or specify individual pages or ranges of pages.
Detailed Statistics: Optionally display detailed processing statistics, including token usage and processing times.

Installation
- Prerequisites
Quick Start
Usage
Available Models and Providers
Examples
Contributing
License
Acknowledgments

Installation

Install GPTParse using pip:

pip install gptparse

Prerequisites

Ensure you have the following installed:

Python 3.9 or higher
Poppler: For PDF to image conversion

Installing Poppler

Poppler is the underlying project that handles PDF processing. You can check if you already have it installed by running pdftoppm -h in your terminal/command prompt.

Ubuntu/Debian:
```
sudo apt-get install poppler-utils
```
Arch Linux:
```
sudo pacman -S poppler
```
macOS (with Homebrew):
```
brew install poppler
```
Windows:
1. Download the latest poppler package from oschwartz10612's version, which is the most up-to-date.
2. Extract the downloaded package and move the extracted directory to your desired location.
3. Add the bin/ directory from the extracted folder to your system PATH.
4. Verify the installation by opening a new command prompt and running pdftoppm -h.

After installing Poppler, you should be ready to use GPTParse.

Quick Start

Here's how you can quickly get started with GPTParse:

# Set your API key
export OPENAI_API_KEY="your-openai-api-key"

# Convert a PDF to Markdown using Vision Language Models
gptparse vision example.pdf --output_file output.md

# Convert a PDF to Markdown using fast local processing (no VLM or internet connection required)
gptparse fast example.pdf --output_file output.md

# Convert using hybrid mode (combines fast and vision for better results)
gptparse hybrid example.pdf --output_file output.md

# Convert using OCR mode (direct text extraction)
gptparse ocr example.pdf --output_file output.md

Usage

Setting Up Environment Variables

Before using GPTParse, set up the API keys for the AI providers you plan to use by setting the appropriate environment variables:

OpenAI:

export OPENAI_API_KEY="your-openai-api-key"

Anthropic:

export ANTHROPIC_API_KEY="your-anthropic-api-key"

Google:

export GOOGLE_API_KEY="your-google-api-key"

You can set these variables in your shell profile (~/.bashrc, ~/.zshrc, etc.) or include them in your Python script before importing GPTParse.

Note: Keep your API keys secure and do not expose them in code repositories.

Configuration

GPTParse allows you to set default configurations for ease of use. Use the configure command to set default values for the AI provider, model, and concurrency:

gptparse configure

You will be prompted to enter the desired provider, model, and concurrency level. The configuration is saved in ~/.gptparse_config.json.

Example:

$ gptparse configure
GPTParse Configuration
Enter new values or press Enter to keep the current values.
Current values are shown in [brackets].

AI Provider [openai]: anthropic
Default Model for anthropic [claude-3-5-sonnet-latest]: claude-3-opus-latest
Default Concurrency [10]: 5
Configuration updated successfully.

Current configuration:
  provider: anthropic
  model: claude-3-opus-latest
  concurrency: 5

Using GPTParse as a Python Package

Below is an example of how to use GPTParse in your Python code:

import os

# For AI-powered vision processing
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
from gptparse.modes.vision import vision
from gptparse.modes.fast import fast
from gptparse.modes.hybrid import hybrid

# Using vision mode
vision_result = vision(
    concurrency=10,
    file_path="example.pdf",
    model="gpt-4o",
    output_file="output.md",
    custom_system_prompt=None,
    select_pages=None,
    provider="openai",
)

# Using fast mode (no AI required)
fast_result = fast(
    file_path="example.pdf",
    output_file="output.md",
    select_pages=None,
)

# Using hybrid mode (combines fast and vision)
hybrid_result = hybrid(
    concurrency=10,
    file_path="example.pdf",
    model="gpt-4o",
    output_file="output.md",
    custom_system_prompt=None,
    select_pages=None,
    provider="openai",
)

Using GPTParse via the CLI

When using the command-line interface, you have four modes available:

Vision Mode - Uses AI models for high-quality conversion:

export OPENAI_API_KEY="your-openai-api-key"
gptparse vision example.pdf --output_file output.md --provider openai

Fast Mode - Uses local processing for quick conversion (no AI required):

gptparse fast example.pdf --output_file output.md

Hybrid Mode - Combines fast and vision modes for enhanced results:

export OPENAI_API_KEY="your-openai-api-key"
gptparse hybrid example.pdf --output_file output.md --provider openai

OCR Mode - Uses direct OCR processing for text extraction:

gptparse ocr example.pdf --output_file output.md

--output_file: Output file name (must have a .md or .txt extension).
--abort-on-error: Stop processing if an error occurs (optional).

Vision Mode Options

--concurrency: Number of concurrent processes (default: value set in configuration or 10).
--model: Vision language model to use (overrides configured default).
--output_file: Output file name (must have a .md or .txt extension).
--custom_system_prompt: Custom system prompt for the language model.
--select_pages: Pages to process (e.g., "1,3-5,10"). Only applicable for PDF files.
--provider: AI provider to use (openai, anthropic, google).
--stats: Display detailed statistics after processing.

Fast Mode Options

--output_file: Output file name (must have a .md or .txt extension).
--select_pages: Pages to process (e.g., "1,3-5,10"). Only applicable for PDF files.
--stats: Display basic processing statistics.

Hybrid Mode Options

--concurrency: Number of concurrent processes (default: value set in configuration or 10).
--model: Vision language model to use (overrides configured default).
--output_file: Output file name (must have a .md or .txt extension).
--custom_system_prompt: Custom system prompt for the language model.
--select_pages: Pages to process (e.g., "1,3-5,10"). Only applicable for PDF files.
--provider: AI provider to use (openai, anthropic, google).
--stats: Display detailed statistics after processing.

OCR Mode Options

gptparse ocr example.pdf --output_file output.md

--output_file: Output file name (must have a .md or .txt extension).
--abort-on-error: Stop processing if an error occurs (optional).

Available Models and Providers

GPTParse supports multiple models from different AI providers.

OpenAI Models

gpt-4o (Default)
gpt-4o-mini

Anthropic Models

claude-3-5-sonnet-latest (Default)
claude-3-opus-latest
claude-3-sonnet-20240229
claude-3-haiku-20240307

Google Models

gemini-1.5-pro-002 (Default)
gemini-1.5-flash-002
gemini-1.5-flash-8b

To list available models for a provider in your code, you can use:

from gptparse.models.model_interface import list_available_models

# List models for a specific provider
models = list_available_models(provider='openai')
print("OpenAI models:", models)

# List all available models from all providers
all_models = list_available_models()
print("All available models:", all_models)

Examples

Processing Specific Pages

To process only specific pages from a PDF document, use the --select_pages option:

gptparse vision example.pdf --select_pages "2,4,6-8"

This command will process pages 2, 4, 6, 7, and 8 of example.pdf.

Custom System Prompt

Provide a custom system prompt to influence the model's output:

gptparse vision example.pdf --custom_system_prompt "Please extract all text in bullet points."

Displaying Statistics

To display detailed processing statistics, use the --stats flag:

gptparse vision example.pdf --stats

Sample output:

Detailed Statistics:
File Path: example.pdf
Completion Time: 12.34 seconds
Total Pages Processed: 5
Total Input Tokens: 2500
Total Output Tokens: 3000
Total Tokens: 5500
Average Tokens per Page: 1100.00

Page-wise Statistics:
  Page 1: 600 tokens
  Page 2: 500 tokens
  Page 3: 700 tokens
  Page 4: 800 tokens
  Page 5: 400 tokens

Processing Images

To process an image file:

# Process a PNG file
gptparse vision screenshot.png --output_file output.md

# Process a JPG file
gptparse vision photo.jpg --output_file output.md

Supported image formats:

PNG
JPG/JPEG

Processing with OCR

To process a file using direct OCR:

# Process a PDF file with OCR
gptparse ocr document.pdf --output_file output.md

# Process an image with OCR
gptparse ocr scan.png --output_file output.md

# Process with abort-on-error flag
gptparse ocr document.pdf --output_file output.md --abort-on-error

The OCR mode supports:

PDF documents
PNG images
JPG/JPEG images

Contributing

Contributions are welcome! If you'd like to contribute to GPTParse, please follow these steps:

Fork the repository on GitHub.
Create a new branch for your feature or bugfix.
Make your changes and ensure tests pass.
Submit a pull request with a clear description of your changes.

Please ensure that your code adheres to the existing style conventions and passes all tests.

License

GPTParse is licensed under the Apache-2.0 License. See LICENSE for more information.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Nov 12, 2024

0.2.0

Nov 4, 2024

0.1.2

Oct 23, 2024

0.1.1

Oct 19, 2024

0.1.0

Oct 18, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gptparse-0.3.0.tar.gz (27.7 kB view details)

Uploaded Nov 12, 2024 Source

Built Distribution

gptparse-0.3.0-py3-none-any.whl (32.1 kB view details)

Uploaded Nov 12, 2024 Python 3

File details

Details for the file gptparse-0.3.0.tar.gz.

File metadata

Download URL: gptparse-0.3.0.tar.gz
Upload date: Nov 12, 2024
Size: 27.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.7

File hashes

Hashes for gptparse-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`4c2df30bf8c495b093b967030b31d1a3fc3ecee43d6583bf679a11d5a9c6e69e`
MD5	`4849c86b911dbb0be082d72a2b801162`
BLAKE2b-256	`3cddd49d64fdfcf60e19f291bbdd87adc426777b7210b0812c1f7ffeaee3ccc8`

See more details on using hashes here.

File details

Details for the file gptparse-0.3.0-py3-none-any.whl.

File metadata

Download URL: gptparse-0.3.0-py3-none-any.whl
Upload date: Nov 12, 2024
Size: 32.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.7

File hashes

Hashes for gptparse-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9ac6bc67b9b780cf96ab532d09c969ef823d67836d1c18f9f7b569e1c2235efa`
MD5	`cb4945b757d3e2f291758d2181097ba8`
BLAKE2b-256	`bf52909961781de973bd167c4826e5e9616137353cbbd1af4fdc43148c6f689f`

See more details on using hashes here.

gptparse 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GPTParse

Features

Table of Contents

Installation

Prerequisites

Installing Poppler

Quick Start

Usage

Setting Up Environment Variables

Configuration

Using GPTParse as a Python Package

Using GPTParse via the CLI

Vision Mode Options

Fast Mode Options

Hybrid Mode Options

OCR Mode Options

Available Models and Providers

OpenAI Models

Anthropic Models

Google Models

Examples

Processing Specific Pages

Custom System Prompt

Displaying Statistics

Processing Images

Processing with OCR

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes