Skip to main content

Convert PDFs to markdown using Large Language Models (LLMs) with vision capabilities

Project description

LLM OCR

PyPI Python License

Convert PDFs to markdown using Large Language Models (LLMs) with vision capabilities.

Features

  • 🔍 High-quality OCR using vision-capable LLMs
  • 📄 Batch processing of multiple PDF pages
  • 🔌 Multiple provider support (Gemini, OpenAI)
  • ⚙️ Configurable processing settings
  • 🔄 Automatic retry logic for transient errors
  • 📝 Clean markdown output

Installation

pip install ocr-llm

System Dependencies

You also need to install poppler (required for PDF processing):

# macOS
brew install poppler

# Ubuntu/Debian
sudo apt-get install poppler-utils

# Fedora/RHEL
sudo yum install poppler-utils

Dependencies

The library requires:

  • System: poppler-utils for PDF processing
  • Python:
    • google-genai for Gemini provider
    • openai for OpenAI provider
    • pdf2image and Pillow for PDF processing

Quick Start

Using OpenAI

import asyncio
from llm_ocr import LLMOCR, OpenAI

async def main():
    # Initialize OpenAI provider
    provider = OpenAI(
        api_key="your-api-key",  # Or set OPENAI_API_KEY env var
        model=OpenAI.GPT_4O_MINI
    )

    # Create OCR processor
    async with LLMOCR(provider) as ocr:
        # Convert PDF to markdown
        markdown = await ocr.convert(
            "document.pdf",
            output_path="output.md"
        )
        print(markdown)

asyncio.run(main())

Using Gemini

import asyncio
from llm_ocr import LLMOCR, Gemini

async def main():
    # Initialize Gemini provider
    provider = Gemini(
        api_key="your-api-key",  # Or set GEMINI_API_KEY env var
        model=Gemini.FLASH_2_5  # Or Gemini.PRO_2_5 for best quality
    )

    # Create OCR processor
    async with LLMOCR(provider) as ocr:
        # Convert PDF to markdown
        markdown = await ocr.convert(
            "document.pdf",
            output_path="output.md"
        )
        print(markdown)

asyncio.run(main())

Available Models

OpenAI

  • OpenAI.GPT_4O
  • OpenAI.GPT_4O_MINI (default)

Additional models: O1, O3, O4_MINI, GPT_5, GPT_5_MINI, GPT_4_1, and more.

See llm_ocr/providers/openai.py for the complete list.

Gemini

  • Gemini.PRO_2_5
  • Gemini.FLASH_2_5 (default)

Additional models: PRO_2_0, FLASH_2_0.

See llm_ocr/providers/gemini.py for the complete list.

Configuration

Customize the OCR processing with OCRConfig:

from llm_ocr import LLMOCR, OpenAI, OCRConfig

config = OCRConfig(
    dpi=300,                    # Higher DPI for better quality
    max_pages=10,               # Limit number of pages to process
    llm_batch_size=2,           # Send 2 pages to LLM at once
    convert_to_grayscale=True,  # Convert images to grayscale
    max_retries=3,              # Retry failed requests
    retry_delay=1.0,            # Wait 1 second between retries
    include_page_markers=True,  # Add page markers in output
)

provider = OpenAI()
ocr = LLMOCR(provider, config=config)

Configuration Options

Option Default Description
dpi 200 DPI for PDF to image conversion (72-600)
max_pages None Maximum number of pages to process
batch_size 5 PDF to image conversion batch size
llm_batch_size 1 Number of pages to send to LLM at once
thread_count 4 Number of threads for PDF conversion
convert_to_grayscale False Convert images to grayscale
optimize_png True Optimize PNG compression
use_cropbox True Use PDF cropbox for conversion
max_retries 3 Maximum retry attempts for failed requests
retry_delay 1.0 Delay between retries in seconds
include_page_markers False Add page markers in markdown output

Advanced Usage

Custom Provider Parameters

Pass additional parameters to the LLM provider:

# OpenAI with custom parameters
provider = OpenAI(
    model=OpenAI.GPT_4O,
    max_tokens=4000,
    temperature=0.0,
)

# Gemini with custom parameters
provider = Gemini(
    model=Gemini.PRO_2_5,
    temperature=0.0,
)

Processing Multiple Documents

import asyncio
from pathlib import Path
from llm_ocr import LLMOCR, OpenAI

async def process_documents():
    provider = OpenAI()

    async with LLMOCR(provider) as ocr:
        pdf_files = Path("pdfs").glob("*.pdf")

        for pdf_file in pdf_files:
            output_file = pdf_file.with_suffix(".md")
            await ocr.convert(pdf_file, output_path=output_file)
            print(f"Converted {pdf_file.name} -> {output_file.name}")

asyncio.run(process_documents())

Without Context Manager

If you prefer not to use the context manager:

import asyncio
from llm_ocr import LLMOCR, OpenAI

async def main():
    provider = OpenAI()
    ocr = LLMOCR(provider)

    try:
        markdown = await ocr.convert("document.pdf")
        print(markdown)
    finally:
        await ocr.aclose()  # Don't forget to close!

asyncio.run(main())

Environment Variables

Set API keys via environment variables:

# For OpenAI
export OPENAI_API_KEY="your-openai-api-key"

# For Gemini
export GEMINI_API_KEY="your-gemini-api-key"

Then use providers without passing API keys:

# API key read from environment variable
provider = OpenAI()  # Uses OPENAI_API_KEY
# or
provider = Gemini()  # Uses GEMINI_API_KEY

Error Handling

The library uses a fail-fast approach with automatic retries:

import asyncio
from llm_ocr import LLMOCR, OpenAI, OCRConfig

async def main():
    provider = OpenAI()
    config = OCRConfig(
        max_retries=5,      # Retry up to 5 times
        retry_delay=2.0,    # Wait 2 seconds between retries
    )

    async with LLMOCR(provider, config) as ocr:
        try:
            markdown = await ocr.convert("document.pdf")
            print(markdown)
        except Exception as e:
            print(f"Failed to process document: {e}")

asyncio.run(main())

License

See LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocr_llm-1.0.0.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocr_llm-1.0.0-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file ocr_llm-1.0.0.tar.gz.

File metadata

  • Download URL: ocr_llm-1.0.0.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for ocr_llm-1.0.0.tar.gz
Algorithm Hash digest
SHA256 62cd37a82ab142925ba74f31e6e026afdee5dc41cde43bc022ae1e9cbc5a4e51
MD5 aee623b21a9fa9f2b087d749cf407c9c
BLAKE2b-256 d6bdf2c2614831d78665703c1deaecfa1b54c2ee731f7906d00c74a663cf984b

See more details on using hashes here.

File details

Details for the file ocr_llm-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: ocr_llm-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for ocr_llm-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 18e5fca0b8ed1797727ea7d249a2f0bef3c22b8bcdd4461af8abbcdb5ad7b5bd
MD5 6280991284010efab925fa77de29e360
BLAKE2b-256 54f34cb71aaaf10c1791568869c7aad46c7fff68d81a1e7cabec216df7429deb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page