Skip to main content

OCR using LLMs

Project description

vllmocr

PyPI version

vllmocr is a command-line tool that performs Optical Character Recognition (OCR) on images and PDFs using Large Language Models (LLMs). The LLM model is prompted to return the complete text in Markdown format. vllmocr supports multiple LLM providers, including OpenAI, Anthropic, Google, and local models via Ollama. It was designed to assist with creating text versions of public domain books and historical newspaper articles.

Features

  • Image and PDF OCR: Extracts text from both images (PNG, JPG, JPEG) and PDF files.
  • Multiple LLM Providers: Supports a variety of LLMs:
    • OpenAI: GPT-4o
    • Anthropic: Claude 3 Haiku, Claude 3.5 Haiku, Claude 3 Sonnet
    • Google: Gemini 1.5 Pro
    • Ollama: (Local models) Llama3, Llama3.2-vision, MiniCPM, and other models supported by Ollama.
    • OpenRouter: Access to various models through the OpenRouter API
  • Configurable: Settings, including the LLM provider and model, can be adjusted via a configuration file or environment variables.
  • Image Preprocessing: Includes optional image rotation for improved OCR accuracy.

Installation

The recommended way to install vllmocr is using uv tool install:

uv tool install vllmocr

If you don't have uv installed, you can install it with:

curl -sSf https://install.ultraviolet.rs | sh

You may need to restart your shell session for uv to be available.

Alternatively, you can use uv pip or regular pip:

uv pip install vllmocr
pip install vllmocr

Usage

vllmocr is a command-line tool that processes both images and PDFs:

vllmocr <file_path> [options]
  • <file_path>: The path to the image file (PNG, JPG, JPEG) or PDF file.

Options:

  • -o, --output: Output file name (default: auto-generated based on input filename and model).
  • -p, --provider: The LLM provider to use (openai, anthropic, google, ollama, openrouter). Defaults to anthropic.
  • -m, --model: The specific model to use (e.g., gpt-4o, haiku, llama3.2-vision, google/gemma-3-27b-it). Defaults to claude-3-5-haiku-latest.
  • -c, --custom-prompt: Custom prompt to use for the LLM.
  • --api-key: API key for the LLM provider. Overrides API keys from the config file or environment variables.
  • --rotate: Manually rotate image by specified degrees (0, 90, 180, or 270).
  • --debug: Save intermediate processing steps for debugging.
  • --log-level: Set the logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL).
  • --help: Show the help message and exit.

Examples:

vllmocr my_image.jpg -m haiku
vllmocr document.pdf -p ollama -m llama3.2-vision
vllmocr scan.jpg -p openai -m gpt-4o --rotate 90

Running vllmocr without arguments will display a help message with usage examples.

A General Note on LLMs and OCR

In my experience, only the largest of LLMs are useful for text transcription. Although vllmocr supports Ollama, I haven't found any locally-runnable models that perform adequately on my MacBook Pro with 36 GB of memory.

Most models demonstrate reasonable accuracy, though hallucinations occur most frequently when processing text that begins or ends mid-sentence. Models typically ignore word or sentence fragments at the top of the page while attempting to complete sentences that are cut off at the bottom. Hallucinations also increase when processing blurry or distorted text. Despite how you prompt them, current models remain overconfident in their capacity to decipher text. Additionally, models occasionally modernize archaic spellings or formatting without indication.

A more substantial challenge arises when processing pages with more than a few hundred words, such as full newspaper or magazine pages. The bigger the model the more words they are able to output, and this doesn't seem to have anything to do with context window size or output restrictions, just parameters. When overwhelmed, models frequently omit significant sections, especially with column-formatted content. To achieve best results, I usually crop the image into smaller, manageable sections and performing OCR on each section individually. This approach dramatically improves accuracy and ensures comprehensive text capture across the entire document.

Configuration

vllmocr can be configured using a TOML file or environment variables. The configuration file is searched for in the following locations (in order of precedence):

  1. ./config.toml (current working directory)
  2. ~/.config/vllmocr/config.toml (user's home directory)
  3. /etc/vllmocr/config.toml (system-wide)

config.toml (Example):

[llm]
provider = "anthropic"  # Default provider
model = "claude-3-5-haiku-latest"  # Default model for the provider

[image_processing]
rotation = 0           # Image rotation in degrees (optional)

[api_keys]
openai = "YOUR_OPENAI_API_KEY"
anthropic = "YOUR_ANTHROPIC_API_KEY"
google = "YOUR_GOOGLE_API_KEY"
openrouter = "YOUR_OPENROUTER_API_KEY"
# Ollama doesn't require an API key

Environment Variables:

You can also set API keys using environment variables:

  • VLLM_OCR_OPENAI_API_KEY
  • VLLM_OCR_ANTHROPIC_API_KEY
  • VLLM_OCR_GOOGLE_API_KEY
  • VLLM_OCR_OPENROUTER_API_KEY

Environment variables override settings in the configuration file. This is the recommended way to set API keys for security reasons. You can also pass the API key directly via the --api-key command-line option, which takes the highest precedence.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllmocr-0.6.0.tar.gz (17.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vllmocr-0.6.0-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file vllmocr-0.6.0.tar.gz.

File metadata

  • Download URL: vllmocr-0.6.0.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for vllmocr-0.6.0.tar.gz
Algorithm Hash digest
SHA256 d8df17b2fb09a1a415eeb915faa7f1892e946837e2569936ff679a60f781e230
MD5 dbddf54bfddde7f9e5ffcb012b6af36c
BLAKE2b-256 92afcf5d028b4edb00c1b118c4226fbf42d40b67418c92389e8a498b5f52b952

See more details on using hashes here.

File details

Details for the file vllmocr-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: vllmocr-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for vllmocr-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9b9e611d61a5ad5d72fe2828b6e5a1bcbd2a2943c8e3d203a0aba8341d71e569
MD5 7ac27acfc1f112f1f69be3a941a53efa
BLAKE2b-256 15c6b8a985df756a9f99cb753a4401df15eea298d036722c8e5b6a61aadea227

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page