Skip to main content

CLI tool for OCR processing using Google Gemini's vision capabilities

Project description

Gemini OCR CLI

CI PyPI version Python 3.11+ License: MIT

A command-line tool for OCR processing using Google Gemini's vision capabilities. Process PDFs and images to extract text, tables, equations, and figures.

Choosing an OCR tool

This is one of five OCR CLI tools with a shared design: clean Markdown output, batch processing, and figure extraction. Pick based on your constraints:

Tool Engine Runs Cost Best for
deepseek-ocr-cli DeepSeek vision Local (Ollama / vLLM) Free General-purpose local OCR with multi-backend flexibility
gemini-ocr-cli (this repo) Google Gemini Cloud API Free tier / Pay-per-use Fast cloud OCR with concurrent processing
marker-ocr-cli Marker (Surya + Texify) Local Free Academic papers with equations, tables, complex layouts
mistral-ocr-cli Mistral OCR API Cloud API ~$1/1k pages Structured extraction (tables, headers, footers)
nougat-ocr-cli Meta Nougat Local (GPU) Free Academic papers, GPU-accelerated batch processing

Installation

Requires Python 3.11+ and a Google Gemini API key.

pip install gemini-ocr-cli

Or from source:

git clone https://github.com/r-uben/gemini-ocr-cli.git
cd gemini-ocr-cli
uv sync

Quick start

# Set your API key
export GEMINI_API_KEY="your_key_here"

# Process a single file
gemini-ocr document.pdf

# Process a directory
gemini-ocr ./documents -o ./results

# Preview what would be processed (no API calls)
gemini-ocr ./documents --dry-run

# Process 4 files concurrently
gemini-ocr ./documents -w 4

Options

Usage: gemini-ocr [OPTIONS] INPUT_PATH

Options:
  -o, --output-dir PATH           Output directory (default: <input_dir>/gemini_ocr_output/)
  --api-key TEXT                  Gemini API key (or set GEMINI_API_KEY env var)
  --model TEXT                    Model to use (default: gemini-3-flash-preview)
  --task [convert|extract|table|describe_figure]
                                  OCR task type (default: convert)
  --prompt TEXT                   Custom prompt for OCR processing

  --include-images/--no-images    Extract embedded images (default: True)
  --save-originals/--no-save-originals  Copy original images to output (default: True)

  -w, --workers N                 Concurrent workers for batch processing (default: 1)
  --reprocess                     Reprocess already-processed files
  --dry-run                       List files without calling the API
  -q, --quiet                     Suppress all output except errors
  -v, --verbose                   Enable verbose/debug output
  --info                          Show configuration and system info
  --env-file PATH                 Path to .env file
  --version                       Show version
  --help                          Show this message

Output structure

gemini_ocr_output/
├── document_name/
│   ├── document_name.md        # OCR markdown (clean text only)
│   └── figures/                # extracted embedded images
│       ├── page1_img1.png
│       └── page2_img1.png
├── another_document/
│   └── ...
└── metadata.json               # processing stats, checksums, file list

API key resolution

Priority order:

  1. --api-key CLI argument
  2. GEMINI_API_KEY environment variable
  3. GOOGLE_API_KEY environment variable (fallback)
  4. .env file in current directory

Configuration

All CLI options can also be set via environment variables or a .env file:

CLI flag Environment variable Default
--api-key GEMINI_API_KEY (required)
--model GEMINI_MODEL gemini-3-flash-preview
--include-images GEMINI_INCLUDE_IMAGES true
--save-originals GEMINI_SAVE_ORIGINAL_IMAGES true
--workers GEMINI_MAX_WORKERS 1
--verbose GEMINI_VERBOSE false
GEMINI_MAX_FILE_SIZE_MB 50
GEMINI_MAX_RETRIES 3
GEMINI_RETRY_BASE_DELAY 1.0

CLI flags override environment variables when explicitly passed.

Development

# Install dev dependencies
uv sync --extra dev

# Run tests
uv run pytest

# Lint
uv run ruff check .

# Format
uv run ruff format .

# Type check
uv run mypy gemini_ocr/ --ignore-missing-imports

Limitations

  • Maximum file size: 50 MB (configurable via GEMINI_MAX_FILE_SIZE_MB)
  • Supported formats: PDF, JPG, JPEG, PNG, WEBP, GIF, BMP, TIFF

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gemini_ocr_cli-0.3.1.tar.gz (82.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gemini_ocr_cli-0.3.1-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file gemini_ocr_cli-0.3.1.tar.gz.

File metadata

  • Download URL: gemini_ocr_cli-0.3.1.tar.gz
  • Upload date:
  • Size: 82.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for gemini_ocr_cli-0.3.1.tar.gz
Algorithm Hash digest
SHA256 80ca1e8ea746761a2d3b04fe3ee064522eef49bbc87d264b3fdf605d1a8a805e
MD5 0f413ece80848585ff1185a569b98769
BLAKE2b-256 e0cd7003c8d317245297716ec5e1a8c62878b66dbdf67269a1622e5d000d6129

See more details on using hashes here.

File details

Details for the file gemini_ocr_cli-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for gemini_ocr_cli-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5a541cec17edb78eae35fa517ce2015d884c1b7d4ab4de014df3b360c44ee6ee
MD5 06dce9a59b9a4ea76d1ebe7307a02d7c
BLAKE2b-256 a1a268caadb1cb46ed27ebf1638f4c759df8352311decfbd47f2b35c80af70b7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page