Skip to main content

CLI tool for OCR processing using Google Gemini's vision capabilities

Project description

Gemini OCR CLI

CI PyPI version Python 3.11+ License: MIT

A command-line tool for OCR processing using Google Gemini's vision capabilities. Process PDFs and images to extract text, tables, equations, and figures.

Installation

Requires Python 3.11+ and a Google Gemini API key.

pip install gemini-ocr-cli

Or from source:

git clone https://github.com/r-uben/gemini-ocr-cli.git
cd gemini-ocr-cli
uv sync

Quick start

# Set your API key
export GEMINI_API_KEY="your_key_here"

# Process a single file
gemini-ocr document.pdf

# Process a directory
gemini-ocr ./documents -o ./results

# Preview what would be processed (no API calls)
gemini-ocr ./documents --dry-run

# Process 4 files concurrently
gemini-ocr ./documents -w 4

Options

Usage: gemini-ocr [OPTIONS] INPUT_PATH

Options:
  -o, --output-dir PATH           Output directory (default: <input_dir>/gemini_ocr_output/)
  --api-key TEXT                  Gemini API key (or set GEMINI_API_KEY env var)
  --model TEXT                    Model to use (default: gemini-3.1-flash-lite-preview)
  --task [convert|extract|table|describe_figure]
                                  OCR task type (default: convert)
  --prompt TEXT                   Custom prompt for OCR processing

  --include-images/--no-images    Extract embedded images (default: True)
  --save-originals/--no-save-originals  Copy original images to output (default: True)

  -w, --workers N                 Concurrent workers for batch processing (default: 1)
  --reprocess                     Reprocess already-processed files
  --dry-run                       List files without calling the API
  -q, --quiet                     Suppress all output except errors
  -v, --verbose                   Enable verbose/debug output
  --info                          Show configuration and system info
  --env-file PATH                 Path to .env file
  --version                       Show version
  --help                          Show this message

Output structure

gemini_ocr_output/
├── document_name/
│   ├── document_name.md        # OCR markdown (clean text only)
│   └── figures/                # extracted embedded images
│       ├── page1_img1.png
│       └── page2_img1.png
├── another_document/
│   └── ...
└── metadata.json               # processing stats, checksums, file list

API key resolution

Priority order:

  1. --api-key CLI argument
  2. GEMINI_API_KEY environment variable
  3. GOOGLE_API_KEY environment variable (fallback)
  4. .env file in current directory

Configuration

All CLI options can also be set via environment variables or a .env file:

CLI flag Environment variable Default
--api-key GEMINI_API_KEY (required)
--model GEMINI_MODEL gemini-3.1-flash-lite-preview
--include-images GEMINI_INCLUDE_IMAGES true
--save-originals GEMINI_SAVE_ORIGINAL_IMAGES true
--workers GEMINI_MAX_WORKERS 1
--verbose GEMINI_VERBOSE false
GEMINI_MAX_FILE_SIZE_MB 50
GEMINI_MAX_RETRIES 3
GEMINI_RETRY_BASE_DELAY 1.0

CLI flags override environment variables when explicitly passed.

Development

# Install dev dependencies
uv sync --extra dev

# Run tests
uv run pytest

# Lint
uv run ruff check .

# Format
uv run ruff format .

# Type check
uv run mypy gemini_ocr/ --ignore-missing-imports

Limitations

  • Maximum file size: 50 MB (configurable via GEMINI_MAX_FILE_SIZE_MB)
  • Supported formats: PDF, JPG, JPEG, PNG, WEBP, GIF, BMP, TIFF

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gemini_ocr_cli-0.3.0.tar.gz (81.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gemini_ocr_cli-0.3.0-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file gemini_ocr_cli-0.3.0.tar.gz.

File metadata

  • Download URL: gemini_ocr_cli-0.3.0.tar.gz
  • Upload date:
  • Size: 81.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for gemini_ocr_cli-0.3.0.tar.gz
Algorithm Hash digest
SHA256 d0e666e3ae3c079c8de427bc2e291097a7468e1c4b5bc63e81de0e0ff2f1d2c9
MD5 cf6852bee6ff053d77c9f750f16a5021
BLAKE2b-256 05b28013d68e3233035f4e503fb296dacd78cb122ee830f45c15f9b48f9e9e11

See more details on using hashes here.

File details

Details for the file gemini_ocr_cli-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for gemini_ocr_cli-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7950e7a25273f23f949375df069983c93e29571d0e30a4927ef4a72984e6ab7c
MD5 8c4cebd6f1b81203c9ea6eadbae90c89
BLAKE2b-256 9694dfe84c575a565460b56f8a6cb3b19b5cec2d9f1a3e9a7d6fb49f0603fca6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page