OCR support for stache-tools CLI using Tesseract

Project description

stache-tools-ocr

OCR support for stache-tools CLI. Automatically detects and processes scanned PDFs and images using Tesseract.

Quick Start

# 1. Install system dependencies (one-time)
# Ubuntu/Debian:
sudo apt install ocrmypdf tesseract-ocr

# macOS:
brew install ocrmypdf

# Windows:
choco install ocrmypdf

# 2. Install Python package
pip install stache-tools-ocr

# 3. Use it automatically
stache ingest scanned.pdf
stache ingest *.jpg *.png

That's it! OCR loaders automatically register with the CLI.

Architecture

stache-tools-ocr is a thin adapter that wraps stache-ai-ocr to provide OCR capabilities for the stache-tools CLI.

Design: This package adapts stache-ai-ocr's file path interface to stache-tools' BinaryIO interface, enabling seamless integration with the stache ingest command.

Benefits:

Single source of truth for OCR logic (in stache-ai-ocr)
Simplified installation (dependencies pulled automatically)
Rich metadata from OCR operations
Works with all stache-tools features (--parallel, --dry-run, --namespace)

Installation

Python Package

pip install stache-tools-ocr

This automatically installs stache-ai-ocr>=0.2.0 and all required dependencies (pdfplumber, ocrmypdf, tesseract bindings).

System Dependencies

You still need to install Tesseract OCR separately:

Ubuntu/Debian:

sudo apt install ocrmypdf tesseract-ocr

macOS:

brew install ocrmypdf

Windows:

choco install ocrmypdf

Development

cd stache-tools-ocr
pip install -e .

Usage

Once installed, OCR loaders automatically register and activate:

# Scanned PDFs automatically use OCR
stache ingest scanned.pdf

# Image OCR (requires pytesseract)
stache ingest photo.jpg screenshot.png

# Works with all CLI features
stache ingest *.pdf *.jpg --parallel 4 --namespace books

# Dry run to preview processing
stache ingest scanned.pdf image.jpg --dry-run

How It Works

PDF Loader

Priority: 10 (overrides BasicPDFLoader at priority 0)
Smart Detection: Attempts text extraction first, falls back to OCR if document appears scanned
Method: ocrmypdf CLI tool
Metadata: Provides ocr_used, ocr_failed, page_count, chars_per_page, and ocr_method
Graceful Fallback: Returns empty text with warning if ocrmypdf not installed
Thread Safe: Works with --parallel mode

Image Loader

Priority: 5 (standard)
Formats: JPG, JPEG, PNG, TIFF, TIF, BMP, GIF
Method: pytesseract (Tesseract OCR)
Metadata: Provides ocr_used, ocr_method, image_format, image_size
Graceful Fallback: Returns empty text with warning if pytesseract not installed
Thread Safe: Works with --parallel mode

OCR Behavior

For details on OCR heuristics, timeout configuration, and complete metadata fields, see the stache-ai-ocr documentation.

Configuration

Environment Variables:

Variable	Default	Purpose
`STACHE_OCR_TIMEOUT`	300	OCR timeout in seconds

Override Loaders:

# Force use of OCR loader
export STACHE_LOADER_PDF=OcrPdfLoader
stache ingest document.pdf

# Use basic loader (skip OCR)
export STACHE_LOADER_PDF=BasicPDFLoader
stache ingest document.pdf

System Requirements

Python: 3.10+
ocrmypdf: System binary (for PDF OCR)
tesseract-ocr: System binary (for image OCR)
pdfplumber, pytesseract, pillow: Installed via pip

Cost & Performance

Free - no API costs
Offline - works without internet
Speed: ~1-3 seconds per page (CPU-bound)
Quality: 99% accuracy for clean scans, 70-90% for poor quality

Troubleshooting

"ocrmypdf not found" error:

# Verify installation
which ocrmypdf
tesseract --version

# Reinstall if needed
sudo apt install --reinstall ocrmypdf tesseract-ocr

Tesseract not found (image OCR):

# Install tesseract
# Ubuntu/Debian:
sudo apt install tesseract-ocr

# macOS:
brew install tesseract

# Windows:
choco install tesseract

OCR timeout on large PDFs:

# Increase timeout to 10 minutes
export STACHE_OCR_TIMEOUT=600
stache ingest large-document.pdf

Poor OCR quality:

Ensure scan is at least 300 DPI
Try pre-processing with image tools (deskew, denoise)
See stache-ai-ocr docs for advanced OCR tuning

License

MIT

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Jan 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stache_tools_ocr-0.1.1.tar.gz (11.6 kB view details)

Uploaded Jan 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

stache_tools_ocr-0.1.1-py3-none-any.whl (6.2 kB view details)

Uploaded Jan 12, 2026 Python 3

File details

Details for the file stache_tools_ocr-0.1.1.tar.gz.

File metadata

Download URL: stache_tools_ocr-0.1.1.tar.gz
Upload date: Jan 12, 2026
Size: 11.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for stache_tools_ocr-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`3a3aa93b2d214b917988e8c3bad758016f87c09e948b51f252933d5097ddc601`
MD5	`1209148f90757a1241499d878c2c5071`
BLAKE2b-256	`06f71822963ffaf6add6335b9edb01b1168fee23411dfb551448ca5351d6eed7`

See more details on using hashes here.

File details

Details for the file stache_tools_ocr-0.1.1-py3-none-any.whl.

File metadata

Download URL: stache_tools_ocr-0.1.1-py3-none-any.whl
Upload date: Jan 12, 2026
Size: 6.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for stache_tools_ocr-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`af75c8adfb4f57ab32a50b846dc744cbc28db12a32152183b7125c74155a056e`
MD5	`bb513cc5f372e412fb6b871666c023e2`
BLAKE2b-256	`b68bd43c9ae552d5e056161e6b2960c08e9ae792d7e4ef0512ec494315b73590`

See more details on using hashes here.

stache-tools-ocr 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

stache-tools-ocr

Quick Start

Architecture

Installation

Python Package

System Dependencies

Development

Usage

How It Works

PDF Loader

Image Loader

OCR Behavior

Configuration

System Requirements

Cost & Performance

Troubleshooting

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes