Skip to main content

OCR support for stache-tools CLI using Tesseract

Project description

stache-tools-ocr

OCR support for stache-tools CLI. Automatically detects and processes scanned PDFs and images using Tesseract.

Quick Start

# 1. Install system dependencies (one-time)
# Ubuntu/Debian:
sudo apt install ocrmypdf tesseract-ocr

# macOS:
brew install ocrmypdf

# Windows:
choco install ocrmypdf

# 2. Install Python package
pip install stache-tools-ocr

# 3. Use it automatically
stache ingest scanned.pdf
stache ingest *.jpg *.png

That's it! OCR loaders automatically register with the CLI.

Architecture

stache-tools-ocr is a thin adapter that wraps stache-ai-ocr to provide OCR capabilities for the stache-tools CLI.

Design: This package adapts stache-ai-ocr's file path interface to stache-tools' BinaryIO interface, enabling seamless integration with the stache ingest command.

Benefits:

  • Single source of truth for OCR logic (in stache-ai-ocr)
  • Simplified installation (dependencies pulled automatically)
  • Rich metadata from OCR operations
  • Works with all stache-tools features (--parallel, --dry-run, --namespace)

Installation

Python Package

pip install stache-tools-ocr

This automatically installs stache-ai-ocr>=0.2.0 and all required dependencies (pdfplumber, ocrmypdf, tesseract bindings).

System Dependencies

You still need to install Tesseract OCR separately:

Ubuntu/Debian:

sudo apt install ocrmypdf tesseract-ocr

macOS:

brew install ocrmypdf

Windows:

choco install ocrmypdf

Development

cd stache-tools-ocr
pip install -e .

Usage

Once installed, OCR loaders automatically register and activate:

# Scanned PDFs automatically use OCR
stache ingest scanned.pdf

# Image OCR (requires pytesseract)
stache ingest photo.jpg screenshot.png

# Works with all CLI features
stache ingest *.pdf *.jpg --parallel 4 --namespace books

# Dry run to preview processing
stache ingest scanned.pdf image.jpg --dry-run

How It Works

PDF Loader

  • Priority: 10 (overrides BasicPDFLoader at priority 0)
  • Smart Detection: Attempts text extraction first, falls back to OCR if document appears scanned
  • Method: ocrmypdf CLI tool
  • Metadata: Provides ocr_used, ocr_failed, page_count, chars_per_page, and ocr_method
  • Graceful Fallback: Returns empty text with warning if ocrmypdf not installed
  • Thread Safe: Works with --parallel mode

Image Loader

  • Priority: 5 (standard)
  • Formats: JPG, JPEG, PNG, TIFF, TIF, BMP, GIF
  • Method: pytesseract (Tesseract OCR)
  • Metadata: Provides ocr_used, ocr_method, image_format, image_size
  • Graceful Fallback: Returns empty text with warning if pytesseract not installed
  • Thread Safe: Works with --parallel mode

OCR Behavior

For details on OCR heuristics, timeout configuration, and complete metadata fields, see the stache-ai-ocr documentation.

Configuration

Environment Variables:

Variable Default Purpose
STACHE_OCR_TIMEOUT 300 OCR timeout in seconds

Override Loaders:

# Force use of OCR loader
export STACHE_LOADER_PDF=OcrPdfLoader
stache ingest document.pdf

# Use basic loader (skip OCR)
export STACHE_LOADER_PDF=BasicPDFLoader
stache ingest document.pdf

System Requirements

  • Python: 3.10+
  • ocrmypdf: System binary (for PDF OCR)
  • tesseract-ocr: System binary (for image OCR)
  • pdfplumber, pytesseract, pillow: Installed via pip

Cost & Performance

  • Free - no API costs
  • Offline - works without internet
  • Speed: ~1-3 seconds per page (CPU-bound)
  • Quality: 99% accuracy for clean scans, 70-90% for poor quality

Troubleshooting

"ocrmypdf not found" error:

# Verify installation
which ocrmypdf
tesseract --version

# Reinstall if needed
sudo apt install --reinstall ocrmypdf tesseract-ocr

Tesseract not found (image OCR):

# Install tesseract
# Ubuntu/Debian:
sudo apt install tesseract-ocr

# macOS:
brew install tesseract

# Windows:
choco install tesseract

OCR timeout on large PDFs:

# Increase timeout to 10 minutes
export STACHE_OCR_TIMEOUT=600
stache ingest large-document.pdf

Poor OCR quality:

  • Ensure scan is at least 300 DPI
  • Try pre-processing with image tools (deskew, denoise)
  • See stache-ai-ocr docs for advanced OCR tuning

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stache_tools_ocr-0.1.1.tar.gz (11.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stache_tools_ocr-0.1.1-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file stache_tools_ocr-0.1.1.tar.gz.

File metadata

  • Download URL: stache_tools_ocr-0.1.1.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for stache_tools_ocr-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3a3aa93b2d214b917988e8c3bad758016f87c09e948b51f252933d5097ddc601
MD5 1209148f90757a1241499d878c2c5071
BLAKE2b-256 06f71822963ffaf6add6335b9edb01b1168fee23411dfb551448ca5351d6eed7

See more details on using hashes here.

File details

Details for the file stache_tools_ocr-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for stache_tools_ocr-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 af75c8adfb4f57ab32a50b846dc744cbc28db12a32152183b7125c74155a056e
MD5 bb513cc5f372e412fb6b871666c023e2
BLAKE2b-256 b68bd43c9ae552d5e056161e6b2960c08e9ae792d7e4ef0512ec494315b73590

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page