OCR support for stache-tools CLI using Tesseract
Project description
stache-tools-ocr
OCR support for stache-tools CLI. Automatically detects and processes scanned PDFs and images using Tesseract.
Quick Start
# 1. Install system dependencies (one-time)
# Ubuntu/Debian:
sudo apt install ocrmypdf tesseract-ocr
# macOS:
brew install ocrmypdf
# Windows:
choco install ocrmypdf
# 2. Install Python package
pip install stache-tools-ocr
# 3. Use it automatically
stache ingest scanned.pdf
stache ingest *.jpg *.png
That's it! OCR loaders automatically register with the CLI.
Architecture
stache-tools-ocr is a thin adapter that wraps stache-ai-ocr to provide OCR capabilities for the stache-tools CLI.
Design: This package adapts stache-ai-ocr's file path interface to stache-tools' BinaryIO interface, enabling seamless integration with the stache ingest command.
Benefits:
- Single source of truth for OCR logic (in stache-ai-ocr)
- Simplified installation (dependencies pulled automatically)
- Rich metadata from OCR operations
- Works with all stache-tools features (
--parallel,--dry-run,--namespace)
Installation
Python Package
pip install stache-tools-ocr
This automatically installs stache-ai-ocr>=0.2.0 and all required dependencies (pdfplumber, ocrmypdf, tesseract bindings).
System Dependencies
You still need to install Tesseract OCR separately:
Ubuntu/Debian:
sudo apt install ocrmypdf tesseract-ocr
macOS:
brew install ocrmypdf
Windows:
choco install ocrmypdf
Development
cd stache-tools-ocr
pip install -e .
Usage
Once installed, OCR loaders automatically register and activate:
# Scanned PDFs automatically use OCR
stache ingest scanned.pdf
# Image OCR (requires pytesseract)
stache ingest photo.jpg screenshot.png
# Works with all CLI features
stache ingest *.pdf *.jpg --parallel 4 --namespace books
# Dry run to preview processing
stache ingest scanned.pdf image.jpg --dry-run
How It Works
PDF Loader
- Priority: 10 (overrides BasicPDFLoader at priority 0)
- Smart Detection: Attempts text extraction first, falls back to OCR if document appears scanned
- Method: ocrmypdf CLI tool
- Metadata: Provides
ocr_used,ocr_failed,page_count,chars_per_page, andocr_method - Graceful Fallback: Returns empty text with warning if ocrmypdf not installed
- Thread Safe: Works with
--parallelmode
Image Loader
- Priority: 5 (standard)
- Formats: JPG, JPEG, PNG, TIFF, TIF, BMP, GIF
- Method: pytesseract (Tesseract OCR)
- Metadata: Provides
ocr_used,ocr_method,image_format,image_size - Graceful Fallback: Returns empty text with warning if pytesseract not installed
- Thread Safe: Works with
--parallelmode
OCR Behavior
For details on OCR heuristics, timeout configuration, and complete metadata fields, see the stache-ai-ocr documentation.
Configuration
Environment Variables:
| Variable | Default | Purpose |
|---|---|---|
STACHE_OCR_TIMEOUT |
300 | OCR timeout in seconds |
Override Loaders:
# Force use of OCR loader
export STACHE_LOADER_PDF=OcrPdfLoader
stache ingest document.pdf
# Use basic loader (skip OCR)
export STACHE_LOADER_PDF=BasicPDFLoader
stache ingest document.pdf
System Requirements
- Python: 3.10+
- ocrmypdf: System binary (for PDF OCR)
- tesseract-ocr: System binary (for image OCR)
- pdfplumber, pytesseract, pillow: Installed via pip
Cost & Performance
- Free - no API costs
- Offline - works without internet
- Speed: ~1-3 seconds per page (CPU-bound)
- Quality: 99% accuracy for clean scans, 70-90% for poor quality
Troubleshooting
"ocrmypdf not found" error:
# Verify installation
which ocrmypdf
tesseract --version
# Reinstall if needed
sudo apt install --reinstall ocrmypdf tesseract-ocr
Tesseract not found (image OCR):
# Install tesseract
# Ubuntu/Debian:
sudo apt install tesseract-ocr
# macOS:
brew install tesseract
# Windows:
choco install tesseract
OCR timeout on large PDFs:
# Increase timeout to 10 minutes
export STACHE_OCR_TIMEOUT=600
stache ingest large-document.pdf
Poor OCR quality:
- Ensure scan is at least 300 DPI
- Try pre-processing with image tools (deskew, denoise)
- See stache-ai-ocr docs for advanced OCR tuning
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stache_tools_ocr-0.1.1.tar.gz.
File metadata
- Download URL: stache_tools_ocr-0.1.1.tar.gz
- Upload date:
- Size: 11.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a3aa93b2d214b917988e8c3bad758016f87c09e948b51f252933d5097ddc601
|
|
| MD5 |
1209148f90757a1241499d878c2c5071
|
|
| BLAKE2b-256 |
06f71822963ffaf6add6335b9edb01b1168fee23411dfb551448ca5351d6eed7
|
File details
Details for the file stache_tools_ocr-0.1.1-py3-none-any.whl.
File metadata
- Download URL: stache_tools_ocr-0.1.1-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af75c8adfb4f57ab32a50b846dc744cbc28db12a32152183b7125c74155a056e
|
|
| MD5 |
bb513cc5f372e412fb6b871666c023e2
|
|
| BLAKE2b-256 |
b68bd43c9ae552d5e056161e6b2960c08e9ae792d7e4ef0512ec494315b73590
|