Multi-engine document OCR with cascading fallback
Project description
socr
Multi-engine document OCR with cascading fallback and quality audit.
socr orchestrates multiple OCR engines — calling each as a CLI subprocess, auditing output quality, and falling back to a different engine when results are poor. Each engine is a standalone CLI tool (gemini-ocr, deepseek-ocr, marker-ocr, etc.) that can also be used independently.
Install
pip install socr
# With specific engine backends
pip install socr[gemini] # Google Gemini (cloud)
pip install socr[local] # DeepSeek + Nougat (local/free)
pip install socr[all] # All engines
Engines are installed separately because they have different dependencies (torch, cloud SDKs, etc.). Install only what you need.
Usage
# Process a PDF
socr paper.pdf
# Choose engine
socr paper.pdf --primary gemini
socr paper.pdf --primary marker
# Save extracted figures
socr paper.pdf --save-figures
# Batch process a directory
socr batch ~/Papers/ -o ./results/
socr batch ~/Papers/ --dry-run # preview what would be processed
socr batch ~/Papers/ --reprocess # force reprocess all
# Check which engines are available
socr engines
How it works
PDF → Primary OCR → Quality Audit → (Fallback OCR if needed) → Markdown
- Primary OCR — Calls the primary engine CLI on the whole PDF
- Quality audit — Heuristic checks (word count, garbage ratio, repetition)
- Fallback — If audit fails, tries a different engine
Each engine is a separate CLI binary. socr calls it as a subprocess, reads the output markdown, and applies the quality pipeline.
Engines
| Engine | Package | Type | Notes |
|---|---|---|---|
| Gemini | gemini-ocr-cli |
Cloud | Google Gemini, ~$0.0002/page |
| Mistral | mistral-ocr-cli |
Cloud | Mistral AI |
| Marker | marker-ocr-cli |
Local | Layout-aware (Surya + Texify) |
| DeepSeek | deepseek-ocr-cli |
Local | Via Ollama |
| Nougat | nougat-ocr-cli |
Local | Academic papers, Python <3.13 |
Check availability:
$ socr engines
[+] gemini cloud, ~$0.0002/page
[+] marker local, layout-aware (Surya + Texify)
[+] mistral cloud, ~$0.001/page
[+] deepseek local via Ollama
[x] nougat local, academic papers
CLI reference
socr process <PDF> [OPTIONS]
-o, --output-dir PATH Output directory
--primary ENGINE Primary OCR engine (gemini, marker, deepseek, etc.)
--fallback ENGINE Fallback engine
--no-audit Skip quality audit
--save-figures Save extracted figure images
--timeout SECONDS Subprocess timeout (default: 300)
--profile NAME Load ~/.config/socr/{name}.yaml
--config PATH Custom YAML config file
-q, --quiet Suppress non-error output
-v, --verbose Verbose output
--dry-run List files without processing
--reprocess Force reprocess already-done files
socr batch <DIR> [OPTIONS]
Same options as process, plus:
--limit N Process first N files
socr engines Show available engines
Output
output/<doc_stem>/
├── <doc_stem>.md # OCR text
├── metadata.json # Processing stats
└── figures/ # With --save-figures
└── figure_1_page3.png
Configuration
Create ~/.config/socr/config.yaml:
primary_engine: gemini
fallback_engine: marker
timeout: 300
save_figures: false
audit_enabled: true
audit_min_words: 50
Or use profiles: ~/.config/socr/fast.yaml → socr paper.pdf --profile fast
Engine CLIs
Each backend is an independent CLI tool:
- gemini-ocr-cli — Google Gemini
- deepseek-ocr-cli — DeepSeek via Ollama
- mistral-ocr-cli — Mistral AI
- marker-ocr-cli — Marker (Surya + Texify)
- nougat-ocr-cli — Meta Nougat
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file socr-1.0.0.tar.gz.
File metadata
- Download URL: socr-1.0.0.tar.gz
- Upload date:
- Size: 554.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dda77a37251c9e684aafe3b0c7fc1ebdf7e434ae08974b37306f2d72ee639bab
|
|
| MD5 |
903e30ca9cbd5a1772d38cdb35cda2a6
|
|
| BLAKE2b-256 |
993162f752cc76bcd4e6e22959fa4695d7e72005631acc1a771835425b675763
|
File details
Details for the file socr-1.0.0-py3-none-any.whl.
File metadata
- Download URL: socr-1.0.0-py3-none-any.whl
- Upload date:
- Size: 57.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
afe5ed263de9d90d53d6a69ac6d3f4a0139b76e604299ae0ce0fdab29018b0c1
|
|
| MD5 |
5757bf953d8ab64581e6dad8e0b47333
|
|
| BLAKE2b-256 |
e57f46eb55e7ed8a16d33a6ff15b3e38262d28b702d53bd5e43344dcd1d47d41
|