Skip to main content

Multi-engine document OCR with cascading fallback

Project description

socr

PyPI Python 3.11–3.12 License

Multi-engine document OCR with cascading fallback and quality audit.

socr orchestrates multiple OCR engines — calling each as a CLI subprocess, auditing output quality, and falling back to a different engine when results are poor. Each engine is a standalone CLI tool (gemini-ocr, deepseek-ocr, marker-ocr, etc.) that can also be used independently.

Install

pip install socr

# With specific engine backends
pip install socr[gemini]          # Google Gemini (cloud)
pip install socr[local]           # DeepSeek + Nougat (local/free)
pip install socr[all]             # All engines

Engines are installed separately because they have different dependencies (torch, cloud SDKs, etc.). Install only what you need.

Usage

# Process a PDF
socr paper.pdf

# Choose engine
socr paper.pdf --primary gemini
socr paper.pdf --primary marker

# Save extracted figures
socr paper.pdf --save-figures

# Batch process a directory
socr batch ~/Papers/ -o ./results/
socr batch ~/Papers/ --dry-run        # preview what would be processed
socr batch ~/Papers/ --reprocess      # force reprocess all

# Check which engines are available
socr engines

How it works

PDF → Primary OCR → Quality Audit → (Fallback OCR if needed) → Markdown
  1. Primary OCR — Calls the primary engine CLI on the whole PDF
  2. Quality audit — Heuristic checks (word count, garbage ratio, repetition)
  3. Fallback — If audit fails, tries a different engine

Each engine is a separate CLI binary. socr calls it as a subprocess, reads the output markdown, and applies the quality pipeline.

Engines

Engine Package Type Notes
Gemini gemini-ocr-cli Cloud Google Gemini, ~$0.0002/page
Mistral mistral-ocr-cli Cloud Mistral AI
Marker marker-ocr-cli Local Layout-aware (Surya + Texify)
DeepSeek deepseek-ocr-cli Local Via Ollama
Nougat nougat-ocr-cli Local Academic papers, Python <3.13

Check availability:

$ socr engines

  [+] gemini       cloud, ~$0.0002/page
  [+] marker       local, layout-aware (Surya + Texify)
  [+] mistral      cloud, ~$0.001/page
  [+] deepseek     local via Ollama
  [x] nougat       local, academic papers

CLI reference

socr process <PDF> [OPTIONS]
  -o, --output-dir PATH       Output directory
  --primary ENGINE             Primary OCR engine (gemini, marker, deepseek, etc.)
  --fallback ENGINE            Fallback engine
  --no-audit                   Skip quality audit
  --save-figures               Save extracted figure images
  --timeout SECONDS            Subprocess timeout (default: 300)
  --profile NAME               Load ~/.config/socr/{name}.yaml
  --config PATH                Custom YAML config file
  -q, --quiet                  Suppress non-error output
  -v, --verbose                Verbose output
  --dry-run                    List files without processing
  --reprocess                  Force reprocess already-done files

socr batch <DIR> [OPTIONS]
  Same options as process, plus:
  --limit N                    Process first N files

socr engines                   Show available engines

Output

output/<doc_stem>/
├── <doc_stem>.md        # OCR text
├── metadata.json        # Processing stats
└── figures/             # With --save-figures
    └── figure_1_page3.png

Configuration

Create ~/.config/socr/config.yaml:

primary_engine: gemini
fallback_engine: marker
timeout: 300
save_figures: false
audit_enabled: true
audit_min_words: 50

Or use profiles: ~/.config/socr/fast.yamlsocr paper.pdf --profile fast

Engine CLIs

Each backend is an independent CLI tool:

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

socr-2.1.0.tar.gz (673.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

socr-2.1.0-py3-none-any.whl (120.2 kB view details)

Uploaded Python 3

File details

Details for the file socr-2.1.0.tar.gz.

File metadata

  • Download URL: socr-2.1.0.tar.gz
  • Upload date:
  • Size: 673.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for socr-2.1.0.tar.gz
Algorithm Hash digest
SHA256 85e30802cffb208c07eddbedf0bd1612fed8c9da092b840267a2c3fba1c292a2
MD5 ce84c5a854a17694135c875db816c48a
BLAKE2b-256 ea52d31a2533d15eee6f5e179c73c0cc6d2dc1c14b08318a19c860218d70cd8c

See more details on using hashes here.

File details

Details for the file socr-2.1.0-py3-none-any.whl.

File metadata

  • Download URL: socr-2.1.0-py3-none-any.whl
  • Upload date:
  • Size: 120.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for socr-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7a994772dd835db0bb0c32e9a13ff049bf2c2dc07f5f2b987fa14b40d9b1778b
MD5 81f97c22e025993d487f416b88d6ac5b
BLAKE2b-256 963cb0c91d7bdcad1152fca33725d70dbb35a35aac548205ae30b8b4636d1962

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page