Skip to main content

Multi-engine document OCR with cascading fallback

Project description

socr

PyPI Python 3.11–3.12 License

Multi-engine document OCR with cascading fallback and quality audit.

socr orchestrates multiple OCR engines — calling each as a CLI subprocess, auditing output quality, and falling back to a different engine when results are poor. Each engine is a standalone CLI tool (gemini-ocr, deepseek-ocr, marker-ocr, etc.) that can also be used independently.

Install

pip install socr

# With specific engine backends
pip install socr[gemini]          # Google Gemini (cloud)
pip install socr[local]           # DeepSeek + Nougat (local/free)
pip install socr[all]             # All engines

Engines are installed separately because they have different dependencies (torch, cloud SDKs, etc.). Install only what you need.

Usage

# Process a PDF
socr paper.pdf

# Choose engine
socr paper.pdf --primary gemini
socr paper.pdf --primary marker

# Save extracted figures
socr paper.pdf --save-figures

# Batch process a directory
socr batch ~/Papers/ -o ./results/
socr batch ~/Papers/ --dry-run        # preview what would be processed
socr batch ~/Papers/ --reprocess      # force reprocess all

# Check which engines are available
socr engines

How it works

PDF → Primary OCR → Quality Audit → (Fallback OCR if needed) → Markdown
  1. Primary OCR — Calls the primary engine CLI on the whole PDF
  2. Quality audit — Heuristic checks (word count, garbage ratio, repetition)
  3. Fallback — If audit fails, tries a different engine

Each engine is a separate CLI binary. socr calls it as a subprocess, reads the output markdown, and applies the quality pipeline.

Engines

Engine Package Type Notes
Gemini gemini-ocr-cli Cloud Google Gemini, ~$0.0002/page
Mistral mistral-ocr-cli Cloud Mistral AI
Marker marker-ocr-cli Local Layout-aware (Surya + Texify)
DeepSeek deepseek-ocr-cli Local Via Ollama
Nougat nougat-ocr-cli Local Academic papers, Python <3.13

Check availability:

$ socr engines

  [+] gemini       cloud, ~$0.0002/page
  [+] marker       local, layout-aware (Surya + Texify)
  [+] mistral      cloud, ~$0.001/page
  [+] deepseek     local via Ollama
  [x] nougat       local, academic papers

CLI reference

socr process <PDF> [OPTIONS]
  -o, --output-dir PATH       Output directory
  --primary ENGINE             Primary OCR engine (gemini, marker, deepseek, etc.)
  --fallback ENGINE            Fallback engine
  --no-audit                   Skip quality audit
  --save-figures               Save extracted figure images
  --timeout SECONDS            Subprocess timeout (default: 300)
  --profile NAME               Load ~/.config/socr/{name}.yaml
  --config PATH                Custom YAML config file
  -q, --quiet                  Suppress non-error output
  -v, --verbose                Verbose output
  --dry-run                    List files without processing
  --reprocess                  Force reprocess already-done files

socr batch <DIR> [OPTIONS]
  Same options as process, plus:
  --limit N                    Process first N files

socr engines                   Show available engines

Output

output/<doc_stem>/
├── <doc_stem>.md        # OCR text
├── metadata.json        # Processing stats
└── figures/             # With --save-figures
    └── figure_1_page3.png

Configuration

Create ~/.config/socr/config.yaml:

primary_engine: gemini
fallback_engine: marker
timeout: 300
save_figures: false
audit_enabled: true
audit_min_words: 50

Or use profiles: ~/.config/socr/fast.yamlsocr paper.pdf --profile fast

Engine CLIs

Each backend is an independent CLI tool:

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

socr-1.0.1.tar.gz (554.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

socr-1.0.1-py3-none-any.whl (57.5 kB view details)

Uploaded Python 3

File details

Details for the file socr-1.0.1.tar.gz.

File metadata

  • Download URL: socr-1.0.1.tar.gz
  • Upload date:
  • Size: 554.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for socr-1.0.1.tar.gz
Algorithm Hash digest
SHA256 88ee3dec0bf966b92a29ed7003c4562a5778a7b054f2fe5b338f5a3791b62efb
MD5 522bb12c720c8cf5cb4b73eb3bcde2d2
BLAKE2b-256 6734aadbb2b64e183aa3380ffdc058fb51590109cf8368de7bf78845634940c6

See more details on using hashes here.

File details

Details for the file socr-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: socr-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 57.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for socr-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fc91fa22159a2380e5819ced45d5169892312fd92fc00a1eab46f77f65033af9
MD5 ee291d5e85e2eeaabd057cc2cab10391
BLAKE2b-256 80848fc6255d9900852f1637c2aacb383a356f53e53464c2b49738463e363dae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page