Skip to main content

Multi-engine OCR toolkit — Tesseract, OpenAI, Mistral, Anthropic, Google Vision, ensemble voting, image preprocessing, and Swedish text post-processing

Project description

trollfab-ocr

Multi-engine OCR toolkit with quality scoring, image preprocessing, ensemble voting, and Swedish text post-processing — built for Trollfabriken AITrix AB document processing pipelines.


Engines

Engine Class Backend
Tesseract TesseractOCR Local, offline (requires Tesseract binary)
OpenAI VisionOCR GPT-4o vision API
Mistral MistralOCR / MistralOCRDedicated Pixtral + dedicated OCR API
Anthropic AnthropicOCR Claude vision API
Google Vision GoogleVisionOCR Cloud Vision document_text_detection
Multi-engine MultiEngineOCR Auto-fallback orchestrator
Ensemble OCREnsemble Voting-based merger across engines

Installation

# Core only (no external deps)
pip install trollfab-ocr

# With specific engine support
pip install "trollfab-ocr[tesseract]"    # pytesseract
pip install "trollfab-ocr[openai]"       # OpenAI GPT-4o
pip install "trollfab-ocr[mistral]"      # Mistral OCR API
pip install "trollfab-ocr[anthropic]"    # Anthropic Claude
pip install "trollfab-ocr[google]"       # Google Cloud Vision
pip install "trollfab-ocr[pdf]"          # pdfplumber (pre-OCR routing)
pip install "trollfab-ocr[preprocessing]"  # numpy + OpenCV + scipy
pip install "trollfab-ocr[all]"          # everything

Quick start

Single engine

from multi_ocr import TesseractOCR, VisionOCR, MistralOCR

# Tesseract (local)
ocr = TesseractOCR(languages=["swe", "eng"])
result = ocr.extract_text("scan.png")
print(result.text, result.quality_score)

# OpenAI GPT-4o
ocr = VisionOCR()  # uses OPENAI_API_KEY from env
result = ocr.extract_text("document.jpg")

# Mistral dedicated OCR (best for PDFs)
ocr = MistralOCRDedicated()  # uses MISTRAL_API_KEY from env
result = ocr.extract_from_pdf("report.pdf")

Auto-fallback orchestrator

from multi_ocr import MultiEngineOCR

ocr = MultiEngineOCR()  # uses all available engines
result = ocr.extract("document.png")
print(result.text, result.quality_score, result.engine_used)

Ensemble voting

from multi_ocr import OCREnsemble, TesseractOCR, VisionOCR, MistralOCR

ensemble = OCREnsemble(engines=[TesseractOCR(), VisionOCR(), MistralOCR()])
result = ensemble.extract_with_voting("scan.png")
print(result.text, result.quality_score, result.agreement_score)
print(result.engines_used, result.voting_method)

Image preprocessing

from multi_ocr import ImagePreprocessor, TesseractPreprocessor, tesseract_preprocess

# Full preprocessing pipeline
prep = ImagePreprocessor()
enhanced = prep.prepare_for_ocr("noisy_scan.png")

# Tesseract-optimised 5-step pipeline (grayscale→binarize→denoise→enhance)
img = tesseract_preprocess("scan.jpg")

# Image quality analysis + targeted enhancement
from multi_ocr import ImageEnhancer
enhancer = ImageEnhancer()
analysis = enhancer.analyze_quality("document.jpg")
result = enhancer.enhance("document.jpg", preset="scan")

Swedish text post-processing

from multi_ocr import SwedishTextPostProcessor

pp = SwedishTextPostProcessor()
clean = pp.process(raw_ocr_text)
# Fixes ligatures, OCR artefacts, diacritical chars, all 290 municipalities,
# Swedish abbreviations, whitespace normalization

Pre-OCR routing (avoid unnecessary API calls)

from multi_ocr import has_text_layer, is_simple, extract_text_layer

path = "document.pdf"
if is_simple(path):
    # Fast path: extract native text, no OCR needed
    text = extract_text_layer(path)
elif has_text_layer(path):
    # Has text but complex layout — use Mistral or Docling
    ...
else:
    # Scanned image — run full OCR pipeline
    ...

Unicode cleaning

from multi_ocr import clean_unicode, repair_ligatures

text = clean_unicode(raw_text)         # NFKD → ligature repair → NFC → remove zero-width
text = repair_ligatures("file ff")      # → "file ff"

LLM-based OCR enhancement

from multi_ocr import OCREnhancer

enhancer = OCREnhancer()  # uses OPENAI_API_KEY from env
corrected = enhancer.correct_errors(raw_ocr_text)
doc_type = enhancer.identify_document_type(raw_ocr_text)
structure = enhancer.extract_structure(raw_ocr_text)

Quality scoring

QualityScorer rates text on multiple axes (length, Swedish characters, municipal patterns, document structure) and returns a 0–1 score. Used internally by all engines and the ensemble to select the best result.


Environment variables

Variable Engine
OPENAI_API_KEY VisionOCR, OCREnhancer
MISTRAL_API_KEY MistralOCR, MistralOCRDedicated
ANTHROPIC_API_KEY AnthropicOCR
GOOGLE_APPLICATION_CREDENTIALS GoogleVisionOCR

Package structure

multi_ocr/
├── __init__.py              ← Public API
├── py.typed                 ← PEP 561 typed marker
├── ocr_engine.py            ← Core engines + QualityScorer + MultiEngineOCR
├── ensemble.py              ← OCREnsemble (voting-based merger)
├── voting.py                ← VotingStrategy (best_of_n / line_merge / quorum)
├── image_preprocessor.py   ← Deskew, denoise, binarize, region detection
├── image_enhancer.py        ← Quality analysis + targeted enhancement
├── tesseract_preprocessing.py ← 5-step Tesseract-optimised pipeline
├── ocr_enhancer.py          ← LLM-based correction + structure extraction
├── swedish_postprocessor.py ← 7-step Swedish OCR post-processing
├── mistral_ocr.py           ← Mistral dedicated OCR API (full PDF support)
├── google_vision_ocr.py     ← Google Cloud Vision integration
├── svg_table.py             ← SVG table generation from OCR output
├── routing.py               ← Pre-OCR routing (text-layer detection)
└── unicode_clean.py         ← Ligature repair + zero-width removal

© 2025 Trollfabriken AITrix AB — MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trollfab_ocr-1.0.0.tar.gz (49.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trollfab_ocr-1.0.0-py3-none-any.whl (53.1 kB view details)

Uploaded Python 3

File details

Details for the file trollfab_ocr-1.0.0.tar.gz.

File metadata

  • Download URL: trollfab_ocr-1.0.0.tar.gz
  • Upload date:
  • Size: 49.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for trollfab_ocr-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e6c60e0d3536cb00a4c49ad8bd484ca697d13ce0b44e3fc2343b4edaedca4e59
MD5 fdee6277a457f32e0e8453a2c4660ae4
BLAKE2b-256 3d3776a9cd7c13c3fe7d8a519d5eec6b0b23bffa0d306b9c5b8e27d93b8b3cbd

See more details on using hashes here.

File details

Details for the file trollfab_ocr-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: trollfab_ocr-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 53.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for trollfab_ocr-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d94082cc5dbf6077f5e9ed36e67c3e1eeb7c04f60b701228dac33a48866b441d
MD5 d33f973b653f209383287b370ba4bef7
BLAKE2b-256 e3a251438a95adf60c09428d810b89311174a93bce93702cd1ed1d672e67679a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page