Multi-engine OCR toolkit — Tesseract, OpenAI, Mistral, Anthropic, Google Vision, ensemble voting, image preprocessing, and Swedish text post-processing
Project description
trollfab-ocr
Multi-engine OCR toolkit with quality scoring, image preprocessing, ensemble voting, and Swedish text post-processing — built for Trollfabriken AITrix AB document processing pipelines.
Engines
| Engine | Class | Backend |
|---|---|---|
| Tesseract | TesseractOCR |
Local, offline (requires Tesseract binary) |
| OpenAI | VisionOCR |
GPT-4o vision API |
| Mistral | MistralOCR / MistralOCRDedicated |
Pixtral + dedicated OCR API |
| Anthropic | AnthropicOCR |
Claude vision API |
| Google Vision | GoogleVisionOCR |
Cloud Vision document_text_detection |
| Multi-engine | MultiEngineOCR |
Auto-fallback orchestrator |
| Ensemble | OCREnsemble |
Voting-based merger across engines |
Installation
# Core only (no external deps)
pip install trollfab-ocr
# With specific engine support
pip install "trollfab-ocr[tesseract]" # pytesseract
pip install "trollfab-ocr[openai]" # OpenAI GPT-4o
pip install "trollfab-ocr[mistral]" # Mistral OCR API
pip install "trollfab-ocr[anthropic]" # Anthropic Claude
pip install "trollfab-ocr[google]" # Google Cloud Vision
pip install "trollfab-ocr[pdf]" # pdfplumber (pre-OCR routing)
pip install "trollfab-ocr[preprocessing]" # numpy + OpenCV + scipy
pip install "trollfab-ocr[all]" # everything
Quick start
Single engine
from multi_ocr import TesseractOCR, VisionOCR, MistralOCR
# Tesseract (local)
ocr = TesseractOCR(languages=["swe", "eng"])
result = ocr.extract_text("scan.png")
print(result.text, result.quality_score)
# OpenAI GPT-4o
ocr = VisionOCR() # uses OPENAI_API_KEY from env
result = ocr.extract_text("document.jpg")
# Mistral dedicated OCR (best for PDFs)
ocr = MistralOCRDedicated() # uses MISTRAL_API_KEY from env
result = ocr.extract_from_pdf("report.pdf")
Auto-fallback orchestrator
from multi_ocr import MultiEngineOCR
ocr = MultiEngineOCR() # uses all available engines
result = ocr.extract("document.png")
print(result.text, result.quality_score, result.engine_used)
Ensemble voting
from multi_ocr import OCREnsemble, TesseractOCR, VisionOCR, MistralOCR
ensemble = OCREnsemble(engines=[TesseractOCR(), VisionOCR(), MistralOCR()])
result = ensemble.extract_with_voting("scan.png")
print(result.text, result.quality_score, result.agreement_score)
print(result.engines_used, result.voting_method)
Image preprocessing
from multi_ocr import ImagePreprocessor, TesseractPreprocessor, tesseract_preprocess
# Full preprocessing pipeline
prep = ImagePreprocessor()
enhanced = prep.prepare_for_ocr("noisy_scan.png")
# Tesseract-optimised 5-step pipeline (grayscale→binarize→denoise→enhance)
img = tesseract_preprocess("scan.jpg")
# Image quality analysis + targeted enhancement
from multi_ocr import ImageEnhancer
enhancer = ImageEnhancer()
analysis = enhancer.analyze_quality("document.jpg")
result = enhancer.enhance("document.jpg", preset="scan")
Swedish text post-processing
from multi_ocr import SwedishTextPostProcessor
pp = SwedishTextPostProcessor()
clean = pp.process(raw_ocr_text)
# Fixes ligatures, OCR artefacts, diacritical chars, all 290 municipalities,
# Swedish abbreviations, whitespace normalization
Pre-OCR routing (avoid unnecessary API calls)
from multi_ocr import has_text_layer, is_simple, extract_text_layer
path = "document.pdf"
if is_simple(path):
# Fast path: extract native text, no OCR needed
text = extract_text_layer(path)
elif has_text_layer(path):
# Has text but complex layout — use Mistral or Docling
...
else:
# Scanned image — run full OCR pipeline
...
Unicode cleaning
from multi_ocr import clean_unicode, repair_ligatures
text = clean_unicode(raw_text) # NFKD → ligature repair → NFC → remove zero-width
text = repair_ligatures("file ff") # → "file ff"
LLM-based OCR enhancement
from multi_ocr import OCREnhancer
enhancer = OCREnhancer() # uses OPENAI_API_KEY from env
corrected = enhancer.correct_errors(raw_ocr_text)
doc_type = enhancer.identify_document_type(raw_ocr_text)
structure = enhancer.extract_structure(raw_ocr_text)
Quality scoring
QualityScorer rates text on multiple axes (length, Swedish characters,
municipal patterns, document structure) and returns a 0–1 score. Used
internally by all engines and the ensemble to select the best result.
Environment variables
| Variable | Engine |
|---|---|
OPENAI_API_KEY |
VisionOCR, OCREnhancer |
MISTRAL_API_KEY |
MistralOCR, MistralOCRDedicated |
ANTHROPIC_API_KEY |
AnthropicOCR |
GOOGLE_APPLICATION_CREDENTIALS |
GoogleVisionOCR |
Package structure
multi_ocr/
├── __init__.py ← Public API
├── py.typed ← PEP 561 typed marker
├── ocr_engine.py ← Core engines + QualityScorer + MultiEngineOCR
├── ensemble.py ← OCREnsemble (voting-based merger)
├── voting.py ← VotingStrategy (best_of_n / line_merge / quorum)
├── image_preprocessor.py ← Deskew, denoise, binarize, region detection
├── image_enhancer.py ← Quality analysis + targeted enhancement
├── tesseract_preprocessing.py ← 5-step Tesseract-optimised pipeline
├── ocr_enhancer.py ← LLM-based correction + structure extraction
├── swedish_postprocessor.py ← 7-step Swedish OCR post-processing
├── mistral_ocr.py ← Mistral dedicated OCR API (full PDF support)
├── google_vision_ocr.py ← Google Cloud Vision integration
├── svg_table.py ← SVG table generation from OCR output
├── routing.py ← Pre-OCR routing (text-layer detection)
└── unicode_clean.py ← Ligature repair + zero-width removal
© 2025 Trollfabriken AITrix AB — MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trollfab_ocr-1.0.0.tar.gz.
File metadata
- Download URL: trollfab_ocr-1.0.0.tar.gz
- Upload date:
- Size: 49.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6c60e0d3536cb00a4c49ad8bd484ca697d13ce0b44e3fc2343b4edaedca4e59
|
|
| MD5 |
fdee6277a457f32e0e8453a2c4660ae4
|
|
| BLAKE2b-256 |
3d3776a9cd7c13c3fe7d8a519d5eec6b0b23bffa0d306b9c5b8e27d93b8b3cbd
|
File details
Details for the file trollfab_ocr-1.0.0-py3-none-any.whl.
File metadata
- Download URL: trollfab_ocr-1.0.0-py3-none-any.whl
- Upload date:
- Size: 53.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d94082cc5dbf6077f5e9ed36e67c3e1eeb7c04f60b701228dac33a48866b441d
|
|
| MD5 |
d33f973b653f209383287b370ba4bef7
|
|
| BLAKE2b-256 |
e3a251438a95adf60c09428d810b89311174a93bce93702cd1ed1d672e67679a
|