Multi-engine OCR toolkit — Tesseract, OpenAI, Mistral, Anthropic, Google Vision, ensemble voting, image preprocessing, and Swedish text post-processing

These details have not been verified by PyPI

Project description

trollfab-ocr

Multi-engine OCR toolkit with quality scoring, image preprocessing, ensemble voting, and Swedish text post-processing — built for Trollfabriken AITrix AB document processing pipelines.

Engines

Engine	Class	Backend
Tesseract	`TesseractOCR`	Local, offline (requires Tesseract binary)
OpenAI	`VisionOCR`	GPT-4o vision API
Mistral	`MistralOCR` / `MistralOCRDedicated`	Pixtral + dedicated OCR API
Anthropic	`AnthropicOCR`	Claude vision API
Google Vision	`GoogleVisionOCR`	Cloud Vision `document_text_detection`
Multi-engine	`MultiEngineOCR`	Auto-fallback orchestrator
Ensemble	`OCREnsemble`	Voting-based merger across engines

Installation

# Core only (no external deps)
pip install trollfab-ocr

# With specific engine support
pip install "trollfab-ocr[tesseract]"    # pytesseract
pip install "trollfab-ocr[openai]"       # OpenAI GPT-4o
pip install "trollfab-ocr[mistral]"      # Mistral OCR API
pip install "trollfab-ocr[anthropic]"    # Anthropic Claude
pip install "trollfab-ocr[google]"       # Google Cloud Vision
pip install "trollfab-ocr[pdf]"          # pdfplumber (pre-OCR routing)
pip install "trollfab-ocr[preprocessing]"  # numpy + OpenCV + scipy
pip install "trollfab-ocr[all]"          # everything

Quick start

Single engine

from multi_ocr import TesseractOCR, VisionOCR, MistralOCR

# Tesseract (local)
ocr = TesseractOCR(languages=["swe", "eng"])
result = ocr.extract_text("scan.png")
print(result.text, result.quality_score)

# OpenAI GPT-4o
ocr = VisionOCR()  # uses OPENAI_API_KEY from env
result = ocr.extract_text("document.jpg")

# Mistral dedicated OCR (best for PDFs)
ocr = MistralOCRDedicated()  # uses MISTRAL_API_KEY from env
result = ocr.extract_from_pdf("report.pdf")

Auto-fallback orchestrator

from multi_ocr import MultiEngineOCR

ocr = MultiEngineOCR()  # uses all available engines
result = ocr.extract("document.png")
print(result.text, result.quality_score, result.engine_used)

Ensemble voting

from multi_ocr import OCREnsemble, TesseractOCR, VisionOCR, MistralOCR

ensemble = OCREnsemble(engines=[TesseractOCR(), VisionOCR(), MistralOCR()])
result = ensemble.extract_with_voting("scan.png")
print(result.text, result.quality_score, result.agreement_score)
print(result.engines_used, result.voting_method)

Image preprocessing

from multi_ocr import ImagePreprocessor, TesseractPreprocessor, tesseract_preprocess

# Full preprocessing pipeline
prep = ImagePreprocessor()
enhanced = prep.prepare_for_ocr("noisy_scan.png")

# Tesseract-optimised 5-step pipeline (grayscale→binarize→denoise→enhance)
img = tesseract_preprocess("scan.jpg")

# Image quality analysis + targeted enhancement
from multi_ocr import ImageEnhancer
enhancer = ImageEnhancer()
analysis = enhancer.analyze_quality("document.jpg")
result = enhancer.enhance("document.jpg", preset="scan")

Swedish text post-processing

from multi_ocr import SwedishTextPostProcessor

pp = SwedishTextPostProcessor()
clean = pp.process(raw_ocr_text)
# Fixes ligatures, OCR artefacts, diacritical chars, all 290 municipalities,
# Swedish abbreviations, whitespace normalization

Pre-OCR routing (avoid unnecessary API calls)

from multi_ocr import has_text_layer, is_simple, extract_text_layer

path = "document.pdf"
if is_simple(path):
    # Fast path: extract native text, no OCR needed
    text = extract_text_layer(path)
elif has_text_layer(path):
    # Has text but complex layout — use Mistral or Docling
    ...
else:
    # Scanned image — run full OCR pipeline
    ...

Unicode cleaning

from multi_ocr import clean_unicode, repair_ligatures

text = clean_unicode(raw_text)         # NFKD → ligature repair → NFC → remove zero-width
text = repair_ligatures("ﬁle ﬀ")      # → "file ff"

LLM-based OCR enhancement

from multi_ocr import OCREnhancer

enhancer = OCREnhancer()  # uses OPENAI_API_KEY from env
corrected = enhancer.correct_errors(raw_ocr_text)
doc_type = enhancer.identify_document_type(raw_ocr_text)
structure = enhancer.extract_structure(raw_ocr_text)

Quality scoring

QualityScorer rates text on multiple axes (length, Swedish characters, municipal patterns, document structure) and returns a 0–1 score. Used internally by all engines and the ensemble to select the best result.

Environment variables

Variable	Engine
`OPENAI_API_KEY`	`VisionOCR`, `OCREnhancer`
`MISTRAL_API_KEY`	`MistralOCR`, `MistralOCRDedicated`
`ANTHROPIC_API_KEY`	`AnthropicOCR`
`GOOGLE_APPLICATION_CREDENTIALS`	`GoogleVisionOCR`

Package structure

multi_ocr/
├── __init__.py              ← Public API
├── py.typed                 ← PEP 561 typed marker
├── ocr_engine.py            ← Core engines + QualityScorer + MultiEngineOCR
├── ensemble.py              ← OCREnsemble (voting-based merger)
├── voting.py                ← VotingStrategy (best_of_n / line_merge / quorum)
├── image_preprocessor.py   ← Deskew, denoise, binarize, region detection
├── image_enhancer.py        ← Quality analysis + targeted enhancement
├── tesseract_preprocessing.py ← 5-step Tesseract-optimised pipeline
├── ocr_enhancer.py          ← LLM-based correction + structure extraction
├── swedish_postprocessor.py ← 7-step Swedish OCR post-processing
├── mistral_ocr.py           ← Mistral dedicated OCR API (full PDF support)
├── google_vision_ocr.py     ← Google Cloud Vision integration
├── svg_table.py             ← SVG table generation from OCR output
├── routing.py               ← Pre-OCR routing (text-layer detection)
└── unicode_clean.py         ← Ligature repair + zero-width removal

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

May 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trollfab_ocr-1.0.0.tar.gz (49.7 kB view details)

Uploaded May 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trollfab_ocr-1.0.0-py3-none-any.whl (53.1 kB view details)

Uploaded May 15, 2026 Python 3

File details

Details for the file trollfab_ocr-1.0.0.tar.gz.

File metadata

Download URL: trollfab_ocr-1.0.0.tar.gz
Upload date: May 15, 2026
Size: 49.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for trollfab_ocr-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`e6c60e0d3536cb00a4c49ad8bd484ca697d13ce0b44e3fc2343b4edaedca4e59`
MD5	`fdee6277a457f32e0e8453a2c4660ae4`
BLAKE2b-256	`3d3776a9cd7c13c3fe7d8a519d5eec6b0b23bffa0d306b9c5b8e27d93b8b3cbd`

See more details on using hashes here.

File details

Details for the file trollfab_ocr-1.0.0-py3-none-any.whl.

File metadata

Download URL: trollfab_ocr-1.0.0-py3-none-any.whl
Upload date: May 15, 2026
Size: 53.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for trollfab_ocr-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d94082cc5dbf6077f5e9ed36e67c3e1eeb7c04f60b701228dac33a48866b441d`
MD5	`d33f973b653f209383287b370ba4bef7`
BLAKE2b-256	`e3a251438a95adf60c09428d810b89311174a93bce93702cd1ed1d672e67679a`

See more details on using hashes here.

trollfab-ocr 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

trollfab-ocr

Engines

Installation

Quick start

Single engine

Auto-fallback orchestrator

Ensemble voting

Image preprocessing

Swedish text post-processing

Pre-OCR routing (avoid unnecessary API calls)

Unicode cleaning

LLM-based OCR enhancement

Quality scoring

Environment variables

Package structure

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes