Skip to main content

Thulium - State-of-the-art Multilingual Handwriting Text Recognition for Python

Project description

Thulium - State-of-the-Art Multilingual Handwriting Text Recognition for Python

Build Status Coverage PyPI version Python Versions License Code Style: Black Lint: Ruff Type Checking: Mypy Pre-commit Docs


Thulium is a state-of-the-art, open-source Python library for offline handwritten text recognition (HTR) and document intelligence. Engineered for high-performance research and production deployments, Thulium provides an end-to-end processing stack from document layout analysis to language-model-enhanced decoding across 52+ languages, with comprehensive support for Latin, Cyrillic, Arabic, Devanagari, Georgian, Armenian, and other major world scripts.

Version 0.2.0 (Beta): Language parity release with first-class support for all 52+ languages. See CHANGELOG.md for details.

Overview

Thulium abstracts the complexity of modern deep learning-based OCR/HTR pipelines into a modular, extensible Python API. The library is designed to address the challenges of multilingual handwriting recognition, providing robust solutions for digitizing historical archives, processing structured forms, and building reading systems for both high-resource and low-resource languages.

The architecture emphasizes configurability and research reproducibility, enabling researchers and engineers to experiment with different model components while maintaining production-grade reliability.

Key Capabilities

  • Multilingual Deep Learning: Pluggable language profiles supporting 52+ languages across Latin, Cyrillic, Arabic, Devanagari, Georgian, Armenian, CJK, and other scripts.
  • End-to-End Pipeline: Complete processing chain including:
    • Preprocessing: Image normalization, binarization, and augmentation.
    • Segmentation: Robust line and word segmentation via U-Net architectures.
    • Recognition: CNN-RNN-CTC and Transformer-based HTR models.
    • Decoding: Greedy, beam search, and language-model-enhanced decoding.
  • Production-Ready Design: Built with modularity, extensibility, and rigorous testing practices.
  • Explainability (XAI): Built-in tools for attention visualization and confidence analysis.
  • Comprehensive CLI: Command-line interface for batch processing and evaluation.

Installation

From PyPI

pip install thulium

From Source (Development)

git clone https://github.com/olaflaitinen/Thulium.git
cd Thulium
pip install -e .[dev]

System Requirements

  • Python 3.10 or higher
  • PyTorch 2.0 or higher
  • Optional: CUDA-compatible GPU for accelerated inference
  • Optional: poppler-utils for PDF processing

Quickstart

Python API

The high-level API automates model selection and pipeline orchestration.

from thulium.api import recognize_image

# Recognize text in an Azerbaijani document
result = recognize_image(
    path="docs/samples/handwriting.jpg",
    language="az",
    device="auto"  # Automatically uses GPU if available
)

print(f"Full Text:\n{result.full_text}")

# Inspect confidence per line
for line in result.lines:
    if line.confidence < 0.8:
        print(f"Low confidence line [{line.confidence:.2f}]: {line.text}")

CLI Usage

Thulium includes a robust command-line interface for batch processing and evaluation.

# Basic recognition
thulium recognize my_document.jpg --language az --output result.json

# Verbose logging
thulium recognize page_01.png -l en -v

# Show version
thulium version

Architecture Overview

Thulium is organized into modular, composable layers to facilitate research and extension:

flowchart LR
    A[Input Image / PDF] --> B[Preprocessing]
    B --> C[Layout Segmentation]
    C --> D[Line / Word Crops]
    D --> E[HTR Model]
    E --> F[Language Model Scoring]
    F --> G[Post-processing]
    G --> H[Structured Output]
Module Description
thulium.api High-level entry points for ease of use.
thulium.data Loaders, transforms, and language profile registry.
thulium.models PyTorch implementations of backbones, sequence heads, and decoders.
thulium.pipeline Logic for chaining segmentation and recognition steps.
thulium.evaluation Metrics (CER, WER, SER) and benchmarking tools.
thulium.xai Explainability via attention maps and confidence analysis.

For a detailed technical description, see the Architecture Documentation.

Language Support

Thulium is architected to support 50+ languages across diverse writing systems. Language support is defined via modular Language Profiles in thulium.data.language_profiles.

Scandinavian Languages

Code Language Script Notes
nb Norwegian (Bokmal) Latin Standard written Norwegian
nn Norwegian (Nynorsk) Latin New Norwegian variant
sv Swedish Latin
da Danish Latin
is Icelandic Latin Preserves Old Norse characters
fo Faroese Latin
fi Finnish Latin Finno-Ugric language

Baltic Languages

Code Language Script Notes
lt Lithuanian Latin
lv Latvian Latin
et Estonian Latin Finno-Ugric language

Caucasus Region

Code Language Script Notes
az Azerbaijani Latin Extended alphabet with special characters
tr Turkish Latin Dotted/dotless i distinction
ka Georgian Georgian Mkhedruli script
hy Armenian Armenian Eastern Armenian alphabet

Western Europe

Code Language Script Notes
en English Latin Baseline model and configs
de German Latin
fr French Latin Full accent support
es Spanish Latin
pt Portuguese Latin
it Italian Latin
nl Dutch Latin

Eastern Europe

Code Language Script Notes
pl Polish Latin
cs Czech Latin
sk Slovak Latin
hu Hungarian Latin
ro Romanian Latin
bg Bulgarian Cyrillic
sr Serbian (Cyrillic) Cyrillic Also supports Latin variant
hr Croatian Latin
sl Slovenian Latin
ru Russian Cyrillic
uk Ukrainian Cyrillic
el Greek Greek

Middle East and Central Asia

Code Language Script Direction Notes
ar Arabic Arabic RTL
fa Persian (Farsi) Arabic RTL
ur Urdu Arabic RTL
he Hebrew Hebrew RTL

South Asia

Code Language Script Notes
hi Hindi Devanagari
bn Bengali Bengali
ta Tamil Tamil
te Telugu Telugu
mr Marathi Devanagari

East and Southeast Asia

Code Language Script Notes
zh Chinese (Simplified) Han Common characters subset
ja Japanese Mixed Hiragana and Katakana
ko Korean Hangul Jamo-based
th Thai Thai
vi Vietnamese Latin Extensive diacritics
id Indonesian Latin
ms Malay Latin

Africa

Code Language Script Notes
sw Swahili Latin
af Afrikaans Latin

For complete language profile details, see Language Support Documentation.

Evaluation and Benchmarking

Thulium includes built-in tools for rigorous evaluation using standard metrics.

from thulium.evaluation.metrics import cer, wer

reference = "The quick brown fox"
hypothesis = "The quick brown fax"

print(f"CER: {cer(reference, hypothesis):.4f}")
print(f"WER: {wer(reference, hypothesis):.4f}")

Metrics

  • CER (Character Error Rate): Measures character-level recognition accuracy.
  • WER (Word Error Rate): Measures word-level recognition accuracy.
  • SER (Sequence Error Rate): Binary indicator of exact sequence match.

For detailed metric definitions and formulas, see Evaluation Metrics.

Contributing

We welcome contributions from the community, especially for adding new language profiles, model architectures, or evaluation benchmarks. Please refer to CONTRIBUTING.md for guidelines on code style, testing, and pull requests.

All contributors are expected to adhere to our Code of Conduct.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


Thulium is named after the rare earth element (atomic number 69), symbolizing the specialized, high-value nature of multilingual handwriting intelligence.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thulium_htr-0.2.0.tar.gz (82.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thulium_htr-0.2.0-py3-none-any.whl (85.9 kB view details)

Uploaded Python 3

File details

Details for the file thulium_htr-0.2.0.tar.gz.

File metadata

  • Download URL: thulium_htr-0.2.0.tar.gz
  • Upload date:
  • Size: 82.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.1

File hashes

Hashes for thulium_htr-0.2.0.tar.gz
Algorithm Hash digest
SHA256 87c836ff541982583c3f23c0d769a011b96656c313dd009affa7ab27a0009d93
MD5 61a892918685e71c408218bdc5b205b2
BLAKE2b-256 2b21aecceedaf714694f40d0333a295e004494fb4acab9acdc94b330716e4524

See more details on using hashes here.

File details

Details for the file thulium_htr-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: thulium_htr-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 85.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.1

File hashes

Hashes for thulium_htr-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 acfd9e5d157f42c46a81f9665ede01f581706e89ac12075a3606683a61c55b11
MD5 ee67b819a2b4088d864028d24331d803
BLAKE2b-256 97b65362b2b266c308e9a7db3b4073bf94b1e0d4ead1a8aa65cd3b0b5015eacb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page