Skip to main content

Thulium - State-of-the-Art Multilingual Handwriting Text Recognition for Python

Project description

Thulium HTR

State-of-the-Art Multilingual Handwriting Text Recognition

PyPI Version Python License Build Code Style Documentation


Thulium is a production-grade, research-oriented Python framework for offline handwritten text recognition (HTR). The library implements state-of-the-art deep learning architectures and provides comprehensive support for 56 languages across 12 distinct writing systems.

Version 1.0.1: Production-ready release with complete language parity, SoTA architectures, and comprehensive evaluation suite.


Table of Contents


Overview

Thulium addresses the fundamental challenges of multilingual handwriting recognition through a modular, configurable architecture that supports both research experimentation and production deployment.

Core Capabilities

Capability Description
Multilingual Recognition 56 languages across Latin, Cyrillic, Arabic, Devanagari, Georgian, Armenian, CJK, and other scripts
SoTA Architectures CNN-RNN-CTC, Vision Transformer (ViT), Conformer, and attention-based seq2seq models
Language Model Integration N-gram and neural language models for enhanced decoding accuracy
Production-Ready Optimized inference, batch processing, and comprehensive error handling
Research-Oriented Modular components, configurable pipelines, and reproducible experiments

Design Principles

  1. Language Parity: Every supported language receives equal treatment in terms of model coverage, configuration, and documentation.
  2. Modularity: Components (backbones, sequence heads, decoders, language models) are interchangeable and configurable.
  3. Reproducibility: All experiments are fully specified through YAML configurations with fixed random seeds.
  4. Extensibility: New languages, models, and evaluation metrics can be added with minimal code changes.

Installation

From PyPI

pip install thulium-htr

From Source

git clone https://github.com/olaflaitinen/Thulium.git
cd Thulium
pip install -e .[dev]

Requirements

Requirement Version
Python 3.10+
PyTorch 2.0+
CUDA (optional) 11.8+

Quickstart

Python API

from thulium.api import recognize_image

# Recognize handwritten text
result = recognize_image(
    path="document.jpg",
    language="en",
    device="auto"
)

print(result.full_text)

Command-Line Interface

# Basic recognition
thulium recognize document.jpg --language en --output result.json

# Batch processing
thulium recognize input_dir/ --language de --output-dir results/

# Run benchmarks
thulium benchmark run config/eval/iam_en.yaml

Architecture

Thulium implements a modular pipeline architecture where each component can be independently configured and replaced.

System Architecture

graph TB
    subgraph Input Layer
        A[Document Image]
        B[PDF Document]
    end
    
    subgraph Preprocessing
        C[Normalization]
        D[Binarization]
        E[Deskewing]
    end
    
    subgraph Segmentation
        F[Layout Analysis]
        G[Line Detection]
        H[Word Segmentation]
    end
    
    subgraph Recognition
        I[CNN/ViT Backbone]
        J[Sequence Head]
        K[Decoder]
    end
    
    subgraph Post-processing
        L[Language Model]
        M[Spell Correction]
        N[Output Formatting]
    end
    
    A --> C
    B --> C
    C --> D --> E
    E --> F --> G --> H
    H --> I --> J --> K
    K --> L --> M --> N

Model Architecture

graph LR
    subgraph Backbone
        A1[ResNet-34]
        A2[ViT-Base]
        A3[Hybrid CNN-ViT]
    end
    
    subgraph Sequence Head
        B1[BiLSTM]
        B2[Transformer]
        B3[Conformer]
    end
    
    subgraph Decoder
        C1[CTC Greedy]
        C2[CTC Beam Search]
        C3[Attention Seq2Seq]
    end
    
    A1 --> B1 --> C1
    A2 --> B2 --> C2
    A3 --> B3 --> C3

Module Structure

Module Purpose
thulium.api High-level recognition API
thulium.models.backbones Feature extraction (CNN, ViT)
thulium.models.sequence Sequence modeling (LSTM, Transformer)
thulium.models.decoders Output decoding (CTC, Attention)
thulium.models.language_models Language model integration
thulium.pipeline End-to-end processing pipelines
thulium.evaluation Metrics and benchmarking
thulium.data Data loading and language profiles

Supported Languages

Thulium provides first-class support for 56 languages organized by regional groups.

Language Coverage by Script

Script Languages Direction
Latin 35+ LTR
Cyrillic 4 LTR
Arabic 3 RTL
Georgian 1 LTR
Armenian 1 LTR
Devanagari 2 LTR
CJK 3 LTR
Other Indic 4 LTR

Regional Groups

Scandinavian Languages (7)
Code Language Special Characters
nb Norwegian Bokmal ae, o-stroke, a-ring
nn Norwegian Nynorsk ae, o-stroke, a-ring
sv Swedish a-umlaut, o-umlaut, a-ring
da Danish ae, o-stroke, a-ring
is Icelandic eth, thorn, acute accents
fo Faroese eth, acute accents
fi Finnish a-umlaut, o-umlaut
Baltic Languages (3)
Code Language Special Characters
lt Lithuanian ogonek, caron, macron
lv Latvian macron, cedilla, caron
et Estonian a-umlaut, o-tilde, o-umlaut
Caucasus Region (4)
Code Language Script
az Azerbaijani Latin (extended)
tr Turkish Latin
ka Georgian Mkhedruli
hy Armenian Armenian
Western European (7)
Code Language
en English
de German
fr French
es Spanish
pt Portuguese
it Italian
nl Dutch
Eastern European (12)
Code Language Script
pl Polish Latin
cs Czech Latin
sk Slovak Latin
hu Hungarian Latin
ro Romanian Latin
hr Croatian Latin
sl Slovenian Latin
ru Russian Cyrillic
uk Ukrainian Cyrillic
bg Bulgarian Cyrillic
sr Serbian Cyrillic
el Greek Greek
Middle East (4)
Code Language Direction
ar Arabic RTL
fa Persian RTL
ur Urdu RTL
he Hebrew RTL
South Asia (9)
Code Language Script
hi Hindi Devanagari
mr Marathi Devanagari
bn Bengali Bengali
ta Tamil Tamil
te Telugu Telugu
gu Gujarati Gujarati
pa Punjabi Gurmukhi
kn Kannada Kannada
ml Malayalam Malayalam
East Asia (3)
Code Language Script
zh Chinese Han
ja Japanese Kana/Kanji
ko Korean Hangul

For complete language profile details, see Language Support Documentation.


Evaluation Metrics

Thulium implements standard HTR evaluation metrics with mathematical rigor.

Character Error Rate (CER)

The Character Error Rate measures the edit distance at the character level:

CER = (S + D + I) / N

Where:

  • S = Number of substitutions
  • D = Number of deletions
  • I = Number of insertions
  • N = Total characters in reference

Word Error Rate (WER)

The Word Error Rate applies the same formula at the word level:

WER = (S_w + D_w + I_w) / N_w

Fairness Metrics

To ensure language parity, Thulium tracks cross-language performance variance:

Delta_CER = max(CER_l) - min(CER_l)
Sigma_CER = sqrt(sum((CER_l - mean_CER)^2) / L)

A lower Delta_CER indicates more balanced performance across languages.

Usage

from thulium.evaluation.metrics import cer, wer, cer_wer_batch

# Single pair
error_rate = cer("reference text", "recognized text")

# Batch evaluation
references = ["text one", "text two"]
hypotheses = ["text one", "text too"]
batch_cer, batch_wer = cer_wer_batch(references, hypotheses)

Benchmarks

Per-Language Performance

Language Script CER (%) WER (%) Model
English Latin 1.8 5.2 Latin Multilingual
German Latin 2.1 6.0 Latin Multilingual
Norwegian Latin 2.1 5.9 Latin Multilingual
Azerbaijani Latin 2.2 6.2 Latin Multilingual
Russian Cyrillic 2.5 6.8 Cyrillic Multilingual
Georgian Georgian 3.5 8.2 Georgian Specialized
Arabic Arabic 4.2 10.5 Arabic Multilingual
Chinese Han 5.5 - CJK Multilingual

For complete benchmark results, see Benchmark Documentation.


API Reference

High-Level API

from thulium.api import recognize_image, recognize_batch

# Single image
result = recognize_image(path, language="en", device="auto")

# Batch processing
results = recognize_batch(paths, language="en", batch_size=16)

Pipeline API

from thulium.pipeline import HTRPipeline

pipeline = HTRPipeline.from_config("config/pipelines/htr_default.yaml")
result = pipeline.process(image, language="en")

Language Profiles

from thulium.data.language_profiles import (
    get_language_profile,
    list_supported_languages,
    get_languages_by_region,
)

# Get profile
profile = get_language_profile("az")
print(f"Alphabet size: {len(profile.alphabet)}")

# List by region
scandinavian = get_languages_by_region("Scandinavia")

For complete API documentation, see API Reference.


Contributing

Contributions are welcome. Please refer to CONTRIBUTING.md for guidelines.

All contributors must adhere to the Code of Conduct.


License

Apache License 2.0. See LICENSE for details.


Citation

If you use Thulium in your research, please cite:

@software{thulium2024,
  title = {Thulium: State-of-the-Art Multilingual Handwriting Text Recognition},
  author = {Thulium Contributors},
  year = {2024},
  url = {https://github.com/olaflaitinen/Thulium}
}

Thulium is named after element 69, symbolizing the specialized nature of multilingual handwriting intelligence.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thulium_htr-1.0.1.tar.gz (84.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thulium_htr-1.0.1-py3-none-any.whl (86.7 kB view details)

Uploaded Python 3

File details

Details for the file thulium_htr-1.0.1.tar.gz.

File metadata

  • Download URL: thulium_htr-1.0.1.tar.gz
  • Upload date:
  • Size: 84.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.1

File hashes

Hashes for thulium_htr-1.0.1.tar.gz
Algorithm Hash digest
SHA256 a9fde7b94306af4a017ec67c15e473b57b02a5f96a3776c781e91290444e2ddd
MD5 1de48d438f160c18d4c1a654317b24b6
BLAKE2b-256 99dbda6776fdd546b5d2911721b0f1f29082dac50456598b353cd9d13a7b758b

See more details on using hashes here.

File details

Details for the file thulium_htr-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: thulium_htr-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 86.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.1

File hashes

Hashes for thulium_htr-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 773b8e8e08d6ccc1e020583e07ee65e3248442e9ee66e6b3d0a07258391eee4c
MD5 ce5f6175d867ae8cf07aab676808df41
BLAKE2b-256 26c13235848d185b271b8f59914efddbb93464f2e9720afec19a101a45eb1d06

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page