Skip to main content

A Python package for transliterating English text to Thai using a ByT5 model

Project description

En2Th Transliterator

PyPI version License: MIT

A Python package for transliterating English text to Thai using a ByT5 model.

Features

  • Byte-level processing: More robust against spelling variations
  • Beam search & sampling: Allows fine-tuning of output quality
  • Batch processing: Efficient for large-scale transliteration
  • Mixed precision (FP16): Faster inference on compatible GPUs
  • Command-line interface: Easy to use from the terminal
  • Hugging Face integration: Automatically downloads and caches the model

Installation

You can install the package via pip:

pip install en2th-transliterator

Usage

As a Python Package

Basic Usage

from en2th_transliterator import En2ThTransliterator

# Initialize with the default model
model = En2ThTransliterator()

# Transliterate a single text
thai_text = model.transliterate("hello")
print(f"Thai: {thai_text}")

Advanced Usage

from en2th_transliterator import En2ThTransliterator

# Initialize with custom parameters
model = En2ThTransliterator(
    model_path=None,  # Use default HF model
    max_length=50,
    num_beams=5,
    length_penalty=1.5,
    verbose=True,
    fp16=True  # Enable mixed precision
)

# Transliterate using sampling
thai_text = model.transliterate(
    "artificial intelligence",
    temperature=0.8,
    top_k=40,
    top_p=0.95
)
print(f"Thai: {thai_text}")

# Batch transliteration
english_texts = ["computer", "keyboard", "mouse", "monitor"]
thai_texts = model.batch_transliterate(
    english_texts,
    batch_size=2,
    temperature=0.5
)

for eng, thai in zip(english_texts, thai_texts):
    print(f"{eng}{thai}")

Command Line Interface

Basic Usage

en2th-transliterate --text "hello"

Transliterate from a File

en2th-transliterate --file input.txt --output results.txt

Output in JSON Format

en2th-transliterate --file input.txt --format json --output results.json

Output in TSV Format

en2th-transliterate --file input.txt --format tsv --output results.tsv

Using Custom Parameters

en2th-transliterate --text "hello" --fp16 --temperature 0.7 --num-beams 5

Model

The package utilizes a ByT5 model fine-tuned on English-to-Thai transliteration data. The model operates at the byte level, making it effective for handling various input variations and generating Thai text with high accuracy.

This package uses the yacht/byt5-base-en2th-transliterator model from Hugging Face Hub.

Performance Optimization

FP16 Mixed Precision

The package supports FP16 mixed precision for faster inference on compatible GPUs. This is enabled by default but can be disabled if needed:

model = En2ThTransliterator(fp16=False)

Or from the command line:

en2th-transliterate --text "hello" --no-fp16

Batch Processing

For transliterating multiple texts, batch processing is more efficient:

texts = ["hello", "world", "computer", "science"]
results = model.batch_transliterate(texts, batch_size=4)

Development

Setting Up Development Environment

# Clone the repository
git clone https://github.com/tchayintr/en2th-transliterator.git
cd en2th-transliterator

# Install in development mode
pip install -e .

Running Tests

# Create a test script
python test_package.py

Building the Package

# Install build tools
pip install build twine

# Build the package
python -m build

# Upload to PyPI
python -m twine upload dist/*

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this package in your research, please cite:

@software{en2th_transliterator,
  author = {Thodsaporn Chay-intr},
  title = {En2Th Transliterator: English to Thai Transliteration using ByT5},
  year = {2025},
  url = {https://github.com/tchayintr/en2th-transliterator}
}

Acknowledgements

  • This package uses the ByT5 architecture developed by Google Research
  • The model was fine-tuned on English-Thai transliteration data from here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

en2th_transliterator-0.1.0.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

en2th_transliterator-0.1.0-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file en2th_transliterator-0.1.0.tar.gz.

File metadata

  • Download URL: en2th_transliterator-0.1.0.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for en2th_transliterator-0.1.0.tar.gz
Algorithm Hash digest
SHA256 45960156448c541d6a7fc242213e07b5c48d6185bd0e9aba92e0ca0f5c8f36e5
MD5 a6cfc5265089baa7c4d7ea4764dfba30
BLAKE2b-256 6eea1f6715bb87d984cc9e42229a09783e97e290c014287713f2bd7cc42036a8

See more details on using hashes here.

File details

Details for the file en2th_transliterator-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for en2th_transliterator-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c073e1cdfe88b42f07fabc47e7c07158b6a1f6e0579774dee54868eac315ef0a
MD5 514a36fe5baee17bb54d1e9ad89b4c10
BLAKE2b-256 9cd2e80698c18677c8ba66cd45ead8a1cc1d8718a9c61eed3ab098c96ae13159

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page