Skip to main content

Accurate Thai and Lao/Isan romanization for real-world text -- song lyrics, colloquial speech, and dialects.

Project description

thairom

PyPI Python License: MIT

Accurate Thai and Lao/Isan romanization for real-world text -- song lyrics, colloquial speech, and dialects.

Installation

pip install thairom

Quick Start

from thairom import romanize

# Thai
print(romanize('สวัสดีครับ'))       # sawatdee krap
print(romanize('ขอบคุณมาก'))       # khop khun mak
print(romanize('หัวใจ'))           # hua jai
print(romanize('หก'))              # hok

# Lao/Isan
print(romanize('ฮักเจ้าหลาย', lang='lo'))  # hak jao laai
print(romanize('ม่วนคัก', lang='lo'))       # muan khak

Features

  • Thai romanization using pythainlp's royin engine with word-level corrections
  • Lao/Isan dialect support for Thai-script Isan text with proper pronunciation rules (r-to-l substitution, etc.)
  • Word correction maps that fix common pythainlp errors on colloquial vocabulary, song lyrics, and everyday phrases
  • Handles real-world text -- tested against song lyrics, spoken Thai, and Isan dialect ground truth data
  • Clean output -- strips leaked Thai/Lao characters and normalizes whitespace

Why thairom instead of pythainlp alone?

pythainlp's royin romanization engine is solid for formal Thai, but it struggles with colloquial speech, song lyrics, and regional dialects. thairom builds on pythainlp and fixes these gaps:

Thai Text pythainlp (royin) thairom Correct
หัวใจ hua chai hua jai hua jai
น้ำตา nam ta nam ta nam ta
เข้าใจ khao chai khao jai khao jai
หก hok hok hok
ก็ ko kaw kaw
เวลา wela welaa welaa
ตลอดเวลา talot wela talod welaa talod welaa
ขอบคุณ khop khun khop khun khop khun
ฮักเจ้าหลาย (no Isan support) hak jao laai hak jao laai

thairom also handles Isan/Lao dialect written in Thai script, which pythainlp does not support at all.

API Reference

romanize(text, lang='th')

Top-level convenience function. Dispatches to romanize_thai or romanize_lao based on lang.

Parameters:

  • text (str): Text to romanize.
  • lang (str): 'th' for Thai (default), 'lo' for Lao/Isan.

Returns: Lowercase romanized string.

romanize_thai(text)

Romanize Thai text using pythainlp with word-level corrections from THAI_WORD_MAP.

Parameters:

  • text (str): Thai text to romanize.

Returns: Lowercase romanized string.

romanize_lao(text)

Romanize Isan/Lao text written in Thai script. Applies Lao pronunciation rules (e.g., initial r becomes l) and word corrections from LAO_WORD_MAP.

Parameters:

  • text (str): Isan/Lao text in Thai script.

Returns: Lowercase romanized string.

Word Maps

The correction maps are available as importable dictionaries for inspection or extension:

from thairom.maps import THAI_WORD_MAP, LAO_WORD_MAP

Contributing

Contributions are welcome, especially additions to the word correction maps. The maps were developed using an autoresearch pipeline that scores romanization output against ground truth data. If you find a word that romanizes incorrectly:

  1. Add the word and its correct romanization to THAI_WORD_MAP or LAO_WORD_MAP in src/thairom/maps.py.
  2. Add a test case to tests/test_romanize.py.
  3. Run pytest to verify.
  4. Submit a pull request.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thairom-0.1.0.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thairom-0.1.0-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file thairom-0.1.0.tar.gz.

File metadata

  • Download URL: thairom-0.1.0.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for thairom-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c58a23c0ae9bb13b103b3d616bb3344022b6e2a0fbae5e40011255abff368955
MD5 eecb6002fe4db0f07fb0b1a06f237263
BLAKE2b-256 ea7fffbb3242f1a8e5a76c9eb96cb1e6166d02d33c6946e4230628b4a4311e16

See more details on using hashes here.

File details

Details for the file thairom-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: thairom-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for thairom-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0327ecc186962a9414a94a2119b028ca9bd1f34364b465596f468f1812023b8b
MD5 bfd176d3d35d5399f8f32d2f20a6669b
BLAKE2b-256 d6b2e2934d2e13f41a6fbdd6571c57db5c408c433a1c548632ef46fbd6378d1d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page