Skip to main content

Accurate Thai and Lao/Isan romanization for real-world text -- song lyrics, colloquial speech, and dialects.

Project description

thairom

PyPI Python License: MIT

Accurate Thai and Lao/Isan romanization for real-world text -- song lyrics, colloquial speech, and dialects.

Installation

pip install thairom

Quick Start

from thairom import romanize

# Thai
print(romanize('สวัสดีครับ'))       # sawatdee krap
print(romanize('ขอบคุณมาก'))       # khop khun mak
print(romanize('หัวใจ'))           # hua jai
print(romanize('หก'))              # hok

# Lao/Isan
print(romanize('ฮักเจ้าหลาย', lang='lo'))  # hak jao laai
print(romanize('ม่วนคัก', lang='lo'))       # muan khak

Features

  • Thai romanization using pythainlp's royin engine with word-level corrections
  • Lao/Isan dialect support for Thai-script Isan text with proper pronunciation rules (r-to-l substitution, etc.)
  • Word correction maps that fix common pythainlp errors on colloquial vocabulary, song lyrics, and everyday phrases
  • Handles real-world text -- tested against song lyrics, spoken Thai, and Isan dialect ground truth data
  • Clean output -- strips leaked Thai/Lao characters and normalizes whitespace

Why thairom instead of pythainlp alone?

pythainlp's royin romanization engine is solid for formal Thai, but it struggles with colloquial speech, song lyrics, and regional dialects. thairom builds on pythainlp and fixes these gaps:

Thai Text pythainlp (royin) thairom Correct
หัวใจ hua chai hua jai hua jai
น้ำตา nam ta nam ta nam ta
เข้าใจ khao chai khao jai khao jai
หก hok hok hok
ก็ ko kaw kaw
เวลา wela welaa welaa
ตลอดเวลา talot wela talod welaa talod welaa
ขอบคุณ khop khun khop khun khop khun
ฮักเจ้าหลาย (no Isan support) hak jao laai hak jao laai

thairom also handles Isan/Lao dialect written in Thai script, which pythainlp does not support at all.

API Reference

romanize(text, lang='th')

Top-level convenience function. Dispatches to romanize_thai or romanize_lao based on lang.

Parameters:

  • text (str): Text to romanize.
  • lang (str): 'th' for Thai (default), 'lo' for Lao/Isan.

Returns: Lowercase romanized string.

romanize_thai(text)

Romanize Thai text using pythainlp with word-level corrections from THAI_WORD_MAP.

Parameters:

  • text (str): Thai text to romanize.

Returns: Lowercase romanized string.

romanize_lao(text)

Romanize Isan/Lao text written in Thai script. Applies Lao pronunciation rules (e.g., initial r becomes l) and word corrections from LAO_WORD_MAP.

Parameters:

  • text (str): Isan/Lao text in Thai script.

Returns: Lowercase romanized string.

Word Maps

The correction maps are available as importable dictionaries for inspection or extension:

from thairom.maps import THAI_WORD_MAP, LAO_WORD_MAP

Contributing

Contributions are welcome, especially additions to the word correction maps. The maps were developed using an autoresearch pipeline that scores romanization output against ground truth data. If you find a word that romanizes incorrectly:

  1. Add the word and its correct romanization to THAI_WORD_MAP or LAO_WORD_MAP in src/thairom/maps.py.
  2. Add a test case to tests/test_romanize.py.
  3. Run pytest to verify.
  4. Submit a pull request.

License

MIT License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thairom-0.1.4.tar.gz (20.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thairom-0.1.4-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file thairom-0.1.4.tar.gz.

File metadata

  • Download URL: thairom-0.1.4.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for thairom-0.1.4.tar.gz
Algorithm Hash digest
SHA256 87048b4e1d84badee969c9c7a906bfd44a088d974c4b7169da97845dfedbe28c
MD5 5a345d288be8835bc11f3ad3870bb3e3
BLAKE2b-256 b2b93b84abcb7fb425325872a5fa682c027e6715653e720185edf43dfbfba5f2

See more details on using hashes here.

File details

Details for the file thairom-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: thairom-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for thairom-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 caec06db5dd43e8b92a6aea6b94d2a8d39d5d579dc87c996ee3553e1f761cda4
MD5 337b709b749382c8c2a028b4c5743e9a
BLAKE2b-256 f41e0cb38e8142eada8c48c9ad98bf976bcb4b45df42f4e71b16dc7e97a329d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page