Skip to main content

Accurate Thai and Lao/Isan romanization for real-world text -- song lyrics, colloquial speech, and dialects.

Project description

thairom

PyPI Python License: MIT

Accurate Thai and Lao/Isan romanization for real-world text -- song lyrics, colloquial speech, and dialects.

Installation

pip install thairom

Quick Start

from thairom import romanize

# Thai
print(romanize('สวัสดีครับ'))       # sawatdee krap
print(romanize('ขอบคุณมาก'))       # khop khun mak
print(romanize('หัวใจ'))           # hua jai
print(romanize('หก'))              # hok

# Lao/Isan
print(romanize('ฮักเจ้าหลาย', lang='lo'))  # hak jao laai
print(romanize('ม่วนคัก', lang='lo'))       # muan khak

Features

  • Thai romanization using pythainlp's royin engine with word-level corrections
  • Lao/Isan dialect support for Thai-script Isan text with proper pronunciation rules (r-to-l substitution, etc.)
  • Word correction maps that fix common pythainlp errors on colloquial vocabulary, song lyrics, and everyday phrases
  • Handles real-world text -- tested against song lyrics, spoken Thai, and Isan dialect ground truth data
  • Clean output -- strips leaked Thai/Lao characters and normalizes whitespace

Why thairom instead of pythainlp alone?

pythainlp's royin romanization engine is solid for formal Thai, but it struggles with colloquial speech, song lyrics, and regional dialects. thairom builds on pythainlp and fixes these gaps:

Thai Text pythainlp (royin) thairom Correct
หัวใจ hua chai hua jai hua jai
น้ำตา nam ta nam ta nam ta
เข้าใจ khao chai khao jai khao jai
หก hok hok hok
ก็ ko kaw kaw
เวลา wela welaa welaa
ตลอดเวลา talot wela talod welaa talod welaa
ขอบคุณ khop khun khop khun khop khun
ฮักเจ้าหลาย (no Isan support) hak jao laai hak jao laai

thairom also handles Isan/Lao dialect written in Thai script, which pythainlp does not support at all.

API Reference

romanize(text, lang='th')

Top-level convenience function. Dispatches to romanize_thai or romanize_lao based on lang.

Parameters:

  • text (str): Text to romanize.
  • lang (str): 'th' for Thai (default), 'lo' for Lao/Isan.

Returns: Lowercase romanized string.

romanize_thai(text)

Romanize Thai text using pythainlp with word-level corrections from THAI_WORD_MAP.

Parameters:

  • text (str): Thai text to romanize.

Returns: Lowercase romanized string.

romanize_lao(text)

Romanize Isan/Lao text written in Thai script. Applies Lao pronunciation rules (e.g., initial r becomes l) and word corrections from LAO_WORD_MAP.

Parameters:

  • text (str): Isan/Lao text in Thai script.

Returns: Lowercase romanized string.

Word Maps

The correction maps are available as importable dictionaries for inspection or extension:

from thairom.maps import THAI_WORD_MAP, LAO_WORD_MAP

Contributing

Contributions are welcome, especially additions to the word correction maps. The maps were developed using an autoresearch pipeline that scores romanization output against ground truth data. If you find a word that romanizes incorrectly:

  1. Add the word and its correct romanization to THAI_WORD_MAP or LAO_WORD_MAP in src/thairom/maps.py.
  2. Add a test case to tests/test_romanize.py.
  3. Run pytest to verify.
  4. Submit a pull request.

License

MIT License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thairom-0.1.3.tar.gz (19.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thairom-0.1.3-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file thairom-0.1.3.tar.gz.

File metadata

  • Download URL: thairom-0.1.3.tar.gz
  • Upload date:
  • Size: 19.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for thairom-0.1.3.tar.gz
Algorithm Hash digest
SHA256 affda0aed913d2824ea12c03e1da13ec09fbba771f14f0e14eef76b7c1ddd58a
MD5 6f809cc25ceb74e66df2e3a18d04223b
BLAKE2b-256 49b3528692e13f437b4536faa9a414221c8c586e25f46b65da82080d9e5181d0

See more details on using hashes here.

File details

Details for the file thairom-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: thairom-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for thairom-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 61d3ae54381b1ef4a4df6229e36bc270b5bff4eaadd21147b14f670a6022160f
MD5 e6845cd481f94354b5d698d3a6c43d71
BLAKE2b-256 93fac527ea383bd30179b6c4347f7620d37e2cb5ad068870c83911bada6d4f6f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page