Accurate Thai and Lao/Isan romanization for real-world text -- song lyrics, colloquial speech, and dialects.
Project description
thairom
Accurate Thai and Lao/Isan romanization for real-world text -- song lyrics, colloquial speech, and dialects.
Installation
pip install thairom
Quick Start
from thairom import romanize
# Thai
print(romanize('สวัสดีครับ')) # sawatdee krap
print(romanize('ขอบคุณมาก')) # khop khun mak
print(romanize('หัวใจ')) # hua jai
print(romanize('หก')) # hok
# Lao/Isan
print(romanize('ฮักเจ้าหลาย', lang='lo')) # hak jao laai
print(romanize('ม่วนคัก', lang='lo')) # muan khak
Features
- Thai romanization using pythainlp's royin engine with word-level corrections
- Lao/Isan dialect support for Thai-script Isan text with proper pronunciation rules (r-to-l substitution, etc.)
- Word correction maps that fix common pythainlp errors on colloquial vocabulary, song lyrics, and everyday phrases
- Handles real-world text -- tested against song lyrics, spoken Thai, and Isan dialect ground truth data
- Clean output -- strips leaked Thai/Lao characters and normalizes whitespace
Why thairom instead of pythainlp alone?
pythainlp's royin romanization engine is solid for formal Thai, but it struggles with colloquial speech, song lyrics, and regional dialects. thairom builds on pythainlp and fixes these gaps:
| Thai Text | pythainlp (royin) | thairom | Correct |
|---|---|---|---|
| หัวใจ | hua chai | hua jai | hua jai |
| น้ำตา | nam ta | nam ta | nam ta |
| เข้าใจ | khao chai | khao jai | khao jai |
| หก | hok | hok | hok |
| ก็ | ko | kaw | kaw |
| เวลา | wela | welaa | welaa |
| ตลอดเวลา | talot wela | talod welaa | talod welaa |
| ขอบคุณ | khop khun | khop khun | khop khun |
| ฮักเจ้าหลาย | (no Isan support) | hak jao laai | hak jao laai |
thairom also handles Isan/Lao dialect written in Thai script, which pythainlp does not support at all.
API Reference
romanize(text, lang='th')
Top-level convenience function. Dispatches to romanize_thai or romanize_lao based on lang.
Parameters:
text(str): Text to romanize.lang(str):'th'for Thai (default),'lo'for Lao/Isan.
Returns: Lowercase romanized string.
romanize_thai(text)
Romanize Thai text using pythainlp with word-level corrections from THAI_WORD_MAP.
Parameters:
text(str): Thai text to romanize.
Returns: Lowercase romanized string.
romanize_lao(text)
Romanize Isan/Lao text written in Thai script. Applies Lao pronunciation rules (e.g., initial r becomes l) and word corrections from LAO_WORD_MAP.
Parameters:
text(str): Isan/Lao text in Thai script.
Returns: Lowercase romanized string.
Word Maps
The correction maps are available as importable dictionaries for inspection or extension:
from thairom.maps import THAI_WORD_MAP, LAO_WORD_MAP
Contributing
Contributions are welcome, especially additions to the word correction maps. The maps were developed using an autoresearch pipeline that scores romanization output against ground truth data. If you find a word that romanizes incorrectly:
- Add the word and its correct romanization to
THAI_WORD_MAPorLAO_WORD_MAPinsrc/thairom/maps.py. - Add a test case to
tests/test_romanize.py. - Run
pytestto verify. - Submit a pull request.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file thairom-0.1.0.tar.gz.
File metadata
- Download URL: thairom-0.1.0.tar.gz
- Upload date:
- Size: 13.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c58a23c0ae9bb13b103b3d616bb3344022b6e2a0fbae5e40011255abff368955
|
|
| MD5 |
eecb6002fe4db0f07fb0b1a06f237263
|
|
| BLAKE2b-256 |
ea7fffbb3242f1a8e5a76c9eb96cb1e6166d02d33c6946e4230628b4a4311e16
|
File details
Details for the file thairom-0.1.0-py3-none-any.whl.
File metadata
- Download URL: thairom-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0327ecc186962a9414a94a2119b028ca9bd1f34364b465596f468f1812023b8b
|
|
| MD5 |
bfd176d3d35d5399f8f32d2f20a6669b
|
|
| BLAKE2b-256 |
d6b2e2934d2e13f41a6fbdd6571c57db5c408c433a1c548632ef46fbd6378d1d
|