Skip to main content

Thai phonology engine with pluggable transliteration systems.

Project description

thaiphon

A zero-dependency Thai phonological transliteration engine. 75 % exact-match accuracy against independent Wiktionary IPA ground truth when installed with the optional thaiphon-data-volubilis lexicon package; 57 % on the base engine alone (see Accuracy for details).

What it does

Thai script goes in; transliteration comes out. The engine first resolves the input into a phonological word — a tuple of syllables, each decomposed into IPA-based onset, vowel quality, vowel length, coda, and tone. Every output scheme is then a small declarative SchemeMapping that renames those phonemes into a target surface form. Add a new scheme by writing data, not code.

Three built-in renderers ship with the package:

  • ipa — IPA with tone contour digits (the default phonological representation used for the Wiktionary benchmark).
  • tlc — thai-language.com "Enhanced Phonemic" with bracketed tone tags ({M}, {L}, {H}, {F}, {R}).
  • morev — Cyrillic transliteration following the Russian-language tradition.

Custom schemes (Paiboon+, RTGS, or your own pedagogical convention) are straightforward to register via SchemeMapping.

Install

uv add thaiphon thaiphon-data-volubilis
# or
pip install thaiphon thaiphon-data-volubilis

Requires Python 3.10+. The engine itself has no runtime dependencies. thaiphon-data-volubilis is a companion package shipping a ~35 k-entry Thai lexicon derived from the VOLUBILIS Mundo Dictionary. thaiphon detects it automatically on import and uses it for word-boundary segmentation and word-specific readings that a rule-based engine can't infer from orthography alone.

The data package is CC-BY-SA 4.0, separate from the Apache-2.0 engine — installing it doesn't change the license of your own code.

Minimal install (not recommended for most users)

If there's a hard reason to keep dependencies lean:

pip install thaiphon

The engine still runs, but you give up a lot:

  • Accuracy on the Wiktionary IPA etalon drops from ~75 % to ~57 %.
  • Common words break in visible ways. ส้ม ("orange") comes out as /sa˥˩.ma˦˥/ — two syllables — when the correct reading is /som˥˩/, one closed syllable. The rules alone can't choose between "closed syllable with inherent /o/" and "two open syllables with inserted /a/"; the lexicon settles that kind of ambiguity.
  • Sanskrit-learned compounds miss their learned readings. มหาวิทยาลัย loses the /tʰa/ insertion between /wit̚/ and /jaː/. The spoken form is /wit̚.tʰa.jaː.laj/; the base engine emits /wit̚.jaː.laj/.
  • Failures are silent. The engine always returns a transliteration — no error, no flag, no way to tell at the output layer which of the ~43 % of dictionary entries fell through. A caller can't filter out bad results after the fact.

For anything user-facing, indexing, or searching, install both packages.

Quick start

from thaiphon import (
    transcribe, transcribe_word, transcribe_sentence, analyze, list_schemes,
)

# Discover available schemes.
list_schemes()
# ('ipa', 'morev', 'tlc')

# IPA — the phonological representation.
transcribe("ภูมิ", scheme="ipa")
# '/pʰuːm˧/'

# TLC — thai-language.com convention.
transcribe("ลิฟต์", scheme="tlc")
# 'lif{H}'

# Single-word shortcut that skips sentence tokenization.
transcribe_word("สวัสดี", scheme="ipa")
# '/sa˨˩.wat̚˨˩.diː˧/'

# Sentence segmentation + rendering.
transcribe_sentence("ฉันชอบกินข้าว", scheme="ipa")
# '/t͡ɕʰan˩˩˦/ /t͡ɕʰɔːp̚˥˩/ /kin˧/ /kʰaːw˥˩/'

# Reading profile switches colloquial vs citation pronunciation — see below.
transcribe("ลิฟต์", scheme="ipa", profile="etalon_compat")
# '/lip̚˦˥/'

# Access the intermediate phonological word.
result = analyze("ผลไม้")
for syl in result.best.syllables:
    print(syl.onset, syl.vowel.symbol, syl.vowel_length.name, syl.coda, syl.tone.name)

Reading profiles

The profile argument controls how the engine handles foreign phonotactics in loanwords, especially the coda consonants that native Thai normally neutralises:

# "Lift" / elevator — final /f/ from English.
transcribe("ลิฟต์", scheme="ipa", profile="everyday")         # '/lif˦˥/'
transcribe("ลิฟต์", scheme="ipa", profile="etalon_compat")    # '/lip̚˦˥/'

# "Graph" — register-sensitive.
transcribe("กราฟ", scheme="ipa", profile="everyday")          # '/kraːp̚˨˩/'
transcribe("กราฟ", scheme="ipa", profile="careful_educated")  # '/kraːf˨˩/'

The four supported profiles:

  • everyday (default) — colloquial urban pronunciation. Preserves foreign codas in well-integrated modern loans (ลิฟต์, เคส, อีเมล) and collapses them in older or less register-sensitive words (กราฟ, กอล์ฟ, บัส).
  • careful_educated — broadcast/formal register. Preserves more foreign codas than everyday.
  • learned_full — restores full Sanskrit/Pali learned readings for Indic-derived words (ภูมิ, ปกติ).
  • etalon_compat — dictionary-citation style; collapses every foreign coda to its native-Thai equivalent. Useful for matching pronunciation dictionaries.

Preservation is per-word, driven by the lexicon and attested usage — not by blanket rules that paper over lexical convention.

Accuracy

Measured by exact-match rate against independent, publicly-licensed Thai phonology corpora:

Corpus License Size Scheme Base engine With thaiphon-data-volubilis
kaikki.org Thai Wiktionary CC-BY-SA 4.0 17,014 ipa ~57 % ~75 %
PyThaiNLP G2P (Wiktionary) CC0 15,782 ipa ~73 %

"Base engine" is pip install thaiphon on its own. The jump to 75 % comes from the thaiphon-data-volubilis lexicon (see Install) — the word-boundary and variant coverage a rule-based engine can't infer from orthography alone.

Both numbers come from public data. The primary kaikki.org etalon and PyThaiNLP's independent G2P extraction are cross-checks on each other; if the engine breaks in a way that only one source catches, the other will usually catch it too.

Reproducing the numbers yourself:

The repository ships a 2,500-entry random sample of the kaikki.org Thai Wiktionary dump (seed 20260421, CC-BY-SA 4.0) as a bundled pytest fixture. Install the data package, then run the etalon tests:

# Ensure both packages are installed (see Install section).
pip install thaiphon thaiphon-data-volubilis

# Bundled sample — runs in seconds, no external download:
pytest tests/etalon/test_wiktionary_ipa_sample.py -v
# Floor: 72 %.  Measured: ~74 % on the sample.

# Full 17,014-entry measurement — download the kaikki.org dump first:
#   https://kaikki.org/dictionary/rawdata.html  (Thai entries, ~43 MB)
#   Place at ~/.cache/thaiphon/kaikki-thai.jsonl  (or set $THAIPHON_KAIKKI)
pytest tests/etalon/test_wiktionary_ipa_full.py -v
# Floor: 73 %.  Measured: ~75 % on the full corpus.

If thaiphon-data-volubilis is not installed, both etalon tests skip with a message pointing at the install command, so make test always finishes cleanly whatever your setup. See tests/README.md for more detail and tests/fixtures/README.md for fixture licensing and sampling parameters.

Custom schemes

from thaiphon.model.enums import VowelLength
from thaiphon.renderers.mapping import SchemeMapping, MappingRenderer
from thaiphon.registry import RENDERERS
from thaiphon import transcribe

pedagogical = SchemeMapping(
    scheme_id="ped",
    onset_map={"kʰ": "kh", "k": "k", "tɕ": "j", "m": "m", "n": "n", ...},
    vowel_map={
        ("a", VowelLength.SHORT): "a",
        ("a", VowelLength.LONG):  "aa",
        ...
    },
    coda_map={"m": "m", "n": "n", "ŋ": "ng", "p̚": "p", "t̚": "t", "k̚": "k"},
    tone_format=lambda base, syl: base,   # no tone decoration
    syllable_separator="-",
)
RENDERERS.register("ped", lambda: MappingRenderer(pedagogical))

transcribe("สวัสดี", scheme="ped")

Architecture

The thaiphon pipeline — Thai text flows through normalisation, lexicon lookup, syllabification, rule-based derivation, the phonological word data contract, and the renderer, fanning out to IPA, TLC, or Morev surface strings

Key design decisions:

  • Zero runtime dependencies. The full phonology, rules, and lexicon ship as plain Python with no C or FFI.
  • Immutable phonological model via frozen dataclasses.
  • Lexicons (loanword overrides, Indic learned readings, royal vocabulary, etc.) are module-level Python literals — auditable and grep-able.
  • Scheme-specific editorial conventions live in the renderer layer, not the phonology. Improvements to onset/vowel/coda derivation show up in every output scheme at once.

Development

git clone https://github.com/5w0rdf15h/thaiphon
cd thaiphon
make dev          # install with dev dependencies via uv
make test         # pytest -q
make lint         # ruff
make typecheck    # mypy src/thaiphon
make check        # test + lint + typecheck
make format       # black + isort

The test suite is organised in two layers:

  • tests/*.py — public API surface and representative regression examples. Fast, deterministic, and doubles as living API documentation.
  • tests/rules/*.py — internal phonological-rule coverage (tone matrix, coda collapse, onset resolution, cluster strategies, lexicon membership, etc.). Exercises internal modules directly; the right place to add a regression when fixing a derivation bug.

All example words in tests are hand-curated from openly-licensed sources only — Wiktionary (CC-BY-SA 4.0) and VOLUBILIS Mundo Dictionary (CC-BY-SA 4.0). See tests/fixtures/README.md for the full provenance policy.

Credits

Includes code adapted from the PyThaiNLP project (Apache-2.0). See NOTICE for the full attribution. Optional lexicon data in thaiphon-data-volubilis derives from the VOLUBILIS Mundo Dictionary (CC-BY-SA 4.0).

Acknowledgements

Personal thanks to the teachers and principal at RTL School — Rak Thai Language School for teaching and explaining Thai to me. This engine's phonological intuitions come directly from their patient, careful instruction.

License

Apache-2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thaiphon-0.2.0.tar.gz (484.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thaiphon-0.2.0-py3-none-any.whl (115.9 kB view details)

Uploaded Python 3

File details

Details for the file thaiphon-0.2.0.tar.gz.

File metadata

  • Download URL: thaiphon-0.2.0.tar.gz
  • Upload date:
  • Size: 484.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for thaiphon-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7864e6f87e4e0be72f0359f434d4b7fdfedee86d46db0a99c1c4a9569ebddf69
MD5 fabb29d02b6c85103b9188dd07eaf291
BLAKE2b-256 01fa9ac96d30dacc8618e8168b0d9f8f62fbc5693bc3bd2916c4071e3459a52f

See more details on using hashes here.

Provenance

The following attestation bundles were made for thaiphon-0.2.0.tar.gz:

Publisher: publish.yml on 5w0rdf15h/thaiphon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file thaiphon-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: thaiphon-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 115.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for thaiphon-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a1936cfb578718a3dc625ed64183a233c1400f2a2e973977869575d9e70547e2
MD5 a21fabe2c368fca0d8effdf5025fe9ac
BLAKE2b-256 6d80ab52c329d24fa1481fd86a359135325e75d73cc82e2569d860cffd838bf3

See more details on using hashes here.

Provenance

The following attestation bundles were made for thaiphon-0.2.0-py3-none-any.whl:

Publisher: publish.yml on 5w0rdf15h/thaiphon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page