Skip to main content

Exact-match Thai lexicon derived from the VOLUBILIS Mundo Multilingual Thai Dictionary, for use with thaiphon.

Project description

thaiphon-data-volubilis

An exact-match Thai pronunciation lexicon for the thaiphon phonological engine, derived from the VOLUBILIS Mundo Multilingual Thai Dictionary & Database by Belisan. When installed alongside thaiphon, the pipeline short-circuits on words it finds in this data and returns the exact reading instead of running the rule-based derivation.

Installing it lifts thaiphon's Wiktionary-IPA exact-match rate from ~57 % on the base engine to ~75 %. That gain is the word-boundary and variant coverage a rule-based engine can't infer from orthography alone.

Install

pip install thaiphon-data-volubilis
# or
uv add thaiphon-data-volubilis

Python 3.10+, no runtime dependencies. thaiphon picks the lexicon up automatically on import; no configuration needed.

What's in the wheel

A single lexicon.db file: a read-only SQLite database of 84 k Thai words, each mapped to its pre-derived phonological word (syllable-segmented, with onsets, vowels, codas, and tones resolved). Keys are Thai strings; the primary-key index over thai_word gives O(log n) lookup in tens of microseconds.

The package exposes one Mapping[str, PhonologicalWord] named ENTRIES with the usual dict-like API (__getitem__, __contains__, get, keys, items, values, __len__, __iter__). Callers don't see the storage backend, so a future release can change it without breaking anyone.

Memory footprint

The lexicon is memory-mapped via SQLite's mmap_size pragma and opened with immutable=1. Every process that imports the package shares the same underlying pages through the operating-system page cache. Numbers from this machine, against the previous in-memory representation:

Measurement Before After
Per-process RSS after first lookup ~340 MiB ~7 MiB
Per-process RSS after 1 k lookups ~340 MiB ~35 MiB
Wheel size 1.6 MiB 2.7 MiB
.pyc on Python 3.10 (import cost) 117 MiB 0

The ~35 MiB after warmup is the per-process LRU of the last 10 k inflated entries. The ~50 MiB or so of physical memory holding the mmap'd database pages is counted once by the kernel and shared by every Python worker that has imported the package.

What this means in practice

For multi-worker web servers (gunicorn, uvicorn with workers, uWSGI) total lexicon RSS is bounded by the size of lexicon.db plus a small per-worker LRU. It does not multiply with worker count. Dozens of workers on a single host stay practical.

SQLite forbids sharing a connection across threads, so each Python thread that touches the lexicon lazily opens its own connection via threading.local(). FastAPI / Starlette's threadpool, Django's async views, and any other thread model work without extra plumbing.

Serverless cold starts (AWS Lambda, Cloud Run, Fargate, and Cloudflare Workers when Python support allows) don't pay a .pyc unpack or a lexicon-inflation cost. The file is only read from as entries are requested.

Pre-forking servers like sync gunicorn and uWSGI inherit no SQLite handle across fork(), since the connection lives in threading.local. Each child opens its own on first use.

Attribution

Source data: VOLUBILIS Mundo Multilingual Thai Dictionary & Database by Belisan — https://belisan-volubilis.blogspot.com/.

License

The data is distributed under CC BY-SA 4.0, matching the upstream source license. Derivative works of this data must also be licensed under CC BY-SA 4.0 or a compatible license. See LICENSE and NOTICE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thaiphon_data_volubilis-0.2.0.tar.gz (4.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thaiphon_data_volubilis-0.2.0-py3-none-any.whl (2.8 MB view details)

Uploaded Python 3

File details

Details for the file thaiphon_data_volubilis-0.2.0.tar.gz.

File metadata

  • Download URL: thaiphon_data_volubilis-0.2.0.tar.gz
  • Upload date:
  • Size: 4.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for thaiphon_data_volubilis-0.2.0.tar.gz
Algorithm Hash digest
SHA256 cb7629c9597b6db38bd4d6111eaeff68e6db8cd6be5e06ab84cd0bdb22e5c252
MD5 a0da996af47e4d3527d515bb174ed37b
BLAKE2b-256 d6dc276609d7d2409d9912515ba52158913e50bfaf633bf2cd735bc195911e79

See more details on using hashes here.

Provenance

The following attestation bundles were made for thaiphon_data_volubilis-0.2.0.tar.gz:

Publisher: publish.yml on 5w0rdf15h/thaiphon-data-volubilis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file thaiphon_data_volubilis-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for thaiphon_data_volubilis-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 808da2cab3ff38763e636fa266c2538e7ec0e03da7d5671065ee6455a09d5f28
MD5 d568b0c6454f34225520d4431920100e
BLAKE2b-256 eb96670eea5ea6ec28069c2a5fa81332684daf306962628bed7b0e9b86814194

See more details on using hashes here.

Provenance

The following attestation bundles were made for thaiphon_data_volubilis-0.2.0-py3-none-any.whl:

Publisher: publish.yml on 5w0rdf15h/thaiphon-data-volubilis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page