Exact-match Thai lexicon derived from the VOLUBILIS Mundo Multilingual Thai Dictionary, for use with thaiphon.
Project description
thaiphon-data-volubilis
An exact-match Thai pronunciation lexicon for the
thaiphon phonological engine,
derived from the VOLUBILIS Mundo Multilingual Thai Dictionary &
Database by Belisan. When installed alongside thaiphon, the
pipeline short-circuits on words it finds in this data and returns
the exact reading instead of running the rule-based derivation.
Installing it lifts thaiphon's Wiktionary-IPA exact-match rate from
~57 % on the base engine to ~75 %. That gain is the word-boundary
and variant coverage a rule-based engine can't infer from orthography
alone.
Install
pip install thaiphon-data-volubilis
# or
uv add thaiphon-data-volubilis
Python 3.10+, no runtime dependencies. thaiphon picks the lexicon up
automatically on import; no configuration needed.
What's in the wheel
A single lexicon.db file: a read-only SQLite database of 84 k Thai
words, each mapped to its pre-derived phonological word
(syllable-segmented, with onsets, vowels, codas, and tones resolved).
Keys are Thai strings; the primary-key index over thai_word gives
O(log n) lookup in tens of microseconds.
The package exposes one Mapping[str, PhonologicalWord] named
ENTRIES with the usual dict-like API (__getitem__, __contains__,
get, keys, items, values, __len__, __iter__). Callers
don't see the storage backend, so a future release can change it
without breaking anyone.
Memory footprint
The lexicon is memory-mapped via SQLite's mmap_size pragma and
opened with immutable=1. Every process that imports the package
shares the same underlying pages through the operating-system page
cache. Numbers from this machine, against the previous in-memory
representation:
| Measurement | Before | After |
|---|---|---|
| Per-process RSS after first lookup | ~340 MiB | ~7 MiB |
| Per-process RSS after 1 k lookups | ~340 MiB | ~35 MiB |
| Wheel size | 1.6 MiB | 2.7 MiB |
.pyc on Python 3.10 (import cost) |
117 MiB | 0 |
The ~35 MiB after warmup is the per-process LRU of the last 10 k inflated entries. The ~50 MiB or so of physical memory holding the mmap'd database pages is counted once by the kernel and shared by every Python worker that has imported the package.
What this means in practice
For multi-worker web servers (gunicorn, uvicorn with workers, uWSGI)
total lexicon RSS is bounded by the size of lexicon.db plus a small
per-worker LRU. It does not multiply with worker count. Dozens of
workers on a single host stay practical.
SQLite forbids sharing a connection across threads, so each Python
thread that touches the lexicon lazily opens its own connection via
threading.local(). FastAPI / Starlette's threadpool, Django's async
views, and any other thread model work without extra plumbing.
Serverless cold starts (AWS Lambda, Cloud Run, Fargate, and Cloudflare
Workers when Python support allows) don't pay a .pyc unpack or a
lexicon-inflation cost. The file is only read from as entries are
requested.
Pre-forking servers like sync gunicorn and uWSGI inherit no SQLite
handle across fork(), since the connection lives in threading.local.
Each child opens its own on first use.
Attribution
Source data: VOLUBILIS Mundo Multilingual Thai Dictionary & Database by Belisan — https://belisan-volubilis.blogspot.com/.
License
The data is distributed under
CC BY-SA 4.0,
matching the upstream source license. Derivative works of this data
must also be licensed under CC BY-SA 4.0 or a compatible license. See
LICENSE and NOTICE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file thaiphon_data_volubilis-0.2.0.tar.gz.
File metadata
- Download URL: thaiphon_data_volubilis-0.2.0.tar.gz
- Upload date:
- Size: 4.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb7629c9597b6db38bd4d6111eaeff68e6db8cd6be5e06ab84cd0bdb22e5c252
|
|
| MD5 |
a0da996af47e4d3527d515bb174ed37b
|
|
| BLAKE2b-256 |
d6dc276609d7d2409d9912515ba52158913e50bfaf633bf2cd735bc195911e79
|
Provenance
The following attestation bundles were made for thaiphon_data_volubilis-0.2.0.tar.gz:
Publisher:
publish.yml on 5w0rdf15h/thaiphon-data-volubilis
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
thaiphon_data_volubilis-0.2.0.tar.gz -
Subject digest:
cb7629c9597b6db38bd4d6111eaeff68e6db8cd6be5e06ab84cd0bdb22e5c252 - Sigstore transparency entry: 1362861978
- Sigstore integration time:
-
Permalink:
5w0rdf15h/thaiphon-data-volubilis@b2765a51dd5d00430a285d955d2920e3905c607e -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/5w0rdf15h
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b2765a51dd5d00430a285d955d2920e3905c607e -
Trigger Event:
release
-
Statement type:
File details
Details for the file thaiphon_data_volubilis-0.2.0-py3-none-any.whl.
File metadata
- Download URL: thaiphon_data_volubilis-0.2.0-py3-none-any.whl
- Upload date:
- Size: 2.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
808da2cab3ff38763e636fa266c2538e7ec0e03da7d5671065ee6455a09d5f28
|
|
| MD5 |
d568b0c6454f34225520d4431920100e
|
|
| BLAKE2b-256 |
eb96670eea5ea6ec28069c2a5fa81332684daf306962628bed7b0e9b86814194
|
Provenance
The following attestation bundles were made for thaiphon_data_volubilis-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on 5w0rdf15h/thaiphon-data-volubilis
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
thaiphon_data_volubilis-0.2.0-py3-none-any.whl -
Subject digest:
808da2cab3ff38763e636fa266c2538e7ec0e03da7d5671065ee6455a09d5f28 - Sigstore transparency entry: 1362862026
- Sigstore integration time:
-
Permalink:
5w0rdf15h/thaiphon-data-volubilis@b2765a51dd5d00430a285d955d2920e3905c607e -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/5w0rdf15h
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b2765a51dd5d00430a285d955d2920e3905c607e -
Trigger Event:
release
-
Statement type: