Deterministic Khmer language tools: word counter, segmentation primitives, and more.
Project description
khmerthings
Deterministic Khmer language tools for Python — built as community building blocks: small, correct, dependency-free primitives you can compose into bigger systems.
No machine-learning models, no third-party NLP dependencies, no network calls. Every result is reproducible and explainable. Khmer script writes no spaces between words, so even "simple" operations like counting or sorting need real language handling — khmerthings implements that from first principles.
Tools
Each tool is available both as a Python API and a CLI subcommand, and has its own detailed document:
| Tool | CLI | Python | Docs |
|---|---|---|---|
| Word breaker — split Khmer text into words | khmerthings segment |
break_words, mark_boundaries |
docs/word-breaker.md |
| Word counter — count words in Khmer/mixed text | khmerthings count |
count_words, analyze |
docs/word-counter.md |
| Line sorter — Khmer dictionary-order sorting | khmerthings sort |
sort_lines, khmer_sort_key |
docs/line-sorter.md |
Install
pip install khmerthings # library
uv tool install khmerthings # global CLI
A taste
$ echo "ខ្ញុំស្រឡាញ់ភាសាខ្មែរ" | khmerthings segment
ខ្ញុំ ស្រឡាញ់ ភាសា ខ្មែរ
from khmerthings import break_words, count_words, sort_lines
break_words("ខ្ញុំស្រឡាញ់ភាសាខ្មែរ") # ['ខ្ញុំ', 'ស្រឡាញ់', 'ភាសា', 'ខ្មែរ']
count_words("ខ្ញុំមានឆ្កែ ២ ក្បាល and 3 cats") # 8
sort_lines(["ក្រ", "កា", "កក"]) # ['កក', 'កា', 'ក្រ']
Design principles
- Deterministic: same input, same output, always. Rule- and dictionary-based algorithms only; nothing probabilistic.
- Self-contained: zero runtime dependencies; all word data is our own
hand-curated set of growable wordlists —
words(core vocabulary),names(people's names & titles), andmodern(slang, loanwords, trending terms) — 1,583 entries and growing, each verified entry by entry; no wordlist is imported wholesale. - Lossless: no character is ever dropped — unknown Khmer spans are reported, not discarded.
- Tested first: every module ships with table-driven unit tests and invariant checks (258 tests as of v0.3.0).
Under the hood, the tools share deterministic primitives (character
classification, character-cluster segmentation, a cluster-keyed lexicon
trie, lossless tokenization) in src/khmerthings/ — see the module
docstrings if you want to build on them directly.
Roadmap
- ✅ Word counter, line sorter, word breaker
- ⏳ Wordlist growth across all three sources (
words,names,modern) — hand-curated batches each release; the accuracy lever for every dictionary-based tool - 🔜 Spellchecker & spellfixer (engine is feasible today; waiting on lexicon coverage to make its verdicts trustworthy)
- Later: part-of-speech tagger, intent detection, paragraph categorization
Contributing
See DEVELOPMENT_GUIDE.md for setup, the architecture, the rules (determinism, self-owned data, tests first), and how to add words to the lexicon — the single most valuable contribution. Changes are tracked in CHANGELOG.md.
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file khmerthings-0.5.0.tar.gz.
File metadata
- Download URL: khmerthings-0.5.0.tar.gz
- Upload date:
- Size: 77.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab0da6039c63d835b3b964cb89471c04f3fc2428c6a5b93995a9220732f41014
|
|
| MD5 |
dbec703468e5ef8dec5fcbd9529016b3
|
|
| BLAKE2b-256 |
6b32ba02198970cf6cda07a6d475f6632c889f107d208a093db127c15519d8e3
|
File details
Details for the file khmerthings-0.5.0-py3-none-any.whl.
File metadata
- Download URL: khmerthings-0.5.0-py3-none-any.whl
- Upload date:
- Size: 26.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c66c0a13f84d1cd252b3db4a678eff70a3d58b7b9ce362e407cb43cb1abe2ad2
|
|
| MD5 |
d06749327b30a2951322c21b86210631
|
|
| BLAKE2b-256 |
e5fc6c70aaaebc3d2768b2bd596e08cafda68a9bfef3368855854746fc716446
|