Skip to main content

Deterministic Khmer language tools: word counter, segmentation primitives, and more.

Project description

khmerthings

Deterministic Khmer language tools for Python — built as community building blocks: small, correct, dependency-free primitives you can compose into bigger systems.

No machine-learning models, no third-party NLP dependencies, no network calls. Every result is reproducible and explainable. Khmer script writes no spaces between words, so even "simple" operations like counting or sorting need real language handling — khmerthings implements that from first principles.

Tools

Each tool is available both as a Python API and a CLI subcommand, and has its own detailed document:

Tool CLI Python Docs
Word breaker — split Khmer text into words khmerthings segment break_words, mark_boundaries docs/word-breaker.md
Word counter — count words in Khmer/mixed text khmerthings count count_words, analyze docs/word-counter.md
Line sorter — Khmer dictionary-order sorting khmerthings sort sort_lines, khmer_sort_key docs/line-sorter.md
Spellchecker — find Khmer misspellings & unknown words khmerthings spellcheck check_spelling docs/spellcheck.md
Spellfixer — rewrite known misspellings to canonical khmerthings spellfix fix_spelling docs/spellfix.md
Normalizer — spellfix + re-space into clean, ready-to-use text khmerthings normalize normalize_text docs/normalize.md

Install

pip install khmerthings          # library
uv tool install khmerthings     # global CLI

A taste

$ echo "ខ្ញុំស្រឡាញ់ភាសាខ្មែរ" | khmerthings segment
ខ្ញុំ ស្រឡាញ់ ភាសា ខ្មែរ
from khmerthings import break_words, count_words, fix_spelling, sort_lines

break_words("ខ្ញុំស្រឡាញ់ភាសាខ្មែរ")   # ['ខ្ញុំ', 'ស្រឡាញ់', 'ភាសា', 'ខ្មែរ']
count_words("ខ្ញុំមានឆ្កែ ២ ក្បាល and 3 cats")   # 8
sort_lines(["ក្រ", "កា", "កក"])                    # ['កក', 'កា', 'ក្រ']
fix_spelling("ខ្ញុំសំរាប់ការងារ")                  # 'ខ្ញុំសម្រាប់ការងារ'

Design principles

  • Deterministic: same input, same output, always. Rule- and dictionary-based algorithms only; nothing probabilistic.
  • Self-contained: zero runtime dependencies; all word data is our own hand-curated set of growable wordlists — words (core vocabulary), names (people's names & titles), modern (slang, loanwords, trending terms), and variants (common misspellings mapped to their canonical spelling) — 1,895 entries and growing, each verified entry by entry; no wordlist is imported wholesale.
  • Lossless: no character is ever dropped — unknown Khmer spans are reported, not discarded.
  • Tested first: every module ships with table-driven unit tests and invariant checks (332 tests and growing).

Under the hood, the tools share deterministic primitives (character classification, character-cluster segmentation, a cluster-keyed lexicon trie, lossless tokenization) in src/khmerthings/ — see the module docstrings if you want to build on them directly.

Roadmap

  • ✅ Word counter, line sorter, word breaker, spellchecker & spellfixer, normalizer
  • ⏳ Wordlist growth across all four sources (words, names, modern, variants) — hand-curated batches each release; the accuracy lever for every dictionary-based tool, including the spellchecker's verdicts, suggestions, and fixes
  • Later: part-of-speech tagger, intent detection, paragraph categorization

Contributing

See DEVELOPMENT_GUIDE.md for setup, the architecture, the rules (determinism, self-owned data, tests first), and how to add words to the lexicon — the single most valuable contribution. Changes are tracked in CHANGELOG.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmerthings-0.8.0.tar.gz (94.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

khmerthings-0.8.0-py3-none-any.whl (34.3 kB view details)

Uploaded Python 3

File details

Details for the file khmerthings-0.8.0.tar.gz.

File metadata

  • Download URL: khmerthings-0.8.0.tar.gz
  • Upload date:
  • Size: 94.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for khmerthings-0.8.0.tar.gz
Algorithm Hash digest
SHA256 6c739b27f4afb4df93414185f26f8b8e0185fca71af8a1b9ac0d0233a8573fc3
MD5 092afcb161031a2df031b9258f143062
BLAKE2b-256 5d9db02dc542e342bd9902756338565b79e5e0808fb74a3a8d67cd4cf064b62a

See more details on using hashes here.

File details

Details for the file khmerthings-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: khmerthings-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 34.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for khmerthings-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed7ff2bb240a321d56a6083e880ca652b45472a31200e00c7f8aa3c845f3d085
MD5 42197108f6143dc41ea649d83d3751d5
BLAKE2b-256 c8faea7264322e1461e79b9e69a231e8cfcc1a311b549d22d970ffc1b1bea137

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page