Skip to main content

Deterministic Khmer language tools: word counter, segmentation primitives, and more.

Project description

khmerthings

Deterministic Khmer language tools for Python — built as community building blocks: small, correct, dependency-free primitives you can compose into bigger systems.

No machine-learning models, no third-party NLP dependencies, no network calls. Every result is reproducible and explainable. Khmer script writes no spaces between words, so even "simple" operations like counting or sorting need real language handling — khmerthings implements that from first principles.

Tools

Each tool is available both as a Python API and a CLI subcommand, and has its own detailed document:

Tool CLI Python Docs
Word breaker — split Khmer text into words khmerthings segment break_words, mark_boundaries docs/word-breaker.md
Word counter — count words in Khmer/mixed text khmerthings count count_words, analyze docs/word-counter.md
Line sorter — Khmer dictionary-order sorting khmerthings sort sort_lines, khmer_sort_key docs/line-sorter.md

Install

pip install khmerthings          # library
uv tool install khmerthings     # global CLI

A taste

$ echo "ខ្ញុំស្រឡាញ់ភាសាខ្មែរ" | khmerthings segment
ខ្ញុំ ស្រឡាញ់ ភាសា ខ្មែរ
from khmerthings import break_words, count_words, sort_lines

break_words("ខ្ញុំស្រឡាញ់ភាសាខ្មែរ")   # ['ខ្ញុំ', 'ស្រឡាញ់', 'ភាសា', 'ខ្មែរ']
count_words("ខ្ញុំមានឆ្កែ ២ ក្បាល and 3 cats")   # 8
sort_lines(["ក្រ", "កា", "កក"])                    # ['កក', 'កា', 'ក្រ']

Design principles

  • Deterministic: same input, same output, always. Rule- and dictionary-based algorithms only; nothing probabilistic.
  • Self-contained: zero runtime dependencies; all word data is our own hand-curated set of growable wordlists — words (core vocabulary), names (people's names & titles), modern (slang, loanwords, trending terms), and variants (common misspellings mapped to their canonical spelling) — 1,895 entries and growing, each verified entry by entry; no wordlist is imported wholesale.
  • Lossless: no character is ever dropped — unknown Khmer spans are reported, not discarded.
  • Tested first: every module ships with table-driven unit tests and invariant checks (258 tests as of v0.3.0).

Under the hood, the tools share deterministic primitives (character classification, character-cluster segmentation, a cluster-keyed lexicon trie, lossless tokenization) in src/khmerthings/ — see the module docstrings if you want to build on them directly.

Roadmap

  • ✅ Word counter, line sorter, word breaker
  • ⏳ Wordlist growth across all four sources (words, names, modern, variants) — hand-curated batches each release; the accuracy lever for every dictionary-based tool
  • 🔜 Spellchecker & spellfixer (engine is feasible today; the variants misspelling→canonical map is its future correction table, and lexicon coverage keeps improving its verdicts)
  • Later: part-of-speech tagger, intent detection, paragraph categorization

Contributing

See DEVELOPMENT_GUIDE.md for setup, the architecture, the rules (determinism, self-owned data, tests first), and how to add words to the lexicon — the single most valuable contribution. Changes are tracked in CHANGELOG.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmerthings-0.6.0.tar.gz (82.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

khmerthings-0.6.0-py3-none-any.whl (29.6 kB view details)

Uploaded Python 3

File details

Details for the file khmerthings-0.6.0.tar.gz.

File metadata

  • Download URL: khmerthings-0.6.0.tar.gz
  • Upload date:
  • Size: 82.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for khmerthings-0.6.0.tar.gz
Algorithm Hash digest
SHA256 fb588f7bd7bb6edbeef0892f548dd519718cc2e8adaa7f627055b8c6d2bd3994
MD5 0d067c7ffa93aecdee972ed7384cdc6f
BLAKE2b-256 1c06281fad6548c1e6889fe2b33876e75a86fdb6006ecf5ad6b5561762e55f71

See more details on using hashes here.

File details

Details for the file khmerthings-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: khmerthings-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 29.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for khmerthings-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2592225e0392e9f663a17ad1cca9e4a3f9b2cc9a40daac48382ec8c62de05283
MD5 344a44ba25f64440b2e56e35052051b2
BLAKE2b-256 e76fa5b2305beaf857459cb8540fccd1d085ef9ffc492583aaf1b03c16ed8bda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page