Skip to main content

Deterministic Khmer language tools: word counter, segmentation primitives, and more.

Project description

khmerthings

Deterministic Khmer language tools for Python — built as community building blocks: small, correct, dependency-free primitives you can compose into bigger systems.

No machine-learning models, no third-party NLP dependencies, no network calls. Every result is reproducible and explainable. Khmer script writes no spaces between words, so even "simple" operations like counting or sorting need real language handling — khmerthings implements that from first principles.

Tools

Each tool is available both as a Python API and a CLI subcommand, and has its own detailed document:

Tool CLI Python Docs
Word breaker — split Khmer text into words khmerthings segment break_words, mark_boundaries docs/word-breaker.md
Word counter — count words in Khmer/mixed text khmerthings count count_words, analyze docs/word-counter.md
Line sorter — Khmer dictionary-order sorting khmerthings sort sort_lines, khmer_sort_key docs/line-sorter.md

Install

pip install khmerthings          # library
uv tool install khmerthings     # global CLI

A taste

$ echo "ខ្ញុំស្រឡាញ់ភាសាខ្មែរ" | khmerthings segment
ខ្ញុំ ស្រឡាញ់ ភាសា ខ្មែរ
from khmerthings import break_words, count_words, sort_lines

break_words("ខ្ញុំស្រឡាញ់ភាសាខ្មែរ")   # ['ខ្ញុំ', 'ស្រឡាញ់', 'ភាសា', 'ខ្មែរ']
count_words("ខ្ញុំមានឆ្កែ ២ ក្បាល and 3 cats")   # 8
sort_lines(["ក្រ", "កា", "កក"])                    # ['កក', 'កា', 'ក្រ']

Design principles

  • Deterministic: same input, same output, always. Rule- and dictionary-based algorithms only; nothing probabilistic.
  • Self-contained: zero runtime dependencies; all word data is our own hand-curated set of growable wordlists — words (core vocabulary), names (people's names & titles), and modern (slang, loanwords, trending terms) — 1,583 entries and growing, each verified entry by entry; no wordlist is imported wholesale.
  • Lossless: no character is ever dropped — unknown Khmer spans are reported, not discarded.
  • Tested first: every module ships with table-driven unit tests and invariant checks (258 tests as of v0.3.0).

Under the hood, the tools share deterministic primitives (character classification, character-cluster segmentation, a cluster-keyed lexicon trie, lossless tokenization) in src/khmerthings/ — see the module docstrings if you want to build on them directly.

Roadmap

  • ✅ Word counter, line sorter, word breaker
  • ⏳ Wordlist growth across all three sources (words, names, modern) — hand-curated batches each release; the accuracy lever for every dictionary-based tool
  • 🔜 Spellchecker & spellfixer (engine is feasible today; waiting on lexicon coverage to make its verdicts trustworthy)
  • Later: part-of-speech tagger, intent detection, paragraph categorization

Contributing

See DEVELOPMENT_GUIDE.md for setup, the architecture, the rules (determinism, self-owned data, tests first), and how to add words to the lexicon — the single most valuable contribution. Changes are tracked in CHANGELOG.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmerthings-0.5.0.tar.gz (77.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

khmerthings-0.5.0-py3-none-any.whl (26.2 kB view details)

Uploaded Python 3

File details

Details for the file khmerthings-0.5.0.tar.gz.

File metadata

  • Download URL: khmerthings-0.5.0.tar.gz
  • Upload date:
  • Size: 77.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for khmerthings-0.5.0.tar.gz
Algorithm Hash digest
SHA256 ab0da6039c63d835b3b964cb89471c04f3fc2428c6a5b93995a9220732f41014
MD5 dbec703468e5ef8dec5fcbd9529016b3
BLAKE2b-256 6b32ba02198970cf6cda07a6d475f6632c889f107d208a093db127c15519d8e3

See more details on using hashes here.

File details

Details for the file khmerthings-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: khmerthings-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 26.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for khmerthings-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c66c0a13f84d1cd252b3db4a678eff70a3d58b7b9ce362e407cb43cb1abe2ad2
MD5 d06749327b30a2951322c21b86210631
BLAKE2b-256 e5fc6c70aaaebc3d2768b2bd596e08cafda68a9bfef3368855854746fc716446

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page