Skip to main content

Deterministic Khmer language tools: word counter, segmentation primitives, and more.

Project description

khmerthings

Deterministic Khmer language tools for Python — built as community building blocks: small, correct, dependency-free primitives you can compose into bigger systems.

No machine-learning models, no third-party NLP dependencies, no network calls. Every result is reproducible and explainable. Khmer script writes no spaces between words, so even "simple" operations like counting or sorting need real language handling — khmerthings implements that from first principles.

Tools

Each tool is available both as a Python API and a CLI subcommand, and has its own detailed document:

Tool CLI Python Docs
Word breaker — split Khmer text into words khmerthings segment break_words, mark_boundaries docs/word-breaker.md
Word counter — count words in Khmer/mixed text khmerthings count count_words, analyze docs/word-counter.md
Line sorter — Khmer dictionary-order sorting khmerthings sort sort_lines, khmer_sort_key docs/line-sorter.md

Install

pip install khmerthings          # library
uv tool install khmerthings     # global CLI

A taste

$ echo "ខ្ញុំស្រឡាញ់ភាសាខ្មែរ" | khmerthings segment
ខ្ញុំ ស្រឡាញ់ ភាសា ខ្មែរ
from khmerthings import break_words, count_words, sort_lines

break_words("ខ្ញុំស្រឡាញ់ភាសាខ្មែរ")   # ['ខ្ញុំ', 'ស្រឡាញ់', 'ភាសា', 'ខ្មែរ']
count_words("ខ្ញុំមានឆ្កែ ២ ក្បាល and 3 cats")   # 8
sort_lines(["ក្រ", "កា", "កក"])                    # ['កក', 'កា', 'ក្រ']

Design principles

  • Deterministic: same input, same output, always. Rule- and dictionary-based algorithms only; nothing probabilistic.
  • Self-contained: zero runtime dependencies; all word data is our own hand-curated set of growable wordlists — words (core vocabulary), names (people's names & titles), and modern (slang, loanwords, trending terms) — 802 entries and growing. Candidates are researched from public sources and verified entry by entry; no wordlist is imported wholesale.
  • Lossless: no character is ever dropped — unknown Khmer spans are reported, not discarded.
  • Tested first: every module ships with table-driven unit tests and invariant checks (258 tests as of v0.3.0).

Under the hood, the tools share deterministic primitives (character classification, character-cluster segmentation, a cluster-keyed lexicon trie, lossless tokenization) in src/khmerthings/ — see the module docstrings if you want to build on them directly.

Roadmap

  • ✅ Word counter, line sorter, word breaker
  • ⏳ Wordlist growth across all three sources (words, names, modern) — hand-curated batches each release; the accuracy lever for every dictionary-based tool
  • 🔜 Spellchecker & spellfixer (engine is feasible today; waiting on lexicon coverage to make its verdicts trustworthy)
  • Later: part-of-speech tagger, intent detection, paragraph categorization

Contributing

See DEVELOPMENT_GUIDE.md for setup, the architecture, the rules (determinism, self-owned data, tests first), and how to add words to the lexicon — the single most valuable contribution. Changes are tracked in CHANGELOG.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmerthings-0.4.3.tar.gz (72.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

khmerthings-0.4.3-py3-none-any.whl (22.2 kB view details)

Uploaded Python 3

File details

Details for the file khmerthings-0.4.3.tar.gz.

File metadata

  • Download URL: khmerthings-0.4.3.tar.gz
  • Upload date:
  • Size: 72.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for khmerthings-0.4.3.tar.gz
Algorithm Hash digest
SHA256 41a6e7c4a810ef0d3be8fb2cc74ab44b87d89cfdcdd16b2995b8cf93884750e4
MD5 8b4edbc1707ed7399f9b717824195687
BLAKE2b-256 4eec333cda093714b41678584eb4d68572ef66a144bd56e2873d300b087396c2

See more details on using hashes here.

File details

Details for the file khmerthings-0.4.3-py3-none-any.whl.

File metadata

  • Download URL: khmerthings-0.4.3-py3-none-any.whl
  • Upload date:
  • Size: 22.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for khmerthings-0.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8b063231350b6cb27328c1884b220f4681872c08aff94ff226018d262f7da41a
MD5 6492f30104cb5bc637bcdd650b5177a4
BLAKE2b-256 20a5b89101be2e80c2f412e7a9050219c5f4c5cfc89be8010b37aa5f52581545

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page