Skip to main content

Deterministic Khmer language tools: word counter, segmentation primitives, and more.

Project description

khmerthings

Deterministic Khmer language tools for Python — built as community building blocks: small, correct, dependency-free primitives you can compose into bigger systems.

No machine-learning models, no third-party NLP dependencies, no network calls. Every result is reproducible and explainable. Khmer script writes no spaces between words, so even "simple" operations like counting or sorting need real language handling — khmerthings implements that from first principles.

Tools

Each tool is available both as a Python API and a CLI subcommand, and has its own detailed document:

Tool CLI Python Docs
Word breaker — split Khmer text into words khmerthings segment break_words, mark_boundaries docs/word-breaker.md
Word counter — count words in Khmer/mixed text khmerthings count count_words, analyze docs/word-counter.md
Line sorter — Khmer dictionary-order sorting khmerthings sort sort_lines, khmer_sort_key docs/line-sorter.md
Spellchecker — find Khmer misspellings & unknown words khmerthings spellcheck check_spelling docs/spellcheck.md
Spellfixer — rewrite known misspellings to canonical khmerthings spellfix fix_spelling docs/spellfix.md

Install

pip install khmerthings          # library
uv tool install khmerthings     # global CLI

A taste

$ echo "ខ្ញុំស្រឡាញ់ភាសាខ្មែរ" | khmerthings segment
ខ្ញុំ ស្រឡាញ់ ភាសា ខ្មែរ
from khmerthings import break_words, count_words, fix_spelling, sort_lines

break_words("ខ្ញុំស្រឡាញ់ភាសាខ្មែរ")   # ['ខ្ញុំ', 'ស្រឡាញ់', 'ភាសា', 'ខ្មែរ']
count_words("ខ្ញុំមានឆ្កែ ២ ក្បាល and 3 cats")   # 8
sort_lines(["ក្រ", "កា", "កក"])                    # ['កក', 'កា', 'ក្រ']
fix_spelling("ខ្ញុំសំរាប់ការងារ")                  # 'ខ្ញុំសម្រាប់ការងារ'

Design principles

  • Deterministic: same input, same output, always. Rule- and dictionary-based algorithms only; nothing probabilistic.
  • Self-contained: zero runtime dependencies; all word data is our own hand-curated set of growable wordlists — words (core vocabulary), names (people's names & titles), modern (slang, loanwords, trending terms), and variants (common misspellings mapped to their canonical spelling) — 1,895 entries and growing, each verified entry by entry; no wordlist is imported wholesale.
  • Lossless: no character is ever dropped — unknown Khmer spans are reported, not discarded.
  • Tested first: every module ships with table-driven unit tests and invariant checks (332 tests and growing).

Under the hood, the tools share deterministic primitives (character classification, character-cluster segmentation, a cluster-keyed lexicon trie, lossless tokenization) in src/khmerthings/ — see the module docstrings if you want to build on them directly.

Roadmap

  • ✅ Word counter, line sorter, word breaker, spellchecker & spellfixer
  • ⏳ Wordlist growth across all four sources (words, names, modern, variants) — hand-curated batches each release; the accuracy lever for every dictionary-based tool, including the spellchecker's verdicts, suggestions, and fixes
  • Later: part-of-speech tagger, intent detection, paragraph categorization

Contributing

See DEVELOPMENT_GUIDE.md for setup, the architecture, the rules (determinism, self-owned data, tests first), and how to add words to the lexicon — the single most valuable contribution. Changes are tracked in CHANGELOG.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmerthings-0.7.0.tar.gz (90.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

khmerthings-0.7.0-py3-none-any.whl (32.5 kB view details)

Uploaded Python 3

File details

Details for the file khmerthings-0.7.0.tar.gz.

File metadata

  • Download URL: khmerthings-0.7.0.tar.gz
  • Upload date:
  • Size: 90.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for khmerthings-0.7.0.tar.gz
Algorithm Hash digest
SHA256 a913c76b097f889df8b6ee6c1c04c3f3ee3c8ec8f9d735951e3efb825b4b6c12
MD5 f67af3ead31f2446db38aa4351bd80e6
BLAKE2b-256 22625049db02b2652f2b6b7f477abb1a273efe08fe8cbfd859deb0ee6283bbda

See more details on using hashes here.

File details

Details for the file khmerthings-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: khmerthings-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 32.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for khmerthings-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 46129b6f830f19326fc78d96596ef1252044941edc5f2ab798153faea635105c
MD5 607e3e2f18ef554fdb5c913186c57075
BLAKE2b-256 39c57fca3cbca63a8cf469ffd676abc8b4adfc084a496a7de7a76d74aefd587b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page