Skip to main content

The package is for segment khmer words (adding space between words) with two method: compound-based and morpheme-based.

Project description

Khmer Segmenter

A lightweight Khmer word segmentation package for Python ≥ 3.10, designed for simple and efficient tokenization.

This project is adapted and simplified from the original khnlp package, with the following goals:

  • Support modern Python versions (≥ 3.10)
  • Simplified installation
  • Focus on a single task: Khmer word segmentation
  • Lightweight and easy to integrate into NLP pipelines

This package is intended as a small academic contribution to support Khmer NLP research and practical applications.

Installation

pip install khmer-segmenter

Requires:

  • Python >= 3.10

Usage

from khmer_segmenter import Tokenizer

tokenizer = Tokenizer(seg_type="com")

print(tokenizer.tokenize("សួស្ដីអ្នកទាំងអស់យើង"))

Segmentation Modes

The tokenizer supports two segmentation strategies:

1. Compound-based (seg_type="com")

  • Segments text into compound words
  • Suitable for general word-level NLP tasks
  • Recommended for downstream applications such as:
    • Text classification
    • Named Entity Recognition
    • Information retrieval

2. Morpheme-based (seg_type="mor")

  • Performs finer-grained segmentation
  • Splits text into smaller morphological units
  • Useful for:
    • Linguistic analysis
    • Subword modeling
    • Research-focused NLP tasks

Motivation

Khmer is a low-resource language with no explicit word boundary markers (spaces are not consistently used to separate words). This creates challenges for:

  • Automatic Speech Recognition (ASR)
  • Language Modeling
  • Machine Translation
  • Information Extraction

Existing tools for Khmer segmentation often have:

  • Limited Python version support
  • Heavy dependencies
  • Broader NLP scope than necessary

This project provides a focused, minimal, and modern alternative dedicated solely to segmentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmer_segmenter-0.1.0.tar.gz (6.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

khmer_segmenter-0.1.0-py3-none-any.whl (5.9 MB view details)

Uploaded Python 3

File details

Details for the file khmer_segmenter-0.1.0.tar.gz.

File metadata

  • Download URL: khmer_segmenter-0.1.0.tar.gz
  • Upload date:
  • Size: 6.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for khmer_segmenter-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b993ecc9d2a38b9721b28d03ca6efbfaa1490870908986278d0d703d3c68c545
MD5 06b3fd97d9986919df1108065a209329
BLAKE2b-256 b6b855df39844282dfcfb5b8ae87bee6864e356273e0b12cb187ebdbe7fef592

See more details on using hashes here.

File details

Details for the file khmer_segmenter-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for khmer_segmenter-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 23c3824fe547b92f5fea370a0325bc2953de77d3a1b3cbb41dce7b892308a70d
MD5 2db489744ba0768478d4fb92ad8f9352
BLAKE2b-256 954a2f9e8034ec976275ad24182900870349c66f6e4036c28683e2eb6b5f7dbe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page