Skip to main content

A (fast) Khmer word segmentation toolkit.

Project description

khmercut

A (fast) Khmer word segmentation toolkit.

  • A single python file
  • Using pycrfsuite only
  • Include Khmer normalize
  • CLI Supoprt
  • Multiprocess support
pip install khmercut

Python

from khmercut import tokenize

tokenize("ឃាត់ខ្លួនជនសង្ស័យ០៤នាក់ ករណីលួចខ្សែភ្លើង នៅស្រុកព្រៃនប់")
# => ['ឃាត់ខ្លួន', 'ជនសង្ស័យ', '០៤', 'នាក់', ' ', 'ករណី', 'លួច', 'ខ្សែភ្លើង', ' ', 'នៅ', 'ស្រុក', 'ព្រៃនប់']

CLI

e.g.

khmercut large_km.txt --jobs 20 --normalize -d out/ -s "|"

Available options

usage: khmercut [-h] [-d DIRECTORY] [-s SEPARATOR] [-j JOBS] [-q] [-n] files [files ...]

A fast Khmer word segmentation toolkit.

positional arguments:
  files                 Path to text files

optional arguments:
  -h, --help            show this help message and exit
  -d DIRECTORY, --directory DIRECTORY
                        Output folder
  -s SEPARATOR, --separator SEPARATOR
                        Specify token separator
  -j JOBS, --jobs JOBS  Number of processors
  -q, --quiet           Disable progress output
  -n, --normalize       Normalize input text before processing

Reference

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmercut-0.0.2.tar.gz (5.9 MB view hashes)

Uploaded Source

Built Distribution

khmercut-0.0.2-py3-none-any.whl (5.9 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page