Skip to main content

A (fast) Khmer word segmentation toolkit.

Project description

khmercut

A (fast) Khmer word segmentation toolkit.

  • A single python file
  • Using pycrfsuite only
  • Include Khmer normalize
  • CLI Supoprt
  • Multiprocess support
pip install khmercut

Python

from khmercut import tokenize

tokenize("ឃាត់ខ្លួនជនសង្ស័យ០៤នាក់ ករណីលួចខ្សែភ្លើង នៅស្រុកព្រៃនប់")
# => ['ឃាត់ខ្លួន', 'ជនសង្ស័យ', '០៤', 'នាក់', ' ', 'ករណី', 'លួច', 'ខ្សែភ្លើង', ' ', 'នៅ', 'ស្រុក', 'ព្រៃនប់']

CLI

e.g.

khmercut large_km.txt --jobs 20 --normalize -d out/ -s "|"

Available options

usage: khmercut [-h] [-d DIRECTORY] [-s SEPARATOR] [-j JOBS] [-q] [-n] files [files ...]

A fast Khmer word segmentation toolkit.

positional arguments:
  files                 Path to text files

optional arguments:
  -h, --help            show this help message and exit
  -d DIRECTORY, --directory DIRECTORY
                        Output folder
  -s SEPARATOR, --separator SEPARATOR
                        Specify token separator
  -j JOBS, --jobs JOBS  Number of processors
  -q, --quiet           Disable progress output
  -n, --normalize       Normalize input text before processing

Reference

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmercut-0.0.2.tar.gz (5.9 MB view details)

Uploaded Source

Built Distribution

khmercut-0.0.2-py3-none-any.whl (5.9 MB view details)

Uploaded Python 3

File details

Details for the file khmercut-0.0.2.tar.gz.

File metadata

  • Download URL: khmercut-0.0.2.tar.gz
  • Upload date:
  • Size: 5.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for khmercut-0.0.2.tar.gz
Algorithm Hash digest
SHA256 b28f3f29f9deba0f10ef53db38562246a5312b5e9129390a144e4da5d52f2459
MD5 5500b6c94ae367f01b23efc81e0b2557
BLAKE2b-256 e84536c9ef908b86e642bcca31b980067e04cfc91230693472005c1d16040169

See more details on using hashes here.

File details

Details for the file khmercut-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: khmercut-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 5.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for khmercut-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8e259ca066b550653ea9792c56c45a71dd28c331ea4837c80ae871392f9eb1aa
MD5 310a283d3c582a4470dbb32501af6f21
BLAKE2b-256 23ef16987689caa762b03c71ae992cdb31d8625bc27c11daf4c44761135f87cc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page