khmercut

A (fast) Khmer word segmentation toolkit.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

khmercut

A (fast) Khmer word segmentation toolkit.

A single python file
Using pycrfsuite only
Include Khmer normalize
CLI Supoprt
Multiprocess support

pip install khmercut

Python

from khmercut import tokenize

tokenize("ឃាត់ខ្លួនជនសង្ស័យ០៤នាក់ ករណីលួចខ្សែភ្លើង នៅស្រុកព្រៃនប់")
# => ['ឃាត់ខ្លួន', 'ជនសង្ស័យ', '០៤', 'នាក់', ' ', 'ករណី', 'លួច', 'ខ្សែភ្លើង', ' ', 'នៅ', 'ស្រុក', 'ព្រៃនប់']

CLI

e.g.

khmercut large_km.txt --jobs 20 --normalize -d out/ -s "|"

Available options

usage: khmercut [-h] [-d DIRECTORY] [-s SEPARATOR] [-j JOBS] [-q] [-n] files [files ...]

A fast Khmer word segmentation toolkit.

positional arguments:
  files                 Path to text files

optional arguments:
  -h, --help            show this help message and exit
  -d DIRECTORY, --directory DIRECTORY
                        Output folder
  -s SEPARATOR, --separator SEPARATOR
                        Specify token separator
  -j JOBS, --jobs JOBS  Number of processors
  -q, --quiet           Disable progress output
  -n, --normalize       Normalize input text before processing

Reference

Khmer language processing toolkit

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.2

Aug 3, 2023

0.0.1

Aug 3, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmercut-0.0.2.tar.gz (5.9 MB view hashes)

Uploaded Aug 3, 2023 Source

Built Distribution

khmercut-0.0.2-py3-none-any.whl (5.9 MB view hashes)

Uploaded Aug 3, 2023 Python 3

Hashes for khmercut-0.0.2.tar.gz

Hashes for khmercut-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`b28f3f29f9deba0f10ef53db38562246a5312b5e9129390a144e4da5d52f2459`
MD5	`5500b6c94ae367f01b23efc81e0b2557`
BLAKE2b-256	`e84536c9ef908b86e642bcca31b980067e04cfc91230693472005c1d16040169`

Hashes for khmercut-0.0.2-py3-none-any.whl

Hashes for khmercut-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8e259ca066b550653ea9792c56c45a71dd28c331ea4837c80ae871392f9eb1aa`
MD5	`310a283d3c582a4470dbb32501af6f21`
BLAKE2b-256	`23ef16987689caa762b03c71ae992cdb31d8625bc27c11daf4c44761135f87cc`