A (fast) Khmer word segmentation toolkit.
Project description
khmercut
A (fast) Khmer word segmentation toolkit.
- A single python file
- Using
pycrfsuite
only - Include Khmer normalize
- CLI Supoprt
- Multiprocess support
pip install khmercut
Python
from khmercut import tokenize
tokenize("ឃាត់ខ្លួនជនសង្ស័យ០៤នាក់ ករណីលួចខ្សែភ្លើង នៅស្រុកព្រៃនប់")
# => ['ឃាត់ខ្លួន', 'ជនសង្ស័យ', '០៤', 'នាក់', ' ', 'ករណី', 'លួច', 'ខ្សែភ្លើង', ' ', 'នៅ', 'ស្រុក', 'ព្រៃនប់']
CLI
e.g.
khmercut large_km.txt --jobs 20 --normalize -d out/ -s "|"
Available options
usage: khmercut [-h] [-d DIRECTORY] [-s SEPARATOR] [-j JOBS] [-q] [-n] files [files ...]
A fast Khmer word segmentation toolkit.
positional arguments:
files Path to text files
optional arguments:
-h, --help show this help message and exit
-d DIRECTORY, --directory DIRECTORY
Output folder
-s SEPARATOR, --separator SEPARATOR
Specify token separator
-j JOBS, --jobs JOBS Number of processors
-q, --quiet Disable progress output
-n, --normalize Normalize input text before processing
Reference
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
khmercut-0.0.2.tar.gz
(5.9 MB
view details)
Built Distribution
File details
Details for the file khmercut-0.0.2.tar.gz
.
File metadata
- Download URL: khmercut-0.0.2.tar.gz
- Upload date:
- Size: 5.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b28f3f29f9deba0f10ef53db38562246a5312b5e9129390a144e4da5d52f2459 |
|
MD5 | 5500b6c94ae367f01b23efc81e0b2557 |
|
BLAKE2b-256 | e84536c9ef908b86e642bcca31b980067e04cfc91230693472005c1d16040169 |
File details
Details for the file khmercut-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: khmercut-0.0.2-py3-none-any.whl
- Upload date:
- Size: 5.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e259ca066b550653ea9792c56c45a71dd28c331ea4837c80ae871392f9eb1aa |
|
MD5 | 310a283d3c582a4470dbb32501af6f21 |
|
BLAKE2b-256 | 23ef16987689caa762b03c71ae992cdb31d8625bc27c11daf4c44761135f87cc |