A missing toolkit for Khmer Natural Language Processing.
Project description
Khmer Normalizer
A missing toolkit for Khmer Natural Language Processing.
- Character Reordering
- Duplicate Whitespaces
- Remove zero width space
- Remove emojis
- Fix Common misspellings
- Fix Unicode issues
- Fix Khmer trailing vowels
- URL Replacements
- Unicode Normalization (NFKC)
- Quotes symbols normalization
- Remove repeated punctuations
from khmernormalizer import normalize
text = "hello, world សួស្តីពិភពលោក !!!! 🇰🇭"
result = normalize(text)
# -> "hello, world សួស្តីពិភពលោក!"
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
khmernormalizer-0.0.1.tar.gz
(5.7 kB
view hashes)
Built Distribution
Close
Hashes for khmernormalizer-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 67bb785c5d2dd00415508b987f61ca104daa40e5217988a68f790ea806b62c95 |
|
MD5 | 622990886f9f4663d93b54f7f6ecb678 |
|
BLAKE2b-256 | 488d4756b610d467a74d374c44fe30f953b377f402216c646dfb1e2b19fafee6 |