A missing toolkit for Khmer Natural Language Processing.
Project description
Khmer Normalizer
A missing toolkit for Khmer Natural Language Processing.
- Character Reordering
- Duplicate Whitespaces
- Remove zero width space
- Remove emojis
- Fix Common misspellings
- Fix Unicode issues
- Fix Khmer trailing vowels
- URL Replacements
- Unicode Normalization (NFKC)
- Quotes symbols normalization
- Remove repeated punctuations
Installation
pip install khmernormalizer
Usage
from khmernormalizer import normalize
text = "hello, world សួស្តីពិភពលោក !!!! 🇰🇭"
result = normalize(text)
# -> "hello, world សួស្តីពិភពលោក!"
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
khmernormalizer-0.0.2.tar.gz
(6.4 kB
view hashes)
Built Distribution
Close
Hashes for khmernormalizer-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d0a0e755ed4e1c76c4f0f8152fdc9304c4a325407145d200505736b41aec0606 |
|
MD5 | 5d206864e7ec112625e4962af9c22ceb |
|
BLAKE2b-256 | a7ad4638505472f072ef86d61f8fa0cd50a40241ecfcf2862623cae399886ccf |