Skip to main content

A missing toolkit for Khmer Natural Language Processing.

Project description

Khmer Normalizer

A missing toolkit for Khmer Natural Language Processing.

  • Character Reordering
  • Duplicate Whitespaces
  • Remove zero width space
  • Remove emojis
  • Fix Common misspellings
  • Fix Unicode issues
  • Fix Khmer trailing vowels
  • URL Replacements
  • Unicode Normalization (NFKC)
  • Quotes symbols normalization
  • Remove repeated punctuations

Installation

pip install khmernormalizer

Usage

from khmernormalizer import normalize

input_str = """
តាម៖៖​សេចក្តី​រាយ​ការណ៍​​ឲ្យ​ដឹង​ថា!!!!!
https://google.com/a?x=1
កាល 😂 ពីវេលាម៉ោង    ៗ      ប្រមាណ១១យប់ថ្ងៃទី៤ 😂😂😂😂😂 ??
កាាាាត់
មិិិិិន 
មួយរយះះះះះះះ
រយះពេល
""".strip()

normalize(input_str, 
          emoji_replacement="", 
          remove_zwsp=True, 
          url_replacement="")

Result:

តាម៖សេចក្តីរាយការណ៍ឱ្យដឹងថា!

កាល ពីវេលាម៉ោងៗ ប្រមាណ១១យប់ថ្ងៃទី៤?
កាត់
មិន 
មួយរយៈ
រយៈពេល

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmernormalizer-0.0.4.tar.gz (10.0 kB view details)

Uploaded Source

Built Distribution

khmernormalizer-0.0.4-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file khmernormalizer-0.0.4.tar.gz.

File metadata

  • Download URL: khmernormalizer-0.0.4.tar.gz
  • Upload date:
  • Size: 10.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.0

File hashes

Hashes for khmernormalizer-0.0.4.tar.gz
Algorithm Hash digest
SHA256 f39925b0ea420b1a27f3d6bd3add973739ebc73300bbc4893c7e703a38b4d5e8
MD5 1653842ec0bd06764466a13e42950b7f
BLAKE2b-256 ec73277eae471aad7ea0927d5b26fd482e49b6109942c0ff784be76bda204ef4

See more details on using hashes here.

File details

Details for the file khmernormalizer-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for khmernormalizer-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 86c0cfd1021d13b754f0d8c47b1cff950b6a16152f862a09548abd88f3eb97b0
MD5 6f66ae4299ba41463eb18395ebbb47f2
BLAKE2b-256 b4506549f7c5758f70bf6ea1a2c8160a2137de12ec1b0b9446abe116ff59c5bf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page