Skip to main content

A Khmer language processing toolkit

Project description

๐Ÿ…Khmer natural language processing toolkit๐Ÿ…

circleci Codacy Badge pre-commit code style release versions fownloads DOI

๐ŸŽฏTODO

  • Sentence Segmentation
  • Word Segmentation
  • Part of speech Tagging
  • Named Entity Recognition
  • Text classification

๐Ÿ’ชInstallation

pip install khmer-nltk

๐Ÿน Quick tour

[Blog]

To get the evaluation result of khmer-nltk's functionalities, please refer the sub-modules's readme

Sentence tokenization

>>> from khmernltk import sentence_tokenize
>>> raw_text = "แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ! แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ"
>>> print(sentence_tokenize(raw_text))
['แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ!', 'แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ']

Word tokenization

>>> from khmernltk import word_tokenize
>>> raw_text = "แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ! แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ"
>>> print(word_tokenize(raw_text, return_tokens=True))
['แžแžฝแž”', 'แž†แŸ’แž“แžถแŸ†', 'แž‘แžธ', 'แŸขแŸจ', '!', ' ', 'แŸขแŸฃ', ' ', 'แžแžปแž›แžถ', ' ', 'แžŸแŸ’แž˜แžถแžšแžแžธ', 'แž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถ', 'แž‡แžถแžแžท', 'แžšแžœแžถแž„', 'แžแŸ’แž˜แŸ‚แžš', 'แž“แžทแž„', 'แžแŸ’แž˜แŸ‚แžš', ' ', 'แžˆแžถแž“', 'แž‘แŸ…', 'แž”แž‰แŸ’แž…แž”แŸ‹', 'แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜', ' ', 'แž“แžถแŸ†', 'แž–แž“แŸ’แž›แžบ', 'แžŸแž“แŸ’แžแžทแž—แžถแž–', ' ', 'แž“แžทแž„', 'แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜', 'แž‡แžถแžแŸ’แž˜แžธ']

POS Tagging

Usage

>>> from khmernltk import pos_tag
>>> raw_text = "แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ! แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ"
>>> print(pos_tag(raw_text))
[('แžแžฝแž”', 'n'), ('แž†แŸ’แž“แžถแŸ†', 'n'), ('แž‘แžธ', 'n'), ('แŸขแŸจ', '1'), ('!', '.'), (' ', 'n'), ('แŸขแŸฃ', '1'), (' ', 'n'), ('แžแžปแž›แžถ', 'n'), (' ', 'n'), ('แžŸแŸ’แž˜แžถแžšแžแžธ', 'n'), ('แž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถ', 'n'), ('แž‡แžถแžแžท', 'n'), ('แžšแžœแžถแž„', 'o'), ('แžแŸ’แž˜แŸ‚แžš', 'n'), ('แž“แžทแž„', 'o'), ('แžแŸ’แž˜แŸ‚แžš', 'n'), (' ', 'n'), ('แžˆแžถแž“', 'v'), ('แž‘แŸ…', 'v'), ('แž”แž‰แŸ’แž…แž”แŸ‹', 'v'), ('แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜', 'n'), (' ', 'n'), ('แž“แžถแŸ†', 'v'), ('แž–แž“แŸ’แž›แžบ', 'n'), ('แžŸแž“แŸ’แžแžทแž—แžถแž–', 'n'), (' ', 'n'), ('แž“แžทแž„', 'o'), ('แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜', 'n'), ('แž‡แžถแžแŸ’แž˜แžธ', 'o')]

โœ๏ธ Citation

@misc{hoang-khmer-nltk,
  author = {Phan Viet Hoang},
  title = {Khmer Natural Language Processing Tookit},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/VietHoang1512/khmer-nltk}}
}

Used in:

๐Ÿ‘จโ€๐ŸŽ“ References

๐Ÿ“œ Advisor

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmer-nltk-1.6.tar.gz (7.0 MB view hashes)

Uploaded Source

Built Distribution

khmer_nltk-1.6-py3-none-any.whl (7.0 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page