Skip to main content

A Khmer language processing toolkit

Project description

๐Ÿ…Khmer natural language processing toolkit๐Ÿ…

code style pre-commit release versions fownloads license

๐ŸŽฏTODO

  • Sentence Segmentation
  • Word Segmentation
  • Named Entity Recognition
  • Part of speech Tagging
  • Text classification

๐Ÿ’ชInstallation

$ pip install khmer-nltk

๐Ÿน Quick tour

To get the evaluation result of khmer-nltk's functionalities, please refer the sub-modules's readme

Sentence tokenization

>>> from khmernltk import sentence_tokenize
>>> raw_text = "แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ! แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ"
>>> print(sentence_tokenize(raw_text))
['แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ!', 'แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ']

Word tokenization

>>> from khmernltk import word_tokenize
>>> raw_text = "แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ! แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ"
>>> print(word_tokenize(raw_text, return_tokens=True))
['แžแžฝแž”', 'แž†แŸ’แž“แžถแŸ†', 'แž‘แžธ', 'แŸขแŸจ', '!', ' ', 'แŸขแŸฃ', ' ', 'แžแžปแž›แžถ', ' ', 'แžŸแŸ’แž˜แžถแžšแžแžธ', 'แž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถ', 'แž‡แžถแžแžท', 'แžšแžœแžถแž„', 'แžแŸ’แž˜แŸ‚แžš', 'แž“แžทแž„', 'แžแŸ’แž˜แŸ‚แžš', ' ', 'แžˆแžถแž“', 'แž‘แŸ…', 'แž”แž‰แŸ’แž…แž”แŸ‹', 'แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜', ' ', 'แž“แžถแŸ†', 'แž–แž“แŸ’แž›แžบ', 'แžŸแž“แŸ’แžแžทแž—แžถแž–', ' ', 'แž“แžทแž„', 'แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜', 'แž‡แžถแžแŸ’แž˜แžธ']

POS Tagging

Usage

>>> from khmernltk import pos_tag
>>> raw_text = "แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ! แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ"
>>> print(pos_tag(raw_text))
[('แžแžฝแž”', 'n'), ('แž†แŸ’แž“แžถแŸ†', 'n'), ('แž‘แžธ', 'n'), ('แŸขแŸจ', '1'), ('!', '.'), (' ', 'n'), ('แŸขแŸฃ', '1'), (' ', 'n'), ('แžแžปแž›แžถ', 'n'), (' ', 'n'), ('แžŸแŸ’แž˜แžถแžšแžแžธ', 'n'), ('แž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถ', 'n'), ('แž‡แžถแžแžท', 'n'), ('แžšแžœแžถแž„', 'o'), ('แžแŸ’แž˜แŸ‚แžš', 'n'), ('แž“แžทแž„', 'o'), ('แžแŸ’แž˜แŸ‚แžš', 'n'), (' ', 'n'), ('แžˆแžถแž“', 'v'), ('แž‘แŸ…', 'v'), ('แž”แž‰แŸ’แž…แž”แŸ‹', 'v'), ('แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜', 'n'), (' ', 'n'), ('แž“แžถแŸ†', 'v'), ('แž–แž“แŸ’แž›แžบ', 'n'), ('แžŸแž“แŸ’แžแžทแž—แžถแž–', 'n'), (' ', 'n'), ('แž“แžทแž„', 'o'), ('แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜', 'n'), ('แž‡แžถแžแŸ’แž˜แžธ', 'o')]

โœ๏ธ Citation

@misc{hoang-khmer-nltk,
  author = {Phan Viet Hoang},
  title = {Khmer Natural Language Processing Tookit},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/VietHoang1710/khmer-nltk}}
}

๐Ÿ‘จโ€๐ŸŽ“ References

๐Ÿชถ Advisor

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmer-nltk-1.4.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

khmer_nltk-1.4-py3-none-any.whl (8.1 MB view details)

Uploaded Python 3

File details

Details for the file khmer-nltk-1.4.tar.gz.

File metadata

  • Download URL: khmer-nltk-1.4.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0.post20210125 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.7.9

File hashes

Hashes for khmer-nltk-1.4.tar.gz
Algorithm Hash digest
SHA256 5d1da09fd9487bd4b8a8989a50c454bcb823d0ca240d4165d15eb7ae096c9253
MD5 d6e67d766487ef4ba6458fca65c952a7
BLAKE2b-256 68baa7bdf306f27211e9080d20d9ddacf8998e388e6c36aa798c122b5aa484db

See more details on using hashes here.

File details

Details for the file khmer_nltk-1.4-py3-none-any.whl.

File metadata

  • Download URL: khmer_nltk-1.4-py3-none-any.whl
  • Upload date:
  • Size: 8.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0.post20210125 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.7.9

File hashes

Hashes for khmer_nltk-1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 3460e6f26cc5b7d9e45b621c4fe7bda938a794d5ad646d9b98f90f50ba938140
MD5 b00bcd6154e4c91bf3180ad48d45d592
BLAKE2b-256 df9f41ee989242f01e35dc863212a5c636db1db29850bde507bf1bb7ecf2db1a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page