Skip to main content

A Khmer language processing toolkit

Project description

๐Ÿ…Khmer natural language processing toolkit๐Ÿ…

circleci code style pre-commit release versions fownloads license

๐ŸŽฏTODO

  • Sentence Segmentation
  • Word Segmentation
  • Part of speech Tagging
  • Named Entity Recognition
  • Text classification

๐Ÿ’ชInstallation

$ pip install khmer-nltk

๐Ÿน Quick tour

To get the evaluation result of khmer-nltk's functionalities, please refer the sub-modules's readme

Sentence tokenization

>>> from khmernltk import sentence_tokenize
>>> raw_text = "แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ! แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ"
>>> print(sentence_tokenize(raw_text))
['แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ!', 'แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ']

Word tokenization

>>> from khmernltk import word_tokenize
>>> raw_text = "แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ! แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ"
>>> print(word_tokenize(raw_text, return_tokens=True))
['แžแžฝแž”', 'แž†แŸ’แž“แžถแŸ†', 'แž‘แžธ', 'แŸขแŸจ', '!', ' ', 'แŸขแŸฃ', ' ', 'แžแžปแž›แžถ', ' ', 'แžŸแŸ’แž˜แžถแžšแžแžธ', 'แž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถ', 'แž‡แžถแžแžท', 'แžšแžœแžถแž„', 'แžแŸ’แž˜แŸ‚แžš', 'แž“แžทแž„', 'แžแŸ’แž˜แŸ‚แžš', ' ', 'แžˆแžถแž“', 'แž‘แŸ…', 'แž”แž‰แŸ’แž…แž”แŸ‹', 'แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜', ' ', 'แž“แžถแŸ†', 'แž–แž“แŸ’แž›แžบ', 'แžŸแž“แŸ’แžแžทแž—แžถแž–', ' ', 'แž“แžทแž„', 'แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜', 'แž‡แžถแžแŸ’แž˜แžธ']

POS Tagging

Usage

>>> from khmernltk import pos_tag
>>> raw_text = "แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ! แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ"
>>> print(pos_tag(raw_text))
[('แžแžฝแž”', 'n'), ('แž†แŸ’แž“แžถแŸ†', 'n'), ('แž‘แžธ', 'n'), ('แŸขแŸจ', '1'), ('!', '.'), (' ', 'n'), ('แŸขแŸฃ', '1'), (' ', 'n'), ('แžแžปแž›แžถ', 'n'), (' ', 'n'), ('แžŸแŸ’แž˜แžถแžšแžแžธ', 'n'), ('แž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถ', 'n'), ('แž‡แžถแžแžท', 'n'), ('แžšแžœแžถแž„', 'o'), ('แžแŸ’แž˜แŸ‚แžš', 'n'), ('แž“แžทแž„', 'o'), ('แžแŸ’แž˜แŸ‚แžš', 'n'), (' ', 'n'), ('แžˆแžถแž“', 'v'), ('แž‘แŸ…', 'v'), ('แž”แž‰แŸ’แž…แž”แŸ‹', 'v'), ('แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜', 'n'), (' ', 'n'), ('แž“แžถแŸ†', 'v'), ('แž–แž“แŸ’แž›แžบ', 'n'), ('แžŸแž“แŸ’แžแžทแž—แžถแž–', 'n'), (' ', 'n'), ('แž“แžทแž„', 'o'), ('แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜', 'n'), ('แž‡แžถแžแŸ’แž˜แžธ', 'o')]

โœ๏ธ Citation

@misc{hoang-khmer-nltk,
  author = {Phan Viet Hoang},
  title = {Khmer Natural Language Processing Tookit},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/VietHoang1710/khmer-nltk}}
}

๐Ÿ‘จโ€๐ŸŽ“ References

๐Ÿชถ Advisor

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmer-nltk-1.5.tar.gz (13.9 kB view details)

Uploaded Source

Built Distribution

khmer_nltk-1.5-py3-none-any.whl (8.1 MB view details)

Uploaded Python 3

File details

Details for the file khmer-nltk-1.5.tar.gz.

File metadata

  • Download URL: khmer-nltk-1.5.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0.post20210125 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.7.9

File hashes

Hashes for khmer-nltk-1.5.tar.gz
Algorithm Hash digest
SHA256 fe0d0160fa99f382779052c092d9d5fbf39c138985055d1d9f79c251a4fd7962
MD5 4580ffb849e59c23a4fe21c26e7d266e
BLAKE2b-256 6b2b65a0e754b3b25cc90d56804a1271e786b298d986e57fc5daf423f855d290

See more details on using hashes here.

File details

Details for the file khmer_nltk-1.5-py3-none-any.whl.

File metadata

  • Download URL: khmer_nltk-1.5-py3-none-any.whl
  • Upload date:
  • Size: 8.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0.post20210125 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.7.9

File hashes

Hashes for khmer_nltk-1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 07639222bb7ddeea4636b66e4df787cc7c69a3100a98f0cd865454a71204adff
MD5 776a2eea608c2dc9d259fccf6684e35f
BLAKE2b-256 16c292e0597bed896db18ca9dcf0ea2fc924b29fff5e1f42aa25370730130089

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page