Skip to main content

A Khmer language processing toolkit

Project description

๐Ÿ…Khmer natural language processing toolkit๐Ÿ…

code style pre-commit release versions fownloads license

๐ŸŽฏTODO

  • Sentence Segmentation
  • Word Segmentation
  • Named Entity Recognition
  • Part of speech Tagging
  • Text classification

๐Ÿ’ชInstallation

$ pip install khmer-nltk

๐Ÿน Quick tour

To get the evaluation result of khmer-nltk's functionalities, please refer the sub-modules's readme

Sentence tokenization

>>> from khmernltk import sentence_tokenize
>>> raw_text = "แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ! แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ"
>>> print(sentence_tokenize(raw_text))
['แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ!', 'แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ']

Word tokenization

>>> from khmernltk import word_tokenize
>>> raw_text = "แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ! แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ"
>>> print(word_tokenize(raw_text, return_tokens=True))
['แžแžฝแž”', 'แž†แŸ’แž“แžถแŸ†', 'แž‘แžธ', 'แŸขแŸจ', '!', ' ', 'แŸขแŸฃ', ' ', 'แžแžปแž›แžถ', ' ', 'แžŸแŸ’แž˜แžถแžšแžแžธ', 'แž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถ', 'แž‡แžถแžแžท', 'แžšแžœแžถแž„', 'แžแŸ’แž˜แŸ‚แžš', 'แž“แžทแž„', 'แžแŸ’แž˜แŸ‚แžš', ' ', 'แžˆแžถแž“', 'แž‘แŸ…', 'แž”แž‰แŸ’แž…แž”แŸ‹', 'แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜', ' ', 'แž“แžถแŸ†', 'แž–แž“แŸ’แž›แžบ', 'แžŸแž“แŸ’แžแžทแž—แžถแž–', ' ', 'แž“แžทแž„', 'แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜', 'แž‡แžถแžแŸ’แž˜แžธ']

POS Tagging

Usage

>>> from khmernltk import pos_tag
>>> raw_text = "แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ! แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ"
>>> print(pos_tag(raw_text))
[('แžแžฝแž”', 'n'), ('แž†แŸ’แž“แžถแŸ†', 'n'), ('แž‘แžธ', 'n'), ('แŸขแŸจ', '1'), ('!', '.'), (' ', 'n'), ('แŸขแŸฃ', '1'), (' ', 'n'), ('แžแžปแž›แžถ', 'n'), (' ', 'n'), ('แžŸแŸ’แž˜แžถแžšแžแžธ', 'n'), ('แž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถ', 'n'), ('แž‡แžถแžแžท', 'n'), ('แžšแžœแžถแž„', 'o'), ('แžแŸ’แž˜แŸ‚แžš', 'n'), ('แž“แžทแž„', 'o'), ('แžแŸ’แž˜แŸ‚แžš', 'n'), (' ', 'n'), ('แžˆแžถแž“', 'v'), ('แž‘แŸ…', 'v'), ('แž”แž‰แŸ’แž…แž”แŸ‹', 'v'), ('แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜', 'n'), (' ', 'n'), ('แž“แžถแŸ†', 'v'), ('แž–แž“แŸ’แž›แžบ', 'n'), ('แžŸแž“แŸ’แžแžทแž—แžถแž–', 'n'), (' ', 'n'), ('แž“แžทแž„', 'o'), ('แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜', 'n'), ('แž‡แžถแžแŸ’แž˜แžธ', 'o')]

โœ๏ธ Citation

@misc{hoang-khmer-nltk,
  author = {Phan Viet Hoang},
  title = {Khmer Natural Language Processing Tookit},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/VietHoang1710/khmer-nltk}}
}

๐Ÿ‘จโ€๐ŸŽ“ References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmer-nltk-1.3.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

khmer_nltk-1.3-py3-none-any.whl (7.0 MB view details)

Uploaded Python 3

File details

Details for the file khmer-nltk-1.3.tar.gz.

File metadata

  • Download URL: khmer-nltk-1.3.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.55.1 CPython/3.7.7

File hashes

Hashes for khmer-nltk-1.3.tar.gz
Algorithm Hash digest
SHA256 8ab8cdd2bb927ec574e6ec11852b3a11e874f2985f9b82446bf22211280c3d99
MD5 ca441cec4b8b6f6b2da0c80b4cef26a4
BLAKE2b-256 d43e1b0b4c62173f57f48702efdd1efb9884191e861880e917c8a7b0e121b3e5

See more details on using hashes here.

File details

Details for the file khmer_nltk-1.3-py3-none-any.whl.

File metadata

  • Download URL: khmer_nltk-1.3-py3-none-any.whl
  • Upload date:
  • Size: 7.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.55.1 CPython/3.7.7

File hashes

Hashes for khmer_nltk-1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 95ef5879f049e771d5240e57321c24810db8442ba27bdff2d4367b27ca5a8cf1
MD5 538b9bb37f77c029ce066e77892ec887
BLAKE2b-256 0d6114ad3b462871b70a408d94870cb3ee342f69584365dd343ea2e378e743d1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page