Skip to main content

A Khmer language processing toolkit

Project description

๐Ÿ…Khmer natural language processing toolkit๐Ÿ…

๐Ÿ’ช TODO:

  • Sentence Segmentation
  • Word Segmentation
  • Named Entity Recognition
  • Part of speech Tagging
  • Text classification

๐ŸŽฏ๐ŸŽฏ๐ŸŽฏ Installation

$ pip install khmer-nltk

๐Ÿน๐Ÿน๐Ÿน Quick tour:

Sentence tokenization:

>>> from khmernltk import sentence_tokenize
>>> raw_text = "แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ! แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ"
>>> print(sentence_tokenize(raw_text))
['แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ!', 'แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ']

Word tokenization:

>>> from khmernltk import word_tokenize
>>> raw_text = "แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ! แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ"
>>> print(word_tokenize(raw_text, return_tokens=True))
['แžแžฝแž”', 'แž†แŸ’แž“แžถแŸ†', 'แž‘แžธ', 'แŸขแŸจ', '!', ' ', 'แŸขแŸฃ', ' ', 'แžแžปแž›แžถ', ' ', 'แžŸแŸ’แž˜แžถแžšแžแžธ', 'แž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถ', 'แž‡แžถแžแžท', 'แžšแžœแžถแž„', 'แžแŸ’แž˜แŸ‚แžš', 'แž“แžทแž„', 'แžแŸ’แž˜แŸ‚แžš', ' ', 'แžˆแžถแž“', 'แž‘แŸ…', 'แž”แž‰แŸ’แž…แž”แŸ‹', 'แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜', ' ', 'แž“แžถแŸ†', 'แž–แž“แŸ’แž›แžบ', 'แžŸแž“แŸ’แžแžทแž—แžถแž–', ' ', 'แž“แžทแž„', 'แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜', 'แž‡แžถแžแŸ’แž˜แžธ']

POS Tagging:

Usage:

>>> from khmernltk import pos_tag
>>> raw_text = "แžแžฝแž”แž†แŸ’แž“แžถแŸ†แž‘แžธแŸขแŸจ! แŸขแŸฃ แžแžปแž›แžถ แžŸแŸ’แž˜แžถแžšแžแžธแž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถแž‡แžถแžแžทแžšแžœแžถแž„แžแŸ’แž˜แŸ‚แžšแž“แžทแž„แžแŸ’แž˜แŸ‚แžš แžˆแžถแž“แž‘แŸ…แž”แž‰แŸ’แž…แž”แŸ‹แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜ แž“แžถแŸ†แž–แž“แŸ’แž›แžบแžŸแž“แŸ’แžแžทแž—แžถแž– แž“แžทแž„แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜แž‡แžถแžแŸ’แž˜แžธ"
>>> print(pos_tag(raw_text))
[('แžแžฝแž”', 'n'), ('แž†แŸ’แž“แžถแŸ†', 'n'), ('แž‘แžธ', 'n'), ('แŸขแŸจ', '1'), ('!', '.'), (' ', 'n'), ('แŸขแŸฃ', '1'), (' ', 'n'), ('แžแžปแž›แžถ', 'n'), (' ', 'n'), ('แžŸแŸ’แž˜แžถแžšแžแžธ', 'n'), ('แž•แŸ’แžŸแŸ‡แž•แŸ’แžŸแžถ', 'n'), ('แž‡แžถแžแžท', 'n'), ('แžšแžœแžถแž„', 'o'), ('แžแŸ’แž˜แŸ‚แžš', 'n'), ('แž“แžทแž„', 'o'), ('แžแŸ’แž˜แŸ‚แžš', 'n'), (' ', 'n'), ('แžˆแžถแž“', 'v'), ('แž‘แŸ…', 'v'), ('แž”แž‰แŸ’แž…แž”แŸ‹', 'v'), ('แžŸแž„แŸ’แžšแŸ’แž‚แžถแž˜', 'n'), (' ', 'n'), ('แž“แžถแŸ†', 'v'), ('แž–แž“แŸ’แž›แžบ', 'n'), ('แžŸแž“แŸ’แžแžทแž—แžถแž–', 'n'), (' ', 'n'), ('แž“แžทแž„', 'o'), ('แž€แžถแžšแžšแžฝแž”แžšแžฝแž˜', 'n'), ('แž‡แžถแžแŸ’แž˜แžธ', 'o')]

โœ๏ธโœ๏ธโœ๏ธ Citation

@misc{hoang-khmer-nltk,
  author = {Phan Viet Hoang},
  title = {Khmer Natural Language Processing Tookit},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/VietHoang1710/khmer-nltk}}
}

๐Ÿ‘จโ€๐ŸŽ“๐Ÿ‘จโ€๐ŸŽ“๐Ÿ‘จโ€๐ŸŽ“ References:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmer-nltk-1.1.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

khmer_nltk-1.1-py3-none-any.whl (7.0 MB view details)

Uploaded Python 3

File details

Details for the file khmer-nltk-1.1.tar.gz.

File metadata

  • Download URL: khmer-nltk-1.1.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.7

File hashes

Hashes for khmer-nltk-1.1.tar.gz
Algorithm Hash digest
SHA256 fb25570714a89d095c148c0071b516b4149406ba4c352ce128abde417d47a05a
MD5 01ac40f7e6d5e7b91064d4a16c000c36
BLAKE2b-256 cf69fe39bdde57897ef0aa6fc1ce3744ecdb490887cba8f5c005ea366dfaf049

See more details on using hashes here.

File details

Details for the file khmer_nltk-1.1-py3-none-any.whl.

File metadata

  • Download URL: khmer_nltk-1.1-py3-none-any.whl
  • Upload date:
  • Size: 7.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.7

File hashes

Hashes for khmer_nltk-1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9d094ef5f9bfe84c021935e05261cc80129191a9ed15b96c20c2674e2b4a29de
MD5 f0faf38e0741fe15b24ace3bc6233811
BLAKE2b-256 5679e9392a5e0c110720aebc47f67104a8dd344f313943747e7141fd660b8d5b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page