Skip to main content

Standalone Dictionary-based, Maximum Matching + Thai Character Cluster (newmm) tokenizer extracted from PyThaiNLP

Project description

newmm-tokenizer

Standalone Dictionary-based, Maximum Matching + Thai Character Cluster (newmm) tokenizer extracted from PyThaiNLP.

Objectives

This repository is created for reducing an overall size of original PyThaiNLP Tokenizer Module. The main objective is to be able to segment Thai sentences into a list of words.

Supports

The module supports Python 3.7+ as follow the original PyThaiNLP repository.

Installation

pip install newmm-tokenizer

How to Use

from newmm_tokenizer.tokenizer import word_tokenize

text = 'เป็นเรื่องแรกที่ร้องไห้ตั้งแต่ ep 1 แล้วก็เป็นเรื่องแรกที่เลือกไม่ได้ว่าจะเชียร์พระเอกหรือพระรองดี 19...'
words = word_tokenize(text)

print(words) 
# ['เป็นเรื่อง', 'แรก', 'ที่', 'ร้องไห้', 'ตั้งแต่', ' ', 'ep', ' ', '1', ' ', 'แล้วก็', 'เป็นเรื่อง', 'แรก', 'ที่', 'เลือกไม่ได้', 'ว่า', 'จะ', 'เชียร์', 'พระเอก', 'หรือ', 'พระรอง', 'ดี', ' ', '19', '...']

LICENSE

Please see the original license of PyThaiNLP here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newmm_tokenizer-0.2.2.tar.gz (314.2 kB view details)

Uploaded Source

Built Distribution

newmm_tokenizer-0.2.2-py3-none-any.whl (320.6 kB view details)

Uploaded Python 3

File details

Details for the file newmm_tokenizer-0.2.2.tar.gz.

File metadata

  • Download URL: newmm_tokenizer-0.2.2.tar.gz
  • Upload date:
  • Size: 314.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.7

File hashes

Hashes for newmm_tokenizer-0.2.2.tar.gz
Algorithm Hash digest
SHA256 e6bd825d6a05f759be1e9be67e1d603a61f961d3fa9979d4c3af21ae576250ec
MD5 3eedfc650dd78720f61f28bbf032c210
BLAKE2b-256 aca80135c90ddeaae26f1e12cfd08f4a55e4ff6b2edd5228d7eaba7a55c90f0b

See more details on using hashes here.

File details

Details for the file newmm_tokenizer-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: newmm_tokenizer-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 320.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.7

File hashes

Hashes for newmm_tokenizer-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2ef54c67585d0f562650c93368a07ba0e39b5c8dc4500991135c57df58da65a5
MD5 82817a057f1e38346cfa1290f6514eb8
BLAKE2b-256 96f2e93d15afba1dec377d3a4c018ec1f75214510d6a1792ea9ecee526f5089d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page