Standalone Dictionary-based, Maximum Matching + Thai Character Cluster (newmm) tokenizer extracted from PyThaiNLP
Project description
newmm-tokenizer
Standalone Dictionary-based, Maximum Matching + Thai Character Cluster (newmm) tokenizer extracted from PyThaiNLP.
Objectives
This repository is created for reducing an overall size of original PyThaiNLP Tokenizer Module. The main objective is to be able to segment Thai sentences into a list of words.
Supports
The module supports Python 3.7+ as follow the original PyThaiNLP repository.
Installation
pip install newmm-tokenizer
How to Use
from newmm_tokenizer.tokenizer import word_tokenize
text = 'เป็นเรื่องแรกที่ร้องไห้ตั้งแต่ ep 1 แล้วก็เป็นเรื่องแรกที่เลือกไม่ได้ว่าจะเชียร์พระเอกหรือพระรองดี 19...'
words = word_tokenize(text)
print(words)
# ['เป็นเรื่อง', 'แรก', 'ที่', 'ร้องไห้', 'ตั้งแต่', ' ', 'ep', ' ', '1', ' ', 'แล้วก็', 'เป็นเรื่อง', 'แรก', 'ที่', 'เลือกไม่ได้', 'ว่า', 'จะ', 'เชียร์', 'พระเอก', 'หรือ', 'พระรอง', 'ดี', ' ', '19', '...']
LICENSE
Please see the original license of PyThaiNLP here
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file newmm_tokenizer-0.2.2.tar.gz
.
File metadata
- Download URL: newmm_tokenizer-0.2.2.tar.gz
- Upload date:
- Size: 314.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e6bd825d6a05f759be1e9be67e1d603a61f961d3fa9979d4c3af21ae576250ec |
|
MD5 | 3eedfc650dd78720f61f28bbf032c210 |
|
BLAKE2b-256 | aca80135c90ddeaae26f1e12cfd08f4a55e4ff6b2edd5228d7eaba7a55c90f0b |
File details
Details for the file newmm_tokenizer-0.2.2-py3-none-any.whl
.
File metadata
- Download URL: newmm_tokenizer-0.2.2-py3-none-any.whl
- Upload date:
- Size: 320.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ef54c67585d0f562650c93368a07ba0e39b5c8dc4500991135c57df58da65a5 |
|
MD5 | 82817a057f1e38346cfa1290f6514eb8 |
|
BLAKE2b-256 | 96f2e93d15afba1dec377d3a4c018ec1f75214510d6a1792ea9ecee526f5089d |