Standalone Dictionary-based, Maximum Matching + Thai Character Cluster (newmm) tokenizer extracted from PyThaiNLP
Project description
newmm-tokenizer
Standalone Dictionary-based, Maximum Matching + Thai Character Cluster (newmm) tokenizer extracted from PyThaiNLP.
Objectives
This repository is created for reducing an overall size of original PyThaiNLP Tokenizer Module. The main objective is to be able to segment Thai sentences into a list of words.
Supports
The module supports Python 3.7+ as follow the original PyThaiNLP repository.
Installation
pip install newmm-tokenizer
How to Use
from newmm_tokenizer.tokenizer import word_tokenize
text = 'เป็นเรื่องแรกที่ร้องไห้ตั้งแต่ ep 1 แล้วก็เป็นเรื่องแรกที่เลือกไม่ได้ว่าจะเชียร์พระเอกหรือพระรองดี 19...'
words = word_tokenize(text)
print(words)
# ['เป็นเรื่อง', 'แรก', 'ที่', 'ร้องไห้', 'ตั้งแต่', ' ', 'ep', ' ', '1', ' ', 'แล้วก็', 'เป็นเรื่อง', 'แรก', 'ที่', 'เลือกไม่ได้', 'ว่า', 'จะ', 'เชียร์', 'พระเอก', 'หรือ', 'พระรอง', 'ดี', ' ', '19', '...']
LICENSE
Please see the original license of PyThaiNLP here
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file newmm_tokenizer-0.2.2.tar.gz.
File metadata
- Download URL: newmm_tokenizer-0.2.2.tar.gz
- Upload date:
- Size: 314.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6bd825d6a05f759be1e9be67e1d603a61f961d3fa9979d4c3af21ae576250ec
|
|
| MD5 |
3eedfc650dd78720f61f28bbf032c210
|
|
| BLAKE2b-256 |
aca80135c90ddeaae26f1e12cfd08f4a55e4ff6b2edd5228d7eaba7a55c90f0b
|
File details
Details for the file newmm_tokenizer-0.2.2-py3-none-any.whl.
File metadata
- Download URL: newmm_tokenizer-0.2.2-py3-none-any.whl
- Upload date:
- Size: 320.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ef54c67585d0f562650c93368a07ba0e39b5c8dc4500991135c57df58da65a5
|
|
| MD5 |
82817a057f1e38346cfa1290f6514eb8
|
|
| BLAKE2b-256 |
96f2e93d15afba1dec377d3a4c018ec1f75214510d6a1792ea9ecee526f5089d
|