LEKCut (เล็ก คัด) is a Thai tokenization library that ports the deep learning model to the onnx model.
Project description
LEKCut
LEKCut (เล็ก คัด) is a Thai tokenization library that ports the deep learning model to the onnx model.
Install
pip install lekcut
How to use
from lekcut import word_tokenize
# DeepCut model (default)
word_tokenize("ทดสอบการตัดคำ")
# output: ['ทดสอบ', 'การ', 'ตัด', 'คำ']
# AttaCut syllable + character model
word_tokenize("ทดสอบการตัดคำ", model="attacut-sc")
# output: ['ทดสอบ', 'การ', 'ตัด', 'คำ']
# AttaCut character-only model
word_tokenize("ทดสอบการตัดคำ", model="attacut-c")
# output: ['ทดสอบ', 'การ', 'ตัด', 'คำ']
# OSKut model
word_tokenize("เบียร์ยูไม่อร่อย", model="oskut")
# output: ['เบียร์', 'ยู', 'ไม่', 'อ', 'ร่อย']
# OSKut with a specific engine
word_tokenize("เบียร์ยูไม่อร่อย", model="oskut", engine="tnhc")
# output: ['เบียร์', 'ยู', 'ไม่', 'อร่อย']
# SEFR_CUT model
word_tokenize("เบียร์ยูไม่อร่อย", model="sefr-tnhc")
# output: ['เบียร์', 'ยู', 'ไม่', 'อร่อย']
API
word_tokenize(
text: str,
model: str = "deepcut",
path: str = "default",
providers: List[str] = None,
engine: str = "ws",
k: int = 1,
) -> List[str]
Parameters:
text: Text to tokenizemodel: Model to use. Options:"deepcut"(default),"attacut-sc","attacut-c","oskut","sefr-best","sefr-tnhc","sefr-ws1000"path: Path to custom model file (default: "default", applies todeepcutandattacut-*models)providers: List of ONNX Runtime execution providers (default: None, which uses default CPU provider)engine: OSKut engine variant (applies to"oskut"model only). Options:"ws"(default),"ws-augment-60p","tnhc","scads","tl-deepcut-ws","tl-deepcut-tnhc","deepcut"k: Percentage of characters to refine for OSKut (applies to"oskut"model only). The special default value of1is a sentinel that lets OSKut automatically select an appropriate percentage based on the engine. Pass any integer from 2 to 100 to override.
GPU Support
LEKCut supports GPU acceleration through ONNX Runtime execution providers. To use GPU acceleration:
-
Install ONNX Runtime with GPU support:
pip install onnxruntime-gpu
-
Use the
providersparameter to specify GPU execution:from lekcut import word_tokenize # Use CUDA GPU result = word_tokenize("ทดสอบการตัดคำ", providers=['CUDAExecutionProvider', 'CPUExecutionProvider']) # Use TensorRT (if available) result = word_tokenize("ทดสอบการตัดคำ", providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'])
Available Execution Providers:
CPUExecutionProvider- Default CPU executionCUDAExecutionProvider- NVIDIA CUDA GPU accelerationTensorrtExecutionProvider- NVIDIA TensorRT optimizationDmlExecutionProvider- DirectML for Windows GPU- And more (see ONNX Runtime documentation)
Note: The providers are tried in order, and the first available one will be used. Always include CPUExecutionProvider as a fallback.
Model
deepcut- We ported deepcut model from tensorflow.keras to ONNX model. The model and code come from Deepcut's Github. The model is here.attacut-sc- We ported the AttaCut syllable + character model from PyTorch to ONNX. The model and code come from AttaCut's Github. Requires thessgpackage for syllable tokenization.attacut-c- We ported the AttaCut character-only model from PyTorch to ONNX. The model and code come from AttaCut's Github.oskut- We ported the OSKut (Out-of-domain Stacked Cut) stacked ensemble models from TensorFlow/Keras to ONNX. The model and code come from OSKut's Github. Requires thepyahocorasickpackage. Supports multiple engines:ws(default),ws-augment-60p,tnhc,scads,tl-deepcut-ws,tl-deepcut-tnhc,deepcut.SEFR_CUT- We ported the SEFR CUT (Stacked Ensemble Filter and Refine for Word Segmentation) model from PyTorch to ONNX. The model and code come from SEFR_CUT's Github. List models:"sefr-best","sefr-tnhc","sefr-ws1000"
Load custom model
If you have trained your custom model from deepcut or other that LEKCut support, You can load the custom model by path in word_tokenize after porting your model.
- How to train custom model with your dataset by deepcut - Notebook (Needs to update
deepcut/train.pybefore train model)
How to porting model?
See notebooks/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lekcut-1.0.0.tar.gz.
File metadata
- Download URL: lekcut-1.0.0.tar.gz
- Upload date:
- Size: 24.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5156330d6ce36812dd9fa710bc43918a3cdbff3823cef822e61bbad7728de589
|
|
| MD5 |
00d488d398fa00bf812ae91e0969cc47
|
|
| BLAKE2b-256 |
3142d5fc35d4babf2e0354c7bf74acca840a1657c38c55edc6a054041d3eb5a1
|
File details
Details for the file lekcut-1.0.0-py3-none-any.whl.
File metadata
- Download URL: lekcut-1.0.0-py3-none-any.whl
- Upload date:
- Size: 24.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9b0be4bcb01215f6615efb0d44f48b96f27435264d1d45240985fe2a085201e
|
|
| MD5 |
a3412d0a72f3fb385b0d628042a6c03c
|
|
| BLAKE2b-256 |
d3415117f08cc9c9d7e83174877d509f76610eea8164dd999dfc574dedcdff81
|