Skip to main content

LEKCut (เล็ก คัด) is a Thai tokenization library that ports the deep learning model to the onnx model.

Project description

LEKCut

pypi

LEKCut (เล็ก คัด) is a Thai tokenization library that ports the deep learning model to the onnx model.

Install

pip install lekcut

How to use

from lekcut import word_tokenize

# DeepCut model (default)
word_tokenize("ทดสอบการตัดคำ")
# output: ['ทดสอบ', 'การ', 'ตัด', 'คำ']

# AttaCut syllable + character model
word_tokenize("ทดสอบการตัดคำ", model="attacut-sc")
# output: ['ทดสอบ', 'การ', 'ตัด', 'คำ']

# AttaCut character-only model
word_tokenize("ทดสอบการตัดคำ", model="attacut-c")
# output: ['ทดสอบ', 'การ', 'ตัด', 'คำ']

# OSKut model
word_tokenize("เบียร์ยูไม่อร่อย", model="oskut")
# output: ['เบียร์', 'ยู', 'ไม่', 'อ', 'ร่อย']

# OSKut with a specific engine
word_tokenize("เบียร์ยูไม่อร่อย", model="oskut", engine="tnhc")
# output: ['เบียร์', 'ยู', 'ไม่', 'อร่อย']

API

word_tokenize(
    text: str,
    model: str = "deepcut",
    path: str = "default",
    providers: List[str] = None,
    engine: str = "ws",
    k: int = 1,
) -> List[str]

Parameters:

  • text: Text to tokenize
  • model: Model to use. Options: "deepcut" (default), "attacut-sc", "attacut-c", "oskut"
  • path: Path to custom model file (default: "default", applies to deepcut and attacut-* models)
  • providers: List of ONNX Runtime execution providers (default: None, which uses default CPU provider)
  • engine: OSKut engine variant (applies to "oskut" model only). Options: "ws" (default), "ws-augment-60p", "tnhc", "scads", "tl-deepcut-ws", "tl-deepcut-tnhc", "deepcut"
  • k: Percentage of characters to refine for OSKut (applies to "oskut" model only). The special default value of 1 is a sentinel that lets OSKut automatically select an appropriate percentage based on the engine. Pass any integer from 2 to 100 to override.

GPU Support

LEKCut supports GPU acceleration through ONNX Runtime execution providers. To use GPU acceleration:

  1. Install ONNX Runtime with GPU support:

    pip install onnxruntime-gpu
    
  2. Use the providers parameter to specify GPU execution:

    from lekcut import word_tokenize
    
    # Use CUDA GPU
    result = word_tokenize("ทดสอบการตัดคำ", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
    
    # Use TensorRT (if available)
    result = word_tokenize("ทดสอบการตัดคำ", providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'])
    

Available Execution Providers:

  • CPUExecutionProvider - Default CPU execution
  • CUDAExecutionProvider - NVIDIA CUDA GPU acceleration
  • TensorrtExecutionProvider - NVIDIA TensorRT optimization
  • DmlExecutionProvider - DirectML for Windows GPU
  • And more (see ONNX Runtime documentation)

Note: The providers are tried in order, and the first available one will be used. Always include CPUExecutionProvider as a fallback.

Model

  • deepcut - We ported deepcut model from tensorflow.keras to ONNX model. The model and code come from Deepcut's Github. The model is here.
  • attacut-sc - We ported the AttaCut syllable + character model from PyTorch to ONNX. The model and code come from AttaCut's Github. Requires the ssg package for syllable tokenization.
  • attacut-c - We ported the AttaCut character-only model from PyTorch to ONNX. The model and code come from AttaCut's Github.
  • oskut - We ported the OSKut (Out-of-domain Stacked Cut) stacked ensemble models from TensorFlow/Keras to ONNX. The model and code come from OSKut's Github. Requires the pyahocorasick package. Supports multiple engines: ws (default), ws-augment-60p, tnhc, scads, tl-deepcut-ws, tl-deepcut-tnhc, deepcut.

Load custom model

If you have trained your custom model from deepcut or other that LEKCut support, You can load the custom model by path in word_tokenize after porting your model.

  • How to train custom model with your dataset by deepcut - Notebook (Needs to update deepcut/train.py before train model)

How to porting model?

See notebooks/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lekcut-1.0.0b1.tar.gz (24.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lekcut-1.0.0b1-py3-none-any.whl (24.9 MB view details)

Uploaded Python 3

File details

Details for the file lekcut-1.0.0b1.tar.gz.

File metadata

  • Download URL: lekcut-1.0.0b1.tar.gz
  • Upload date:
  • Size: 24.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lekcut-1.0.0b1.tar.gz
Algorithm Hash digest
SHA256 5c94a6f4ef6b70f57f98872053ea07395a6435f95be83b49b5b968fd5f85cac1
MD5 c038cd0b7e2b2e16f3b87e2c8bf324ea
BLAKE2b-256 32f872575913dc83def3c6849b0d8f369db093dc5aeb9efaaebc0379594b6afe

See more details on using hashes here.

File details

Details for the file lekcut-1.0.0b1-py3-none-any.whl.

File metadata

  • Download URL: lekcut-1.0.0b1-py3-none-any.whl
  • Upload date:
  • Size: 24.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lekcut-1.0.0b1-py3-none-any.whl
Algorithm Hash digest
SHA256 2ee8c28b97ef0f341edc017b8245c44f59cfb6a172a9a073345ad21103c3b909
MD5 c54ea11dbc808938400df5931f2a0e98
BLAKE2b-256 2b500e1528a3b542ad2fcf5ce5f0956013af9a413c0a84131d51512d55a0acb2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page