Skip to main content

Fast and accurate Thai tokenization library.

Project description

Thai Tokenizer

Fast and accurate Thai tokenization library using supervised BPE designed for full-text search applications.

Installation

pip3 install thai_tokenizer

Usage

Default set of pairs is optimized for short Thai-English product descriptions.

from thai_tokenizer import Tokenizer
tokenizer = Tokenizer()
tokenizer('iPad Mini 256GB เครื่องไทย') #> 'iPad Mini 256GB เครื่อง ไทย'
tokenizer.split('เครื่องไทย') #> ['เครื่อง', 'ไทย']

Training

See Training for guidelines to train your own pairs.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thai_tokenizer-0.2.5.tar.gz (6.5 kB view details)

Uploaded Source

Built Distribution

thai_tokenizer-0.2.5-py3-none-any.whl (52.5 kB view details)

Uploaded Python 3

File details

Details for the file thai_tokenizer-0.2.5.tar.gz.

File metadata

  • Download URL: thai_tokenizer-0.2.5.tar.gz
  • Upload date:
  • Size: 6.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.2

File hashes

Hashes for thai_tokenizer-0.2.5.tar.gz
Algorithm Hash digest
SHA256 7909d1005a90ce918aa14eb30db47d3b39c9da1106115ccadf3afece53887f24
MD5 3b60fe455d3db0e1aa1bda6417cb4c8b
BLAKE2b-256 80efaf243b5948557d76b51c32588bde7d6f910243241491a85c58a2ea732803

See more details on using hashes here.

File details

Details for the file thai_tokenizer-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: thai_tokenizer-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 52.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.2

File hashes

Hashes for thai_tokenizer-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 af7ba74a25e4d444205fc159a2413ca300824d7b3378d7cd6bc26074acb666e4
MD5 371751f74f634246cff4298333e05700
BLAKE2b-256 812dbb92afb743d152b6ce7c8ea99b232ca615a860130eab2959aedb2ceb4b33

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page