Fast and Reasonably Accurate Word Tokenizer for Thai
Project description
AttaCut: Fast and Reasonably Accurate Word Tokenizer for Thai
TLDR: 3-Layer dilated CNN on character and syllable features
Installation
$ pip install attacut
Usage
Command-Line Interface
$ attacut-cli -h
AttaCut: Fast and Reasonably Accurate Tokenizer for Thai
Usage:
attacut-cli <src> [--dest=<dest>] [--model=<model>]
attacut-cli (-h | --help)
Options:
-h --help Show this screen.
--model=<model> Model to be used [default: attacut-sc].
--dest=<dest> If not specified, it'll be <src>-tokenized-by-<model>.txt
Higher-Level Inferface
aka. module importing
from attacut import Tokenizer
atta = Tokenizer(model="attacut-sc")
atta.tokenizer(txt)
Benchmark Results
Belows are brief summaries. More details can be found on our benchmarking page.
Tokenization Quality
Speed
Retraining on Custom Dataset
Please refer to our retraining page
Related Resources
Acknowledgements
This repository was initially done by Pattarawat Chormai, while interning at Dr. Attapol Thamrongrattanarit's NLP Lab, Chulalongkorn University, Bangkok, Thailand. Many people have involed in this project. Complete list of names can be found on Acknowledgement.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
attacut-1.0.1.tar.gz
(1.3 MB
view hashes)