Yet Another Tokenizer for Thai
Project description
AttaCut
TLDR: 3-Layer dilated CNN on character and syllable features
Installation
# only for beta version
$ pip install attacut
Usage
Command-Line Interface
$ attacut-cli -h
AttaCut: Fast and Reasonably Accurate Tokenizer for Thai
Usage:
attacut-cli <src> [--dest=<dest>] [--model=<model>]
attacut-cli (-h | --help)
Options:
-h --help Show this screen.
--model=<model> Model to be used [default: attacut-sc].
--dest=<dest> If not specified, it'll be <src>-tokenized-by-<model>.txt
Higher-Level Inferface
aka. module importing
from attacut import Tokenizer
atta = Tokenizer(model="attacut-sc")
atta.tokenizer(txt)
Development
Please refer to DEVELOPMENT.md
Related Resources
Acknowledgements
- This repository was initially done by Pattarawat Chormai, while interning at Dr. Attapol Thamrongrattanarit's NLP Lab, Chulalongkorn University, Bangkok, Thailand.
- Many thanks to my collegeus at Dr. Attapol's lab, PyThaiNLP team, Ekapol Chuangsuwanich , Noom, and Can for comments and feedback.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
attacut-0.0.4.dev0.tar.gz
(1.3 MB
view hashes)
Built Distribution
Close
Hashes for attacut-0.0.4.dev0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 716c7166c763783eca9e6a8e0444d74a64d81d273f759f7fb1f1c0713cb715ea |
|
MD5 | b8e4d684cb46b0347af7d8771e4ce278 |
|
BLAKE2b-256 | c78c3e437d7c3782bbca042018dbe23c899d3c5273d17a38e7faa0d29bc5cc19 |