Yet Another Tokenizer for Thai
Project description
AttaCut
TLDR: 3-Layer dilated CNN on character and syllable features
Installation
# only for beta version
$ pip install https://github.com/heytitle/attacut/archive/v0.0.3-dev.zip
Usage
Command-Line Interface
$ attacut-cli -h
AttaCut: Fast and Reasonably Accurate Tokenizer for Thai
Usage:
attacut-cli <src> [--dest=<dest>] [--model=<model>]
attacut-cli (-h | --help)
Options:
-h --help Show this screen.
--model=<model> Model to be used [default: attacut-sc].
--dest=<dest> If not specified, it'll be <src>-tokenized-by-<model>.txt
Higher-Level Inferface
aka. module importing
from attacut import Tokenizer
atta = Tokenizer(model="attacut-sc")
atta.tokenizer(txt)
Development
Please refer to DEVELOPMENT.md
Related Resources
Acknowledgements
- This repository was initially done by Pattarawat Chormai, while interning at Dr. Attapol Thamrongrattanarit's NLP Lab, Chulalongkorn University, Bangkok, Thailand.
- Many thanks to my collegeus at Dr. Attapol's lab, PyThaiNLP team, Ekapol Chuangsuwanich , Noom, Can for comments and feedback.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
attacut-0.0.3.dev0.tar.gz
(1.3 MB
view hashes)
Built Distribution
Close
Hashes for attacut-0.0.3.dev0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c89583a2fb465115c8f5bbc3e48a6564b7f84366c85d068e5ac6b8385e23bd84 |
|
MD5 | 741f8c0d013a75f7e06a887d6a8cc373 |
|
BLAKE2b-256 | 82010e153062002d75222859f63bf9c546ee7af96c40ce7eb419971c8f03936e |