Fast and Reasonably Accurate Word Tokenizer for Thai

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Natural Language
- Thai
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

AttaCut: Fast and Reasonably Accurate Word Tokenizer for Thai

How does AttaCut look like?

TL;DR: 3-Layer Dilated CNN on syllable and character features. It’s 6x faster than DeepCut (SOTA) while its WL-f1 on BEST is 91%, only 2% lower.

Installation

$ pip install attacut

Remarks: Windows users need to install PyTorch before the command above. Please consult PyTorch.org for more details.

Usage

Command-Line Interface

$ attacut-cli -h
AttaCut: Fast and Reasonably Accurate Word Tokenizer for Thai

Usage:
  attacut-cli <src> [--dest=<dest>] [--model=<model>]
  attacut-cli [-v | --version]
  attacut-cli [-h | --help]

Arguments:
  <src>             Path to input text file to be tokenized

Options:
  -h --help         Show this screen.
  --model=<model>   Model to be used [default: attacut-sc].
  --dest=<dest>     If not specified, it'll be <src>-tokenized-by-<model>.txt
  -v --version      Show version

High-Level API

from attacut import tokenize, Tokenizer

# tokenize `txt` using our best model `attacut-sc`
words = tokenize(txt)

# alternatively, an AttaCut tokenizer might be instantiated directly, allowing
# one to specify whether to use `attacut-sc` or `attacut-c`.
atta = Tokenizer(model="attacut-sc")
words = atta.tokenize(txt)

Benchmark Results

Belows are brief summaries. More details can be found on our benchmarking page.

Tokenization Quality

Speed

Retraining on Custom Dataset

Please refer to our retraining page

Related Resources

Acknowledgements

This repository was initially done by Pattarawat Chormai, while interning at Dr. Attapol Thamrongrattanarit's NLP Lab, Chulalongkorn University, Bangkok, Thailand. Many people have involed in this project. Complete list of names can be found on Acknowledgement.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Natural Language
- Thai
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.1.0.dev0 pre-release

Mar 13, 2020

This version

1.0.6

Nov 21, 2019

1.0.6.dev0 pre-release

Nov 21, 2019

1.0.5

Oct 18, 2019

1.0.4

Oct 1, 2019

1.0.4.dev0 pre-release

Oct 1, 2019

1.0.3

Oct 1, 2019

1.0.3.dev0 pre-release

Oct 1, 2019

1.0.2

Sep 8, 2019

1.0.2.dev0 pre-release

Sep 8, 2019

1.0.1

Sep 1, 2019

1.0.0

Sep 1, 2019

0.0.6.dev0 pre-release

Aug 30, 2019

0.0.5.dev0 pre-release

Aug 30, 2019

0.0.4.dev0 pre-release

Aug 29, 2019

0.0.3.dev0 pre-release

Aug 25, 2019

0.0.2.dev0 pre-release

Aug 25, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

attacut-1.0.6.tar.gz (1.3 MB view details)

Uploaded Nov 21, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

attacut-1.0.6-py3-none-any.whl (1.3 MB view details)

Uploaded Nov 21, 2019 Python 3

File details

Details for the file attacut-1.0.6.tar.gz.

File metadata

Download URL: attacut-1.0.6.tar.gz
Upload date: Nov 21, 2019
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.4

File hashes

Hashes for attacut-1.0.6.tar.gz
Algorithm	Hash digest
SHA256	`ced9d8cd6b2817f6aeb441a2919b0f2da02b432294b08dc82702b40176a79bba`
MD5	`0d32368ece14466da30601e9181e996f`
BLAKE2b-256	`9c086b905097d1cd72dabc50b867c68bda1f971412a7dfee37b5b68fae997258`

See more details on using hashes here.

File details

Details for the file attacut-1.0.6-py3-none-any.whl.

File metadata

Download URL: attacut-1.0.6-py3-none-any.whl
Upload date: Nov 21, 2019
Size: 1.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.4

File hashes

Hashes for attacut-1.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d04193149476d1c371c7d0177a4363ba70d1a6d6f7d2246a577669eb4ea93f2c`
MD5	`4d731a11321e0420dff1f9add2a82371`
BLAKE2b-256	`f6564ab7204bde7468be65d047578192975035d9bc4e786990a407a28a8f75b8`

See more details on using hashes here.

attacut 1.0.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AttaCut: Fast and Reasonably Accurate Word Tokenizer for Thai

How does AttaCut look like?

Installation

Usage

Command-Line Interface

High-Level API

Benchmark Results

Tokenization Quality

Speed

Retraining on Custom Dataset

Related Resources

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes