cutkum

Thai Word-Segmentation with LSTM in Tensorflow

These details have not been verified by PyPI

Project links

Homepage

Project description

# Cutkum ['คัดคำ']
Cutkum ('คัดคำ') is a python code for Thai Word-Segmentation using Recurrent Neural Network (RNN) based on Tensorflow library.

Cutkum is trained on BEST2010, a 5 Millions Thai words corpus by NECTEC (https://www.nectec.or.th/). It also comes with an already trained model, and can be used right out of the box. Cutkum is still a work-in-progress project. Evaluated on the 10% hold-out data from BEST2010 corpus (~600,000 words), the included trained model currently performs at

98.0% recall, 96.3% precision, 97.1% F-measure (character-level)
93.5% recall, 94.1% precision and 94.0% F-measure (word-level -- same evaluation method as BEST2010)

# Update
Feb 17, 2018 - add the training script

# Requirements
* python = 2.7, 3.0+
* tensorflow = 1.4+

# Installation

`cutkum` can be installed using `pip` and the trained model can be downloaded from github. The current included model (model/lstm.l6.d2.pb) is a stacked bi-directional LSTM neural network with 6 layers.

```
pip install cutkum

# then download the trained model (either from github) or with wget

wget https://raw.githubusercontent.com/pucktada/cutkum/master/model/lstm.l6.d2.pb
```

# Usages

Once installed, you can use `cutkum` within your python code to tokenize thai sentences.

```

>>> from cutkum.tokenizer import Cutkum

>>> ck = Cutkum('lstm.l6.d2.pb')
>>> words = ck.tokenize("สารานุกรมไทยสำหรับเยาวชนฯ")

# python 3.0
>>> words
['สารานุกรม', 'ไทย', 'สำหรับ', 'เยาวชน', 'ฯ']

# python 2.7
>>> print("|".join(words))
# สารานุกรม|ไทย|สำหรับ|เยาวชน|ฯ

```

You can also use `cutkum` straight from the command line.

```
usage: cutkum [-h] [-v] -m MODEL_FILE
(-s SENTENCE | -i INPUT_FILE | -id INPUT_DIR)
[-o OUTPUT_FILE | -od OUTPUT_DIR] [--max | --viterbi]
```

```
cutkum -m model/lstm.l6.d2.pb -s "สารานุกรมไทยสำหรับเยาวชนฯ"

# output as
สารานุกรม|ไทย|สำหรับ|เยาวชน|ฯ
```

`cutkum` can also be used to segment text within a file (with -i), or to segment all the files within a given directory (with -id).

```
cutkum -m model/lstm.l6.d2.pb -i input.txt -o output.txt
cutkum -m model/lstm.l6.d2.pb -id input_dir -od output_dir
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details

## To Do
* Improve performance, with better better model, and better included trained-model
* Improve the speed when processing big file

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

2.4

Feb 22, 2018

1.4.2

Feb 17, 2018

1.4

Jan 26, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cutkum-2.4.tar.gz (6.2 MB view details)

Uploaded Feb 22, 2018 Source

File details

Details for the file cutkum-2.4.tar.gz.

File metadata

Download URL: cutkum-2.4.tar.gz
Upload date: Feb 22, 2018
Size: 6.2 MB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for cutkum-2.4.tar.gz
Algorithm	Hash digest
SHA256	`0a274c0d7ca31269756a563526d8211deb0f8614fb49bba1fdf943bf688b1ba3`
MD5	`440d1a3ca636f5bdb84a54f52588ef90`
BLAKE2b-256	`f694cfcd80851a04c53f24d1da2dc320ab72879ccabad03c5c5b602942245555`

See more details on using hashes here.

cutkum 2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes