Skip to main content

Thai Word-Segmentation with LSTM in Tensorflow

Project description

# Cutkum ['คัดคำ']
Cutkum ('คัดคำ') is a python code for Thai Word-Segmentation using Recurrent Neural Network (RNN) based on Tensorflow library.

Cutkum is trained on BEST2010, a 5 Millions Thai words corpus by NECTEC (https://www.nectec.or.th/). It also comes with an already trained model, and can be used right out of the box. Cutkum is still a work-in-progress project. Evaluated on the 10% hold-out data from BEST2010 corpus (~600,000 words), the included trained model currently performs at

98.0% recall, 96.3% precision, 97.1% F-measure (character-level)
93.5% recall, 94.1% precision and 94.0% F-measure (word-level -- same evaluation method as BEST2010)

# Update
Feb 17, 2018 - add the training script

# Requirements
* python = 2.7, 3.0+
* tensorflow = 1.4+

# Installation

`cutkum` can be installed using `pip` and the trained model can be downloaded from github. The current included model (model/lstm.l6.d2.pb) is a stacked bi-directional LSTM neural network with 6 layers.

```
pip install cutkum

# then download the trained model (either from github) or with wget

wget https://raw.githubusercontent.com/pucktada/cutkum/master/model/lstm.l6.d2.pb
```

# Usages

Once installed, you can use `cutkum` within your python code to tokenize thai sentences.

```

>>> from cutkum.tokenizer import Cutkum

>>> ck = Cutkum('lstm.l6.d2.pb')
>>> words = ck.tokenize("สารานุกรมไทยสำหรับเยาวชนฯ")

# python 3.0
>>> words
['สารานุกรม', 'ไทย', 'สำหรับ', 'เยาวชน', 'ฯ']

# python 2.7
>>> print("|".join(words))
# สารานุกรม|ไทย|สำหรับ|เยาวชน|ฯ

```

You can also use `cutkum` straight from the command line.

```
usage: cutkum [-h] [-v] -m MODEL_FILE
(-s SENTENCE | -i INPUT_FILE | -id INPUT_DIR)
[-o OUTPUT_FILE | -od OUTPUT_DIR] [--max | --viterbi]
```

```
cutkum -m model/lstm.l6.d2.pb -s "สารานุกรมไทยสำหรับเยาวชนฯ"

# output as
สารานุกรม|ไทย|สำหรับ|เยาวชน|ฯ
```


`cutkum` can also be used to segment text within a file (with -i), or to segment all the files within a given directory (with -id).

```
cutkum -m model/lstm.l6.d2.pb -i input.txt -o output.txt
cutkum -m model/lstm.l6.d2.pb -id input_dir -od output_dir
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details

## To Do
* Improve performance, with better better model, and better included trained-model
* Improve the speed when processing big file

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cutkum-2.4.tar.gz (6.2 MB view details)

Uploaded Source

File details

Details for the file cutkum-2.4.tar.gz.

File metadata

  • Download URL: cutkum-2.4.tar.gz
  • Upload date:
  • Size: 6.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for cutkum-2.4.tar.gz
Algorithm Hash digest
SHA256 0a274c0d7ca31269756a563526d8211deb0f8614fb49bba1fdf943bf688b1ba3
MD5 440d1a3ca636f5bdb84a54f52588ef90
BLAKE2b-256 f694cfcd80851a04c53f24d1da2dc320ab72879ccabad03c5c5b602942245555

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page