Skip to main content

Thai Word-Segmentation with LSTM in Tensorflow

Project description

# Cutkum ['คัดคำ']
Cutkum ('คัดคำ') is a python code for Thai Word-Segmentation using Recurrent Neural Network (RNN) based on Tensorflow library.

Cutkum is trained on BEST2010, a 5 Millions Thai words corpus by NECTEC (https://www.nectec.or.th/). It also comes with an already trained model, and can be used right out of the box. Cutkum is still a work-in-progress project. Evaluated on the 10% hold-out data from BEST2010 corpus (~600,000 words), the included trained model currently performs at

98.0% recall, 96.3% precision, 97.1% F-measure (character-level)

93.5% recall, 94.1% precision and 94.0% F-measure (word-level -- same evaluation method as BEST2010)

# Requirements
* python = 2.7, 3.0+
* tensorflow >= 1.1

# Usages
```
usage: cutkum.py [-h] [-v] -m MODEL_FILE
(-d DIRECTORY | -i INPUT_FILE | -s SENTENCE) [-o OUTPUT_DIR]
[--max | --viterbi]

```

`cutkum.py` needs to load the trained model, the current included model (model/lstm.l6.d2.pb) is a bi-directional LSTM neural network with 6 layers. `cutkum.py` can be used in 3 ways, 1. to segment text directly from a given sentence (with -s), 2. to segment text within a file (with -i), and 3. to segment all the files within a given directory.

For example, one can run `cutkum.py` to segment a thai phrase `"สารานุกรมไทยสำหรับเยาวชนฯ"` by running

```
./cutkum.py -m model/lstm.l6.d2.pb -s "สารานุกรมไทยสำหรับเยาวชนฯ"
```

which will produce the resulting word segmentation as followed (words are seperated by '|').

```
สารานุกรม|ไทย|สำหรับ|เยาวชน|ฯ
```

or if one want to segment a text file 'input.txt' and save the result to 'output.txt'

```
./cutkum.py -m model/lstm.l6.d2.pb -i input.txt > output.txt
```


## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details

## To Do

* Improve performance, with better better model, and better included trained-model
* Improve the speed when processing big file

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cutkum-1.4.tar.gz (17.4 kB view details)

Uploaded Source

File details

Details for the file cutkum-1.4.tar.gz.

File metadata

  • Download URL: cutkum-1.4.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for cutkum-1.4.tar.gz
Algorithm Hash digest
SHA256 38ec3cde074481a959890078c1e472e081657f988a8d1db8729c25c692dcbc37
MD5 fa8719a94cdac97455919fbd53346fe6
BLAKE2b-256 f52091c004bc2123e763a3e4b3eeee38626771613f74b01ca20a25e057dd7f5c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page