Thai Sentence Segmenter
Project description
BoydCut: Thai Sentence Segmenter
Bidirectional LSTM-CNN Model for Thai Sentence Segmenter
Development Status
This project is the part of my Thesis in Master's degree at Big Data Engineering, CITE, Dhurakij Pundij University https://cite.dpu.ac.th/bigdata/
My Advisor
- Asst. Prof. Dr. Duangjai Jitkongchuen
- Asst. Prof. Dr. Peerasak Intarapaiboon
Requirements
- Tensorflow 2.0+
- Python 3.6.x
- pip install -r requirements
if pip install -r requirements not work please follow Installation steps
Installation steps
- pip install numpy pandas tensorflow
- pip install deepcut
- pip install pythainlp
How to use and Examples
- pip install BoydCut
- Version 1.0.0
- Notebook Example
boydcut = BoydCut()
sent_ls = boydcut.sentenize("ประเทศฝรั่งเศสแผ่นดินใหญ่ทอดตัวตั้งแต่ทะเลเมดิเตอร์\
เรเนียนจนถึงช่องแคบอังกฤษและทะเลเหนือ")
for sent in sent_ls:
print(sent)
> <B-CLS>ประเทศฝรั่งเศส|แผ่นดิน|ใหญ่|ทอด|ตัว|ตั้งแต่|ทะเลเมดิเตอร์เรเนียน|จนถึง|ช่อง|แคบ<E-CLS>
> <B-CLS>อังกฤษ|และ|ทะเล|เหนือ<E-CLS>
boydcut = BoydCut()
sent_ls = boydcut.sentenize(['ประเทศฝรั่งเศส','แผ่นดิน','ใหญ่','ทอดตัว','ตั้งแต่',
'ทะเลเมดิเตอร์เรเนียน','จนถึง','ช่อง','แคบ',
'อังกฤษ','และ','ทะเล','เหนือ'], _tokenize=False)
for sent in sent_ls:
print(sent)
> <B-CLS>ประเทศฝรั่งเศส|แผ่นดิน|ใหญ่|ทอด|ตัว|ตั้งแต่|ทะเลเมดิเตอร์เรเนียน|จนถึง|ช่อง|แคบ<E-CLS>
> <B-CLS>อังกฤษ|และ|ทะเล|เหนือ<E-CLS>
Limitation
- Document feeding is not available yet !
- Max Word for feeding: 200 words / paragraph
- Please use "\n" for decrease size of paragraph
- Results: [sentence1, sentence2, sentence3,..., sentenceN]
Dependency
- POS apply pythainlp.tag.pos_tag(_sentence_ls, corpus="orchid")
- Tokenization apply pythainlp.tokenize.word_tokenize(_text_ls, engine="deepcut")
Contributor
Sorratat Sirirattanajakarin (Boyd)
- Youtube: https://youtube.com/c/BigDataRPG
- Fanpage: https://www.facebook.com/bigdatarpg/
- Medium: https://www.medium.com/bigdataeng
- Github: https://www.github.com/BigDataRPG
- Kaggle: https://www.kaggle.com/boydbigdatarpg
- Linkedin: https://www.linkedin.com/in/boyd-sorratat
- Twitter: https://twitter.com/BoydSorratat
- GoogleScholar: https://scholar.google.com/citations?user=9cIeYAgAAAAJ&hl=en
License and reference
Please make sure to cite the paper if you use BoydCut for your research ^^:
BoydCut: Bidirectional LSTM-CNN Model for Thai Sentence Segmenter S. Sirirattanajakarin, D. Jitkongchuen, P. Intarapaiboon 2020 1st International Conference on Big Data Analytics and Practices (IBDAP)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file BoydCut-1.0.0.tar.gz
.
File metadata
- Download URL: BoydCut-1.0.0.tar.gz
- Upload date:
- Size: 27.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.0.post20201103 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e288987015b271835467474ab861850c66f9bb2999b0c60d8c7ee67d9e1fe06a |
|
MD5 | dba670f0ba94285ccbb04e5a5745521c |
|
BLAKE2b-256 | 86833a574b0d3361470514a704520cca8022cab6d01fa30232c8758dbcf086f3 |
File details
Details for the file BoydCut-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: BoydCut-1.0.0-py3-none-any.whl
- Upload date:
- Size: 27.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.0.post20201103 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d44e8ab4581d1a53a05f076fc6736bd26ad89332e4f3d9d85f98d55bbf97db5a |
|
MD5 | 8c1737f444461383961d8b1326c30859 |
|
BLAKE2b-256 | 545292799d27e575f5632643ccf4e644652d17b8e1a50e014bc9c8ef6f2d5c35 |