Skip to main content

Thai Sentence Segmenter

Project description

BoydCut: Thai Sentence Segmenter

Bidirectional LSTM-CNN Model for Thai Sentence Segmenter

Development Status

This project is the part of my Thesis in Master's degree at Big Data Engineering, CITE, Dhurakij Pundij University https://cite.dpu.ac.th/bigdata/

My Advisor

  • Asst. Prof. Dr. Duangjai Jitkongchuen
  • Asst. Prof. Dr. Peerasak Intarapaiboon

Requirements

  • Tensorflow 2.0+
  • Python 3.6.x
  • pip install -r requirements

if pip install -r requirements not work please follow Installation steps

Installation steps

  • pip install numpy pandas tensorflow
  • pip install deepcut
  • pip install pythainlp

How to use and Examples

boydcut = BoydCut()
sent_ls = boydcut.sentenize("ประเทศฝรั่งเศสแผ่นดินใหญ่ทอดตัวตั้งแต่ทะเลเมดิเตอร์\
                                        เรเนียนจนถึงช่องแคบอังกฤษและทะเลเหนือ")
for sent in sent_ls:
    print(sent)

> <B-CLS>ประเทศฝรั่งเศส|แผ่นดิน|ใหญ่|ทอด|ตัว|ตั้งแต่|ทะเลเมดิเตอร์เรเนียน|จนถึง|ช่อง|แคบ<E-CLS>
> <B-CLS>อังกฤษ|และ|ทะเล|เหนือ<E-CLS>


boydcut = BoydCut()
sent_ls = boydcut.sentenize(['ประเทศฝรั่งเศส','แผ่นดิน','ใหญ่','ทอดตัว','ตั้งแต่',
                            'ทะเลเมดิเตอร์เรเนียน','จนถึง','ช่อง','แคบ',
                            'อังกฤษ','และ','ทะเล','เหนือ'], _tokenize=False)
for sent in sent_ls:
    print(sent)

> <B-CLS>ประเทศฝรั่งเศส|แผ่นดิน|ใหญ่|ทอด|ตัว|ตั้งแต่|ทะเลเมดิเตอร์เรเนียน|จนถึง|ช่อง|แคบ<E-CLS>
> <B-CLS>อังกฤษ|และ|ทะเล|เหนือ<E-CLS>

Limitation

  • Document feeding is not available yet !
  • Max Word for feeding: 200 words / paragraph
  • Please use "\n" for decrease size of paragraph
  • Results: [sentence1, sentence2, sentence3,..., sentenceN]

Dependency

  • POS apply pythainlp.tag.pos_tag(_sentence_ls, corpus="orchid")
  • Tokenization apply pythainlp.tokenize.word_tokenize(_text_ls, engine="deepcut")

Contributor

Sorratat Sirirattanajakarin (Boyd)

License and reference

Please make sure to cite the paper if you use BoydCut for your research ^^:

BoydCut: Bidirectional LSTM-CNN Model for Thai Sentence Segmenter S. Sirirattanajakarin, D. Jitkongchuen, P. Intarapaiboon 2020 1st International Conference on Big Data Analytics and Practices (IBDAP)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

BoydCut-1.0.0.tar.gz (27.6 MB view details)

Uploaded Source

Built Distribution

BoydCut-1.0.0-py3-none-any.whl (27.6 MB view details)

Uploaded Python 3

File details

Details for the file BoydCut-1.0.0.tar.gz.

File metadata

  • Download URL: BoydCut-1.0.0.tar.gz
  • Upload date:
  • Size: 27.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.0.post20201103 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.12

File hashes

Hashes for BoydCut-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e288987015b271835467474ab861850c66f9bb2999b0c60d8c7ee67d9e1fe06a
MD5 dba670f0ba94285ccbb04e5a5745521c
BLAKE2b-256 86833a574b0d3361470514a704520cca8022cab6d01fa30232c8758dbcf086f3

See more details on using hashes here.

File details

Details for the file BoydCut-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: BoydCut-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 27.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.0.post20201103 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.12

File hashes

Hashes for BoydCut-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d44e8ab4581d1a53a05f076fc6736bd26ad89332e4f3d9d85f98d55bbf97db5a
MD5 8c1737f444461383961d8b1326c30859
BLAKE2b-256 545292799d27e575f5632643ccf4e644652d17b8e1a50e014bc9c8ef6f2d5c35

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page