Skip to main content

Basic nlp for thai

Project description

Colab

https://drive.google.com/file/d/1G7OUNsCC-B5XHNd8V5Et1ZKpJp4R66hg/view?usp=share_link

Update

0.3.1

  • Add POS Tagging

0.2.7

  • Add wrap function get_ps

0.2.1

  • Add Token Identification ================================================================================

Token Identification

Example code TokenIden

from basicthainlp import TokenIden
TID = TokenIden()
textTest = "the 1..  .25 \"(12,378.36 / -78.9%) = 76,909\tcontain iphone 13 45. +-*/ -5 12.10.226.38.25 กค. สิงหา%/<=>  6 's\n"
tokenIdenList = TID.tagTokenIden(textTest)
textTokenList,tagList = TID.toTokenList(textTest,tokenIdenList)
for x, y in zip(textTokenList, tagList):
    if y != 'otherSymb' and y != 'space':
        print(x,y)

Example code TokenIden: Add dict

input เป็น folder ที่ข้างในเป็นไฟล์ซึ่งเป็น list ของคำ 1 colum
Tag ที่ได้จะตรงกันชื่อไฟล์

ตัวอย่าง dict

input
--abbreviation.txt #Tag ที่ได้ออกมีคือ abbreviation
----กค.
----สค.
----กพ.

from basicthainlp import TokenIden, DictToken
TID = TokenIden()
textTest = "the 1..  .25 \"(12,378.36 / -78.9%) = 76,909\tcontain iphone 13 45. +-*/ -5 12.10.226.38.25 กค. สิงหา%/<=>  6 's\n"
tokenIdenList = TID.tagTokenIden(textTest)
# >>> Add dict 
DTK = DictToken()
DTK.readFloder('input')
tokenIdenList = DTK.rep_dictToken(textTest,tokenIdenList)
# <<< Add dict
textTokenList,tagList = TID.toTokenList(textTest,tokenIdenList)
for x, y in zip(textTokenList, tagList):
    if y != 'otherSymb' and y != 'space':
        print(x,y)

================================================================================

Word token to Pseudo Morpheme Segmentation

-ไม่ควรใช้งานกับประโยคภาษาไทยยาวๆ ควรตัดคำ หรือ ใช้งานรวมกับ TokenIdentification

Example code PmSeg

from basicthainlp import PmSeg
ps = PmSeg()

textTest = 'รัฐราชการ'
data_list = ps.word2DataList(textTest)
print(data_list)
pred = ps.dataList2pmSeg(data_list)
print(list(textTest))
print(pred[0])
print(ps.pmSeg2List(list(textTest),pred[0]))
[['ร', 'Ccc'], ['ั', 'Vu'], ['ฐ', 'C'], ['ร', 'Ccc'], ['า', 'Vm'], ['ช', 'C'], ['ก', 'C'], ['า', 'Vm'], ['ร', 'Ccc']]
['ร', 'ั', 'ฐ', 'ร', 'า', 'ช', 'ก', 'า', 'ร']
['B', 'I', 'C', 'B', 'I', 'C', 'B', 'I', 'I']
['รัฐ', 'ราช', 'การ']

Example code PmSeg: ใช้งานกับ Token Identification

from basicthainlp import PmSeg
from basicthainlp import TokenIden, DictToken
TID = TokenIden()
DTK = DictToken()
DTK.readFloder('input')
ps = PmSeg()
def get_ps(textInput):
  tokenIdenList = TID.tagTokenIden(textInput)
  tokenIdenList = DTK.rep_dictToken(textInput,tokenIdenList)
  textTokenList,tagList = TID.toTokenList(textInput,tokenIdenList)
  # ['otherSymb','mathSymb','punc','th_char','th_mym','en_char','digit','order','url','whitespace','space','newline','abbreviation','ne']
  # newTokenList = TID.replaceTag(['digit=<digit>'],textTokenList,tagList)
  newTokenList = []
  for textToken, tag in zip(textTokenList, tagList):
      if tag == 'th_char':
          data_list = ps.word2DataList(textToken)
          pred = ps.dataList2pmSeg(data_list)
          psList = ps.pmSeg2List(list(textToken),pred[0])
          newTokenList.extend(psList)
      else:
          newTokenList.append(textToken)
  return newTokenList
textTest = 'ติดตามข่าวล่าสุด 12 iphone'
print(get_ps(textTest))

หรือใช้ wrap function ของ basicthainlp ซึ่งการทำงานจะเป็นดังเช่น โคดด้านบน

from basicthainlp import PmSeg
from basicthainlp import TokenIden, DictToken
from basicthainlp import get_ps
TID = TokenIden()
DTK = DictToken()
DTK.readFloder('input')
ps = PmSeg()
textTest = 'ติดตามข่าวล่าสุด 12 iphone'
print(get_ps(tid_cls=TID,dtk_cls=DTK,ps_cls=ps,textInput=textTest))
print(get_ps(textInput=textTest))

================================================================================

POS Tagging

POS Tagging จาก pm token นำมา tag pos เป็น word

Example code PosTag

from basicthainlp import TokenIden,DictToken
from basicthainlp import PmSeg
from basicthainlp import PosTag
TID = TokenIden()
DTK = DictToken()
DTK.readFloder('input')
PS = PmSeg()
textTest = 'จากนั้นคนร้ายก็ได้ขับมุ่งไปทางถนนเจริญกรุง' 
pos_cls = PosTag(tid_cls=TID,dtk_cls=DTK,ps_cls=PS)
ps_list,tag_list = pos_cls.tagPOS(textTest)
print(ps_list)
print(tag_list)
word_list,pos_list = pos_cls.psSeg2WS(ps_list,tag_list)
print(word_list)
print(pos_list)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

basicthainlp-0.3.2.tar.gz (34.8 MB view details)

Uploaded Source

Built Distribution

basicthainlp-0.3.2-py3-none-any.whl (34.8 MB view details)

Uploaded Python 3

File details

Details for the file basicthainlp-0.3.2.tar.gz.

File metadata

  • Download URL: basicthainlp-0.3.2.tar.gz
  • Upload date:
  • Size: 34.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.12

File hashes

Hashes for basicthainlp-0.3.2.tar.gz
Algorithm Hash digest
SHA256 71fa7f02f102e8755b0e27b5341789b409932ca945783b1ed2c1f32dfae6321d
MD5 dc4a5ef52ae9c68efda197c1449a57fa
BLAKE2b-256 ab1e8acc06549a231fbe29a3f93caab7c04ddb74d80af51aacc9edf7945d4d86

See more details on using hashes here.

File details

Details for the file basicthainlp-0.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for basicthainlp-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e6231db60034a212d6d7c4e53e6380d286ccb43a0d7b47dfcb51a7342400c73f
MD5 17e5cd1e295a015b5732247f7a370ce8
BLAKE2b-256 968069680ab69f256b1b08d9b7d2ef12a799d2d5160e07b35be0bae607edba16

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page