Skip to main content

Basic nlp for thai

Project description

Colab

https://drive.google.com/file/d/1G7OUNsCC-B5XHNd8V5Et1ZKpJp4R66hg/view?usp=share_link

Token Identification

Example code TokenIden

from basicthainlp import TokenIden
TID = TokenIden()
textTest = "the 1..  .25 \"(12,378.36 / -78.9%) = 76,909\tcontain iphone 13 45. +-*/ -5 12.10.226.38.25 กค. สิงหา%/<=>  6 's\n"
tokenIdenList = TID.tagTokenIden(textTest)
textTokenList,tagList = TID.toTokenList(textTest,tokenIdenList)
for x, y in zip(textTokenList, tagList):
    if y != 'otherSymb' and y != 'space':
        print(x,y)

Example code TokenIden: Add dict

input เป็น folder ที่ข้างในเป็นไฟล์ซึ่งเป็น list ของคำ 1 colum
Tag ที่ได้จะตรงกันชื่อไฟล์

ตัวอย่าง dict

input
--abbreviation.txt #Tag ที่ได้ออกมีคือ abbreviation
----กค.
----สค.
----กพ.

from basicthainlp import TokenIden, DictToken
TID = TokenIden()
textTest = "the 1..  .25 \"(12,378.36 / -78.9%) = 76,909\tcontain iphone 13 45. +-*/ -5 12.10.226.38.25 กค. สิงหา%/<=>  6 's\n"
tokenIdenList = TID.tagTokenIden(textTest)
# >>> Add dict 
DTK = DictToken()
DTK.readFloder('input')
tokenIdenList = DTK.rep_dictToken(textTest,tokenIdenList)
# <<< Add dict
textTokenList,tagList = TID.toTokenList(textTest,tokenIdenList)
for x, y in zip(textTokenList, tagList):
    if y != 'otherSymb' and y != 'space':
        print(x,y)

================================================================================

Word token to Pseudo Morpheme Segmentation

-ไม่ควรใช้งานกับประโยคภาษาไทยยาวๆ ควรตัดคำ หรือ ใช้งานรวมกับ TokenIdentification

Example code PmSeg

from basicthainlp import PmSeg
ps = PmSeg()

textTest = 'รัฐราชการ'
data_list = ps.word2DataList(textTest)
print(data_list)
pred = ps.dataList2pmSeg(data_list)
print(list(textTest))
print(pred[0])
print(ps.pmSeg2List(list(textTest),pred[0]))
[['ร', 'Ccc'], ['ั', 'Vu'], ['ฐ', 'C'], ['ร', 'Ccc'], ['า', 'Vm'], ['ช', 'C'], ['ก', 'C'], ['า', 'Vm'], ['ร', 'Ccc']]
['ร', 'ั', 'ฐ', 'ร', 'า', 'ช', 'ก', 'า', 'ร']
['B', 'I', 'C', 'B', 'I', 'C', 'B', 'I', 'I']
['รัฐ', 'ราช', 'การ']

Example code PmSeg: ใช้งานกับ Token Identification

from basicthainlp import PmSeg
from basicthainlp import TokenIden, DictToken
TID = TokenIden()
DTK = DictToken()
DTK.readFloder('input')
ps = PmSeg()
def get_ps(textInput):
  tokenIdenList = TID.tagTokenIden(textInput)
  tokenIdenList = DTK.rep_dictToken(textInput,tokenIdenList)
  textTokenList,tagList = TID.toTokenList(textInput,tokenIdenList)
  # ['otherSymb','mathSymb','punc','th_char','th_mym','en_char','digit','order','url','whitespace','space','newline','abbreviation','ne']
  # newTokenList = TID.replaceTag(['digit=<digit>'],textTokenList,tagList)
  newTokenList = []
  for textToken, tag in zip(textTokenList, tagList):
      if tag == 'th_char':
          data_list = ps.word2DataList(textToken)
          pred = ps.dataList2pmSeg(data_list)
          psList = ps.pmSeg2List(list(textToken),pred[0])
          newTokenList.extend(psList)
      else:
          newTokenList.append(textToken)
  return newTokenList
textTest = 'ติดตามข่าวล่าสุด 12 iphone'
print(get_ps(textTest))

หรือใช้ wrap funvtion ของ basicthainlp ซึ่งการทำงานจะเป็นดังเช่น โคดด้านบน

from basicthainlp import PmSeg
from basicthainlp import TokenIden, DictToken
from basicthainlp import get_ps
TID = TokenIden()
DTK = DictToken()
DTK.readFloder('input')
ps = PmSeg()
textTest = 'ติดตามข่าวล่าสุด 12 iphone'
print(get_ps(TID=TID,DTK=DTK,PS=ps,textInput=textTest))
print(get_ps(textInput=textTest))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

basicthainlp-0.2.7.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

basicthainlp-0.2.7-py3-none-any.whl (1.5 MB view details)

Uploaded Python 3

File details

Details for the file basicthainlp-0.2.7.tar.gz.

File metadata

  • Download URL: basicthainlp-0.2.7.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.12

File hashes

Hashes for basicthainlp-0.2.7.tar.gz
Algorithm Hash digest
SHA256 88c2092cffa09027ba4a93de88668037ea03fe956d466cbb8c2f313e90bf0ea3
MD5 3b670b8b35ec1b3fbf4cc6e40f76e8a3
BLAKE2b-256 cf222dfb29d0e76cdcd8f2bf967b734453a1730ed9003593b415adee33d23c98

See more details on using hashes here.

File details

Details for the file basicthainlp-0.2.7-py3-none-any.whl.

File metadata

File hashes

Hashes for basicthainlp-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 b8d93348ea0133569b5ad46c8d67d274088bc4bd36c61212be0610c0ba93af84
MD5 1be82f7ac53fbc922e6f269669dbebf1
BLAKE2b-256 346803f2117985c3d41bf7e5c97b0cebf8a352ecc3197a42cd81c7baf2e4f3a0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page