Skip to main content

Basic nlp for thai

Project description

Colab

https://drive.google.com/file/d/1G7OUNsCC-B5XHNd8V5Et1ZKpJp4R66hg/view?usp=share_link

Token Identification

Example code TokenIden

from basicthainlp import TokenIden
TID = TokenIden()
textTest = "the 1..  .25 \"(12,378.36 / -78.9%) = 76,909\tcontain iphone 13 45. +-*/ -5 12.10.226.38.25 กค. สิงหา%/<=>  6 's\n"
tokenIdenList = TID.tagTokenIden(textTest)
textTokenList,tagList = TID.toTokenList(textTest,tokenIdenList)
for x, y in zip(textTokenList, tagList):
    if y != 'otherSymb' and y != 'space':
        print(x,y)

Example code TokenIden: Add dict

input เป็น folder ที่ข้างในเป็นไฟล์ซึ่งเป็น list ของคำ 1 colum
Tag ที่ได้จะตรงกันชื่อไฟล์

ตัวอย่าง dict

input
--abbreviation.txt #Tag ที่ได้ออกมีคือ abbreviation
----กค.
----สค.
----กพ.

from basicthainlp import TokenIden, DictToken
TID = TokenIden()
textTest = "the 1..  .25 \"(12,378.36 / -78.9%) = 76,909\tcontain iphone 13 45. +-*/ -5 12.10.226.38.25 กค. สิงหา%/<=>  6 's\n"
tokenIdenList = TID.tagTokenIden(textTest)
# >>> Add dict 
DTK = DictToken()
DTK.readFloder('input')
tokenIdenList = DTK.rep_dictToken(textTest,tokenIdenList)
# <<< Add dict
textTokenList,tagList = TID.toTokenList(textTest,tokenIdenList)
for x, y in zip(textTokenList, tagList):
    if y != 'otherSymb' and y != 'space':
        print(x,y)

================================================================================

Word token to Pseudo Morpheme Segmentation

-ไม่ควรใช้งานกับประโยคภาษาไทยยาวๆ ควรตัดคำ หรือ ใช้งานรวมกับ TokenIdentification

Example code PmSeg

from basicthainlp import PmSeg
ps = PmSeg()

textTest = 'รัฐราชการ'
data_list = ps.word2DataList(textTest)
print(data_list)
pred = ps.dataList2pmSeg(data_list)
print(list(textTest))
print(pred[0])
print(ps.pmSeg2List(list(textTest),pred[0]))
[['ร', 'Ccc'], ['ั', 'Vu'], ['ฐ', 'C'], ['ร', 'Ccc'], ['า', 'Vm'], ['ช', 'C'], ['ก', 'C'], ['า', 'Vm'], ['ร', 'Ccc']]
['ร', 'ั', 'ฐ', 'ร', 'า', 'ช', 'ก', 'า', 'ร']
['B', 'I', 'C', 'B', 'I', 'C', 'B', 'I', 'I']
['รัฐ', 'ราช', 'การ']

Example code PmSeg: ใช้งานกับ Token Identification

from basicthainlp import PmSeg
from basicthainlp import TokenIden, DictToken
TID = TokenIden()
DTK = DictToken()
DTK.readFloder('input')
ps = PmSeg()
def get_ps(textInput):
  tokenIdenList = TID.tagTokenIden(textInput)
  tokenIdenList = DTK.rep_dictToken(textInput,tokenIdenList)
  textTokenList,tagList = TID.toTokenList(textInput,tokenIdenList)
  # ['otherSymb','mathSymb','punc','th_char','th_mym','en_char','digit','order','url','whitespace','space','newline','abbreviation','ne']
  # newTokenList = TID.replaceTag(['digit=<digit>'],textTokenList,tagList)
  newTokenList = []
  for textToken, tag in zip(textTokenList, tagList):
      if tag == 'th_char':
          data_list = ps.word2DataList(textToken)
          pred = ps.dataList2pmSeg(data_list)
          psList = ps.pmSeg2List(list(textToken),pred[0])
          newTokenList.extend(psList)
      else:
          newTokenList.append(textToken)
  return newTokenList
textTest = 'ติดตามข่าวล่าสุด 12 iphone'
print(get_ps(textTest))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

basicthainlp-0.2.5.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

basicthainlp-0.2.5-py3-none-any.whl (1.5 MB view details)

Uploaded Python 3

File details

Details for the file basicthainlp-0.2.5.tar.gz.

File metadata

  • Download URL: basicthainlp-0.2.5.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.12

File hashes

Hashes for basicthainlp-0.2.5.tar.gz
Algorithm Hash digest
SHA256 b3445549d49264df9e55b53013dc614ceee5bbbda558355ef61e9096d752e23f
MD5 b3092e97e4aca37878c2b8d1088362da
BLAKE2b-256 fd500b0cda3e672c80f66804bbe9d3a56da345d6c0428d2fe751c53d1d6afeda

See more details on using hashes here.

File details

Details for the file basicthainlp-0.2.5-py3-none-any.whl.

File metadata

File hashes

Hashes for basicthainlp-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 56f11497c268ba0f1ad4c9606b4d00e0920b23367959fb047e0fcfb02eb5f72e
MD5 249a76d3bcadb2c6a34b55ee8818a78f
BLAKE2b-256 d45010231250d96681c3ac20271bae0589962b3145371ca6500a45da09e23926

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page