Skip to main content

Basic nlp for thai

Project description

Token Identification

Example code TokenIden

from basicthainlp import TokenIden
TID = TokenIden()
textTest = "the 1..  .25 \"(12,378.36 / -78.9%) = 76,909\tcontain iphone 13 45. +-*/ -5 12.10.226.38.25 กค. สิงหา%/<=>  6 's\n"
tokenIdenList = TID.tagTokenIden(textTest)
textTokenList,tagList = TID.toTokenList(textTest,tokenIdenList)
for x, y in zip(textTokenList, tagList):
    if y != 'otherSymb' and y != 'space':
        print(x,y)

Example code TokenIden: Add dict

input เป็น folder ที่ข้างในเป็นไฟล์ซึ่งเป็น list ของคำ 1 colum Tag ที่ได้จะตรงกันชื่อไฟล์

ตัวอย่าง dict

input --abbreviation.txt #Tag ที่ได้ออกมีคือ abbreviation ----กค. ----สค. ----กพ.

from basicthainlp import TokenIden, DictToken
TID = TokenIden()
textTest = "the 1..  .25 \"(12,378.36 / -78.9%) = 76,909\tcontain iphone 13 45. +-*/ -5 12.10.226.38.25 กค. สิงหา%/<=>  6 's\n"
tokenIdenList = TID.tagTokenIden(textTest)
# >>> Add dict 
DTK = DictToken()
DTK.readFloder('input')
tokenIdenList = DTK.rep_dictToken(textTest,tokenIdenList)
# <<< Add dict
textTokenList,tagList = TID.toTokenList(textTest,tokenIdenList)
for x, y in zip(textTokenList, tagList):
    if y != 'otherSymb' and y != 'space':
        print(x,y)

================================================================================

Word token to Pseudo Morpheme Segmentation

-ไม่ควรใช้งานกับประโยคภาษาไทยยาวๆ ควรตัดคำ หรือ ใช้งานรวมกับ TokenIdentification

Example code PmSeg

from basicthainlp import PmSeg
ps = PmSeg()

textTest = 'รัฐราชการ'
data_list = ps.word2DataList(textTest)
print(data_list)
pred = ps.dataList2pmSeg(data_list)
print(list(textTest))
print(pred[0])
print(ps.pmSeg2List(list(textTest),pred[0]))
[['ร', 'Ccc'], ['ั', 'Vu'], ['ฐ', 'C'], ['ร', 'Ccc'], ['า', 'Vm'], ['ช', 'C'], ['ก', 'C'], ['า', 'Vm'], ['ร', 'Ccc']]
['ร', 'ั', 'ฐ', 'ร', 'า', 'ช', 'ก', 'า', 'ร']
['B', 'I', 'C', 'B', 'I', 'C', 'B', 'I', 'I']
['รัฐ', 'ราช', 'การ']

Example code PmSeg: ใช้งานกับ Token Identification

from basicthainlp import PmSeg
from basicthainlp import TokenIden, DictToken
TID = TokenIden()
DTK = DictToken()
DTK.readFloder('input')
ps = PmSeg()
textTest = 'ติดตามข่าวล่าสุด 12 iphone'
tokenIdenList = TID.tagTokenIden(textTest)
tokenIdenList = DTK.rep_dictToken(textTest,tokenIdenList)
textTokenList,tagList = TID.toTokenList(textTest,tokenIdenList)
# ['otherSymb','mathSymb','punc','th_char','th_mym','en_char','digit','order','url','whitespace','space','newline','abbreviation','ne']
# newTokenList = TID.replaceTag(['digit=<digit>'],textTokenList,tagList)
newTokenList = []
for textToken, tag in zip(textTokenList, tagList):
    if tag == 'th_char':
        data_list = ps.word2DataList(textToken)
        pred = ps.dataList2pmSeg(data_list)
        psList = ps.pmSeg2List(list(textToken),pred[0])
        newTokenList.extend(psList)
    else:
        newTokenList.append(textToken)
print(newTokenList)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

basicthainlp-0.2.4.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

basicthainlp-0.2.4-py3-none-any.whl (1.5 MB view details)

Uploaded Python 3

File details

Details for the file basicthainlp-0.2.4.tar.gz.

File metadata

  • Download URL: basicthainlp-0.2.4.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.12

File hashes

Hashes for basicthainlp-0.2.4.tar.gz
Algorithm Hash digest
SHA256 328bf3064b24f76ae396b6a721728bb5478661d6b759e29d2206a81099547f6b
MD5 31d7a93cdbbf55a1b61ae7b6a6aa0034
BLAKE2b-256 2af5ef7077b63d7efaba3fa11d667ac3a70deda6a98d5a4d4e9bf7ab21e5d64a

See more details on using hashes here.

File details

Details for the file basicthainlp-0.2.4-py3-none-any.whl.

File metadata

File hashes

Hashes for basicthainlp-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d5a08c84edcac114d71d5d08330770a83368a8cf0fae27db01d476ce0ca4709d
MD5 10a1dee20ac63e216837dca66e2aff79
BLAKE2b-256 483f91f8266d62118bc72439f5ce7e04f8650ccf912c8f2ee8f18216d5cf98f7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page