Skip to main content

Basic nlp for thai

Project description

Token Identification

Example code TokenIden

from basicthainlp import TokenIden
TID = TokenIden()
textTest = "the 1..  .25 \"(12,378.36 / -78.9%) = 76,909\tcontain iphone 13 45. +-*/ -5 12.10.226.38.25 กค. สิงหา%/<=>  6 's\n"
tokenIdenList = TID.tagTokenIden(textTest)
textTokenList,tagList = TID.toTokenList(textTest,tokenIdenList)
for x, y in zip(textTokenList, tagList):
    if y != 'otherSymb' and y != 'space':
        print(x,y)

Example code TokenIden: Add dict

input เป็น folder ที่ข้างในเป็นไฟล์ซึ่งเป็น list ของคำ 1 colum Tag ที่ได้จะตรงกันชื่อไฟล์

ตัวอย่าง dict

input --abbreviation.txt #Tag ที่ได้ออกมีคือ abbreviation ----กค. ----สค. ----กพ.

from basicthainlp import TokenIden, DictToken
TID = TokenIden()
textTest = "the 1..  .25 \"(12,378.36 / -78.9%) = 76,909\tcontain iphone 13 45. +-*/ -5 12.10.226.38.25 กค. สิงหา%/<=>  6 's\n"
tokenIdenList = TID.tagTokenIden(textTest)
# >>> Add dict 
DTK = DictToken()
DTK.readFloder('input')
tokenIdenList = DTK.rep_dictToken(textTest,tokenIdenList)
# <<< Add dict
textTokenList,tagList = TID.toTokenList(textTest,tokenIdenList)
for x, y in zip(textTokenList, tagList):
    if y != 'otherSymb' and y != 'space':
        print(x,y)

================================================================================

Word token to Pseudo Morpheme Segmentation

-ไม่ควรใช้งานกับประโยคภาษาไทยยาวๆ ควรตัดคำ หรือ ใช้งานรวมกับ TokenIdentification

Example code PmSeg

from basicthainlp import PmSeg
ps = PmSeg()

textTest = 'รัฐราชการ'
data_list = ps.word2DataList(textTest)
print(data_list)
pred = ps.dataList2pmSeg(data_list)
print(list(textTest))
print(pred[0])
print(ps.pmSeg2List(list(textTest),pred[0]))
[['ร', 'Ccc'], ['ั', 'Vu'], ['ฐ', 'C'], ['ร', 'Ccc'], ['า', 'Vm'], ['ช', 'C'], ['ก', 'C'], ['า', 'Vm'], ['ร', 'Ccc']]
['ร', 'ั', 'ฐ', 'ร', 'า', 'ช', 'ก', 'า', 'ร']
['B', 'I', 'C', 'B', 'I', 'C', 'B', 'I', 'I']
['รัฐ', 'ราช', 'การ']

Example code PmSeg: ใช้งานกับ Token Identification

from basicthainlp import PmSeg
from basicthainlp.tokenIdentification import TokenIden, DictToken
TID = TokenIden()
DTK = DictToken()
DTK.readFloder('input')
ps = PmSeg()
textTest = 'ติดตามข่าวล่าสุด 12 iphone'
tokenIdenList = TID.tagTokenIden(textTest)
tokenIdenList = DTK.rep_dictToken(textTest,tokenIdenList)
textTokenList,tagList = TID.toTokenList(textTest,tokenIdenList)
# ['otherSymb','mathSymb','punc','th_char','th_mym','en_char','digit','order','url','whitespace','space','newline','abbreviation','ne']
# newTokenList = TID.replaceTag(['digit=<digit>'],textTokenList,tagList)
newTokenList = []
for textToken, tag in zip(textTokenList, tagList):
    if tag == 'th_char':
        data_list = ps.word2DataList(textToken)
        pred = ps.dataList2pmSeg(data_list)
        psList = ps.pmSeg2List(list(textToken),pred[0])
        newTokenList.extend(psList)
    else:
        newTokenList.append(textToken)
print(newTokenList)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

basicthainlp-0.2.3.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

basicthainlp-0.2.3-py3-none-any.whl (1.5 MB view details)

Uploaded Python 3

File details

Details for the file basicthainlp-0.2.3.tar.gz.

File metadata

  • Download URL: basicthainlp-0.2.3.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.12

File hashes

Hashes for basicthainlp-0.2.3.tar.gz
Algorithm Hash digest
SHA256 90d4c93424d865228653f792eb657a3130802e785e3acc990d57de138a56c1b1
MD5 9c3795f0c8aad562eb2e22dbd71d6b7d
BLAKE2b-256 731ed3224def582d22e1edd67d8aa8edf139127667594f2074ca664282164179

See more details on using hashes here.

File details

Details for the file basicthainlp-0.2.3-py3-none-any.whl.

File metadata

File hashes

Hashes for basicthainlp-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7ea8cc7d0c02782a37b59849f305276154cf575a1050e4c2bdd9b564376ecd88
MD5 17cefbe936ca43c3ba00c22cff7a38b8
BLAKE2b-256 ba958cb98a7351b4f67ca3e259c5b10c39574a2f16a6486ea035a80acddaaf57

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page