Basic nlp for thai
Project description
Colab
https://drive.google.com/file/d/1G7OUNsCC-B5XHNd8V5Et1ZKpJp4R66hg/view?usp=share_link
Update
0.3.1
- Add POS Tagging
0.2.7
- Add wrap function get_ps
0.2.1
- Add Token Identification ================================================================================
Token Identification
Example code TokenIden
from basicthainlp import TokenIden
TID = TokenIden()
textTest = "the 1.. .25 \"(12,378.36 / -78.9%) = 76,909\tcontain iphone 13 45. +-*/ -5 12.10.226.38.25 กค. สิงหา%/<=> 6 's\n"
tokenIdenList = TID.tagTokenIden(textTest)
textTokenList,tagList = TID.toTokenList(textTest,tokenIdenList)
for x, y in zip(textTokenList, tagList):
if y != 'otherSymb' and y != 'space':
print(x,y)
Example code TokenIden: Add dict
input เป็น folder ที่ข้างในเป็นไฟล์ซึ่งเป็น list ของคำ 1 colum
Tag ที่ได้จะตรงกันชื่อไฟล์
ตัวอย่าง dict
input
--abbreviation.txt #Tag ที่ได้ออกมีคือ abbreviation
----กค.
----สค.
----กพ.
from basicthainlp import TokenIden, DictToken
TID = TokenIden()
textTest = "the 1.. .25 \"(12,378.36 / -78.9%) = 76,909\tcontain iphone 13 45. +-*/ -5 12.10.226.38.25 กค. สิงหา%/<=> 6 's\n"
tokenIdenList = TID.tagTokenIden(textTest)
# >>> Add dict
DTK = DictToken()
DTK.readFloder('input')
tokenIdenList = DTK.rep_dictToken(textTest,tokenIdenList)
# <<< Add dict
textTokenList,tagList = TID.toTokenList(textTest,tokenIdenList)
for x, y in zip(textTokenList, tagList):
if y != 'otherSymb' and y != 'space':
print(x,y)
================================================================================
Word token to Pseudo Morpheme Segmentation
-ไม่ควรใช้งานกับประโยคภาษาไทยยาวๆ ควรตัดคำ หรือ ใช้งานรวมกับ TokenIdentification
Example code PmSeg
from basicthainlp import PmSeg
ps = PmSeg()
textTest = 'รัฐราชการ'
data_list = ps.word2DataList(textTest)
print(data_list)
pred = ps.dataList2pmSeg(data_list)
print(list(textTest))
print(pred[0])
print(ps.pmSeg2List(list(textTest),pred[0]))
[['ร', 'Ccc'], ['ั', 'Vu'], ['ฐ', 'C'], ['ร', 'Ccc'], ['า', 'Vm'], ['ช', 'C'], ['ก', 'C'], ['า', 'Vm'], ['ร', 'Ccc']]
['ร', 'ั', 'ฐ', 'ร', 'า', 'ช', 'ก', 'า', 'ร']
['B', 'I', 'C', 'B', 'I', 'C', 'B', 'I', 'I']
['รัฐ', 'ราช', 'การ']
Example code PmSeg: ใช้งานกับ Token Identification
from basicthainlp import PmSeg
from basicthainlp import TokenIden, DictToken
TID = TokenIden()
DTK = DictToken()
DTK.readFloder('input')
ps = PmSeg()
def get_ps(textInput):
tokenIdenList = TID.tagTokenIden(textInput)
tokenIdenList = DTK.rep_dictToken(textInput,tokenIdenList)
textTokenList,tagList = TID.toTokenList(textInput,tokenIdenList)
# ['otherSymb','mathSymb','punc','th_char','th_mym','en_char','digit','order','url','whitespace','space','newline','abbreviation','ne']
# newTokenList = TID.replaceTag(['digit=<digit>'],textTokenList,tagList)
newTokenList = []
for textToken, tag in zip(textTokenList, tagList):
if tag == 'th_char':
data_list = ps.word2DataList(textToken)
pred = ps.dataList2pmSeg(data_list)
psList = ps.pmSeg2List(list(textToken),pred[0])
newTokenList.extend(psList)
else:
newTokenList.append(textToken)
return newTokenList
textTest = 'ติดตามข่าวล่าสุด 12 iphone'
print(get_ps(textTest))
หรือใช้ wrap function ของ basicthainlp ซึ่งการทำงานจะเป็นดังเช่น โคดด้านบน
from basicthainlp import PmSeg
from basicthainlp import TokenIden, DictToken
from basicthainlp import get_ps
TID = TokenIden()
DTK = DictToken()
DTK.readFloder('input')
ps = PmSeg()
textTest = 'ติดตามข่าวล่าสุด 12 iphone'
print(get_ps(tid_cls=TID,dtk_cls=DTK,ps_cls=ps,textInput=textTest))
print(get_ps(textInput=textTest))
================================================================================
POS Tagging
POS Tagging จาก pm token นำมา tag pos เป็น word
Example code PosTag
from basicthainlp import TokenIden,DictToken
from basicthainlp import PmSeg
from basicthainlp import PosTag
TID = TokenIden()
DTK = DictToken()
DTK.readFloder('input')
PS = PmSeg()
textTest = 'จากนั้นคนร้ายก็ได้ขับมุ่งไปทางถนนเจริญกรุง'
pos_cls = PosTag(tid_cls=TID,dtk_cls=DTK,ps_cls=PS)
ps_list,tag_list = pos_cls.tagPOS(textTest)
print(ps_list)
print(tag_list)
word_list,pos_list = pos_cls.psSeg2WS(ps_list,tag_list)
print(word_list)
print(pos_list)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
basicthainlp-0.3.2.tar.gz
(34.8 MB
view details)
Built Distribution
File details
Details for the file basicthainlp-0.3.2.tar.gz
.
File metadata
- Download URL: basicthainlp-0.3.2.tar.gz
- Upload date:
- Size: 34.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 71fa7f02f102e8755b0e27b5341789b409932ca945783b1ed2c1f32dfae6321d |
|
MD5 | dc4a5ef52ae9c68efda197c1449a57fa |
|
BLAKE2b-256 | ab1e8acc06549a231fbe29a3f93caab7c04ddb74d80af51aacc9edf7945d4d86 |
File details
Details for the file basicthainlp-0.3.2-py3-none-any.whl
.
File metadata
- Download URL: basicthainlp-0.3.2-py3-none-any.whl
- Upload date:
- Size: 34.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e6231db60034a212d6d7c4e53e6380d286ccb43a0d7b47dfcb51a7342400c73f |
|
MD5 | 17e5cd1e295a015b5732247f7a370ce8 |
|
BLAKE2b-256 | 968069680ab69f256b1b08d9b7d2ef12a799d2d5160e07b35be0bae607edba16 |