Thai Language Toolkit
Project description
TLTK is a Python package designed for Thai language processing, which includes functionalities such as syllable and word segmentation, discourse unit segmentation, POS tagging, named entity recognition, grapheme-to-phoneme conversion, IPA transcription, romanization, and more. To use TLTK, you will need to have Python 3.6 or a more recent version installed. The project is an open-source software developed at Chulalongkorn University. As of version 1.2.2, the package license has been changed to the New BSD License (BSD-3-Clause).
Input : must be utf8 Thai texts.
Updates:
Version 1.7: Introduced the spoonerism(w) module, which generates one or two spoonerisms from the input word w. This is achieved by swapping the first and last syllables, either a) preserving the initial consonant or b) preserving both the initial consonant and tone. The output is provided as a list of readings in Thai. Additionally, the dependency “sklearn” has been updated to “scikit-learn”.
Version 1.6.8: Bug fixes have been made to the “TextAna” module.
Version 1.6.7: Bug fixes have been made to the “g2p” module.
Version 1.6.6 includes UDParser using MaltParser (https://www.maltparser.org/). To use this feature, please install MaltParser and add a line ‘tltk.nlp.Maltparser_Path = “/path/to/maltparser-1.9.2”’ in your code before using ‘MaltParser’ or ‘MaltParser_wordlist’. The former requires text input while the latter requires a list of words. The UD tree generated by MaltParser is a dictionary with the following format: {‘sentence’: “ข้อความภาษาไทย”, ‘words’: [{‘id’: nn, ‘pos’: POS, ‘deprel’: REL, ‘head’: HD_ID}, {…}, …]}. You can use ‘print_dtree’ to print D-tree from the parsed result. Additionally, ‘delrel’ and ‘SynDepth’ have been added to the properties of ‘TextAna’ when the option ‘UDParse=”Malt”’ is specified. By default, ‘UDParse=”none”’.
Version 1.6.5: This version includes bug fixes in the “SylAna” and “WordAna” modules, as well as a new module called “tltk.corpus.compound(x,y)”.
Version 1.6.3: Bug fixes have been made to the “g2p” module, and some features have been modified in both “WordAna” and “TextAna” modules.
Version 1.6.2: Changes have been made to the text features in this version.
Version 1.6.1: This version includes new text features, an updated Word2Vec model using ‘TNCc5model3.bin’, a change from ‘g2p_all’ to ‘th2ipa_all’, and some bug fixes.
Version 1.6: The new feature in this version is ‘TNC_tag’, which allows you to mark up Thai text in XML format.
Version 1.5.8: This version includes the addition of average reduced frequency in the TextAna module.
Version 1.5.7: The SylAna module has been added, which is included in WordAna. The output is a list of syllable properties, which is added to the word property. Additionally, ‘th2read(text)’ has been added, which shows the pronunciation in Thai written forms.
Version 1.5: This version includes the addition of the WordAna and TextAna modules. The output of WordAna is an object with word properties.
The following line of code has also been mentioned: ‘res = tltk.nlp.TNC_tag(text,POS)’ returns XML format of Thai texts as used in TNC. The POS option can be set to either “Y” or “N”.
sp = tltk.nlp.SylAna(syl_form,syl_phone) => sp.form (syllable form), sp.phone (syllable sound), sp.char (number of characters in the syllable), sp.dead (indicates whether the syllable is dead or live, True/False), sp.initC (initial consonant form), sp.finalC (final consonant form), sp.vowel (vowel form), sp.tonemark (indicates the tone mark, เอก, โท, ตรี, จัตวา), sp.initPh (initial consonant sound), sp.finalPh (final consonant sound), sp.vowelPh (vowel sound), sp.tone (tone 1, 2, 3, 4, or 5), sp.leading (indicates whether the syllable is a leading syllable, True/False), sp.cluster (indicates whether the syllable has an initial cluster, True/False), sp.karan (number of characters marked with a karan marker)
wd = tltk.nlp.WordAna(w) => wd.form (word form), wd.phone (word sound), wd.char (number of characters in the word), wd.syl (number of syllables), wd.corrtone (number of tones that match the same tone marker), wd.corrfinal (number of final consonant sounds that match the final character -ก -ด -ง -น -ม -ย -ว), wd.karan (number of karan markers), wd.cluster (number of cluster consonants), wd.lead (number of leading consonants), wd.doubvowel (number of complex vowels), wd.syl_prop (a list of syllable properties)
res = tltk.nlp.TextAna(text, TextOption, WordOption) => a complex dictionary output describing the input text.
TextOption can be set to “segmented”, “edu”, or “par”. Use “segmented” if the text is segmented with <p>,<s>, and |. Use “edu” to use TLTK edu segmentation. Use “par” to use “\n” in the text.
WordOption can be set to “colloc” or “mm”. If the text is not yet segmented, use “colloc” or “mm” to segment the text into words using TLTK.
### properties from SylAna
form: syllable form
phone: syllable sound
char: number of characters in the syllable
dead: True|False (indicates whether the syllable is dead or alive)
initC: initial consonant
finalC: final consonant
vowel: vowel form
tonemark: tone marker (values: 1, 2, 3, 4, 5)
initPh: initial sound
finalPh: final sound
vowelPh: vowel sound
tone: tone (values: 1, 2, 3, 4, 5)
leading: True|False (indicates whether the syllable is a leading syllable, e.g., in สบาย, สห)
cluster: True|False (indicates whether the syllable has a cluster consonant)
karan: character(s) marked with karan
### properties from WordAna
form: word form
phone: word sound
char: number of characters
syl: number of syllables
corrtone: number of correct tone markers (สามัญ, ่ เอก, ้ โท, ๊ ตรี, ๋ จัตวา) in both form and sound
incorrtone: number of incorrect tone markers in both form and sound
corrfinal: number of correct final consonants (-ก -ด -ง -น -ม -ย -ว)
incorrfinal: number of incorrect final consonants (excluding -ก -ด -ง -น -ม -ย -ว)
karan: number of karan markers
cluster: number of cluster consonants
lead: number of leading consonants
doubvowel: number of double vowels
### properties from TextAna
DesSpC: No. of spaces in a text
DesChaC: No. of characters in a text
DesSymbC: No. of symbols or special characters in a text
DesPC: No. of paragraphs
DesEduC: No. of edu units
DesTotW: Total number of words in a text
DesTotT: Total number of unique words (types) in a text
DesEduL: Mean length of an edu unit (in words)
DesEduLd: Standard deviation of edu length (in words)
DesWrdL: Mean length of a word (in syllables)
DesWrdLd: Standard deviation of word length (in syllables)
DesPL: Mean length of a paragraph (in words)
DesCorrToneC: Number of words with the correct tone form and tone sound
DesInCorrToneC: Number of words with incorrect tone form and/or tone sound
DesCorrFinalC: Number of words with correct final consonant (-ก -ด -ง -น -ม -ย -ว)
DesInCorrFinalC: Number of words with incorrect final consonant (not -ก -ด -ง -น -ม -ย -ว)
DesClusterC: Number of words with a consonant cluster
DesLeadC: Number of words with a leading syllable (e.g. สบาย, สห)
DesDoubVowelC: Number of words with a double vowel
DesTNCt1C: No. of words in TNC tier1 50%
DesTNCt2C: No. of words in TNC tier2 51-60%
DesTNCt3C: No. of words in TNC tier3 61-70%
DesTNCt4C: No. of words in TNC tier4 71-80%
DesTTC1: No. of words in TTC level1
DesTTC2: No. of words in TTC level2
DesTTC3: No. of words in TTC level3
DesTTC4: No. of words in TTC level4
WrdCorrTone: ratio of words with the same tone form and phone
WrdInCorrTone: ratio of words with different tone form and phone
WrdCorrFinal: ratio of words with correct final consonant -ก -ด -ง -น -ม -ย -ว
WrdInCorrFinal: ratio of words with final consonant not -ก -ด -ง -น -ม -ย -ว
WrdKaran: ratio of words with a karan
WrdCluster: ratio of words with a cluster
WrdLead: ratio of words with a leading syllable
WrdDoubVowel: ratio of words with a double vowel
WrdNEl: ratio of named entity locations
WrdNEo: ratio of named entity organizations
WrdNEp: ratio of named entity persons
WrdNeg: ratio of negations
WrdTNCt1: relative frequency of words in TNC tier 1 (/1000 words)
WrdTNCt2: relative frequency of words in TNC tier 2
WrdTNCt3: relative frequency of words in TNC tier 3
WrdTNCt4: relative frequency of words in TNC tier 4
WrdTTC1: relative frequency of words in TTC level 1
WrdTTC2: relative frequency of words in TTC level 2
WrdTTC3: relative frequency of words in TTC level 3
WrdTTC4: relative frequency of words in TTC level 4
WrdC: mean of relative frequency of content words in TTC
WrdF: mean of relative frequency of function words in TTC
WrdCF: mean of relative frequency of content/function words in TTC
WrdFrmSing: mean of relative frequency of single-word forms in TTC
WrdFrmComp: mean of relative frequency of complex/compound word forms in TTC
WrdFrmTran: mean of relative frequency of transliterated words in TTC
WrdSemSimp: mean of relative frequency of simple words in TTC
WrdSemTran: mean of relative frequency of transparent compound words in TTC
WrdSemSemi: mean of relative frequency of words in between transparent and opaque compound words in TTC
WrdSemOpaq: mean of relative frequency of opaque compound words in TTC
WrdBaseM: mean of relative frequency of basic vocab from Ministry of Education
WrdBaseT: mean of relative frequency of basic vocab from TTC & TNC < 2000
WrdTfidf: average of TF-IDF of each word (calculated from TNC)
WrdTncDisp: average of dispersion of each word (calculated from TNC)
WrdTtcDisp: average of dispersion of each word (calculated from TTC)
WrdArf: average of ARF (average reduced frequency) of each word in the text
WrdNOUN: mean of relative frequency of words with POS=NOUN
WrdVERB: mean of relative frequency of words with POS=VERB
WrdADV: mean of relative frequency of words with POS=ADV
WrdDET: mean of relative frequency of words with POS=DET
WrdADJ: mean of relative frequency of words with POS=ADJ
WrdADP: mean of relative frequency of words with POS=ADP
WrdPUNCT: mean of relative frequency of words with POS=PUNCT
WrdAUX: mean of relative frequency of words with POS=AUX
WrdSYM: mean of relative frequency of words with POS=SYM
WrdINTJ: mean of relative frequency of words with POS=INTJ
WrdCCONJ: mean of relative frequency of words with POS=CCONJ
WrdPROPN: mean of relative frequency of words with POS=PROPN
WrdNUM: mean of relative frequency of words with POS=NUM
WrdPART: mean of relative frequency of words with POS=PART
WrdPRON: mean relative frequency of words with POS=PRON
WrdSCONJ: mean relative frequency of words with POS=SCONJ
LdvTTR: type-token ratio, which is the ratio of the number of unique words (types) to the total number of words (tokens) in a text
CrfCNL: proportion of utterances having the same NOUN overlapped locally (yes or no)
CrfCVL: proportion of utterances having the same VERB overlapped locally (yes or no)
CrfCWL: proportion of utterances having the same content words overlapped locally (yes or no)
CrfCTL: proportion of utterances having content words overlapped locally (measured by the number of overlapping tokens)
wrd: dictionary where wrd[word] = freq, representing the frequency of each word in a text
wrd_arf: dictionary where wrd_arf[word] = arf, representing the average reduced frequency of each word in a text
wrd_deprel: dictionary where wrd_deprel[deprel] = freq, representing the frequency of each dependency relation (deprel) in a text
Version 1.4 has been updated for gensim 4.0. Users can load a Thai corpus using Corpus(), then create a model using W2V_train() or D2V_train(), or load an existing model from W2V_load(Model_File). The pre-trained w2v model for TNC is TNCc5model2.bin. The model for EDU segmentation has been recompiled to work with the new library.
Version 1.3.8 has added spell_variants to generate all variation forms of the same pronunciation.
Version 1.3.6 has removed the “matplotlib” dependency and fixed an error with “ใคร”.
More compound words have been added to the dictionary. Versions 1.1.3-1.1.5 contained many entries that were not words and had a few errors. Those entries have been removed in later versions.
The NER tagger model has been updated by using more named entity data from the AiforThai project.
tltk.nlp : basic tools for Thai language processing.
>tltk.nlp.spoonerism(word_or_phrase): Returns one or two “spoonerisms” derived from the input. For example, using spoonerism(‘แขนเป็นฟอ’) will produce the spoonerism(s).
=>[‘คอ-เป็น-แฝน’, ‘ขอ-เป็น-แฟน’]
>tltk.nlp.TextAna(Text, UDParse=”Malt”): This function analyzes plain text by paragraph, segments words using the colloc approach, and employs MaltParse for UDParsing. The default options are TextOption=”par”, WordOption=”colloc”, and UDParse=”none”. If the input is already segmented with ‘|’, then use TextOption=”segmented” and WordOption=”segmented”. If processing by ‘edu’ is preferred, set TextOption=”edu”.
=>output as a dict of text features described in TextAna
>tltk.nlp.TextAna2json(Text, Filename, Options) functions similarly to the above, but the results are saved to a JSON file. The Options parameter includes a Mode which can be set to “write” or “append”.
>tltk.nlp.MaltParser(Text) e.g. print_dtree(tltk.nlp.MaltParser(“เขานั่งดูหนังอยู่ที่บ้าน”))
=>
1:—-เขา (PRON, nsubj - 2)
2:–นั่ง (VERB, root - 0)
3:—-ดู (VERB, compound - 2)
4:——หนัง (NOUN, obj - 3)
5:——อยู่ (VERB, compound - 3)
6:———-ที่ (ADP, case - 7)
7:——–บ้าน (NOUN, obl - 5)
>tltk.nlp.TNC_tag(Text,POSTagOption) e.g. tltk.nlp.TNC_tag(‘นายกรัฐมนตรีกล่าวกับคนขับรถประจำทางหลวงสายสองว่า อยากวิงวอนให้ใช้ความรอบคอบ’,POS=’Y’)
=> ‘<w tran=”naa0jok3rat3tha1mon0trii0” POS=”NOUN”>นายกรัฐมนตรี</w><w tran=”klaaw1” POS=”VERB”>กล่าว</w><w tran=”kap1” POS=”ADP”>กับ</w><w tran=”khon0khap1rot3” POS=”NOUN”>คนขับรถ</w><w tran=”pra1cam0” POS=”NOUN”>ประจำ</w><w tran=”thaaN0luuaN4” POS=”NOUN”>ทางหลวง</w><w tran=”saaj4” POS=”NOUN”>สาย</w><w tran=”sOON4” POS=”NUM”>สอง</w><w tran=”waa2” POS=”SCONJ”>ว่า</w><s/><w tran=”jaak1” POS=”VERB”>อยาก</w><w tran=”wiN0wOOn0” POS=”VERB”>วิงวอน</w><w tran=”haj2” POS=”SCONJ”>ให้</w><w tran=”chaj3” POS=”VERB”>ใช้</w><w tran=”khwaam0” POS=”NOUN”>ความ</w><w tran=”rOOp2khOOp2” POS=”VERB”>รอบคอบ</w><s/>’
>tltk.nlp.chunk(Text) : chunk parsing. The output includes markups for word segments (|), elementary discourse units (<u/>), pos tags (/POS),and named entities (<NEx>…</NEx>), e.g. tltk.nlp.chunk(“สำนักงานเขตจตุจักรชี้แจงว่า ได้นำป้ายประกาศเตือนปลิงไปปักตามแหล่งน้ำ ในเขตอำเภอเมือง จังหวัดอ่างทอง หลังจากนายสุกิจ อายุ 65 ปี ถูกปลิงกัดแล้วไม่ได้ไปพบแพทย์”)
=> ‘<NEo>สำนักงาน/NOUN|เขต/NOUN|จตุจักร/PROPN|</NEo>ชี้แจง/VERB|ว่า/SCONJ|<s/>/PUNCT|ได้/AUX|นำ/VERB|ป้ายประกาศ/NOUN|เตือน/VERB|ปลิง/NOUN|ไป/VERB|ปัก/VERB|ตาม/ADP|แหล่งน้ำ/NOUN|<u/>ใน/ADP|<NEl>เขต/NOUN|อำเภอ/NOUN|เมือง/NOUN|<s/>/PUNCT|จังหวัด/NOUN|อ่างทอง/PROPN|</NEl><u/>หลังจาก/SCONJ|<NEp>นาย/NOUN|สุ/PROPN|กิจ/NOUN|</NEp><s/>/PUNCT|อายุ/NOUN|<u/>65/NUM|<s/>/PUNCT|ปี/NOUN|<u/>ถูก/AUX|ปลิง/VERB|กัด/VERB|แล้ว/ADV|ไม่ได้/AUX|ไป/VERB|พบ/VERB|แพทย์/NOUN|<u/>’
>tltk.nlp.segment(Text) : segment edu by marking <u/> e.g. tltk.nlp.segment(“แต่อาจเพราะนกกินปลีอกเหลืองเป็นพ่อแม่มือใหม่ รังที่ทำจึงไม่ค่อยแข็งแรง วันหนึ่งรังก็ฉีกเกือบขาดเป็นสองท่อนห้อยต่องแต่ง ผมพยายามหาอุปกรณ์มายึดรังกลับคืนรูปทรงเดิม ขณะที่แม่นกกินปลีอกเหลืองส่งเสียงโวยวายอยู่ใกล้ ๆ แต่สุดท้ายไม่สำเร็จ สองสามวันต่อมารังที่ช่วยซ่อมก็พังไป ไม่เห็นแม่นกบินกลับมาอีกเลย”)
=>”แต่|อาจ|เพราะ|นกกินปลีอกเหลือง|เป็น|พ่อแม่|มือใหม่|<s/>|รัง|ที่|ทำ|จึง|ไม่ค่อย|แข็งแรง<u/>วัน|หนึ่ง|รัง|ก็|ฉีก|เกือบ|ขาด|เป็น|สอง|ท่อน|ห้อย|ต่องแต่ง<u/>ผม|พยายาม|หา|อุปกรณ์|มา|ยึด|รัง|กลับคืน|รูปทรง|เดิม<u/>ขณะที่|แม่|นกกินปลีอกเหลือง|ส่งเสียง|โวยวาย|อยู่|ใกล้|ๆ|<s/><u/>แต่|สุดท้าย|ไม่|สำเร็จ<u/>สอง|สาม|วัน|ต่อมา|รัง|ที่|ช่วย|ซ่อม|ก็|พัง|ไป<u/>ไม่|เห็น|แม่|นก|บิน|กลับ|มา|อีก|เลย<u/>”
>tltk.nlp.ner_tag(Text) : The output includes markups for named entities (<NEx>…</NEx>), e.g. tltk.nlp.ner_tag(“สำนักงานเขตจตุจักรชี้แจงว่า ได้นำป้ายประกาศเตือนปลิงไปปักตามแหล่งน้ำ ในเขตอำเภอเมือง จังหวัดอ่างทอง หลังจากนายสุกิจ อายุ 65 ปี ถูกปลิงกัดแล้วไม่ได้ไปพบแพทย์”)
=> ‘<NEo>สำนักงานเขตจตุจักร</NEo>ชี้แจงว่า ได้นำป้ายประกาศเตือนปลิงไปปักตามแหล่งน้ำ ใน<NEl>เขตอำเภอเมือง จังหวัดอ่างทอง</NEl> หลังจาก<NEp>นายสุกิจ</NEp> อายุ 65 ปี ถูกปลิงกัดแล้วไม่ได้ไปพบแพทย์’
>tltk.nlp.ner([(w,pos),….]) : module for named entity recognition (person, organization, location), e.g. tltk.nlp.ner([(‘สำนักงาน’, ‘NOUN’), (‘เขต’, ‘NOUN’), (‘จตุจักร’, ‘PROPN’), (‘ชี้แจง’, ‘VERB’), (‘ว่า’, ‘SCONJ’), (’<s/>’, ‘PUNCT’)])
=> [(‘สำนักงาน’, ‘NOUN’, ‘B-O’), (‘เขต’, ‘NOUN’, ‘I-O’), (‘จตุจักร’, ‘PROPN’, ‘I-O’), (‘ชี้แจง’, ‘VERB’, ‘O’), (‘ว่า’, ‘SCONJ’, ‘O’), (’<s/>’, ‘PUNCT’, ‘O’)] Named entity recognition is based on the CRF model adapted from the http://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html tutorial. The model was trained on a corpus containing 170,000 named entities. The tags used for organizations are B-O and I-O, for persons are B-P and I-P, and for locations are B-L and I-L.
>tltk.nlp.pos_tag(Text,WordSegmentOption) : word segmentation and POS tagging (using nltk.tag.perceptron), e.g. tltk.nlp.pos_tag(‘โปรแกรมสำหรับใส่แท็กหมวดคำภาษาไทย วันนี้ใช้งานได้บ้างแล้ว’) or
=> [[(‘โปรแกรม’, ‘NOUN’), (‘สำหรับ’, ‘ADP’), (‘ใส่’, ‘VERB’), (‘แท็ก’, ‘NOUN’), (‘หมวดคำ’, ‘NOUN’), (‘ภาษาไทย’, ‘PROPN’), (’<s/>’, ‘PUNCT’)], [(‘วันนี้’, ‘NOUN’), (‘ใช้งาน’, ‘VERB’), (‘ได้’, ‘ADV’), (‘บ้าง’, ‘ADV’), (‘แล้ว’, ‘ADV’), (’<s/>’, ‘PUNCT’)]]
The default word segmentation method used is “colloc” in the function word_segment(Text, “colloc”), but if the option is set to “mm”, then the function word_segment(Text, “mm”) will be used. The POS tag set used is based on the Universal POS tag set found at http://universaldependencies.org/u/pos/index.html. The nltk.tag.perceptron model is used for POS tagging, which was trained on a POS-tagged subcorpus in TNC consisting of 148,000 words.
nltk.tag.perceptron model is used for POS tagging. It is trainned with POS-tagged subcorpus in TNC (148,000 words)
>tltk.nlp.pos_tag_wordlist(WordLst) : Same as “tltk.nlp.pos_tag”, but the input is a word list, [w1,w2,…]
>tltk.nlp.segment(Text) : segment a paragraph into elementary discourse units (edu) marked with <u/> and segment words in each edu e.g. tltk.nlp.segment(“แต่อาจเพราะนกกินปลีอกเหลืองเป็นพ่อแม่มือใหม่ รังที่ทำจึงไม่ค่อยแข็งแรง วันหนึ่งรังก็ฉีกเกือบขาดเป็นสองท่อนห้อยต่องแต่ง ผมพยายามหาอุปกรณ์มายึดรังกลับคืนรูปทรงเดิม ขณะที่แม่นกกินปลีอกเหลืองส่งเสียงโวยวายอยู่ใกล้ ๆ แต่สุดท้ายไม่สำเร็จ สองสามวันต่อมารังที่ช่วยซ่อมก็พังไป ไม่เห็นแม่นกบินกลับมาอีกเลย”)
=> ‘แต่|อาจ|เพราะ|นกกินปลีอกเหลือง|เป็น|พ่อแม่|มือใหม่|<s/>|รัง|ที่|ทำ|จึง|ไม่|ค่อย|แข็งแรง<u/>วัน|หนึ่ง|รัง|ก็|ฉีก|เกือบ|ขาด|เป็น|สอง|ท่อน|ห้อย|ต่องแต่ง<u/>ผม|พยายาม|หา|อุปกรณ์|มา|ยึด|รัง|กลับคืน|รูปทรง|เดิม<u/>ขณะ|ที่|แม่|นกกินปลีอกเหลือง|ส่งเสียง|โวยวาย|อยู่|ใกล้|ๆ<u/>แต่|สุดท้าย|ไม่|สำเร็จ|<s/>|สอง|สาม|วัน|ต่อ|มา|รัง|ที่|ช่วย|ซ่อม|ก็|พัง|ไป<u/>ไม่|เห็น|แม่|นก|บิน|กลับ|มา|อีก|เลย<u/>’ edu segmentation is based on syllable input using RandomForestClassifier model, which is trained on an edu-segmented corpus (approx. 7,000 edus) created and used in Nalinee's thesis
>tltk.nlp.word_segment(Text,method=’mm|ngram|colloc’) : word segmentation using either maximum matching or ngram or maximum collocation approach. ‘colloc’ is used by default. Please note that the first run of ngram method would take a long time because TNC.3g will be loaded for ngram calculation. e.g.
>tltk.nlp.word_segment(‘ผู้สื่อข่าวรายงานว่านายกรัฐมนตรีไม่มาทำงานที่ทำเนียบรัฐบาล’) => ‘ผู้สื่อข่าว|รายงาน|ว่า|นายกรัฐมนตรี|ไม่|มา|ทำงาน|ที่|ทำเนียบรัฐบาล|<s/>’
>tltk.nlp.syl_segment(Text) : syllable segmentation using 3gram statistics e.g. tltk.nlp.syl_segment(‘โปรแกรมสำหรับประมวลผลภาษาไทย’)
=> ‘โปร~แกรม~สำ~หรับ~ประ~มวล~ผล~ภา~ษา~ไทย<s/>’
>tltk.nlp.word_segment_nbest(Text, N) : return the best N segmentations based on the assumption of minimum word approach. e.g. tltk.nlp.word_segment_nbest(‘คนขับรถประจำทางปรับอากาศ”’,10)
=> [[‘คนขับ|รถประจำทาง|ปรับอากาศ’, ‘คนขับรถ|ประจำทาง|ปรับอากาศ’, ‘คน|ขับ|รถประจำทาง|ปรับอากาศ’, ‘คน|ขับรถ|ประจำทาง|ปรับอากาศ’, ‘คนขับ|รถ|ประจำทาง|ปรับอากาศ’, ‘คนขับรถ|ประจำ|ทาง|ปรับอากาศ’, ‘คนขับ|รถประจำทาง|ปรับ|อากาศ’, ‘คนขับรถ|ประจำทาง|ปรับ|อากาศ’, ‘คน|ขับ|รถ|ประจำทาง|ปรับอากาศ’, ‘คนขับ|ร|ถ|ประจำทาง|ปรับอากาศ’]]
>tltk.nlp.g2p(Text) : return Word segments and pronunciations e.g. tltk.nlp.g2p(“สถาบันอุดมศึกษาไม่สามารถก้าวให้ทันการเปลี่ยนแปลงของตลาดแรงงาน”)
=> “สถา~บัน~อุ~ดม~ศึก~ษา|ไม่|สา~มารถ|ก้าว|ให้|ทัน|การ|เปลี่ยน~แปลง|ของ|ตลาด~แรง~งาน<tr/>sa1’thaa4~ban0~?u1~dom0~sUk1~saa4|maj2|saa4~maat2|kaaw2|haj2|than0|kaan0|pliian1~plxxN0|khOON4|ta1’laat1~rxxN0~Naan0|<s/>”
>tltk.nlp.th2ipa(Text) : return Thai transcription in IPA forms e.g. tltk.nlp.th2ipa(“ลงแม่น้ำรอเดินไปหาปลา”)
=> ‘loŋ1 mɛː3.naːm4 rᴐː1 dɤːn1 paj1 haː5 plaː1 <s/>’
>tltk.nlp.th2roman(Text) : return Thai romanization according to Royal Thai Institute guideline. .e.g. tltk.nlp.th2roman(“คือเขาเดินเลยลงไปรอในแม่น้ำสะอาดไปหามะปราง”)
=> ‘khue khaw doen loei long pai ro nai maenam sa-at pai ha maprang <s/>’
>tltk.nlp.th2read(Text) : convert text into Thai reading forms, e.g. th2read(‘สามารถเขียนคำอ่านภาษาไทยได้’)
=> ‘สา-มาด-เขียน-คัม-อ่าน-พา-สา-ไท-ด้าย-’
>tltk.nlp.th2ipa_all(Text) : return all transcriptions (IPA) as a list of tuple (syllable_list, transcription). Transcription is based on syllable reading rules. It could be different from th2ipa. e.g. tltk.nlp.th2ipa_all(“รอยกร่าง”)
=> [(‘รอย~กร่าง’, ‘rᴐːj1.ka2.raːŋ2’), (‘รอย~กร่าง’, ‘rᴐːj1.kraːŋ2’), (‘รอ~ยก~ร่าง’, ‘rᴐː1.jok4.raːŋ3’)]
>tltk.nlp.spell_candidates(Word) : list of possible correct words using minimum edit distance, e.g. tltk.nlp.spell_candidates(‘รักษ’)
=> [‘รัก’, ‘ทักษ’, ‘รักษา’, ‘รักษ์’]
>tltk.nlp.spell_variants(Word, InDict=”no|yes”, Karan=”exclude|include”):
This function returns a list of word variants with the same pronunciation as the input Word. The InDict parameter allows the option “yes” to save only words found in the dictionary, while the default option “no” includes all variants regardless of their dictionary status. The Karan parameter allows the option “include” to include words spelled with the karan character, while the default option “exclude” excludes them. For example, tltk.nlp.spell_variants(‘โควิด’).
=> [‘โฆวิธ’, ‘โฆวิต’, ‘โฆวิด’, ‘โฆวิท’, ‘โฆวิช’, ‘โฆวิจ’, ‘โฆวิส’, ‘โฆวิษ’, ‘โฆวิตร’, ‘โฆวิฒ’, ‘โฆวิฏ’, ‘โฆวิซ’, ‘โควิธ’, ‘โควิต’, ‘โควิด’, ‘โควิท’, ‘โควิช’, ‘โควิจ’, ‘โควิส’, ‘โควิษ’, ‘โควิตร’, ‘โควิฒ’, ‘โควิฏ’, ‘โควิซ’]
Other defined functions in the package: >tltk.nlp.reset_thaidict() : clear dictionary content >tltk.nlp.read_thaidict(DictFile) : add a new dictionary e.g. tltk.nlp.read_thaidict(‘BEST.dict’) >tltk.nlp.check_thaidict(Word) : check whether Word exists in the dictionary
tltk.corpus : basic tools for corpus enquiry
>tltk.corpus.compound(w1, w2): Evaluates the similarity between combinations of w1 and w2, specifically w1-w2, w1-w1w2, and w2-w1w2. For instance, invoking tltk.corpus.compound(‘กลัด’,’กลุ้ม’) indicates that ‘กลัดกลุ้ม’ is more similar to ‘กลุ้ม’.
=>[((‘กลุ้ม’, ‘กลัดกลุ้ม’), 0.42245594), ((‘กลัด’, ‘กลัดกลุ้ม’), 0.09066804), ((‘กลัด’, ‘กลุ้ม’), 0.0011619462)]
>tltk.corpus.Corpus_build(DIR, filetype=”xxx”) creates a corpus as a list of paragraphs from files located in the directory specified by DIR. The default file type is .txt. However, it is important to note that the files must be pre-segmented into words, with each word separated by the | character, e.g. w1|w2|w3|w4 ….
>tltk.corpus.Corpus() creates a corpus object that has three methods:
x.frequency(Text): This method returns the frequency of a specific Text string in the corpus.
x.dispersion(C): This method returns a dispersion plot for a given word list C in the corpus.
x.totalword(C): This method returns the total number of words in the corpus that match a given word list C.
Here, C is the result created from Corpus_build.
>C = tltk.corpus.Copus_build(‘temp/data/’)
>corp = tltk.corpus.Corpus()
>print(corp.frequency(C))
> {‘จังหวัด’: 32, ‘สมุทรสาคร’: 16, ‘เปิด’: 3, ‘ศูนย์’: 13, ‘ควบคุม’: 13, ‘แจ้ง’: 16, …..}
>tltk.corpus.Xwordlist() creates a comparison object that compares two word lists A and B generated from the Corp.frequency() method. The Corp object is created from Corpus().
Four comparison methods are defined in this object:
onlyA(): This method returns the list of words that occur only in A.
onlyB(): This method returns the list of words that occur only in B.
intersect(): This method returns the list of words that occur in both A and B.
union(): This method returns the list of words that occur in either A or B (or both).
Here, c1 and c2 are Corpus() objects created using Corpus_build(…). Xcomp is a Xwordlist() object. parsA and parsB are word lists created from the Corpus_build(…) method.
For example, Xcomp.onlyA(c1.frequency(parsA), c2.frequency(parsB)).
>tltk.corpus.Xwordlist() create an object which is a comparison of two wordlists A and B. Four comparison methods are defined: onlyA, onlyB, intersect, union. A and B is an object created from Corp.frequency(). Corp is an object created from Corpus() e.g. Xcomp.onlyA(c1.frequency(parsA),c2.frequency(parsB))); c1 = Corpus(); c2 = Corpus(); Xcomp = Xwordlist(); parsA and parsB are created from Corpus_build(…)
>tltk.corpus.W2V_train(Corpus) create a model of Word2Vec. Input is a corpus created from Corpus_build.
>tltk.corpus.D2V_train(Corpus) create a model of Doc2Vec. Input is a corpus created from Corpus_build.
>tltk.corpus.TNC_load() by default load TNC.3g. The file can be in the working directory or TLTK package directory
>tltk.corpus.trigram_load(TRIGRAM) load Trigram data from other sourse saved in tab delimited format “W1tW2tW3tFreq” e.g. tltk.corpus.load3gram(‘TNC.3g’) ‘TNC.3g’ can be downloaded separately from Thai National Corpus Project.
>tltk.corpus.unigram(w1) return normalized frequecy (frequency/million) of w1 from the corpus
>tltk.corpus.bigram(w1,w2) return frequency/million of Bigram w1-w2 from the corpus e.g. tltk.corpus.bigram(“หาย”,”ดี”) => 2.331959592765809
>tltk.corpus.trigram(w1,w2,w3) return frequency/million of Trigram w1-w2-w3 from the corpus
>tltk.corpus.collocates(w, stat=”chi2”, direct=”both”, span=2, limit=10, minfq=1) ### return all collocates of w, STAT = {freq,mi,chi2} DIR={left,right,both} SPAN={1,2} The output is a list of tuples ((w1,w2), stat). e.g. tltk.corpus.collocates(“วิ่ง”,limit=5)
=> [((‘วิ่ง’, ‘แจ้น’), 86633.93952758134), ((‘วิ่ง’, ‘ตื๋อ’), 77175.29122642518), ((‘วิ่ง’, ‘กระหืดกระหอบ’), 48598.79465339733), ((‘วิ่ง’, ‘ปรู๊ด’), 41111.63720974819), ((‘ลู่’, ‘วิ่ง’), 33990.56839021914)]
>tltk.corpus.W2V_load(File) load w2v model created from gensim. If no file is given, file “TNCc5model3.bin” will be loaded.
>tltk.corpus.w2v_load() by deafult load word2vec file “TNCc5model2.bin”. The file can be in the working directory or TLTK package directory
>tltk.corpus.w2v_exist(w) check whether w has a vector representation e.g. tltk.corpus.w2v_exist(“อาหาร”) => True
>tltk.corpus.w2v(w) return vector representation of w
>tltk.corpus.similarity(w1,w2) e.g. tltk.corpus.similarity(“อาหาร”,”อาหารว่าง”) => 0.783551877546
>tltk.corpus.similar_words(w, n=10, cutoff=0., score=”n”) e.g. tltk.corpus.similar_words(“อาหาร”,n=5, score=”y”)
=> [(‘อาหารว่าง’, 0.7835519313812256), (‘ของว่าง’, 0.7366500496864319), (‘ของหวาน’, 0.703102707862854), (‘เนื้อสัตว์’, 0.6960341930389404), (‘ผลไม้’, 0.6641997694969177)]
>tltk.corpus.outofgroup([w1,w2,w3,…]) e.g. tltk.corpus.outofgroup([“น้ำ”,”อาหาร”,”ข้าว”,”รถยนต์”,”ผัก”]) => “รถยนต์”
>tltk.corpus.analogy(w1,w2,w3,n=1) e.g. tltk.corpus.analogy(‘พ่อ’,’ผู้ชาย’,’แม่’) => [‘ผู้หญิง’]
>tltk.corpus.w2v_plot([w1,w2,w3,…]) => plot a scratter graph of w1-wn in two dimensions
>tltk.corpus.w2v_compare_color([w1,w2,w3,…]) => visualize the components of vectors w1-wn in color
>tltk.corpus.compound(w1,w2) => check a compound w1w2, whether w1 or w2 is similar to w1w2 e.g. tltk.corpus.compound(‘เล็ก’,’น้อย’) => [((‘เล็ก’, ‘น้อย’), 0.4533272), ((‘น้อย’, ‘เล็กน้อย’), 0.35492077), ((‘เล็ก’, ‘เล็กน้อย’), 0.24106339)]
Notes
The word segmentation method used is based on a maximum collocation approach, which is described in the publication “Collocation and Thai Word Segmentation” by W. Aroonmanakun (2002). This publication can be found in the Proceedings of the Fifth Symposium on Natural Language Processing & The Fifth Oriental COCOSDA Workshop, edited by Thanaruk Theeramunkong and Virach Sornlertlamvanich, and published by Sirindhorn International Institute of Technology in Pathumthani. The relevant pages are 68-75. Here is the link to the publication: http://pioneer.chula.ac.th/~awirote/ling/SNLP2002-0051c.pdf
To segment Thai texts, you can use either tltk.nlp.word_segment(Text) or tltk.nlp.syl_segment(Text). The syllable segmentation method is based on a trigram model trained on a corpus of 3.1 million syllables. The input text should be a paragraph of Thai text that may contain English text. Spaces in the paragraph should be marked as “<s/>”. Word boundaries are marked by “|”, and syllable boundaries are marked by “~”. Please note that the syllables represented here are written syllables. Some written syllables may be pronounced as two syllables. For example, “สกัด” is segmented here as one written syllable, but it is pronounced as two syllables “sa1-kat1”.
The process of determining words in a sentence is based on a combination of a dictionary and the maximum collocation strength between syllables. The standard dictionary includes many compounds and idioms, such as ‘เตาไมโครเวฟ’, ‘ไฟฟ้ากระแสสลับ’, ‘ปีงบประมาณ’, ‘อุโมงค์ใต้ดิน’, ‘อาหารจานด่วน’, ‘ปูนขาวผสมพิเศษ’, ‘เต้นแร้งเต้นกา’, etc. These will likely be segmented as one word. If your application requires the use of shortest meaningful words (i.e. ‘รถ|โดยสาร’, ‘คน|ใช้’, ‘กลาง|คืน’, ‘ต้น|ไม้’, as segmented in the BEST corpus), you can reset the default dictionary used in this package and load a new dictionary containing only simple words or the shortest meaningful words. To clear the default dictionary content, use “reset_thaidict()”. To load a new dictionary, use “read_thaidict(‘DICT_FILE’)”. A file named ‘BEST.dict’ containing a list of words compiled from the BEST corpus is included in this package.
The standard dictionary used in this package has more than 65,000 entries, including abbreviations and transliterations, compiled from various sources. Additionally, a list of 8,700 proper names such as country names, organization names, location names, animal names, plant names, food names, etc., has been added to the system’s dictionary. Examples of such proper names include ‘อุซเบกิสถาน’, ‘สำนักเลขาธิการนายกรัฐมนตรี’, ‘วัดใหญ่สุวรรณาราม’, ‘หนอนเจาะลำต้นข้าวโพด’, and ‘ปลาหมึกกระเทียมพริกไทย’.
For segmenting a specific domain text, a specialized dictionary can be used by adding it to the existing dictionary before segmenting the text. This can be done by calling read_thaidict(“SPECIALIZED_DICT”). Please note that the dictionary should be a text file in “utf-8” encoding, and each word should be on a separate line.
‘Sentence segmentation’ or actually ‘EDU segmentation’ is a process of breaking a paragraph into chunks of discourse units, which are usually clauses. It is based on a RandomForestClassifier model, which is trained on an EDU-segmented corpus (8,100 EDUs) created and used in Nalinee’s thesis (http://www.arts.chula.ac.th/~ling/thesis/2556MA-LING-Nalinee.pdf). The model has an accuracy of 97.8%. The reason behind using EDUs can be found in [Aroonmanakun, W. 2007. Thoughts on Word and Sentence Segmentation in Thai. In Proceedings of the Seventh Symposium on Natural Language Processing, Dec 13-15, 2007, Pattaya, Thailand. 85-90.] [Intasaw, N. and Aroonmanakun, W. 2013. Basic Principles for Segmenting Thai EDUs. in Proceedings of 27th Pacific Asia Conference on Language, Information, and Computation, pages 491-498, Nov 22-24, 2013, Taipei.].
‘grapheme to phoneme’ (g2p), as well as IPA transcription (th2ipa) and Thai romanization (th2roman) are based on the hybrid approach presented in the paper “A Unified Model of Thai Word Segmentation and Romanization”. The Thai Royal Institute guideline for Thai romanization can be downloaded from “http://www.arts.chula.ac.th/~ling/tts/ThaiRoman.pdf”, or “http://www.royin.go.th/?page_id=619”. [Aroonmanakun, W., and W. Rivepiboon. 2004. A Unified Model of Thai Word Segmentation and Romanization. In Proceedings of The 18th Pacific Asia Conference on Language, Information and Computation, Dec 8-10, 2004, Tokyo, Japan. 205-214.] (http://www.aclweb.org/anthology/Y04-1021)
Remarks
A prototype of the UD Parser is implemented using MaltParser (https://www.maltparser.org/). To use MaltParser, it must be installed, and a line ‘tltk.nlp.Maltparser_Path = “/path/to/maltparser-1.9.2”’ should be added to your code. The UD tree generated by MaltParser is a dictionary with the following format: {‘sentence’: “ข้อความภาษาไทย”, ‘words’: [{‘id’: nn, ‘pos’: POS, ‘deprel’: REL, ‘head’: HD_ID}, {…}, …]}. The model is trained on 1,114 UD trees manually analyzed from a sample of TNC and is included as “thamalt.mco” in the TLTK package. Additional UD trees will be added in the future.
The TNC Trigram data (TNC.3g) and TNC word2vec (TNCc5model3.bin) can be downloaded from the TNC website: http://www.arts.chula.ac.th/ling/tnc/searchtnc/.
The “spell_candidates” module is modified from Peter Norvig’s Python code, which can be found at http://norvig.com/spell-correct.html.
The “w2v_compare_color” module is modified from http://chrisculy.net/lx/wordvectors/wvecs_visualization.html.
The BEST corpus is a corpus released by NECTEC (https://www.nectec.or.th/corpus/).
This project uses Universal POS tags. For more information, please see http://universaldependencies.org/u/pos/index.html and http://www.arts.chula.ac.th/~ling/contents/File/UD%20Annotation%20for%20Thai.pdf.
pos_tag is based on the PerceptronTagger in the nltk.tag.perceptron module. It was trained using TNC data that was manually pos-tagged (approximately 148,000 words). The accuracy of the pos-tagging is 91.68%. The NLTK PerceptronTagger is a port of the Textblob Averaged Perceptron Tagger, which can be found at https://explosion.ai/blog/part-of-speech-pos-tagger-in-python.
The named entity recognition module is a CRF model adapted from a tutorial (http://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html). The model was trained using NER data from Sasimimon’s and Nutcha’s theses (altogether 7,354 names in a corpus of 183,300 words) (http://pioneer.chula.ac.th/~awirote/Data-Nutcha.zip, http://pioneer.chula.ac.th/~awirote/Data-Sasiwimon.zip) and NER data from AIforThai (https://aiforthai.in.th/). Only valid NE files from AIforThai were used, and the total number of all NEs is 170,076. The accuracy of the model is reported below (88%).
tag |
precision |
recall |
f1-score |
support |
B-L |
0.56 |
0.48 |
0.52 |
27105 |
B-O |
0.72 |
0.58 |
0.64 |
59613 |
B-P |
0.82 |
0.83 |
0.83 |
83358 |
I-L |
0.52 |
0.43 |
0.47 |
17859 |
I-O |
0.67 |
0.59 |
0.63 |
67396 |
I-P |
0.85 |
0.88 |
0.86 |
175069 |
O |
0.92 |
0.94 |
0.93 |
1032377 |
accuracy |
0.88 |
1462777 |
||
macro avg |
0.72 |
0.68 |
0.70 |
1462777 |
weighted avg |
0.87 |
0.88 |
0.88 |
1462777 |
Use cases
This package is free for commercial use. If you incorporate this package in your work, we would appreciate it if you inform us through awirote@chula.ac.th.
BAS Web Services (https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface) used TLTK for Thai grapheme-to-phoneme conversion in their project.
Chubb Life Assurance Public Company Limited used TLTK for Thai transliteration.
The .NET project wraps Thai Romanization in the Thai Language Toolkit Project to simplify usage in other .NET projects. https://github.com/dotnetthailand/ThaiRomanizationSharp
Huawei, Consumer Cloud Service Asia Pacific Cloud Service Business Growth Dept. used TLTK for AppSearch processing for Thai.
osml10n, localization functions for Openstreetmap data used TLTK for thai language transcription in cases where transcripted names are unavailable in Openstreetmap data itself. https://github.com/giggls/osml10n
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.