Skip to main content

Pretraining transformer based Thai language models

Project description

thai2transformers

Pretraining transformer-based Thai language models


thai2transformers provides customized scripts to pretrain transformer-based masked language model on Thai texts with various types of tokens as follows:

  • spm: a subword-level token from SentencePiece library.
  • newmm : a dictionary-based Thai word tokenizer based on maximal matching from PyThaiNLP.
  • ssg: a CRF-based Thai syllable tokenizer [Chormai et al., 2020],
  • sefr: a ML-based Thai word tokenizer based on Stacked Ensemble Filter and Refine (SEFR) [Limkonchotiwat et al., 2020] based on probabilities from CNN-based deepcut and SEFR tokenizer is loaded with engine="best".


Thai texts for language model pretraining


We curate a list of sources that can be used to pretrain language model. The statistics for each data source are listed in this spreadsheet.

Also, you can download current version of cleaned datasets from here.



Model pretraining and finetuning instructions:


a) Instruction for RoBERTa BASE model pretraining on Thai Wikipedia dump:

In this example, we demonstrate how pretrain RoBERTa base model on Thai Wikipedia dump from scratch

  1. Install required libraries: 1_installation.md

  2. Prepare thwiki dataset from Thai Wikipedia dump: 2_thwiki_data-preparation.md

  3. Tokenizer training and vocabulary building :

    a) For SentencePiece BPE (spm), word-level token (newmm), syllable-level token (syllable): 3_train_tokenizer.md

    b) For word-level token from Limkonchotiwat et al., 2020 (sefr-cut) : 3b_sefr-cut_pretokenize.md

  4. Pretrain a masked langauge model: 4_run_mlm.md


b) Instruction for RoBERTa model finetuning on existing Thai text classification, and NER/POS tagging datasets.

In this example, we demonstrate how to finetune WanchanBERTa, a RoBERTa base model pretrained on Thai Wikipedia dump and Thai assorted texts.

  • Finetune model for sequence classification task from exisitng datasets including wisesight_sentiment, wongnai_reviews, generated_reviews_enth (review star prediction), and prachathai67k: 5a_finetune_sequence_classificaition.md

  • Finetune model for token classification task (NER and POS tagging) from exisitng datasets including thainer and lst20: 5b_finetune_token_classificaition.md



BibTeX entry and citation info

@misc{lowphansirikul2021wangchanberta,
      title={WangchanBERTa: Pretraining transformer-based Thai Language Models}, 
      author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
      year={2021},
      eprint={2101.09635},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thai2transformers-0.1.0.dev1.tar.gz (27.0 kB view details)

Uploaded Source

File details

Details for the file thai2transformers-0.1.0.dev1.tar.gz.

File metadata

  • Download URL: thai2transformers-0.1.0.dev1.tar.gz
  • Upload date:
  • Size: 27.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/45.1.0.post20200127 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.6

File hashes

Hashes for thai2transformers-0.1.0.dev1.tar.gz
Algorithm Hash digest
SHA256 3138a55acb699f732e701e9710d2640e7960b8d4c87678e7b00d7fe9f0ac0151
MD5 b71e2c5b533caace13a2fd9977f2143a
BLAKE2b-256 79fd9a0024c95be6c30735730a721a6a9aa05ba67855f023884c1b11e7dcf898

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page