Skip to main content

ThaiXtransformers: Use Pretraining RoBERTa based Thai language models from VISTEC-depa AI Research Institute of Thailand.

Project description


Open In Colab

Use Pretraining RoBERTa based Thai language models from VISTEC-depa AI Research Institute of Thailand.

Fork from vistec-AI/thai2transformers.

This project build the tokenizer and preprocessing data for RoBERTa models from VISTEC-depa AI Research Institute of Thailand.

Paper: WangchanBERTa: Pretraining transformer-based Thai Language Models


pip install thaixtransformers



from thaixtransformers import Tokenizer

If you use models, you should load model by model name.

Tokenizer(model_name) -> Tokeinzer


from thaixtransformers import Tokenizer
from transformers import pipeline
from transformers import AutoModelForMaskedLM

tokenizer = Tokenizer("airesearch/wangchanberta-base-wiki-newmm")
model = AutoModelForMaskedLM.from_pretrained("airesearch/wangchanberta-base-wiki-newmm")

classifier = pipeline("fill-mask",model=model,tokenizer=tokenizer)
print(classifier("ผมชอบ<mask>มาก ๆ"))
# output:
#    [{'score': 0.05261131376028061,
#  'token': 6052,
#  'token_str': 'อินเทอร์เน็ต',
#  'sequence': 'ผมชอบอินเทอร์เน็ตมากๆ'},
# {'score': 0.03980604186654091,
#  'token': 11893,
#  'token_str': 'อ่านหนังสือ',
#  'sequence': 'ผมชอบอ่านหนังสือมากๆ'},
#    ...]


If you want to preprocessing data before training model, you can use preprocess.

from thaixtransformers.preprocess import process_transformers

process_transformers(str) -> str


from thaixtransformers.preprocess import process_transformers

print(process_transformers("สวัสดี   :D"))
# output: 'สวัสดี<_>:d'

BibTeX entry and citation info

      title={WangchanBERTa: Pretraining transformer-based Thai Language Models}, 
      author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thaixtransformers-0.1.0.tar.gz (18.0 kB view hashes)

Uploaded Source

Built Distribution

thaixtransformers-0.1.0-py3-none-any.whl (19.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page