ThaiXtransformers: Use Pretraining RoBERTa based Thai language models from VISTEC-depa AI Research Institute of Thailand.
Project description
ThaiXtransformers
Use Pretraining RoBERTa based Thai language models from VISTEC-depa AI Research Institute of Thailand.
Fork from vistec-AI/thai2transformers.
This project build the tokenizer and preprocessing data for RoBERTa models from VISTEC-depa AI Research Institute of Thailand.
Paper: WangchanBERTa: Pretraining transformer-based Thai Language Models
Install
pip install thaixtransformers
Usage
Tokenizer
from thaixtransformers import Tokenizer
If you use models, you should load model by model name.
Tokenizer(model_name) -> Tokeinzer
Example
from thaixtransformers import Tokenizer
from transformers import pipeline
from transformers import AutoModelForMaskedLM
tokenizer = Tokenizer("airesearch/wangchanberta-base-wiki-newmm")
model = AutoModelForMaskedLM.from_pretrained("airesearch/wangchanberta-base-wiki-newmm")
classifier = pipeline("fill-mask",model=model,tokenizer=tokenizer)
print(classifier("ผมชอบ<mask>มาก ๆ"))
# output:
# [{'score': 0.05261131376028061,
# 'token': 6052,
# 'token_str': 'อินเทอร์เน็ต',
# 'sequence': 'ผมชอบอินเทอร์เน็ตมากๆ'},
# {'score': 0.03980604186654091,
# 'token': 11893,
# 'token_str': 'อ่านหนังสือ',
# 'sequence': 'ผมชอบอ่านหนังสือมากๆ'},
# ...]
Preprocess
If you want to preprocessing data before training model, you can use preprocess.
from thaixtransformers.preprocess import process_transformers
process_transformers(str) -> str
Example
from thaixtransformers.preprocess import process_transformers
print(process_transformers("สวัสดี :D"))
# output: 'สวัสดี<_>:d'
BibTeX entry and citation info
@misc{lowphansirikul2021wangchanberta,
title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
year={2021},
eprint={2101.09635},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for thaixtransformers-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b9373c75458075f0c534ed4ab422e5490872cc31595d6c3a2aff42ebbd4d8350 |
|
MD5 | 440a757f48080f4b0036719cab7b9d00 |
|
BLAKE2b-256 | fa5988774229cada59628e851420096dc5779009c1511696f23fe78225c3d384 |