ThaiXtransformers: Use Pretraining RoBERTa based Thai language models from VISTEC-depa AI Research Institute of Thailand.
Project description
ThaiXtransformers
Use Pretraining RoBERTa based Thai language models from VISTEC-depa AI Research Institute of Thailand.
Fork from vistec-AI/thai2transformers.
This project build the tokenizer and preprocessing data for RoBERTa models from VISTEC-depa AI Research Institute of Thailand.
Paper: WangchanBERTa: Pretraining transformer-based Thai Language Models
Install
pip install thaixtransformers
Usage
Tokenizer
from thaixtransformers import Tokenizer
If you use models, you should load model by model name.
Tokenizer(model_name) -> Tokeinzer
Example
from thaixtransformers import Tokenizer
from transformers import pipeline
from transformers import AutoModelForMaskedLM
tokenizer = Tokenizer("airesearch/wangchanberta-base-wiki-newmm")
model = AutoModelForMaskedLM.from_pretrained("airesearch/wangchanberta-base-wiki-newmm")
classifier = pipeline("fill-mask",model=model,tokenizer=tokenizer)
print(classifier("ผมชอบ<mask>มาก ๆ"))
# output:
# [{'score': 0.05261131376028061,
# 'token': 6052,
# 'token_str': 'อินเทอร์เน็ต',
# 'sequence': 'ผมชอบอินเทอร์เน็ตมากๆ'},
# {'score': 0.03980604186654091,
# 'token': 11893,
# 'token_str': 'อ่านหนังสือ',
# 'sequence': 'ผมชอบอ่านหนังสือมากๆ'},
# ...]
Preprocess
If you want to preprocessing data before training model, you can use preprocess.
from thaixtransformers.preprocess import process_transformers
process_transformers(str) -> str
Example
from thaixtransformers.preprocess import process_transformers
print(process_transformers("สวัสดี :D"))
# output: 'สวัสดี<_>:d'
BibTeX entry and citation info
@misc{lowphansirikul2021wangchanberta,
title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
year={2021},
eprint={2101.09635},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file thaixtransformers-0.1.0.tar.gz
.
File metadata
- Download URL: thaixtransformers-0.1.0.tar.gz
- Upload date:
- Size: 18.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb8b49dc0660baf92e17c483e6b42346dee2df357ddc397ac5e4c12b5806443a |
|
MD5 | 73b7f63e0efbc69c927f68587fe64cde |
|
BLAKE2b-256 | 15eb7056a9ef57cdce0c35e914d7cfeb63facfa570e4ba84e93386df52880c19 |
File details
Details for the file thaixtransformers-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: thaixtransformers-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b9373c75458075f0c534ed4ab422e5490872cc31595d6c3a2aff42ebbd4d8350 |
|
MD5 | 440a757f48080f4b0036719cab7b9d00 |
|
BLAKE2b-256 | fa5988774229cada59628e851420096dc5779009c1511696f23fe78225c3d384 |