Skip to main content

ThaiXtransformers: Use Pretraining RoBERTa based Thai language models from VISTEC-depa AI Research Institute of Thailand.

Project description

ThaiXtransformers

Open In Colab

Use Pretraining RoBERTa based Thai language models from VISTEC-depa AI Research Institute of Thailand.

Fork from vistec-AI/thai2transformers.

This project build the tokenizer and preprocessing data for RoBERTa models from VISTEC-depa AI Research Institute of Thailand.

Paper: WangchanBERTa: Pretraining transformer-based Thai Language Models

Install

pip install thaixtransformers

Usage

Tokenizer

from thaixtransformers import Tokenizer

If you use models, you should load model by model name.

Tokenizer(model_name) -> Tokeinzer

Example

from thaixtransformers import Tokenizer
from transformers import pipeline
from transformers import AutoModelForMaskedLM

tokenizer = Tokenizer("airesearch/wangchanberta-base-wiki-newmm")
model = AutoModelForMaskedLM.from_pretrained("airesearch/wangchanberta-base-wiki-newmm")

classifier = pipeline("fill-mask",model=model,tokenizer=tokenizer)
print(classifier("ผมชอบ<mask>มาก ๆ"))
# output:
#    [{'score': 0.05261131376028061,
#  'token': 6052,
#  'token_str': 'อินเทอร์เน็ต',
#  'sequence': 'ผมชอบอินเทอร์เน็ตมากๆ'},
# {'score': 0.03980604186654091,
#  'token': 11893,
#  'token_str': 'อ่านหนังสือ',
#  'sequence': 'ผมชอบอ่านหนังสือมากๆ'},
#    ...]

Preprocess

If you want to preprocessing data before training model, you can use preprocess.

from thaixtransformers.preprocess import process_transformers

process_transformers(str) -> str

Example

from thaixtransformers.preprocess import process_transformers

print(process_transformers("สวัสดี   :D"))
# output: 'สวัสดี<_>:d'

BibTeX entry and citation info

@misc{lowphansirikul2021wangchanberta,
      title={WangchanBERTa: Pretraining transformer-based Thai Language Models}, 
      author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
      year={2021},
      eprint={2101.09635},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thaixtransformers-0.1.0.tar.gz (18.0 kB view details)

Uploaded Source

Built Distribution

thaixtransformers-0.1.0-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file thaixtransformers-0.1.0.tar.gz.

File metadata

  • Download URL: thaixtransformers-0.1.0.tar.gz
  • Upload date:
  • Size: 18.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.4

File hashes

Hashes for thaixtransformers-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bb8b49dc0660baf92e17c483e6b42346dee2df357ddc397ac5e4c12b5806443a
MD5 73b7f63e0efbc69c927f68587fe64cde
BLAKE2b-256 15eb7056a9ef57cdce0c35e914d7cfeb63facfa570e4ba84e93386df52880c19

See more details on using hashes here.

File details

Details for the file thaixtransformers-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for thaixtransformers-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b9373c75458075f0c534ed4ab422e5490872cc31595d6c3a2aff42ebbd4d8350
MD5 440a757f48080f4b0036719cab7b9d00
BLAKE2b-256 fa5988774229cada59628e851420096dc5779009c1511696f23fe78225c3d384

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page