Skip to main content

Indobenchmark toolkit for supporting IndoNLU and IndoNLG

Project description

Indobenchmark Toolkit

Pull Requests Welcome GitHub license Contributor Covenant

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources for Bahasa Indonesia such as Institut Teknologi Bandung, Universitas Multimedia Nusantara, The Hong Kong University of Science and Technology, Universitas Indonesia, DeepMind, Gojek, and Prosa.AI.

Toolkit Modules

IndoNLGTokenizer

IndoNLGTokenizer is the tokenizer used by both IndoBART and IndoGPT models. The example for using the IndoNLGTokenizer is shown as follow:

  • IndoNLGTokenizer for IndoGPT
## Encode ##
from indobenchmark import IndoNLGTokenizer
tokenizer = IndoNLGTokenizer.from_pretrained('indobenchmark/indogpt')
inputs = tokenizer.prepare_input_for_generation('hai, bagaimana kabar', model_type='indogpt', return_tensors='pt')
# inputs: {'input_ids': tensor([[    0,  4693, 39956,  1119,  3447]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

## Decode ##
from indobenchmark import IndoNLGTokenizer
tokenizer = IndoNLGTokenizer.from_pretrained('indobenchmark/indogpt')
text = tokenizer.decode([0,  4693, 39956,  1119,  3447])
# text: '<s> hai, bagaimana kabar'
  • IndoNLGTokenizer for IndoBART
## Encode ##
from indobenchmark import IndoNLGTokenizer
tokenizer = IndoNLGTokenizer.from_pretrained('indobenchmark/indobart')
inputs = tokenizer.prepare_input_for_generation('hai, bagaimana kabar', return_tensors='pt', 
                       lang_token = '[indonesian]', decoder_lang_token='[indonesian]')
# inputs: {'input_ids': tensor([    0,  4693, 39956,  1119,  3447,     2, 40002]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1])}

## Decode ##
from indobenchmark import IndoNLGTokenizer
tokenizer = IndoNLGTokenizer.from_pretrained('indobenchmark/indobart')
text = tokenizer.decode([0,  4693, 39956,  1119,  3447, 2, 40002])
# text: '<s> hai, bagaimana kabar </s> [indonesian]'

note: IndoNLGTokenizer will automatically lower case the text input since both the IndoNLGTokenizer, the IndoBart, and the IndoGPT models are only trained on lower-cased texts.

Research Paper

IndoNLU has been accepted by AACL-IJCNLP 2020 and you can find the details in our paper https://www.aclweb.org/anthology/2020.aacl-main.85.pdf. If you are using any component on IndoNLU including Indo4B, FastText-Indo4B, or IndoBERT in your work, please cite the following paper:

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}

IndoNLG has been accepted by EMNLP 2021 and you can find the details in our paper https://arxiv.org/abs/2104.08200. If you are using any component on IndoNLG including Indo4B-Plus, IndoBART, or IndoGPT in your work, please cite the following paper:

@misc{cahyawijaya2021indonlg,
      title={IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation}, 
      author={Samuel Cahyawijaya and Genta Indra Winata and Bryan Wilie and Karissa Vincentio and Xiaohong Li and Adhiguna Kuncoro and Sebastian Ruder and Zhi Yuan Lim and Syafri Bahar and Masayu Leylia Khodra and Ayu Purwarianti and Pascale Fung},
      year={2021},
      eprint={2104.08200},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

IndoNLU and IndoNLG Models

IndoBERT and IndoBERT-lite Models

We provide 4 IndoBERT and 4 IndoBERT-lite Pretrained Language Model [Link]

FastText (Indo4B)

We provide the full uncased FastText model file (11.9 GB) and the corresponding Vector file (3.9 GB)

  • FastText model (11.9 GB) [Link]
  • Vector file (3.9 GB) [Link]

We provide smaller FastText models with smaller vocabulary for each of the 12 downstream tasks

IndoBART and IndoGPT Models

We provide IndoBART and IndoGPT Pretrained Language Model [Link]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indobenchmark-toolkit-0.1.7.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

indobenchmark_toolkit-0.1.7-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file indobenchmark-toolkit-0.1.7.tar.gz.

File metadata

  • Download URL: indobenchmark-toolkit-0.1.7.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.4

File hashes

Hashes for indobenchmark-toolkit-0.1.7.tar.gz
Algorithm Hash digest
SHA256 55917b6e818d8a6a4e949d18ee9bc9e4927d60c01adc0cea586e76eb70aba2df
MD5 b57f6d8285ae969ffab97b67153034e0
BLAKE2b-256 69688fb3616fc2dc9ce55212f37ca6b89fb5db3c258fad820cd601d47598cd50

See more details on using hashes here.

File details

Details for the file indobenchmark_toolkit-0.1.7-py3-none-any.whl.

File metadata

File hashes

Hashes for indobenchmark_toolkit-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 e51a768284181dd9aefab3d298d864f7dea9225b769bf28d101bf74687c62ba3
MD5 1c68892c1b5953d9d2aaf377f32dc5ff
BLAKE2b-256 4d267b0cd995968abdb2132c33d01b30d2ff0b22fb898777c55e2e69f1ddb8f4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page