Asian language bart models (En, Ja, Ko, Zh, ECJK)

These details have not been verified by PyPI

Project links

Homepage

Project description

Asian Bart

asian-bart is package of Bart model for Asian languages.
asian-bart supports English, Chinese, Korean, japanese, Total (=ECJK)
We made asian-bart using mBart by embedding layer pruning.

Installation

pip install asian-bart

Model specification

ECJK model
- vocab size: 57k
- model size: 413M
- languages: En, Zh, Ja, Ko
- architecture: Transformer 12 Encoder + 12 Decoder
- name: hyunwoongko/asian-bart-ecjk
English model
- vocab size: 32k
- model size: 387M
- languages: English (en_XX)
- architecture: Transformer 12 Encoder + 12 Decoder
- name: hyunwoongko/asian-bart-en
Chinese model
- vocab size: 20k
- model size: 375M
- languages: Chinese (zh_CN)
- architecture: Transformer 12 Encoder + 12 Decoder
- name: hyunwoongko/asian-bart-zh
Japanese model
- vocab size: 13k
- model size: 368M
- languages: Japanese (ja_XX)
- architecture: Transformer 12 Encoder + 12 Decoder
- name: hyunwoongko/asian-bart-ja
Korean model
- vocab size: 8k
- model size: 363M
- languages: Korean (ko_KR)
- architecture: Transformer 12 Encoder + 12 Decoder
- name: hyunwoongko/asian-bart-ko

Usage

The asian-bart is made using mbart, so you have to follow mbart's input rules:
- source: text + </s> + lang_code
- target: lang_code + text + </s>
For more details, please check the content of the mbart paper.

Usage of tokenizer

tokenization of (single language, single text)

>>> from asian_bart import AsianBartTokenizer
>>> tokenizer = AsianBartTokenizer.from_pretrained("hyunwoongko/asian-bart-ecjk")
>>> tokenizer.prepare_seq2seq_batch(
...     src_texts="hello.",
...     src_langs="en_XX",
... )

{
  'input_ids': tensor([[37199, 35816,     2, 57521]]), 
  'attention_mask': tensor([[1, 1, 1, 1]])
}

batch tokenization of (single language, mutiple texts)

>>> from asian_bart import AsianBartTokenizer
>>> tokenizer = AsianBartTokenizer.from_pretrained("hyunwoongko/asian-bart-ecjk")
>>> tokenizer.prepare_seq2seq_batch(
...     src_texts=["hello.", "how are you?", "good."],
...     src_langs="en_XX",
... )

{
  'input_ids': tensor([[37199, 35816,     2, 57521,     1,     1],
                       [38248, 46819, 39446, 36209,     2, 57521],
                       [40010, 39539,     2, 57521,     1,     1]]), 

  'attention_mask': tensor([[1, 1, 1, 1, 0, 0],
                            [1, 1, 1, 1, 1, 1],
                            [1, 1, 1, 1, 0, 0]])
}

batch tokenization of (multiple languages, multiple texts)

>>> from asian_bart import AsianBartTokenizer
>>> tokenizer = AsianBartTokenizer.from_pretrained("hyunwoongko/asian-bart-ecjk")
>>> tokenizer.prepare_seq2seq_batch(
...     src_texts=["hello.", "반가워", "你好", "こんにちは"],
...     src_langs=["en_XX", "ko_KR", "zh_CN", "ja_XX"],
... )

{
  'input_ids': tensor([[37199, 35816, 39539,     2, 57521,     1,     1,     1],
                       [22880, 49591,  3901,     2, 57523,     1,     1,     1],
                       [50356,  7929,     2, 57524,     1,     1,     1,     1],
                       [42990, 19092, 51547, 36821, 33899, 37382,     2, 57522]]), 

   'attention_mask': tensor([[1, 1, 1, 1, 1, 0, 0, 0],
                             [1, 1, 1, 1, 1, 0, 0, 0],
                             [1, 1, 1, 1, 0, 0, 0, 0],
                             [1, 1, 1, 1, 1, 1, 1, 1]])
}

seq2seq tokenization of (source text, target text)

>>> from asian_bart import AsianBartTokenizer
>>> tokenizer = AsianBartTokenizer.from_pretrained("hyunwoongko/asian-bart-ecjk")
>>> tokenizer.prepare_seq2seq_batch(
...     src_texts="반가워",
...     src_langs="ko_KR",
...     tgt_texts="hello.",
...     tgt_langs="en_XX",
... )

{
  'input_ids': tensor([[22880, 49591,  3901,     2, 57523]]), 
  'attention_mask': tensor([[1, 1, 1, 1, 1]]), 
  'labels': tensor([[37199, 35816, 39539,     2, 57521]])
}

all above batch tokenization settings work the same about target texts

>>> from asian_bart import AsianBartTokenizer
>>> tokenizer = AsianBartTokenizer.from_pretrained("hyunwoongko/asian-bart-ecjk")
>>> tokenizer.prepare_seq2seq_batch(
...     src_texts=["hello.", "반가워", "你好", "こんにちは"],
...     src_langs=["en_XX", "ko_KR", "zh_CN", "ja_XX"],
...     tgt_texts=["hello.", "반가워", "你好", "こんにちは"],
...     tgt_langs=["en_XX", "ko_KR", "zh_CN", "ja_XX"],
... )

{
  'input_ids': tensor([[37199, 35816, 39539,     2, 57521,     1,     1,     1],
                      [22880, 49591,  3901,     2, 57523,     1,     1,     1],
                      [50356,  7929,     2, 57524,     1,     1,     1,     1],
                      [42990, 19092, 51547, 36821, 33899, 37382,     2, 57522]]), 

  'attention_mask': tensor([[1, 1, 1, 1, 1, 0, 0, 0],
                            [1, 1, 1, 1, 1, 0, 0, 0],
                            [1, 1, 1, 1, 0, 0, 0, 0],
                            [1, 1, 1, 1, 1, 1, 1, 1]]), 

  'labels': tensor([[37199, 35816, 39539,     2, 57521,     1,     1,     1],
                    [22880, 49591,  3901,     2, 57523,     1,     1,     1],
                    [50356,  7929,     2, 57524,     1,     1,     1,     1],
                    [42990, 19092, 51547, 36821, 33899, 37382,     2, 57522]])
}

Usage of models

Interfaces of all functions are the same as mbart model on Huggingface transformers.
Here is an example of using a asian bart model. (ecjk model)
Other language work the same way. change both model and tokenizer's from_pretrained.
- English only: from_pretrained("hyunwoongko/asian-bart-en")
- Chinese only: from_pretrained("hyunwoongko/asian-bart-zh")
- Japanese only: from_pretrained("hyunwoongko/asian-bart-ja")
- Korean only: from_pretrained("hyunwoongko/asian-bart-ko")

# import modules
>>> import torch
>>> from asian_bart import AsianBartTokenizer, AsianBartForConditionalGeneration
>>> from transformers.models.bart.modeling_bart import shift_tokens_right

# create model and tokenizer
>>> tokenizer = AsianBartTokenizer.from_pretrained("hyunwoongko/asian-bart-ecjk")
>>> model = AsianBartForConditionalGeneration.from_pretrained("hyunwoongko/asian-bart-ecjk")

# tokenize texts
>>> tokens = tokenizer.prepare_seq2seq_batch(
...     src_texts="Kevin is the <mask> man in the world.",
...     src_langs="en_XX",
...     tgt_texts="Kevin is the most kind man in the world.",
...     tgt_langs="en_XX",                  
... )

>>> input_ids = tokens["input_ids"]
>>> attention_mask = tokens["attention_mask"]
>>> labels = tokens["labels"]
>>> decoder_input_ids = shift_tokens_right(labels, tokenizer.pad_token_id)

# forwarding model for training
>>> output = model(
...     input_ids=input_ids,
...     attention_mask=attention_mask,
...     decoder_input_ids=decoder_input_ids,
... )

# compute loss
>>> lm_logits = outputs[0]
>>> loss_function = torch.nn.CrossEntropyLoss(
...     ignore_index=tokenizer.pad_token_id
... )

>>> loss = loss_function(
...     lm_logits.view(-1, lm_logits.shape[-1]), 
...     labels.view(-1)
... )

# generate text
>>> output = model.generate(
...     input_ids=input_ids,
...     attention_mask=attention_mask,
...     decoder_start_token_id=tokenizer.lang_code_to_id["en_XX"],
... )

Downstream tasks

You can train various downstream tasks with asian bart.
All interfaces have the same usage as the Huggingface transformers.
Supported classes:
- AsianBartTokenizer
- AsianBartModel
- AsianBartForCausalLM
- AsianBartForQuestionAnswering
- AsianBartForConditionalGeneration
- AsianBartForSequenceClassification

License

Copyright 2021 Hyunwoong Ko.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.2

Jun 10, 2021

1.0.1

Apr 3, 2021

1.0.0

Apr 1, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

asian_bart-1.0.2-py3-none-any.whl (8.1 kB view details)

Uploaded Jun 10, 2021 Python 3

File details

Details for the file asian_bart-1.0.2-py3-none-any.whl.

File metadata

Download URL: asian_bart-1.0.2-py3-none-any.whl
Upload date: Jun 10, 2021
Size: 8.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.3

File hashes

Hashes for asian_bart-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`93a5d5279c95067186eb6f046b962b0bcbf306b8de57a637f2cb60a0048fd6e0`
MD5	`13af06634a0805cadcb364c2f092dfe1`
BLAKE2b-256	`1192a4d6b8c58d735660fbae8061396df4bfa52e57f909e42bec1191ebf0ede2`

See more details on using hashes here.

asian-bart 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Asian Bart

Installation

Model specification

Usage

Usage of tokenizer

Usage of models

Downstream tasks

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes