Transformers library for KoBERT, DistilKoBERT
Project description
KoBERT-Transformers
KoBERT
& DistilKoBERT
on ๐ค Huggingface Transformers ๐ค
KoBERT ๋ชจ๋ธ์ ๊ณต์ ๋ ํฌ์ ๊ฒ๊ณผ ๋์ผํฉ๋๋ค. ๋ณธ ๋ ํฌ๋ Huggingface tokenizer์ ๋ชจ๋ API๋ฅผ ์ง์ํ๊ธฐ ์ํด์ ์ ์๋์์ต๋๋ค.
๐จ ์ค์! ๐จ
๐ TL;DR
transformers
๋v2.9.1
์ด์์ ๋ฐ๋์ ์ค์น!tokenizer
๋ ๋ณธ ๋ ํฌ์tokenization_kobert.py
๋ฅผ ์ฌ์ฉ!
1. Tokenizer ํธํ
Huggingface Transformers
๊ฐ v2.9.0
๋ถํฐ tokenization ๊ด๋ จ API๊ฐ ์ผ๋ถ ๋ณ๊ฒฝ๋์์ต๋๋ค. ์ด์ ๋ง์ถฐ ๊ธฐ์กด์ tokenization_kobert.py
๋ฅผ ์์ ๋ฒ์ ์ ๋ง๊ฒ ์์ ํ์์ต๋๋ค.
2. Embedding์ padding_idx ์ด์
์ด์ ๋ถํฐ BertModel
์ BertEmbeddings
์์ padding_idx=0
์ผ๋ก Hard-coding๋์ด ์์์ต๋๋ค. (์๋ ์ฝ๋ ์ฐธ๊ณ )
class BertEmbeddings(nn.Module): def __init__(self, config): super().__init__() self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0) self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size) self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
๊ทธ๋ฌ๋ Sentencepiece์ ๊ฒฝ์ฐ ๊ธฐ๋ณธ๊ฐ์ผ๋ก pad_token_id=1
, unk_token_id=0
์ผ๋ก ์ค์ ์ด ๋์ด ์๊ณ (์ด๋ KoBERT๋ ๋์ผ), ์ด๋ฅผ ๊ทธ๋๋ก ์ฌ์ฉํ๋ BertModel์ ๊ฒฝ์ฐ ๋ฌธ์ ๊ฐ ์๊ธฐ๊ฒ ๋ฉ๋๋ค.
Huggingface์์๋ ์ต๊ทผ์ ํด๋น ์ด์๋ฅผ ์ธ์งํ์ฌ ์ด๋ฅผ ์์ ํ์ฌ v2.9.0
์ ๋ฐ์ํ์์ต๋๋ค. (๊ด๋ จ PR #3793), config์ pad_token_id=1
์ ์ถ๊ฐ ๊ฐ๋ฅํ์ฌ ์ด๋ฅผ ํด๊ฒฐํ ์ ์๊ฒ ํ์์ต๋๋ค.
class BertEmbeddings(nn.Module): def __init__(self, config): super().__init__() self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id) self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size) self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
๊ทธ๋ฌ๋ v.2.9.0
์์ DistilBERT
, ALBERT
๋ฑ์๋ ์ด ์ด์๊ฐ ํด๊ฒฐ๋์ง ์์ ์ง์ PR์ ์ฌ๋ ค ์ฒ๋ฆฌํ์๊ณ (๊ด๋ จ PR #3965), v2.9.1
์ ์ต์ข
์ ์ผ๋ก ๋ฐ์๋์ด ๋ฐฐํฌ๋์์ต๋๋ค.
์๋๋ ์ด์ ๊ณผ ํ์ฌ ๋ฒ์ ์ ์ฐจ์ด์ ์ ๋ณด์ฌ์ฃผ๋ ์ฝ๋์ ๋๋ค.
# Transformers v2.7.0 >>> from transformers import BertModel, DistilBertModel >>> model = BertModel.from_pretrained("monologg/kobert") >>> model.embeddings.word_embeddings Embedding(8002, 768, padding_idx=0) >>> model = DistilBertModel.from_pretrained("monologg/distilkobert") >>> model.embeddings.word_embeddings Embedding(8002, 768, padding_idx=0) ### Transformers v2.9.1 >>> from transformers import BertModel, DistilBertModel >>> model = BertModel.from_pretrained("monologg/kobert") >>> model.embeddings.word_embeddings Embedding(8002, 768, padding_idx=1) >>> model = DistilBertModel.from_pretrained("monologg/distilkobert") >>> model.embeddings.word_embeddings Embedding(8002, 768, padding_idx=1)
KoBERT / DistilKoBERT on ๐ค Transformers ๐ค
Dependencies
- torch>=1.1.0
- transformers>=2.9.1
How to Use
>>> from transformers import BertModel, DistilBertModel >>> bert_model = BertModel.from_pretrained('monologg/kobert') >>> distilbert_model = DistilBertModel.from_pretrained('monologg/distilkobert')
Tokenizer๋ฅผ ์ฌ์ฉํ๋ ค๋ฉด, ๋ฃจํธ ๋๋ ํ ๋ฆฌ์ tokenization_kobert.py
ํ์ผ์ ๋ณต์ฌํ ํ, KoBertTokenizer
๋ฅผ ์ํฌํธํ๋ฉด ๋ฉ๋๋ค.
- KoBERT์ DistilKoBERT ๋ชจ๋ ๋์ผํ ํ ํฌ๋์ด์ ๋ฅผ ์ฌ์ฉํฉ๋๋ค.
- ๊ธฐ์กด KoBERT์ ๊ฒฝ์ฐ Special Token์ด ์ ๋๋ก ๋ถ๋ฆฌ๋์ง ์๋ ์ด์๊ฐ ์์ด์ ํด๋น ๋ถ๋ถ์ ์์ ํ์ฌ ๋ฐ์ํ์์ต๋๋ค. (Issue link)
>>> from tokenization_kobert import KoBertTokenizer >>> tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert') # monologg/distilkobert๋ ๋์ผ >>> tokenizer.tokenize("[CLS] ํ๊ตญ์ด ๋ชจ๋ธ์ ๊ณต์ ํฉ๋๋ค. [SEP]") >>> ['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]'] >>> tokenizer.convert_tokens_to_ids(['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]']) >>> [2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3]
Kobert-Transformers (Pip library)
tokenization_kobert.py
๋ฅผ ๋ฉํํ ํ์ด์ฌ ๋ผ์ด๋ธ๋ฌ๋ฆฌ- KoBERT, DistilKoBERT๋ฅผ Huggingface Transformers ๋ผ์ด๋ธ๋ฌ๋ฆฌ ํํ๋ก ์ ๊ณต
v0.4.0
์์๋transformers v2.9.1
์ด์์ผ๋ก ๊ธฐ๋ณธ ์ค์นํฉ๋๋ค.
Install Kobert-Transformers
$ pip3 install kobert-transformers
How to Use
>>> import torch >>> from kobert_transformers import get_kobert_model, get_distilkobert_model >>> model = get_kobert_model() >>> model.eval() >>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]]) >>> attention_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]]) >>> token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]]) >>> sequence_output, pooled_output = model(input_ids, attention_mask, token_type_ids) >>> sequence_output[0] tensor([[-0.2461, 0.2428, 0.2590, ..., -0.4861, -0.0731, 0.0756], [-0.2478, 0.2420, 0.2552, ..., -0.4877, -0.0727, 0.0754], [-0.2472, 0.2420, 0.2561, ..., -0.4874, -0.0733, 0.0765]], grad_fn=<SelectBackward>)
>>> from kobert_transformers import get_tokenizer >>> tokenizer = get_tokenizer() >>> tokenizer.tokenize("[CLS] ํ๊ตญ์ด ๋ชจ๋ธ์ ๊ณต์ ํฉ๋๋ค. [SEP]") ['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]'] >>> tokenizer.convert_tokens_to_ids(['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]']) [2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3]
Reference
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|
Filename, size kobert_transformers-0.4.1-py3-none-any.whl (12.3 kB) | File type Wheel | Python version py3 | Upload date | Hashes View |
Hashes for kobert_transformers-0.4.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 518d622054ce0965c7853eb9a0710051b8db295f7f6c477f584473ba3f64330e |
|
MD5 | dc05ff5d8d3d4deb6886411c6b7165ba |
|
BLAKE2-256 | f36df4e21513c1f26cacd68c144a428ccaa90dd92d85985e878976ebbaf06624 |