Skip to main content

Transformers library for KoBERT, DistilKoBERT

Project description

KoBERT-Transformers

KoBERT & DistilKoBERT on ๐Ÿค— Huggingface Transformers ๐Ÿค—

KoBERT ๋ชจ๋ธ์€ ๊ณต์‹ ๋ ˆํฌ์˜ ๊ฒƒ๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋ ˆํฌ๋Š” Huggingface tokenizer์˜ ๋ชจ๋“  API๋ฅผ ์ง€์›ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ œ์ž‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿšจ ์ค‘์š”! ๐Ÿšจ

๐Ÿ™ TL;DR

  1. transformers ๋Š” v2.9.1 ์ด์ƒ์„ ๋ฐ˜๋“œ์‹œ ์„ค์น˜!
  2. tokenizer๋Š” ๋ณธ ๋ ˆํฌ์˜ tokenization_kobert.py๋ฅผ ์‚ฌ์šฉ!

1. Tokenizer ํ˜ธํ™˜

Huggingface Transformers๊ฐ€ v2.9.0๋ถ€ํ„ฐ tokenization ๊ด€๋ จ API๊ฐ€ ์ผ๋ถ€ ๋ณ€๊ฒฝ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด์— ๋งž์ถฐ ๊ธฐ์กด์˜ tokenization_kobert.py๋ฅผ ์ƒ์œ„ ๋ฒ„์ „์— ๋งž๊ฒŒ ์ˆ˜์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

2. Embedding์˜ padding_idx ์ด์Šˆ

์ด์ „๋ถ€ํ„ฐ BertModel์˜ BertEmbeddings์—์„œ padding_idx=0์œผ๋กœ Hard-coding๋˜์–ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. (์•„๋ž˜ ์ฝ”๋“œ ์ฐธ๊ณ )

class BertEmbeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

๊ทธ๋Ÿฌ๋‚˜ Sentencepiece์˜ ๊ฒฝ์šฐ ๊ธฐ๋ณธ๊ฐ’์œผ๋กœ pad_token_id=1, unk_token_id=0์œผ๋กœ ์„ค์ •์ด ๋˜์–ด ์žˆ๊ณ  (์ด๋Š” KoBERT๋„ ๋™์ผ), ์ด๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋Š” BertModel์˜ ๊ฒฝ์šฐ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธฐ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Huggingface์—์„œ๋„ ์ตœ๊ทผ์— ํ•ด๋‹น ์ด์Šˆ๋ฅผ ์ธ์ง€ํ•˜์—ฌ ์ด๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ v2.9.0์— ๋ฐ˜์˜ํ•˜์˜€์Šต๋‹ˆ๋‹ค. (๊ด€๋ จ PR #3793), config์— pad_token_id=1 ์„ ์ถ”๊ฐ€ ๊ฐ€๋Šฅํ•˜์—ฌ ์ด๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

class BertEmbeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

๊ทธ๋Ÿฌ๋‚˜ v.2.9.0์—์„œ DistilBERT, ALBERT ๋“ฑ์—๋Š” ์ด ์ด์Šˆ๊ฐ€ ํ•ด๊ฒฐ๋˜์ง€ ์•Š์•„ ์ง์ ‘ PR์„ ์˜ฌ๋ ค ์ฒ˜๋ฆฌํ•˜์˜€๊ณ  (๊ด€๋ จ PR #3965), v2.9.1์— ์ตœ์ข…์ ์œผ๋กœ ๋ฐ˜์˜๋˜์–ด ๋ฐฐํฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์•„๋ž˜๋Š” ์ด์ „๊ณผ ํ˜„์žฌ ๋ฒ„์ „์˜ ์ฐจ์ด์ ์„ ๋ณด์—ฌ์ฃผ๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.

# Transformers v2.7.0
>>> from transformers import BertModel, DistilBertModel
>>> model = BertModel.from_pretrained("monologg/kobert")
>>> model.embeddings.word_embeddings
Embedding(8002, 768, padding_idx=0)
>>> model = DistilBertModel.from_pretrained("monologg/distilkobert")
>>> model.embeddings.word_embeddings
Embedding(8002, 768, padding_idx=0)


### Transformers v2.9.1
>>> from transformers import BertModel, DistilBertModel
>>> model = BertModel.from_pretrained("monologg/kobert")
>>> model.embeddings.word_embeddings
Embedding(8002, 768, padding_idx=1)
>>> model = DistilBertModel.from_pretrained("monologg/distilkobert")
>>> model.embeddings.word_embeddings
Embedding(8002, 768, padding_idx=1)

KoBERT / DistilKoBERT on ๐Ÿค— Transformers ๐Ÿค—

Dependencies

  • torch>=1.1.0
  • transformers>=2.9.1

How to Use

>>> from transformers import BertModel, DistilBertModel
>>> bert_model = BertModel.from_pretrained('monologg/kobert')
>>> distilbert_model = DistilBertModel.from_pretrained('monologg/distilkobert')

Tokenizer๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด, ๋ฃจํŠธ ๋””๋ ‰ํ† ๋ฆฌ์˜ tokenization_kobert.py ํŒŒ์ผ์„ ๋ณต์‚ฌํ•œ ํ›„, KoBertTokenizer๋ฅผ ์ž„ํฌํŠธํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

  • KoBERT์™€ DistilKoBERT ๋ชจ๋‘ ๋™์ผํ•œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๊ธฐ์กด KoBERT์˜ ๊ฒฝ์šฐ Special Token์ด ์ œ๋Œ€๋กœ ๋ถ„๋ฆฌ๋˜์ง€ ์•Š๋Š” ์ด์Šˆ๊ฐ€ ์žˆ์–ด์„œ ํ•ด๋‹น ๋ถ€๋ถ„์„ ์ˆ˜์ •ํ•˜์—ฌ ๋ฐ˜์˜ํ•˜์˜€์Šต๋‹ˆ๋‹ค. (Issue link)
>>> from tokenization_kobert import KoBertTokenizer
>>> tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert') # monologg/distilkobert๋„ ๋™์ผ
>>> tokenizer.tokenize("[CLS] ํ•œ๊ตญ์–ด ๋ชจ๋ธ์„ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค. [SEP]")
>>> ['[CLS]', 'โ–ํ•œ๊ตญ', '์–ด', 'โ–๋ชจ๋ธ', '์„', 'โ–๊ณต์œ ', 'ํ•ฉ๋‹ˆ๋‹ค', '.', '[SEP]']
>>> tokenizer.convert_tokens_to_ids(['[CLS]', 'โ–ํ•œ๊ตญ', '์–ด', 'โ–๋ชจ๋ธ', '์„', 'โ–๊ณต์œ ', 'ํ•ฉ๋‹ˆ๋‹ค', '.', '[SEP]'])
>>> [2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3]

Kobert-Transformers (Pip library)

PyPI license Downloads

  • tokenization_kobert.py๋ฅผ ๋žฉํ•‘ํ•œ ํŒŒ์ด์ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • KoBERT, DistilKoBERT๋ฅผ Huggingface Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ˜•ํƒœ๋กœ ์ œ๊ณต
  • v0.4.0์—์„œ๋Š” transformers v2.9.1 ์ด์ƒ์œผ๋กœ ๊ธฐ๋ณธ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค.

Install Kobert-Transformers

$ pip3 install kobert-transformers

How to Use

>>> import torch
>>> from kobert_transformers import get_kobert_model, get_distilkobert_model
>>> model = get_kobert_model()
>>> model.eval()
>>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
>>> attention_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
>>> token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
>>> sequence_output, pooled_output = model(input_ids, attention_mask, token_type_ids)
>>> sequence_output[0]
tensor([[-0.2461,  0.2428,  0.2590,  ..., -0.4861, -0.0731,  0.0756],
        [-0.2478,  0.2420,  0.2552,  ..., -0.4877, -0.0727,  0.0754],
        [-0.2472,  0.2420,  0.2561,  ..., -0.4874, -0.0733,  0.0765]],
       grad_fn=<SelectBackward>)
>>> from kobert_transformers import get_tokenizer
>>> tokenizer = get_tokenizer()
>>> tokenizer.tokenize("[CLS] ํ•œ๊ตญ์–ด ๋ชจ๋ธ์„ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค. [SEP]")
['[CLS]', 'โ–ํ•œ๊ตญ', '์–ด', 'โ–๋ชจ๋ธ', '์„', 'โ–๊ณต์œ ', 'ํ•ฉ๋‹ˆ๋‹ค', '.', '[SEP]']
>>> tokenizer.convert_tokens_to_ids(['[CLS]', 'โ–ํ•œ๊ตญ', '์–ด', 'โ–๋ชจ๋ธ', '์„', 'โ–๊ณต์œ ', 'ํ•ฉ๋‹ˆ๋‹ค', '.', '[SEP]'])
[2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3]

Reference

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

kobert_transformers-0.4.1-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file kobert_transformers-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: kobert_transformers-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 12.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.27.0 CPython/3.6.6

File hashes

Hashes for kobert_transformers-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 518d622054ce0965c7853eb9a0710051b8db295f7f6c477f584473ba3f64330e
MD5 dc05ff5d8d3d4deb6886411c6b7165ba
BLAKE2b-256 f36df4e21513c1f26cacd68c144a428ccaa90dd92d85985e878976ebbaf06624

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page