Skip to main content

Distillation of KoBERT

Project description

DistilKoBERT

Distillation of KoBERT

KoBERT for transformers library

>>> from transformers import BertModel
>>> model = BertModel.from_pretrained('kobert')
  • Tokenizer를 사용하려면, kobert 폴더에서 tokenization_kobert.py 파일을 복사한 후, KoBertTokenizer를 임포트하면 됩니다.
>>> from tokenization_kobert import KoBertTokenizer
>>> tokenizer = KoBertTokenizer.from_pretrained('kobert')
>>> tokenizer.tokenize("[CLS] 한국어 모델을 공유합니다. [SEP]")
['[CLS]', '▁한국', '어', '▁모델', '을', '▁공유', '합니다', '.', '[SEP]']
>>> tokenizer.convert_tokens_to_ids(['[CLS]', '▁한국', '어', '▁모델', '을', '▁공유', '합니다', '.', '[SEP]'])
[2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3]

Pretraining DistilKoBERT

  • 기존의 12 layer를 3 layer로 줄였으며, 기타 configuration은 kobert를 그대로 따랐습니다.
  • Pretraining Corpus는 한국어 위키, 나무위키, 뉴스 등 약 6GB의 데이터를 사용했으며, 2.5 epoch 학습하였습니다.

DistilKobert python library

Install DistilKoBERT

$ pip3 install distilkobert

How to Use

>>> import torch
>>> from distilkobert import get_distilkobert_model

>>> model = get_distilkobert_model()
>>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
>>> attention_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
>>> last_layer_hidden_state, _ = model(input_ids, attention_mask)
>>> last_layer_hidden_state
tensor([[[-0.2155,  0.1182,  0.1865,  ..., -1.0626, -0.0747, -0.0945],
         [-0.5559, -0.1476,  0.1060,  ..., -0.3178, -0.0172, -0.1064],
         [ 0.1284,  0.2212,  0.2971,  ..., -0.4619,  0.0483,  0.3293]],

        [[ 0.0414, -0.2016,  0.2643,  ..., -0.4734, -0.9823, -0.2869],
         [ 0.2286, -0.1787,  0.1831,  ..., -0.7605, -1.0209, -0.5340],
         [ 0.2507, -0.0022,  0.4103,  ..., -0.7278, -0.9471, -0.3140]]],
       grad_fn=<AddcmulBackward>)
>>> from distilkobert import get_tokenizer
>>> tokenizer = get_tokenizer()
>>> tokenizer.tokenize("[CLS] 한국어 모델을 공유합니다. [SEP]")
['[CLS]', '▁한국', '어', '▁모델', '을', '▁공유', '합니다', '.', '[SEP]']
>>> tokenizer.convert_tokens_to_ids(['[CLS]', '▁한국', '어', '▁모델', '을', '▁공유', '합니다', '.', '[SEP]'])
[2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3]

Result on Sub-task

KoBERT DistilKoBERT (3 layer) DistilKoBERT (1 layer) Bert-base-multilingual-cased FastText
Model Size (MB) 351 108 54 681 2
NSMC (%) 89.63 88.28 84.24 87.07 85.50

Reference

TBD

  • [ ] Train DistilKoALBERT
  • [x] Build API Server
  • [x] Make Dockerfile for server

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for distilkobert, version 0.3.0
Filename, size File type Python version Upload date Hashes
Filename, size distilkobert-0.3.0-py3-none-any.whl (9.0 kB) File type Wheel Python version py3 Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page