Skip to main content

Google AI 2018 BERT pytorch implementation

Project description

BERT-pytorch

LICENSE GitHub issues GitHub stars CircleCI PyPI PyPI - Status Documentation Status

Pytorch implementation of Google AI's 2018 BERT, with simple annotation

BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Paper URL : https://arxiv.org/abs/1810.04805

Introduction

Google AI's BERT paper shows the amazing result on various NLP task (new 17 NLP tasks SOTA), including outperform the human F1 score on SQuAD v1.1 QA task. This paper proved that Transformer(self-attention) based encoder can be powerfully used as alternative of previous language model with proper language model training method. And more importantly, they showed us that this pre-trained language model can be transfer into any NLP task without making task specific model architecture.

This amazing result would be record in NLP history, and I expect many further papers about BERT will be published very soon.

This repo is implementation of BERT. Code is very simple and easy to understand fastly. Some of these codes are based on The Annotated Transformer

Currently this project is working on progress. And the code is not verified yet.

Installation

pip install bert-pytorch

Quickstart

NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator

0. Prepare your corpus

Welcome to the \t the jungle\n
I can stay \t here all night\n

or tokenized corpus (tokenization is not in package)

Wel_ _come _to _the \t _the _jungle\n
_I _can _stay \t _here _all _night\n

1. Building vocab based on your corpus

bert-vocab -c data/corpus.small -o data/vocab.small

2. Train your own BERT model

bert -c data/corpus.small -v data/vocab.small -o output/bert.model

Language Model Pre-training

In the paper, authors shows the new language model training methods, which are "masked language model" and "predict next sentence".

Masked Language Model

Original Paper : 3.3.1 Task #1: Masked LM

Input Sequence  : The man went to [MASK] store with [MASK] dog
Target Sequence :                  the                his

Rules:

Randomly 15% of input token will be changed into something, based on under sub-rules

  1. Randomly 80% of tokens, gonna be a [MASK] token
  2. Randomly 10% of tokens, gonna be a [RANDOM] token(another word)
  3. Randomly 10% of tokens, will be remain as same. But need to be predicted.

Predict Next Sentence

Original Paper : 3.3.2 Task #2: Next Sentence Prediction

Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP]
Label : Is Next

Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = NotNext

"Is this sentence can be continuously connected?"

understanding the relationship, between two text sentences, which is not directly captured by language modeling

Rules:

  1. Randomly 50% of next sentence, gonna be continuous sentence.
  2. Randomly 50% of next sentence, gonna be unrelated sentence.

Author

Junseong Kim, Scatter Lab (codertimo@gmail.com / junseong.kim@scatterlab.co.kr)

License

This project following Apache 2.0 License as written in LICENSE file

Copyright 2018 Junseong Kim, Scatter Lab, respective BERT contributors

Copyright (c) 2018 Alexander Rush : The Annotated Trasnformer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bert_demo-0.0.1a4.tar.gz (3.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bert_demo-0.0.1a4-py3-none-any.whl (3.3 kB view details)

Uploaded Python 3

File details

Details for the file bert_demo-0.0.1a4.tar.gz.

File metadata

  • Download URL: bert_demo-0.0.1a4.tar.gz
  • Upload date:
  • Size: 3.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.19

File hashes

Hashes for bert_demo-0.0.1a4.tar.gz
Algorithm Hash digest
SHA256 ec27e84f1cab3a09dc04be46d0fb4811bb2aedc194b80adb5378fab027f86da9
MD5 c96d22b53662cd842f28a83845a0cfe7
BLAKE2b-256 4b9957a00d15dc617982e095a58a16e2c18805c87e48b6773c200bc89208ae26

See more details on using hashes here.

File details

Details for the file bert_demo-0.0.1a4-py3-none-any.whl.

File metadata

  • Download URL: bert_demo-0.0.1a4-py3-none-any.whl
  • Upload date:
  • Size: 3.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.19

File hashes

Hashes for bert_demo-0.0.1a4-py3-none-any.whl
Algorithm Hash digest
SHA256 3784ecd6399265b851459a1cd5ee224e81a415af6e3b6a69705454118bae9137
MD5 46ce7636316c523c60c842413a5392a7
BLAKE2b-256 f51ebd843a4559f0b2f003842873890bb1fbe7db12c03f1cf22427d848aa480d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page