BERT implemented in Keras
Project description
Keras BERT
Implementation of the BERT. Official pre-trained models could be loaded for feature extraction and prediction.
Install
pip install keras-bert
Usage
- Load Official Pre-trained Models
- Tokenizer
- Train & Use
- Use Warmup
- Download Pretrained Checkpoints
- Extract Features
External Links
- Kashgari is a Production-ready NLP Transfer learning framework for text-labeling and text-classification
- Keras ALBERT
Load Official Pre-trained Models
In feature extraction demo, you should be able to get the same extraction results as the official model chinese_L-12_H-768_A-12
. And in prediction demo, the missing word in the sentence could be predicted.
Run on TPU
The extraction demo shows how to convert to a model that runs on TPU.
The classification demo shows how to apply the model to simple classification tasks.
Tokenizer
The Tokenizer
class is used for splitting texts and generating indices:
from keras_bert import Tokenizer
token_dict = {
'[CLS]': 0,
'[SEP]': 1,
'un': 2,
'##aff': 3,
'##able': 4,
'[UNK]': 5,
}
tokenizer = Tokenizer(token_dict)
print(tokenizer.tokenize('unaffable')) # The result should be `['[CLS]', 'un', '##aff', '##able', '[SEP]']`
indices, segments = tokenizer.encode('unaffable')
print(indices) # Should be `[0, 2, 3, 4, 1]`
print(segments) # Should be `[0, 0, 0, 0, 0]`
print(tokenizer.tokenize(first='unaffable', second='钢'))
# The result should be `['[CLS]', 'un', '##aff', '##able', '[SEP]', '钢', '[SEP]']`
indices, segments = tokenizer.encode(first='unaffable', second='钢', max_len=10)
print(indices) # Should be `[0, 2, 3, 4, 1, 5, 1, 0, 0, 0]`
print(segments) # Should be `[0, 0, 0, 0, 0, 1, 1, 0, 0, 0]`
Train & Use
from tensorflow import keras
from keras_bert import get_base_dict, get_model, compile_model, gen_batch_inputs
# A toy input example
sentence_pairs = [
[['all', 'work', 'and', 'no', 'play'], ['makes', 'jack', 'a', 'dull', 'boy']],
[['from', 'the', 'day', 'forth'], ['my', 'arm', 'changed']],
[['and', 'a', 'voice', 'echoed'], ['power', 'give', 'me', 'more', 'power']],
]
# Build token dictionary
token_dict = get_base_dict() # A dict that contains some special tokens
for pairs in sentence_pairs:
for token in pairs[0] + pairs[1]:
if token not in token_dict:
token_dict[token] = len(token_dict)
token_list = list(token_dict.keys()) # Used for selecting a random word
# Build & train the model
model = get_model(
token_num=len(token_dict),
head_num=5,
transformer_num=12,
embed_dim=25,
feed_forward_dim=100,
seq_len=20,
pos_num=20,
dropout_rate=0.05,
)
compile_model(model)
model.summary()
def _generator():
while True:
yield gen_batch_inputs(
sentence_pairs,
token_dict,
token_list,
seq_len=20,
mask_rate=0.3,
swap_sentence_rate=1.0,
)
model.fit_generator(
generator=_generator(),
steps_per_epoch=1000,
epochs=100,
validation_data=_generator(),
validation_steps=100,
callbacks=[
keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)
],
)
# Use the trained model
inputs, output_layer = get_model(
token_num=len(token_dict),
head_num=5,
transformer_num=12,
embed_dim=25,
feed_forward_dim=100,
seq_len=20,
pos_num=20,
dropout_rate=0.05,
training=False, # The input layers and output layer will be returned if `training` is `False`
trainable=False, # Whether the model is trainable. The default value is the same with `training`
output_layer_num=4, # The number of layers whose outputs will be concatenated as a single output.
# Only available when `training` is `False`.
)
Use Warmup
AdamWarmup
optimizer is provided for warmup and decay. The learning rate will reach lr
in warmpup_steps
steps, and decay to min_lr
in decay_steps
steps. There is a helper function calc_train_steps
for calculating the two steps:
import numpy as np
from keras_bert import AdamWarmup, calc_train_steps
train_x = np.random.standard_normal((1024, 100))
total_steps, warmup_steps = calc_train_steps(
num_example=train_x.shape[0],
batch_size=32,
epochs=10,
warmup_proportion=0.1,
)
optimizer = AdamWarmup(total_steps, warmup_steps, lr=1e-3, min_lr=1e-5)
Download Pretrained Checkpoints
Several download urls has been added. You can get the downloaded and uncompressed path of a checkpoint by:
from keras_bert import get_pretrained, PretrainedList, get_checkpoint_paths
model_path = get_pretrained(PretrainedList.multi_cased_base)
paths = get_checkpoint_paths(model_path)
print(paths.config, paths.checkpoint, paths.vocab)
Extract Features
You can use helper function extract_embeddings
if the features of tokens or sentences (without further tuning) are what you need. To extract the features of all tokens:
from keras_bert import extract_embeddings
model_path = 'xxx/yyy/uncased_L-12_H-768_A-12'
texts = ['all work and no play', 'makes jack a dull boy~']
embeddings = extract_embeddings(model_path, texts)
The returned result is a list with the same length as texts. Each item in the list is a numpy array truncated by the length of the input. The shapes of outputs in this example are (7, 768)
and (8, 768)
.
When the inputs are paired-sentences, and you need the outputs of NSP
and max-pooling of the last 4 layers:
from keras_bert import extract_embeddings, POOL_NSP, POOL_MAX
model_path = 'xxx/yyy/uncased_L-12_H-768_A-12'
texts = [
('all work and no play', 'makes jack a dull boy'),
('makes jack a dull boy', 'all work and no play'),
]
embeddings = extract_embeddings(model_path, texts, output_layer_num=4, poolings=[POOL_NSP, POOL_MAX])
There are no token features in the results. The outputs of NSP
and max-pooling will be concatenated with the final shape (768 x 4 x 2,)
.
The second argument in the helper function is a generator. To extract features from file:
import codecs
from keras_bert import extract_embeddings
model_path = 'xxx/yyy/uncased_L-12_H-768_A-12'
with codecs.open('xxx.txt', 'r', 'utf8') as reader:
texts = map(lambda x: x.strip(), reader)
embeddings = extract_embeddings(model_path, texts)
Use tensorflow.python.keras
Add TF_KERAS=1
to environment variables to use tensorflow.python.keras
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file keras-bert-0.89.0.tar.gz
.
File metadata
- Download URL: keras-bert-0.89.0.tar.gz
- Upload date:
- Size: 25.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
d35457c91130e23fc9a97181b0eb5c7b81d6945b40edc52310f6f713ed4eaecd
|
|
MD5 |
1d7a8cb8049f41fa8281b17ad05816b7
|
|
BLAKE2b-256 |
740affc65dfa4b31942ee8348e0026d2a7ee57e1769e9266c677141a3e2cac9c
|