A Keras-based and TensorFlow-backend language model toolkit.
Project description
LangML (Language ModeL) is a Keras-based and TensorFlow-backend language model toolkit, which provides mainstream pre-trained language models, e.g., BERT/RoBERTa/ALBERT, and their downstream application models.
Outline
Features
- Common and widely-used Keras layers: CRF, Attentions, Transformer
- Pretrained Language Models: Bert, RoBERTa, ALBERT. Friendly designed interfaces and easy to implement downstream singleton, shared/unshared two-tower or multi-tower models.
- Tokenizers: WPTokenizer (wordpiece), SPTokenizer (sentencepiece)
- Baseline models: Text Classification, Named Entity Recognition. It's no need to write any code to train the baselines. You just need to preprocess the data into a specific format and use the "langml-cli" to train the model.
Installation
You can install or upgrade langml/langml-cli via the following command:
pip install -U langml
Documents
Keras Variants
LangML supports keras and tf.keras. You can configure environment variables to set specific Keras variant.
export TF_KERAS=0 # use keras
export TF_KERAS=1 # use tf.keras
NLP Baseline Models
You can train various baseline models using "langml-cli".
Usage:
$ langml-cli --help
Usage: langml [OPTIONS] COMMAND [ARGS]...
LangML client
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
baseline LangML Baseline client
Text Classification
Please prepare your data into JSONLines format, and provide text and label field in each line, for example:
{"text": "this is sentence1", "label": "label1"}
{"text": "this is sentence2", "label": "label2"}
Bert
$ langml-cli baseline clf bert --help
Usage: langml baseline clf bert [OPTIONS]
Options:
--backbone TEXT specify backbone: bert | roberta | albert
--epoch INTEGER epochs
--batch_size INTEGER batch size
--learning_rate FLOAT learning rate
--max_len INTEGER max len
--lowercase do lowercase
--tokenizer_type TEXT specify tokenizer type from [`wordpiece`,
`sentencepiece`]
--monitor TEXT monitor for keras callback
--early_stop INTEGER patience to early stop
--use_micro whether to use micro metrics
--config_path TEXT bert config path [required]
--ckpt_path TEXT bert checkpoint path [required]
--vocab_path TEXT bert vocabulary path [required]
--train_path TEXT train path [required]
--dev_path TEXT dev path [required]
--test_path TEXT test path
--save_dir TEXT dir to save model [required]
--verbose INTEGER 0 = silent, 1 = progress bar, 2 = one line per
epoch
--distributed_training distributed training
--distributed_strategy TEXT distributed training strategy
--help Show this message and exit.
BiLSTM
$ langml-cli baseline clf bilstm --help
Usage: langml baseline clf bilstm [OPTIONS]
Options:
--epoch INTEGER epochs
--batch_size INTEGER batch size
--learning_rate FLOAT learning rate
--embedding_size INTEGER embedding size
--hidden_size INTEGER hidden size of lstm
--max_len INTEGER max len
--lowercase do lowercase
--tokenizer_type TEXT specify tokenizer type from [`wordpiece`,
`sentencepiece`]
--monitor TEXT monitor for keras callback
--early_stop INTEGER patience to early stop
--use_micro whether to use micro metrics
--vocab_path TEXT vocabulary path [required]
--train_path TEXT train path [required]
--dev_path TEXT dev path [required]
--test_path TEXT test path
--save_dir TEXT dir to save model [required]
--verbose INTEGER 0 = silent, 1 = progress bar, 2 = one line per
epoch
--with_attention apply attention mechanism
--distributed_training distributed training
--distributed_strategy TEXT distributed training strategy
--help Show this message and exit.
TextCNN
$ langml-cli baseline clf textcnn --help
Usage: langml baseline clf textcnn [OPTIONS]
Options:
--epoch INTEGER epochs
--batch_size INTEGER batch size
--learning_rate FLOAT learning rate
--embedding_size INTEGER embedding size
--filter_size INTEGER filter size of convolution
--max_len INTEGER max len
--lowercase do lowercase
--tokenizer_type TEXT specify tokenizer type from [`wordpiece`,
`sentencepiece`]
--monitor TEXT monitor for keras callback
--early_stop INTEGER patience to early stop
--use_micro whether to use micro metrics
--vocab_path TEXT vocabulary path [required]
--train_path TEXT train path [required]
--dev_path TEXT dev path [required]
--test_path TEXT test path
--save_dir TEXT dir to save model [required]
--verbose INTEGER 0 = silent, 1 = progress bar, 2 = one line per
epoch
--distributed_training distributed training
--distributed_strategy TEXT distributed training strategy
--help Show this message and exit.
Named Entity Recognition
Please prepare your data in the following format: use \t to separate entity segment and entity type in a sentence, and use \n\n to separate different sentences.
An english example:
I like O
apples Fruit
I like O
pineapples Fruit
A chinese example:
我来自 O
中国 LOC
我住在 O
上海 LOC
Bert-CRF
$ langml-cli baseline ner bert-crf --help
Usage: langml baseline ner bert-crf [OPTIONS]
Options:
--backbone TEXT specify backbone: bert | roberta | albert
--epoch INTEGER epochs
--batch_size INTEGER batch size
--learning_rate FLOAT learning rate
--dropout_rate FLOAT dropout rate
--max_len INTEGER max len
--lowercase do lowercase
--tokenizer_type TEXT specify tokenizer type from [`wordpiece`,
`sentencepiece`]
--config_path TEXT bert config path [required]
--ckpt_path TEXT bert checkpoint path [required]
--vocab_path TEXT bert vocabulary path [required]
--train_path TEXT train path [required]
--dev_path TEXT dev path [required]
--test_path TEXT test path
--save_dir TEXT dir to save model [required]
--monitor TEXT monitor for keras callback
--early_stop INTEGER patience to early stop
--verbose INTEGER 0 = silent, 1 = progress bar, 2 = one line per
epoch
--distributed_training distributed training
--distributed_strategy TEXT distributed training strategy
--help Show this message and exit.
LSTM-CRF
$ langml-cli baseline ner lstm-crf --help
Usage: langml baseline ner lstm-crf [OPTIONS]
Options:
--epoch INTEGER epochs
--batch_size INTEGER batch size
--learning_rate FLOAT learning rate
--dropout_rate FLOAT dropout rate
--embedding_size INTEGER embedding size
--hidden_size INTEGER hidden size
--max_len INTEGER max len
--lowercase do lowercase
--tokenizer_type TEXT specify tokenizer type from [`wordpiece`,
`sentencepiece`]
--vocab_path TEXT vocabulary path [required]
--train_path TEXT train path [required]
--dev_path TEXT dev path [required]
--test_path TEXT test path
--save_dir TEXT dir to save model [required]
--monitor TEXT monitor for keras callback
--early_stop INTEGER patience to early stop
--verbose INTEGER 0 = silent, 1 = progress bar, 2 = one line per
epoch
--distributed_training distributed training
--distributed_strategy TEXT distributed training strategy
--help Show this message and exit.
Pretrained Language Models
langml.plm.load_albert(config_path: str, checkpoint_path: str, seq_len: Optional[int] = None, pretraining: bool = False, with_mlm: bool = True, with_nsp: bool = True, lazy_restore: bool = False, weight_prefix: Optional[str] = None, dropout_rate: float = 0.0, **kwargs) -> Union[Tuple[Models, Callable], Tuple[Models, Callable, Callable]]: 🔗
load and restore ALBERT model.
Args:
- config_path: configure path, str.
- checkpoint_path: checkpoint path, str,
- seq_len: sequence length, int.
- pretraining: pretraining mode, bool. If you want to continue pretraining a language model, set it True
- with_mlm: use Mask Language Model task, bool. This argument works when pretraining=True.
- with_nsp: apply Next Sentence Prediction task, bool. This argument works when pretraining=True.
- lazy_restore: lazy restore pretrained model weights. When applying distributed training strategy, set it as True, and it will return one more callback function.
- weight_prefix: add prefix name to weights, Optional[str]. For an unshared two-tower / multi-tower model, you can set the different prefixes to different towers.
- dropout_rate: dropout rate, float.
Return:
- model: an instance of keras.Model
- bert: an instance of BERT
- restore_weight_callback: a callback function to restore model weights. This callback function returns when lazy_restore=True.
**Examples: refer to **load_bert examples
langml.plm.load_bert(config_path: str, checkpoint_path: str, seq_len: Optional[int] = None, pretraining: bool = False, with_mlm: bool = True, with_nsp: bool = True, lazy_restore: bool = False, weight_prefix: Optional[str] = None, dropout_rate: float = 0.0, **kwargs) -> Union[Tuple[Models, Callable], Tuple[Models, Callable, Callable]]
load and restore BERT/RoBERTa model.
Args:
- config_path: configure path, str.
- checkpoint_path: checkpoint path, str,
- seq_len: sequence length, int.
- pretraining: pretraining mode, bool. If you want to continue pretraining a language model, set it True
- with_mlm: use Mask Language Model task, bool. This argument works when pretraining=True.
- with_nsp: apply Next Sentence Prediction task, bool. This argument works when pretraining=True.
- lazy_restore: lazy restore pretrained model weights. When applying distributed training strategy, set it as True, and it will return one more callback function.
- weight_prefix: add prefix name to weights, Optional[str]. For an unshared two-tower / multi-tower model, you can set the different prefixes to different towers.
- dropout_rate: dropout rate, float.
Return:
- model: an instance of keras.Model
- bert: an instance of BERT
- restore_weight_callback: a callback function to restore model weights. This callback function returns when lazy_restore=True.
Examples:
1. finetune a model (click to expand...)
from langml.plm import load_bert
bert_model, bert = load_bert(
config_path = '/path/to/bert_config.json',
checkpoint_path = '/path/to/bert_model.ckpt'
)
CLS = L.Lambda(lambda x: x[:, 0])(bert_model.output)
output = L.Dense(num_labels,
initializer=bert.initializer,
activation='softmax')(CLS)
train_model = keras.Model(bert_model.input, output)
train_model.summary()
train_model.compile(keras.optimizers.Adam(1e-5),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
2. finetune a model under distributed training (click to expand...)
from langml.plm import load_bert
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
bert_model, bert, restore_weight_callback = load_bert(
config_path = '/path/to/bert_config.json',
checkpoint_path = '/path/to/bert_model.ckpt',
lazy_restore=True
)
CLS = L.Lambda(lambda x: x[:, 0])(bert_model.output)
output = L.Dense(num_labels,
initializer=bert.initializer,
activation='softmax')(CLS)
train_model = keras.Model(bert_model.input, output)
train_model.summary()
train_model.compile(keras.optimizers.Adam(1e-5),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# restore weights after compile
restore_weight_callback(bert_model)
3. continue to pretrain a language model(click to expand...)
from langml.plm import load_bert
bert_model, bert = load_bert(
config_path = '/path/to/bert_config.json',
checkpoint_path = '/path/to/bert_model.ckpt',
pretraning=True,
dropout_rate=0.2
)
model.summary()
model.compile(keras.optimizers.Adam(1e-5),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
4. finetune a two-tower model with shared weights (click to expand...)
from langml.plm import load_bert
# left tower
# use the default input placeholder
bert_model, bert = load_bert(
config_path = '/path/to/bert_config.json',
checkpoint_path = '/path/to/bert_model.ckpt',
)
# CLS representation
left_output = L.Lambda(lambda x: x[:, 0])(bert_model.ouput)
# right tower
# inputs of right tower
right_token_in = L.Input(shape=(None, ), name='Right-Input-Token')
right_segment_in = L.Input(shape=(None, ), name='Right-Input-Segment')
# outputs of right tower
right_output = bert(inputs=[right_token_in, right_segment_in], return_model=False)
right_output = L.Lambda(lambda x: x[:, 0])(right_output)
# matching operation
matching = L.Lambda(your_matching_layer)([left_output, right_output])
# output
output = L.Dense(num_labels)(matching)
train_model = Model(inputs=(*bert_model.input, right_token_in, right_segment_in),
outpus=[output])
train_model.summary()
train_model.compile(keras.optimizers.Adam(1e-5),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
5. finetune a two-tower model with unshared weights (click to expand...)
from langml.plm import load_bert
# left tower
left_bert_model, _ = load_bert(
config_path = '/path/to/bert_config.json',
checkpoint_path = '/path/to/bert_model.ckpt',
weight_prefix = 'Left'
)
# CLS representation
left_output = L.Lambda(lambda x: x[:, 0])(left_bert_model.ouput)
# right tower
right_bert_model, _ = load_bert(
config_path = '/path/to/bert_config.json',
checkpoint_path = '/path/to/bert_model.ckpt',
weight_prefix = 'Right'
)
# CLS representation
right_output = L.Lambda(lambda x: x[:, 0])(right_bert_model.ouput)
# matching operation
matching = L.Lambda(your_matching_layer)([left_output, right_output])
# output
output = L.Dense(num_labels)(matching)
train_model = Model(inputs=(*bert_model.input, right_token_in, right_segment_in),
outpus=[output])
train_model.summary()
train_model.compile(keras.optimizers.Adam(1e-5),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
Tokenizers
langml.tokenizer.WPTokenizer(vocab_path: str, lowercase: bool = False)
Load WordPiece Tokenizer
Examples: (click to expand...)
from langml.tokenizer import WPTokenizer
tokenizer = WPTokenizer('/path/to/vocab.txt')
text = 'hello world'
tokenized = tokenizer.encode(text)
print("token_ids:", tokenized.ids)
print("segment_ids:", tokenized.segment_ids)
langml.tokenizer.SPTokenizer
Load Sentencepiece Tokenizer
Examples: (click to expand...)
from langml.tokenizer import SPTokenizer
tokenizer = SPTokenizer('/path/to/vocab.model')
text = 'hello world'
tokenized = tokenizer.encode(text)
print("token_ids:", tokenized.ids)
print("segment_ids:", tokenized.segment_ids)
Keras Layers
langml.layers.CRF(output_dim: int, sparse_target: bool = True, **kwargs)
Args:
- output_dim: output dimension, int. It's usually equal to the tag size.
- sparse_target: set sparse_target, bool. If the target is prepared as one-hot encoding, set this argument as
True.
Return:
- Tensor
Examples:
click to expand
import keras
import keras.layers as L
from langml.layers import CRF
num_labels = 10
embedding_size = 100
hidden_size = 128
# define a CRF layer
crf = CRF(num_labels)
model = keras.Sequential()
model.add(L.Embedding(num_labels, embedding_size))
model.add(L.LSTM(hidden_size, return_sequences=True))
model.add(L.Dense(num_labels))
model.add(crf)
model.summary()
model.compile('adam', loss=crf.loss, metrics=[crf.accuracy])
langml.layers.SelfAttention(attention_units: Optional[int] = None, return_attention: bool = False, is_residual: bool = False, attention_activation: Activation = 'relu', attention_epsilon: float = 1e10, kernel_initializer: Initializer = 'glorot_normal', kernel_regularizer: Optional[Regularizer] = None, kernel_constraint: Optional[Constraint] = None, bias_initializer: Union[Initializer, str] = 'zeros', bias_regularizer: Optional[Regularizer] = None, bias_constraint: Optional[Constraint] = None, use_attention_bias: bool = True, attention_penalty_weight: float = 0.0, **kwargs)
Examples:
click to expand
import keras
import keras.layers as L
from langml.layers import SelfAttention
model = keras.Sequential()
model.add(L.Embedding(num_labels, embedding_size))
model.add(L.LSTM(hidden_size, return_sequences=True))
model.add(SelfAttention())
model.add(L.Dense(num_labels))
model.summary()
model.compile('adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
langml.layers.SelfAdditiveAttention(attention_units: Optional[int] = None, return_attention: bool = False, is_residual: bool = False, attention_activation: Activation = 'relu', attention_epsilon: float = 1e10, kernel_initializer: Initializer = 'glorot_normal', kernel_regularizer: Optional[Regularizer] = None, kernel_constraint: Optional[Constraint] = None, bias_initializer: Initializer = 'zeros', bias_regularizer: Optional[Regularizer] = None, bias_constraint: Optional[Constraint] = None, use_attention_bias: bool = True, attention_penalty_weight: float = 0.0, **kwargs)
langml.layers.ScaledDotProductAttention(return_attention: bool = False, history_only: bool = False, **kwargs)
langml.layers.MultiHeadAttention(head_num: int, return_attention: bool = False, attention_activation: Activation = 'relu', kernel_initializer: Initializer = 'glorot_normal', kernel_regularizer: Optional[Regularizer] = None, kernel_constraint: Optional[Constraint] = None, bias_initializer: Initializer = 'zeros', bias_regularizer: Optional[Regularizer] = None, bias_constraint: Optional[Constraint] = None, use_attention_bias: Optional[bool] = True, **kwargs)
langml.layers.LayerNorm(center: bool = True, scale: bool = True, epsilon: float = 1e-7, gamma_initializer: Initializer = 'ones', gamma_regularizer: Optional[Regularizer] = None, gamma_constraint: Optional[Constraint] = None, beta_initializer: Initializer = 'zeros', beta_regularizer: Optional[Regularizer] = None, beta_constraint: Optional[Constraint] = None, **kwargs)
Save Model
langml.model.save_frozen(model: Models, fpath: str)
freeze model to tensorflow pb.
Reference
The implementation of pretrained language model is inspired by CyberZHG/keras-bert and bojone/bert4keras.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langml-0.1.0.tar.gz.
File metadata
- Download URL: langml-0.1.0.tar.gz
- Upload date:
- Size: 48.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a726a8c94a5790b21c5745df9f7dc6d7c1db2ccddbb8050fb6d2692319be26fe
|
|
| MD5 |
da99fe5447c81bf29de078cb26d39291
|
|
| BLAKE2b-256 |
761bf778752bc85854028d929e0e941d337c0d455951ccb6f0f8e62ab8b4cded
|
File details
Details for the file langml-0.1.0-py3-none-any.whl.
File metadata
- Download URL: langml-0.1.0-py3-none-any.whl
- Upload date:
- Size: 60.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d831d4c73ce7797a7e8f01a616cf86693f462142dc1519625f179bc473d2179
|
|
| MD5 |
bfac6a9b57a381f0c03d9064399afe1c
|
|
| BLAKE2b-256 |
59a1a8f8cc86ffb8cf5acd5d6bb27dafee9d7eef57040a69c02bc0356532a19e
|