A Keras-based and TensorFlow-backend language model toolkit.

LangML (Language ModeL) is a Keras-based and TensorFlow-backend language model toolkit, which provides mainstream pre-trained language models, e.g., BERT/RoBERTa/ALBERT, and their downstream application models.



  • Common and widely-used Keras layers: CRF, Attentions, Transformer
  • Pretrained Language Models: Bert, RoBERTa, ALBERT. Friendly designed interfaces and easy to implement downstream singleton, shared/unshared two-tower or multi-tower models.
  • Tokenizers: WPTokenizer (wordpiece), SPTokenizer (sentencepiece)
  • Baseline models: Text Classification, Named Entity Recognition. It's no need to write any code to train the baselines. You just need to preprocess the data into a specific format and use the "langml-cli" to train the model.


You can install or upgrade langml/langml-cli via the following command:

pip install -U langml


Keras Variants

LangML supports keras and tf.keras. You can configure environment variables to set specific Keras variant.

export TF_KERAS=0 # use keras

export TF_KERAS=1 # use tf.keras

NLP Baseline Models

You can train various baseline models using "langml-cli".


$ langml-cli --help
Usage: langml [OPTIONS] COMMAND [ARGS]...

  LangML client

  --version  Show the version and exit.
  --help     Show this message and exit.

  baseline  LangML Baseline client

Text Classification

Please prepare your data into JSONLines format, and provide text and label field in each line, for example:

{"text": "this is sentence1", "label": "label1"}
{"text": "this is sentence2", "label": "label2"}


$ langml-cli baseline clf bert --help
Usage: langml baseline clf bert [OPTIONS]

  --backbone TEXT              specify backbone: bert | roberta | albert
  --epoch INTEGER              epochs
  --batch_size INTEGER         batch size
  --learning_rate FLOAT        learning rate
  --max_len INTEGER            max len
  --lowercase                  do lowercase
  --tokenizer_type TEXT        specify tokenizer type from [`wordpiece`,

  --monitor TEXT               monitor for keras callback
  --early_stop INTEGER         patience to early stop
  --use_micro                  whether to use micro metrics
  --config_path TEXT           bert config path  [required]
  --ckpt_path TEXT             bert checkpoint path  [required]
  --vocab_path TEXT            bert vocabulary path  [required]
  --train_path TEXT            train path  [required]
  --dev_path TEXT              dev path  [required]
  --test_path TEXT             test path
  --save_dir TEXT              dir to save model  [required]
  --verbose INTEGER            0 = silent, 1 = progress bar, 2 = one line per

  --distributed_training       distributed training
  --distributed_strategy TEXT  distributed training strategy
  --help                       Show this message and exit.


$ langml-cli baseline clf bilstm --help
Usage: langml baseline clf bilstm [OPTIONS]

  --epoch INTEGER              epochs
  --batch_size INTEGER         batch size
  --learning_rate FLOAT        learning rate
  --embedding_size INTEGER     embedding size
  --hidden_size INTEGER        hidden size of lstm
  --max_len INTEGER            max len
  --lowercase                  do lowercase
  --tokenizer_type TEXT        specify tokenizer type from [`wordpiece`,

  --monitor TEXT               monitor for keras callback
  --early_stop INTEGER         patience to early stop
  --use_micro                  whether to use micro metrics
  --vocab_path TEXT            vocabulary path  [required]
  --train_path TEXT            train path  [required]
  --dev_path TEXT              dev path  [required]
  --test_path TEXT             test path
  --save_dir TEXT              dir to save model  [required]
  --verbose INTEGER            0 = silent, 1 = progress bar, 2 = one line per

  --with_attention             apply attention mechanism
  --distributed_training       distributed training
  --distributed_strategy TEXT  distributed training strategy
  --help                       Show this message and exit.


$ langml-cli baseline clf textcnn --help
Usage: langml baseline clf textcnn [OPTIONS]

  --epoch INTEGER              epochs
  --batch_size INTEGER         batch size
  --learning_rate FLOAT        learning rate
  --embedding_size INTEGER     embedding size
  --filter_size INTEGER        filter size of convolution
  --max_len INTEGER            max len
  --lowercase                  do lowercase
  --tokenizer_type TEXT        specify tokenizer type from [`wordpiece`,

  --monitor TEXT               monitor for keras callback
  --early_stop INTEGER         patience to early stop
  --use_micro                  whether to use micro metrics
  --vocab_path TEXT            vocabulary path  [required]
  --train_path TEXT            train path  [required]
  --dev_path TEXT              dev path  [required]
  --test_path TEXT             test path
  --save_dir TEXT              dir to save model  [required]
  --verbose INTEGER            0 = silent, 1 = progress bar, 2 = one line per

  --distributed_training       distributed training
  --distributed_strategy TEXT  distributed training strategy
  --help                       Show this message and exit.

Named Entity Recognition

Please prepare your data in the following format: use \t to separate entity segment and entity type in a sentence, and use \n\n to separate different sentences.

An english example:

I like    O
apples  Fruit

I like    O
pineapples  Fruit

A chinese example:

我来自  O
中国    LOC

我住在  O
上海    LOC


$ langml-cli baseline ner bert-crf --help
Usage: langml baseline ner bert-crf [OPTIONS]

  --backbone TEXT              specify backbone: bert | roberta | albert
  --epoch INTEGER              epochs
  --batch_size INTEGER         batch size
  --learning_rate FLOAT        learning rate
  --dropout_rate FLOAT         dropout rate
  --max_len INTEGER            max len
  --lowercase                  do lowercase
  --tokenizer_type TEXT        specify tokenizer type from [`wordpiece`,

  --config_path TEXT           bert config path  [required]
  --ckpt_path TEXT             bert checkpoint path  [required]
  --vocab_path TEXT            bert vocabulary path  [required]
  --train_path TEXT            train path  [required]
  --dev_path TEXT              dev path  [required]
  --test_path TEXT             test path
  --save_dir TEXT              dir to save model  [required]
  --monitor TEXT               monitor for keras callback
  --early_stop INTEGER         patience to early stop
  --verbose INTEGER            0 = silent, 1 = progress bar, 2 = one line per

  --distributed_training       distributed training
  --distributed_strategy TEXT  distributed training strategy
  --help                       Show this message and exit.


$  langml-cli baseline ner lstm-crf --help
Usage: langml baseline ner lstm-crf [OPTIONS]

  --epoch INTEGER              epochs
  --batch_size INTEGER         batch size
  --learning_rate FLOAT        learning rate
  --dropout_rate FLOAT         dropout rate
  --embedding_size INTEGER     embedding size
  --hidden_size INTEGER        hidden size
  --max_len INTEGER            max len
  --lowercase                  do lowercase
  --tokenizer_type TEXT        specify tokenizer type from [`wordpiece`,

  --vocab_path TEXT            vocabulary path  [required]
  --train_path TEXT            train path  [required]
  --dev_path TEXT              dev path  [required]
  --test_path TEXT             test path
  --save_dir TEXT              dir to save model  [required]
  --monitor TEXT               monitor for keras callback
  --early_stop INTEGER         patience to early stop
  --verbose INTEGER            0 = silent, 1 = progress bar, 2 = one line per

  --distributed_training       distributed training
  --distributed_strategy TEXT  distributed training strategy
  --help                       Show this message and exit.

Pretrained Language Models

langml.plm.load_albert(config_path: str, checkpoint_path: str, seq_len: Optional[int] = None, pretraining: bool = False, with_mlm: bool = True, with_nsp: bool = True, lazy_restore: bool = False, weight_prefix: Optional[str] = None, dropout_rate: float = 0.0, **kwargs) -> Union[Tuple[Models, Callable], Tuple[Models, Callable, Callable]]: 🔗

load and restore ALBERT model.


  • config_path: configure path, str.
  • checkpoint_path: checkpoint path, str,
  • seq_len: sequence length, int.
  • pretraining: pretraining mode, bool. If you want to continue pretraining a language model, set it True
  • with_mlm: use Mask Language Model task, bool. This argument works when pretraining=True.
  • with_nsp: apply Next Sentence Prediction task, bool. This argument works when pretraining=True.
  • lazy_restore: lazy restore pretrained model weights. When applying distributed training strategy, set it as True, and it will return one more callback function.
  • weight_prefix: add prefix name to weights, Optional[str]. For an unshared two-tower / multi-tower model, you can set the different prefixes to different towers.
  • dropout_rate: dropout rate, float.


  • model: an instance of keras.Model
  • bert: an instance of BERT
  • restore_weight_callback: a callback function to restore model weights. This callback function returns when lazy_restore=True.

**Examples: refer to **load_bert examples

langml.plm.load_bert(config_path: str, checkpoint_path: str, seq_len: Optional[int] = None, pretraining: bool = False, with_mlm: bool = True, with_nsp: bool = True, lazy_restore: bool = False, weight_prefix: Optional[str] = None, dropout_rate: float = 0.0, **kwargs) -> Union[Tuple[Models, Callable], Tuple[Models, Callable, Callable]]

load and restore BERT/RoBERTa model.


  • config_path: configure path, str.
  • checkpoint_path: checkpoint path, str,
  • seq_len: sequence length, int.
  • pretraining: pretraining mode, bool. If you want to continue pretraining a language model, set it True
  • with_mlm: use Mask Language Model task, bool. This argument works when pretraining=True.
  • with_nsp: apply Next Sentence Prediction task, bool. This argument works when pretraining=True.
  • lazy_restore: lazy restore pretrained model weights. When applying distributed training strategy, set it as True, and it will return one more callback function.
  • weight_prefix: add prefix name to weights, Optional[str]. For an unshared two-tower / multi-tower model, you can set the different prefixes to different towers.
  • dropout_rate: dropout rate, float.


  • model: an instance of keras.Model
  • bert: an instance of BERT
  • restore_weight_callback: a callback function to restore model weights. This callback function returns when lazy_restore=True.


1. finetune a model (click to expand...)
from langml.plm import load_bert

bert_model, bert = load_bert(
    config_path = '/path/to/bert_config.json',
    checkpoint_path = '/path/to/bert_model.ckpt'

CLS = L.Lambda(lambda x: x[:, 0])(bert_model.output)
output = L.Dense(num_labels,

train_model = keras.Model(bert_model.input, output)
2. finetune a model under distributed training (click to expand...)
from langml.plm import load_bert

strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    bert_model, bert, restore_weight_callback = load_bert(
        config_path = '/path/to/bert_config.json',
        checkpoint_path = '/path/to/bert_model.ckpt',

    CLS = L.Lambda(lambda x: x[:, 0])(bert_model.output)
    output = L.Dense(num_labels,

    train_model = keras.Model(bert_model.input, output)
	# restore weights after compile
3. continue to pretrain a language model(click to expand...)
from langml.plm import load_bert

bert_model, bert = load_bert(
    config_path = '/path/to/bert_config.json',
    checkpoint_path = '/path/to/bert_model.ckpt',

4. finetune a two-tower model with shared weights (click to expand...)
from langml.plm import load_bert

# left tower
# use the default input placeholder
bert_model, bert = load_bert(
    config_path = '/path/to/bert_config.json',
    checkpoint_path = '/path/to/bert_model.ckpt',
# CLS representation
left_output = L.Lambda(lambda x: x[:, 0])(bert_model.ouput)

# right tower
# inputs of right tower
right_token_in = L.Input(shape=(None, ), name='Right-Input-Token')
right_segment_in = L.Input(shape=(None, ), name='Right-Input-Segment')

# outputs of right tower
right_output = bert(inputs=[right_token_in, right_segment_in], return_model=False)
right_output = L.Lambda(lambda x: x[:, 0])(right_output)

# matching operation
matching = L.Lambda(your_matching_layer)([left_output, right_output])

# output
output = L.Dense(num_labels)(matching)
train_model = Model(inputs=(*bert_model.input, right_token_in, right_segment_in),

5. finetune a two-tower model with unshared weights (click to expand...)
from langml.plm import load_bert

# left tower
left_bert_model, _ = load_bert(
    config_path = '/path/to/bert_config.json',
    checkpoint_path = '/path/to/bert_model.ckpt',
    weight_prefix = 'Left'
# CLS representation
left_output = L.Lambda(lambda x: x[:, 0])(left_bert_model.ouput)

# right tower
right_bert_model, _ = load_bert(
    config_path = '/path/to/bert_config.json',
    checkpoint_path = '/path/to/bert_model.ckpt',
    weight_prefix = 'Right'
# CLS representation
right_output = L.Lambda(lambda x: x[:, 0])(right_bert_model.ouput)

# matching operation
matching = L.Lambda(your_matching_layer)([left_output, right_output])

# output
output = L.Dense(num_labels)(matching)
train_model = Model(inputs=(*bert_model.input, right_token_in, right_segment_in),



langml.tokenizer.WPTokenizer(vocab_path: str, lowercase: bool = False)

Load WordPiece Tokenizer

Examples: (click to expand...)
from langml.tokenizer import WPTokenizer

tokenizer = WPTokenizer('/path/to/vocab.txt')

text = 'hello world'
tokenized = tokenizer.encode(text)

print("token_ids:", tokenized.ids)
print("segment_ids:", tokenized.segment_ids)


Load Sentencepiece Tokenizer

Examples: (click to expand...)
from langml.tokenizer import SPTokenizer

tokenizer = SPTokenizer('/path/to/vocab.model')

text = 'hello world'
tokenized = tokenizer.encode(text)

print("token_ids:", tokenized.ids)
print("segment_ids:", tokenized.segment_ids)

Keras Layers

langml.layers.CRF(output_dim: int, sparse_target: bool = True, **kwargs)


  • output_dim: output dimension, int. It's usually equal to the tag size.
  • sparse_target: set sparse_target, bool. If the target is prepared as one-hot encoding, set this argument as True.


  • Tensor


click to expand
import keras
import keras.layers as L
from langml.layers import CRF

num_labels = 10
embedding_size = 100
hidden_size = 128

# define a CRF layer
crf = CRF(num_labels)

model = keras.Sequential()
model.add(L.Embedding(num_labels, embedding_size))
model.add(L.LSTM(hidden_size, return_sequences=True))
model.compile('adam', loss=crf.loss, metrics=[crf.accuracy])

langml.layers.SelfAttention(attention_units: Optional[int] = None, return_attention: bool = False, is_residual: bool = False, attention_activation: Activation = 'relu', attention_epsilon: float = 1e10, kernel_initializer: Initializer = 'glorot_normal', kernel_regularizer: Optional[Regularizer] = None, kernel_constraint: Optional[Constraint] = None, bias_initializer: Union[Initializer, str] = 'zeros', bias_regularizer: Optional[Regularizer] = None, bias_constraint: Optional[Constraint] = None, use_attention_bias: bool = True, attention_penalty_weight: float = 0.0, **kwargs)


click to expand
import keras
import keras.layers as L
from langml.layers import SelfAttention

model = keras.Sequential()
model.add(L.Embedding(num_labels, embedding_size))
model.add(L.LSTM(hidden_size, return_sequences=True))
model.compile('adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

langml.layers.SelfAdditiveAttention(attention_units: Optional[int] = None, return_attention: bool = False, is_residual: bool = False, attention_activation: Activation = 'relu', attention_epsilon: float = 1e10, kernel_initializer: Initializer = 'glorot_normal', kernel_regularizer: Optional[Regularizer] = None, kernel_constraint: Optional[Constraint] = None, bias_initializer: Initializer = 'zeros', bias_regularizer: Optional[Regularizer] = None, bias_constraint: Optional[Constraint] = None, use_attention_bias: bool = True, attention_penalty_weight: float = 0.0, **kwargs)

langml.layers.ScaledDotProductAttention(return_attention: bool = False, history_only: bool = False, **kwargs)

langml.layers.MultiHeadAttention(head_num: int, return_attention: bool = False, attention_activation: Activation = 'relu', kernel_initializer: Initializer = 'glorot_normal', kernel_regularizer: Optional[Regularizer] = None, kernel_constraint: Optional[Constraint] = None, bias_initializer: Initializer = 'zeros', bias_regularizer: Optional[Regularizer] = None, bias_constraint: Optional[Constraint] = None, use_attention_bias: Optional[bool] = True, **kwargs)

langml.layers.LayerNorm(center: bool = True, scale: bool = True, epsilon: float = 1e-7, gamma_initializer: Initializer = 'ones', gamma_regularizer: Optional[Regularizer] = None, gamma_constraint: Optional[Constraint] = None, beta_initializer: Initializer = 'zeros', beta_regularizer: Optional[Regularizer] = None, beta_constraint: Optional[Constraint] = None, **kwargs)

Save Model

langml.model.save_frozen(model: Models, fpath: str)

freeze model to tensorflow pb.


The implementation of pretrained language model is inspired by CyberZHG/keras-bert and bojone/bert4keras.

