A Keras-based and TensorFlow-backend language model toolkit.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

LangML (Language ModeL) is a Keras-based and TensorFlow-backend language model toolkit, which provides mainstream pre-trained language models, e.g., BERT/RoBERTa/ALBERT, and their downstream application models.

Outline

Outline
Features
Installation
Quick Start
Documentation
Reference

Features

Common and widely-used Keras layers: CRF, Attentions, Transformer
Pretrained Language Models: Bert, RoBERTa, ALBERT. Friendly designed interfaces and easy to implement downstream singleton, shared/unshared two-tower or multi-tower models.
Tokenizers: WPTokenizer (wordpiece), SPTokenizer (sentencepiece)
Baseline models: Text Classification, Named Entity Recognition, Contrastive Learning. It's no need to write any code, and just need to preprocess the data into a specific format and use the "langml-cli" to train various baseline models.
Prompt-Based Tuning: PTuning

Installation

You can install or upgrade langml/langml-cli via the following command:

pip install -U langml

Quick Start

Set a Keras variant

Use pure Keras (default setting)

export TF_KERAS=0

Use TensorFlow Keras

export TF_KERAS=1

Load pretrained language models

from langml import WPTokenizer, SPTokenizer
from langml import load_bert, load_albert

# load bert / roberta plm
bert_model, bert = load_bert(config_path, checkpoint_path)
# load albert plm
albert_model, albert = load_albert(config_path, checkpoint_path)
# load wordpiece tokenizer
wp_tokenizer = WPTokenizer(vocab_path, lowercase)
# load sentencepiece tokenizer
sp_tokenizer = SPTokenizer(vocab_path, lowercase)

Finetune a model

from langml import keras, L
from langml import load_bert

config_path = '/path/to/bert_config.json'
ckpt_path = '/path/to/bert_model.ckpt'
vocab_path = '/path/to/vocab.txt'

bert_model, bert_instance = load_bert(config_path, ckpt_path)
# get CLS representation
cls_output = L.Lambda(lambda x: x[:, 0])(bert_model.output)
output = L.Dense(2, activation='softmax',
                 kernel_intializer=bert_instance.initializer)(cls_output)
train_model = keras.Model(bert_model.input, cls_output)
train_model.summary()
train_model.compile(loss='categorical_crossentropy', optimizer=keras.optimizer.Adam(1e-5))

Use langml-cli to train baseline models

Text Classification

$ langml-cli baseline clf --help

Named Entity Recognition

$ langml-cli baseline ner --help

Contrastive Learning

$ langml-cli baseline contrastive --help

Text Matching

$ langml-cli baseline matching --help

Prompt-Based Tuning

Use Ptuning for text classification:

from langml.prompt import Template,  PTuniningPrompt, PTuningForClassification
from langml.tokenizer import WPTokenizer

vocab_path = '/path/to/vocab.txt'

tokenizer = WPTokenizer(vocab_path, lowercase=True)

# 1. Define a template
template = Template(
    #  must specify tokens that are defined in the vocabulary, and the mask token is required
    template=['it', 'was', '[MASK]', '.'],
    # must specify tokens that are defined in the vocabulary.
    label_tokens_map={
        'positive': ['good'],
        'negative': ['bad', 'terrible']
    },
    tokenizer=tokenizer
)

# 2. Define Prompt Model

bert_config_path = '/path/to/bert_config.json'
bert_ckpt_path = '/path/to/bert_model.ckpt'

prompt_model = PTuniningPrompt('bert', bert_config_path, bert_ckpt_path,
                               template, freeze_plm=False, learning_rate=5e-5, encoder='lstm')
prompt_classifier = PTuningForClassification(prompt_model, tokenizer)

# 3. Train and Infer

data = [('I do not like this food', 'negative'),
        ('I hate you', 'negative'),
        ('I like you', 'positive'),
        ('I like this food', 'positive')]

X = [d for d, _ in data]
y = [l for _, l in data]

prompt_classifier.fit(X, y, X, y, batch_size=2, epoch=50, model_path='best_model.weight')
# load pretrained model
# prompt_classifier.load('best_model.weight')
print("pred", prompt_classifier.predict('I hate you'))

Documentation

Please visit the langml.readthedocs.io to check the latest documentation.

Reference

The implementation of pretrained language model is inspired by CyberZHG/keras-bert and bojone/bert4keras.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.4.2

Jun 29, 2022

0.4.1

May 25, 2022

0.4.0

May 20, 2022

0.3.0

May 18, 2022

0.2.4

Apr 2, 2022

0.2.3

Mar 26, 2022

This version

0.2.2

Jan 23, 2022

0.2.1

Jan 22, 2022

0.2.0

Jan 15, 2022

0.1.1

Nov 28, 2021

0.1.0

Nov 7, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langml-0.2.2.tar.gz (51.9 kB view hashes)

Uploaded Jan 23, 2022 Source

Built Distribution

langml-0.2.2-py3-none-any.whl (76.1 kB view hashes)

Uploaded Jan 23, 2022 Python 3

Hashes for langml-0.2.2.tar.gz

Hashes for langml-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`734e65bd2bff0b740a9cace478bc42f1e82707da2fe9cc5464464685e5e0c4f2`
MD5	`18dc4b02f5e41da3f1ef1300215bb9cb`
BLAKE2b-256	`d49b8e9da90186091d36754dddfd46191e462e9db96722a1f55f67058de80a7a`

Hashes for langml-0.2.2-py3-none-any.whl

Hashes for langml-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`001540690fa0d9fc43ba0c4a748e5ae52b43055665d04f53ba52607c5963c455`
MD5	`f5b2758ae82e458804625db12e514087`
BLAKE2b-256	`3f3387ad24319c052b7cc8f8aa0352e6ac5b9b7394eae44b01fc23573b78755a`