Skip to main content

A Keras-based and TensorFlow-backend language model toolkit.

Project description

LangML (Language ModeL) is a Keras-based and TensorFlow-backend language model toolkit, which provides mainstream pre-trained language models, e.g., BERT/RoBERTa/ALBERT, and their downstream application models.

pypi

Outline

Features

  • Common and widely-used Keras layers: CRF, Attentions, Transformer
  • Pretrained Language Models: Bert, RoBERTa, ALBERT. Friendly designed interfaces and easy to implement downstream singleton, shared/unshared two-tower or multi-tower models.
  • Tokenizers: WPTokenizer (wordpiece), SPTokenizer (sentencepiece)
  • Baseline models: Text Classification, Named Entity Recognition. It's no need to write any code, and just need to preprocess the data into a specific format and use the "langml-cli" to train various baseline models.
  • Prompt-Based Tuning: PTuning

Installation

You can install or upgrade langml/langml-cli via the following command:

pip install -U langml

Quick Start

Finetune a model

from langml import keras, L
from langml.plm import load_bert

config_path = '/path/to/bert_config.json'
ckpt_path = '/path/to/bert_model.ckpt'
vocab_path = '/path/to/vocab.txt'

bert_model, bert_instance = load_bert(config_path, ckpt_path)
# get CLS representation
cls_output = L.Lambda(lambda x: x[:, 0])(bert_model.output)
output = L.Dense(2, activation='softmax',
                 kernel_intializer=bert_instance.initializer)(cls_output)
train_model = keras.Model(bert_model.input, cls_output)
train_model.summary()
train_model.compile(loss='categorical_crossentropy', optimizer=keras.optimizer.Adam(1e-5))

Use langml-cli to train baseline models

To train a bert classifier, just one line:

$ langml-cli baseline clf bert --backbone bert --config_path /path/to/bert_config.json --ckpt_path /path/to/bert_model.ckpt --vocab_path /path/to/vocab.txt --train_path /path/to/train.jsonl --dev_path /path/to/dev.jsonl --save_dir model --verbose 2

Prompt-Based Tuning

Use Ptuning for text classification:

from langml.prompt import Template,  PTuniningPrompt, PTuningForClassification
from langml.tokenizer import WPTokenizer

vocab_path = '/path/to/vocab.txt'

tokenizer = WPTokenizer(vocab_path, lowercase=True)

# 1. Define a template
template = Template(
    #  must specify tokens that are defined in the vocabulary, and the mask token is required
    template=['it', 'was', '[MASK]', '.'],
    # must specify tokens that are defined in the vocabulary.
    label_tokens_map={
        'positive': ['good'],
        'negative': ['bad', 'terrible']
    },
    tokenizer=tokenizer
)

# 2. Define Prompt Model

bert_config_path = '/path/to/bert_config.json'
bert_ckpt_path = '/path/to/bert_model.ckpt'

prompt_model = PTuniningPrompt('bert', bert_config_path, bert_ckpt_path,
                               template, freeze_plm=False, learning_rate=5e-5, encoder='lstm')
prompt_classifier = PTuningForClassification(prompt_model, tokenizer)

# 3. Train and Infer

data = [('I do not like this food', 'negative'),
        ('I hate you', 'negative'),
        ('I like you', 'positive'),
        ('I like this food', 'positive')]

X = [d for d, _ in data]
y = [l for _, l in data]

prompt_classifier.fit(X, y, X, y, batch_size=2, epoch=50, model_path='best_model.weight')
# load pretrained model
# prompt_classifier.load('best_model.weight')
print("pred", prompt_classifier.predict('I hate you'))

Documentation

Please visit the langml.readthedocs.io to check the latest documentation.

Reference

The implementation of pretrained language model is inspired by CyberZHG/keras-bert and bojone/bert4keras.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langml-0.1.1.tar.gz (44.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langml-0.1.1-py3-none-any.whl (62.0 kB view details)

Uploaded Python 3

File details

Details for the file langml-0.1.1.tar.gz.

File metadata

  • Download URL: langml-0.1.1.tar.gz
  • Upload date:
  • Size: 44.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.7.3

File hashes

Hashes for langml-0.1.1.tar.gz
Algorithm Hash digest
SHA256 aa1892153941ed9864d21b1c0940ddc253b510db1f1d78a16236594a1894b807
MD5 f2c9fba49aee3d5ef45e76578ac95b5a
BLAKE2b-256 d03ecd34e1b4e996dbe08e0ec260e0a4f3fd697435162fc69aaaa1b249b3cffa

See more details on using hashes here.

File details

Details for the file langml-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: langml-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 62.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.7.3

File hashes

Hashes for langml-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 065b283e28ea1581007b96726ab3a33120c55813145f594f302a80653caf0026
MD5 4006248c697ffa28036febc20505954c
BLAKE2b-256 7ce780d098471b7a336630bbdd96bfa09b2455d42bf07b8da98148d18fe60c0a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page