cclm

NLP framework for composing together models modularly

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Composable, Character-Level Models

What are the goals of the project?

Modularity: Fine-tuning large language models is expensive. cclm seeks to decompose models into subcomponents that can be readily mixed and matched, allowing for a wider variety of sizes, architectures, and pretraining methods. Rather than fine-tuning a huge model on your own data, fit a smaller one on your dataset and combine it with off-the-shelf models.
Character-level input: Many corpora used in pretraining are clean and typo-free, but a lot of input in real world applications aren't - leaving you at a disadvantage if your tokenization scheme isn't flexible enough. Using characters as input also makes it simple define many 'heads' of a model with the same input space.
Ease of use: It should be quick to get started and easy to deploy.

How does it work?

The way cclm aims to achieve the above is by making the model building process composable. There are many ways to pretrain a model on text, and infinite corpora on which to train, and each application has different needs.

cclm makes it possible to define a base input on which to build many different computational graphs, then combine them. For instance, if there is a standard, published cclm model trained with masked language modeling (MLM) on (wikitext + bookcorpus), you might start with that, but add a second 'tower' to that model that uses the same base, but is pretrained to extract entities from wiki-ner. By combining the two pretrained 'towers', you get a model with information from both tasks that you can then use as a starting point for your downstream model.

As the package matures, the goal is to make available many pretraining methods (starting with Masked Language Modeling) and to publish standard pretrained models (like huggingface/transformers, spacy, tensorflowhub, ...).

Basic concepts

The main output of a training job with cclm is a ComposedModel, which consists of a preprocessor that turns text into a vector[int], a base model that embeds that vector input, and one or more models that accept the output of the embedder. The ComposedModel concatenates the output from those models together to produce its final output.

The package uses datasets and tokenizers from huggingface for a standard interface - but to fit models, you can pass a List[str] directly.

To start, you need a Preprocessor. Currently, there is only an MLMPreprocessor that computes extra data at training time for its pretraining task, but that is subject to change.

from cclm.preprocessing import MLMPreprocessor

prep = MLMPreprocessor()  # set max_example_len to specify a maximum input length
prep.fit(dataset) # defines the characters the model knows about and the output tokens for MLM

Once you have that, you can create a CCLMModelBase, which is the common base on which all the separate models will sit.

from cclm.models import CCLMModelBase

base = CCLMModelBase(preprocessor=prep)

The base doesn't need to be fit, as you can fit it while you do your first pretraining task.

Now you're ready to build your first model using a pretraining task (here masked language modeling)

from cclm.pretraining import MaskedLanguagePretrainer

pretrainer = MaskedLanguagePretrainer(
    base=base,
    downsample_factor=16,  # how much we want to reduce the length of the input sequence
    n_strided_convs=4,  # how many strided conv layers we have. stride_len**n_strided_convs must == downsample_factor
)

pretrainer.fit(dataset, epochs=10)

The MaskedLanguagePretrainer defines a transformer model (which uses strided convolutions to reduce the size before the transformer layer, then upsamples to match the original size), and calling .fit() will use the MLMPreprocessor associated with the base to produce masked inputs and try to identify the missing input token(s) using sampled_softmax loss or negative sampling.

Once you've trained one or more models with Pretrainer objects, you can compose them together into one model.

composed = ComposedModel(base, [pretrainer_a.model, pretrainer_b.model])

You can then use composed.model(x) to embed input

x = prep.string_to_array("cclm SURE is useful!!", prep.max_example_len)
emb = composed.model(x)   # has shape (1, prep.max_example_len, pretrainer_a_model_shape[-1]+pretrainer_b_model_shape[-1])

... or create a new model with something like

# pool the output across the character dimension
gmp = tf.keras.layers.GlobalMaxPool1D()
# add a classification head on top
d = tf.keras.layers.Dense(1, activation="sigmoid")
keras_model = tf.keras.Model(composed.model.input, d(gmp(composed.model.output)))

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.2

Jun 30, 2021

This version

0.1.1

May 7, 2021

0.1.0

May 6, 2021

0.0.1

May 5, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cclm-0.1.1.tar.gz (13.9 kB view hashes)

Uploaded May 7, 2021 Source

Built Distribution

cclm-0.1.1-py3-none-any.whl (16.4 kB view hashes)

Uploaded May 7, 2021 Python 3

Hashes for cclm-0.1.1.tar.gz

Hashes for cclm-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`93ec3d9f0a0757174850d8c93713c839e41ebc7cad776399b2f4be98734b15b5`
MD5	`c19c010d638c63c646864a934dfde7b2`
BLAKE2b-256	`33084e77244a439e647616c6385303b28e11241f4dc1c5fb8fb62e0c46aeb7e3`

Hashes for cclm-0.1.1-py3-none-any.whl

Hashes for cclm-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d3682dbb5ea6d8b3b9b61bf9b4070dde27721b0645a19bb435325ef0d92a6cc7`
MD5	`0f095bb750ee2a0d3503d96f4c2a7ac5`
BLAKE2b-256	`11a015bdc79cdfd5101a1bffe7a86f4473b6c112505c99d774f6a657288b69f6`