Skip to main content

NLP framework for composing together models modularly

Project description

CCLM

Composable, Character-Level Models

Why cclm?

The goal of cclm is to make the deep learning model development process modular by providing abstractions for structuring a computational graph.

If we think of the ML lifecycle as producing a usable class Model that consumers can call on input to get output, then comparing the model training process to human-led software development highlights some big differences. For instance often when we retrain models, we usually change the whole model at once - imagine a developer telling you every commit they made touched every line of code in the package. Similarly, using a pretrained model is like using a 'batteries included' framework: you likely end up inheriting a good deal of functionality you don't require, and it may be hard to customize. These differences suggest that there may be changes that could make it easier to manage deep learning model development, particularly as models continue to explode in size.

How does it work?

The way cclm aims to achieve the above is by making the model building process composable. There are many ways to pretrain a model on text, and infinite corpora on which to train, and each application has different needs.

cclm makes it possible to define a base input on which to build many different computational graphs, then combine them. For instance, if there is a standard, published cclm model trained with masked language modeling (MLM) on (wikitext + bookcorpus), you might start with that, but add a second component to that model that uses the same base, but is pretrained to extract entities from wiki-ner. By combining the two pretrained components with a ComposedModel, you get a model with information from both tasks that you can then use as a starting point for your downstream task.

Common model components will be published onto the cclm-shelf to make it simple to mix and match capabilities.

The choice to emphasize character-level rather than arbitrary tokenization schemes is to make the input as generically useful across tasks as possible. Character-level input also makes it simpler to add realistic typos/noise to make models more robust to imperfect inputs.

Basic concepts

The main output of a training job with cclm is a ComposedModel, which consists of a Preprocessor that turns text into a vector[int], a base model that embeds that vector input, and one or more model components that accept the output of the embedder. The ComposedModel concatenates the output from those models together to produce its final output.

The package uses datasets and tokenizers from huggingface for a standard interface and to benefit from their great framework. To fit models and preprocessors, you can also pass a List[str] directly.

To start, you need a Preprocessor.

from cclm.preprocessing import Preprocessor

prep = Preprocessor()  # set max_example_len to specify a maximum input length
prep.fit(dataset) # defines the model's vocabulary (character-level)

Once you have that, you can create an Embedder, which is the common base on which all the separate models will sit. This is a flexible class primarily responsible for holding a model that embeds a sequence of integers (representing characters) into a space the components expect. For more complicated setups, the Embedder could have a ComposedModel as its model

from cclm.models import Embedder

embedder = Embedder(prep.max_example_len, prep.n_chars)

The embedder doesn't necessarily need to be fit by itself, as you can fit it while you do your first pretraining task.

Now you're ready to build your first model using a pretraining task (here masked language modeling)

from cclm.pretraining import MaskedLanguagePretrainer

pretrainer = MaskedLanguagePretrainer(embedder=embedder)
pretrainer.fit(dataset, epochs=10)

The MaskedLanguagePretrainer defines a transformer-based model to do masked language modeling. Calling .fit() will use the Preprocessor to produce masked inputs and try to identify the missing input token(s) using sampled_softmax loss or negative sampling. This is just one example of a pretraining task, but others can be found in cclm.pretrainers.

Once you've trained one or more models using Pretrainer objects, you can compose them together into one model.

composed = ComposedModel(embedder, [pretrainer_a.model, pretrainer_b.model])

You can then use composed.model(x) to embed input

x = prep.string_to_array("cclm is neat", prep.max_example_len)
emb = composed.model(x)   # has shape (1, prep.max_example_len, pretrainer_a_model_shape[-1]+pretrainer_b_model_shape[-1])

... or create a new model with something like

# pool the output across the character dimension
gmp = tf.keras.layers.GlobalMaxPool1D()
# add a classification head on top
d = tf.keras.layers.Dense(1, activation="sigmoid")
keras_model = tf.keras.Model(composed.model.input, d(gmp(composed.model.output)))

Shelf

The Shelf class is used to load off-the-shelf components. These are published to a separate repo using git lfs, and are loaded with a specific tag.

from cclm.shelf import Shelf

shelf = Shelf()
identifier = "en_wiki_clm_1"
item_type = "preprocessor"
shelf.fetch(identifier, item_type, tag="v0.2.1", cache_dir=".cclm")
prep = Preprocessor(
    load_from=os.path.join(cache_dir, identifier, item_type, "cclm_config.json")
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cclm-0.1.2.tar.gz (16.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cclm-0.1.2-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file cclm-0.1.2.tar.gz.

File metadata

  • Download URL: cclm-0.1.2.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.6 CPython/3.8.3 Linux/5.8.0-55-generic

File hashes

Hashes for cclm-0.1.2.tar.gz
Algorithm Hash digest
SHA256 0732300a5143d2abc51b56dbb45cc1ec52d57b4605b550f28a343e38a651acc3
MD5 dd099dfb7b66be30f2d71b34256860bf
BLAKE2b-256 1074e03b6a411609b5ba62b591b97a942937fc86078210c854a57a9d9bb47d21

See more details on using hashes here.

File details

Details for the file cclm-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: cclm-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 19.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.6 CPython/3.8.3 Linux/5.8.0-55-generic

File hashes

Hashes for cclm-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9d997b7a62ca0dd070e7010f001a6126d0cddd0767a3025d64ad267e2f0e026b
MD5 5fc1639cf3a8b268a823d9a42ddc5b77
BLAKE2b-256 973d3f907a1d41bb45ba7584275ac7ee2c698943e0b49862e201cb3cab92407b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page