Skip to main content

GroupHug is a library with extensions to 🤗 transformers for multitask language modelling.

Project description

grouphug

GroupHug is a library with extensions to 🤗 transformers for multitask language modelling. In addition, it contains utilities that ease data preparation, training, and inference.

Overview

The package is optimized for training a single language model to make quick and robust predictions for a wide variety of related tasks at once, as well as to investigate the regularizing effect of training a language modelling task at the same time.

You can train on multiple datasets, with each dataset containing an arbitrary subset of your tasks. Supported tasks include:

  • A single language modelling task (Masked language modelling, Masked token detection, Causal language modelling).
    • The default collator included handles most preprocessing for these heads automatically.
  • Any number of classification tasks, including single- and multi-label classification and regression
    • A utility function that automatically creates a classification head from your data.
    • Additional options such as hidden layer size, additional input variables, and class weights.
  • You can also define your own model heads.

Quick Start

The project is based on Python 3.8+ and PyTorch 1.10+. To install it, simply use:

pip install grouphug

Documentation

Documentation can be generated from docstrings using make html in the docs directory, but this is not yet on a hosted site.

Example usage

import pandas as pd
from datasets import load_dataset
from transformers import AutoTokenizer

from grouphug import AutoMultiTaskModel, ClassificationHeadConfig, DatasetFormatter, LMHeadConfig, MultiTaskTrainer

# load some data. 'label' gets renamed in huggingface, so is better avoided as a feature name.
task_one = load_dataset("tweet_eval",'emoji').rename_column("label", "tweet_label")
both_tasks = pd.DataFrame({"text": ["yay :)", "booo!"], "sentiment": ["pos", "neg"], "tweet_label": [0,14]})

# create a tokenizer
base_model = "prajjwal1/bert-tiny"
tokenizer = AutoTokenizer.from_pretrained(base_model)

# preprocess your data: tokenization, preparing class variables
formatter = DatasetFormatter().tokenize().encode("sentiment")
# data converted to a DatasetCollection: essentially a dict of DatasetDict
data = formatter.apply({"one": task_one, "both": both_tasks}, tokenizer=tokenizer, test_size=0.05)

# define which model heads you would like
head_configs = [
    LMHeadConfig(weight=0.1),  # default is BERT-style masked language modelling
    ClassificationHeadConfig.from_data(data, "sentiment"),  # detects dimensions and type
    ClassificationHeadConfig.from_data(data, "tweet_label"),  # detects dimensions and type
]
# create the model, optionally saving the tokenizer and formatter along with it
model = AutoMultiTaskModel.from_pretrained(base_model, head_configs, formatter=formatter, tokenizer=tokenizer)
# create the trainer
trainer = MultiTaskTrainer(
    model=model,
    tokenizer=tokenizer,
    train_data=data[:, "train"],
    eval_data=data[["one"], "test"],
    eval_heads={"one": ["tweet_label"]},  # limit evaluation to one classification task
)
trainer.train()

Tutorials

See examples for a few notebooks that demonstrate the key features.

Supported Models

The package has support for the following base models:

  • Bert, DistilBert, Roberta/DistilRoberta, XLM-Roberta
  • Deberta/DebertaV2
  • Electra
  • GPT2, GPT-J, GPT-NeoX, OPT

Extending it to support other models is possible by simply inheriting from _BaseMultiTaskModel, although language modelling head weights may not always load.

Limitations

  • The package only supports PyTorch, and will not work with other frameworks. There are no plans to change this.
  • Grouphug was developed and tested with 🤗 transformers 4.19-4.22. We will aim to test and keep compatibility with the latest version, but it is still recommended to lock the latest working versions.

See the contributing page if you are interested in contributing.

License

grouphug was initially developed at Chatdesk and is licensed under the Apache 2 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grouphug-0.8.0.tar.gz (36.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

grouphug-0.8.0-py3-none-any.whl (40.3 kB view details)

Uploaded Python 3

File details

Details for the file grouphug-0.8.0.tar.gz.

File metadata

  • Download URL: grouphug-0.8.0.tar.gz
  • Upload date:
  • Size: 36.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.14

File hashes

Hashes for grouphug-0.8.0.tar.gz
Algorithm Hash digest
SHA256 1915843708704e005d0153ed71f6f755dcaf41d308c483e2a648e06b5c1d40d4
MD5 070706faa514436982c7a0f7e1a3a4fc
BLAKE2b-256 57e52736f18d4a17df60886f510ee84640cca3f6251a8c66a737a397ec94ceba

See more details on using hashes here.

File details

Details for the file grouphug-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: grouphug-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 40.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.14

File hashes

Hashes for grouphug-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 24fa4d411a968b945913b10002689e60594539595155aac38ff3896cd4954582
MD5 ecc74c99725a2f055363b2326d8c417d
BLAKE2b-256 a5be704204143ff6a44d6de4d60b302cf36f0178301bec3c15e8988e52d01329

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page