Skip to main content

Collection of pytorch datasets for hate speech and toxic internet discourse

Project description

Hate Datasets Compilation

Contains 17 different hate, violence, and discrimination speech datasets, along with anotations on where they were found, the data format and method for collection and labeling. Each dataset is kept in the original file structure and placed inside the respective folder's data file, in case a more recent version is obtained. (Links for the dataset's source are in each ABOUT.md file)

Datasets

Additional information about each dataset can be found in the corresponding ABOUT.md file.

The following Datasets are implemented:

Several more have downloaders already, and are close to completion

Notes

Two of the datasets, MLMA Dataset and Online Intervention Dataset, only contain hateful posts, instead labeling other features such as the target of hate. Models trained on these datasets may be biased because of this.

Installing

To use the hatecomp datasets or models, simply run the following command in a python environment of your choice:

pip install hatecomp

If you do not have pytorch already installed, it is recommended to do so using conda. Visit the pytorch website for more information.

Once it has finished downloading, you can start loading in datasets and models. Below are a couple of examples to get you started. For more advanced usage, please see scripts/train.py.

Examples

Here are a couple examples of how to use the hatecomp library.

Working with a Dataset

Loading datasets is very simple. Each has its own downloading script that will run lazily when you try to create the dataset. If you would like to, you can specify where to download and if the dataset should download. By default the datasets only download when they cannot find the necessary files in the given location.

from hatecomp.datasets import Vicomtech

# load a dataset from the default location,
# or download the dataset in the default location
dataset = Vicomtech()
example = dataset[0]

# load a dataset from a specified location,
# or download to that location
dataset = Vicomtech(root = "my/special/dataset/path")
example = dataset[0]

# only load a dataset if it can be found at the given location
dataset = Vicomtech(root = "my/special/dataset/path", download = False)
example = dataset[0]

The datasets also come equipped with a couple of handy features designed especially for NLP use and convenience.

from hatecomp.datasets import Vicomtech

# Mapping a function over the dataset data (usually text, unless the dataset has already been mapped)
# Note that the map function can support batching if your mapped function supports it.
def my_tokenizing_function(some_string):
    return 0
dataset = Vicomtech()
tokenized_dataset = dataset.map(function = my_tokenizing_function, batched = False)

# Splitting the dataset
train_split, test_split = tokenized_dataset.split(test_proportion = 0.1)

Using a model

Hatecomp also provides functionality for training models with these datasets, along with some pretrained models.

Importing a pretrained model

Loading one of our pretrained models is quite simple, and only requires using the name of the appropriate dataset to do so. The model will then be downloaded into the files of the local hatecomp package. Note this means that uninstalling hatecomp will delete the models, and this is intended behavior.

from hatecomp.models import HatecompClassifier, HatecompTokenizer

# To load an already downloaded model
model = HatecompClassifier.from_hatecomp_pretrained("Vicomtech")
tokenizer = HatecompTokenizer.from_hatecomp_pretrained("Vicomtech")

# To download a model if it does not exist locally
model = HatecompClassifier.from_hatecomp_pretrained("Vicomtech", download=True)
tokenizer = HatecompTokenizer.from_hatecomp_pretrained("Vicomtech")

# Force download a model (useful if the files become corrupted for any reason)
model = HatecompClassifier.from_hatecomp_pretrained("Vicomtech", force_download=True)
tokenizer = HatecompTokenizer.from_hatecomp_pretrained("Vicomtech")

The tokenizers also have the same download and force_download flags available, but if you are loading the tokenizer directly after the model, the files will be installed locally already, as the download retrieves the necessary data for both the model and the tokenizer.

Training models

The process for training a model is quite simple, as there is a custom trainer class designed specifically for the datasets and models. Also included is a convenience DataLoader wrapper, which will handle collating the hatecomp data, since hatecomp datasets return ids, and the base torch.utils.data.DataLoader does not handle those by default.

import torch

from hatecomp.datasets import Vicomtech
from hatecomp.datasets.base import DataLoader

from hatecomp.training import HatecompTrainer
from hatecomp.models import HatecompClassifier, HatecompTokenizer

dataset = Vicomtech()
model = HatecompClassifier.from_huggingface_pretrained(
    "roberta-base",
    dataset.num_classes
)
tokenizer = HatecompTokenizer.from_huggingface_pretrained(
    "roberta-base",
)

train_set, test_set = dataset.split(0.1)
train_dataloader = DataLoader(train_set, batch_size=32)
test_dataloader = DataLoader(test_set, batch_size=64)

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-5
)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=3e-5,
    steps_per_epoch=len(train_dataloader),
    epochs=5
)

loss_function = torch.nn.functional.cross_entropy

trainer = HatecompTrainer(
    root="root_directory",
    model=model,
    tokenizer=tokenizer,
    optimizer=optimizer,
    scheduler=scheduler,
    loss_function=loss_function,
    train_dataloader=train_dataloader,
    test_dataloader=test_dataloader,
    epochs=5
)
trainer.train("cuda")

For an even more expedited training process, there is also the HatecompTrialRunner, an example of using this class can be found in scripts/hyperopt.py.

Results

Here is a list of results acheived on various datasets with Huggingface models, along with the SOTA performance (as best I could find). Since it is not always possible to find SOTA scores for obscure datsets measured with a particular metric, the hatecomp score is selected to match whatever SOTA could be found. (The links are locations where the SOTA reference was found. If you are looking for citations, please refer to the About.md for each dataset)

Dataset Metric SOTA hatecomp/huggingface
Vicomtech Accuracy 0.79 0.93
ZeerakTalat-NAACL F1 0.74 0.94
ZeerakTalat-NLPCSS F1 0.53 0.76
HASOC F1 (Macro Average) 0.53 0.55
TwitterSexism F1 (Macro) 0.87 0.99
MLMA F1 (Multitask Macros EN) [0.30, 0.43, 0.18, 0.57] [0.58, 0.64, 0.16, 0.51]

(If you know of a better SOTA than what is listed here, please create an issue or pull request.)

Also note that some of these datasets require tweet data. For these, a large number of tweet_ids return Unauthorized from the twitter API, so the data which the hatecomp models trained on is a subset of the total dataset. More information can be found in the following table:

Dataset Total Size Successfully Downloaded Tweets Available Training Portion
ZeerakTalat-NAACL 16907 7210 0.4264
ZeerakTalat-NLPCSS 6909 5385 0.77941
TwitterSexism 10583 5054 0.4775

This info is valid as of Feb 2022, and is probably subject to change as Twitter continues to lock down their API.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hatecomp-0.4.2.tar.gz (30.4 kB view details)

Uploaded Source

Built Distribution

hatecomp-0.4.2-py3-none-any.whl (40.0 kB view details)

Uploaded Python 3

File details

Details for the file hatecomp-0.4.2.tar.gz.

File metadata

  • Download URL: hatecomp-0.4.2.tar.gz
  • Upload date:
  • Size: 30.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.9.7

File hashes

Hashes for hatecomp-0.4.2.tar.gz
Algorithm Hash digest
SHA256 4616dd1ccad759093e681b6ef66ab2ac07ac465e0acd6a68106fb173a5c7afea
MD5 7252897857cfbad006415c9bd40bfed9
BLAKE2b-256 6c0270972b2ee8618f3e3769be6485b5b6e5ed0b8bd2f759ceeda190fc88db1a

See more details on using hashes here.

File details

Details for the file hatecomp-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: hatecomp-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 40.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.9.7

File hashes

Hashes for hatecomp-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0793aaeea0ac5ea513ad124ae52f8b6ca54dfc175050508b75f9b87d5b89a6cd
MD5 4f81dfaae6ed3cd57e8bd9bc44007f0e
BLAKE2b-256 eedb085e278cf1222c52d4da2518263b15dbe61e33942ea064ec892f6f96261a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page