Collection of pytorch datasets for hate speech and toxic internet discourse
Project description
Hate Datasets Compilation
Contains 17 different hate, violence, and discrimination speech datasets, along with anotations on where they were found, the data format and method for collection and labeling. Each dataset is kept in the original file structure and placed inside the respective folder's data
file, in case a more recent version is obtained. (Links for the dataset's source are in each ABOUT.md file)
Datasets
Additional information about each dataset can be found in the corresponding ABOUT.md file.
The following Datasets are implemented:
- ZeerakTalat NAACL
- ZeerakTalat NLPCSS
- HASOC
- Vicomtech
- TwitterSexism
- MLMA
- WikiToxicity
- WikiAggression
- WikiPersonalAttacks
Several more have downloaders already, and are close to completion
Notes
Two of the datasets, MLMA Dataset
and Online Intervention Dataset
, only contain hateful posts, instead labeling other features such as the target of hate. Models trained on these datasets may be biased because of this.
Installing
To use the hatecomp datasets or models, simply run the following command in a python environment of your choice:
pip install hatecomp
If you do not have pytorch already installed, it is recommended to do so using conda. Visit the pytorch website for more information.
Once it has finished downloading, you can start loading in datasets and models. Below are a couple of examples to get you started. For more advanced usage, please see scripts/train.py
.
Examples
Here are a couple examples of how to use the hatecomp library.
Working with a Dataset
Loading datasets is very simple. Each has its own downloading script that will run lazily when you try to create the dataset. If you would like to, you can specify where to download and if the dataset should download. By default the datasets only download when they cannot find the necessary files in the given location.
from hatecomp.datasets import Vicomtech
# load a dataset from the default location,
# or download the dataset in the default location
dataset = Vicomtech()
example = dataset[0]
# load a dataset from a specified location,
# or download to that location
dataset = Vicomtech(root = "my/special/dataset/path")
example = dataset[0]
# only load a dataset if it can be found at the given location
dataset = Vicomtech(root = "my/special/dataset/path", download = False)
example = dataset[0]
The datasets also come equipped with a couple of handy features designed especially for NLP use and convenience.
from hatecomp.datasets import Vicomtech
# Mapping a function over the dataset data (usually text, unless the dataset has already been mapped)
# Note that the map function can support batching if your mapped function supports it.
def my_tokenizing_function(some_string):
return 0
dataset = Vicomtech()
tokenized_dataset = dataset.map(function = my_tokenizing_function, batched = False)
# Splitting the dataset
train_split, test_split = tokenized_dataset.split(test_proportion = 0.1)
Using a model
Hatecomp also provides functionality for training models with these datasets, along with some pretrained models.
Importing a pretrained model
Loading one of our pretrained models is quite simple, and only requires using the name of the appropriate dataset to do so. The model will then be downloaded into the files of the local hatecomp
package. Note this means that uninstalling hatecomp
will delete the models, and this is intended behavior.
from hatecomp.models import HatecompClassifier, HatecompTokenizer
# To load an already downloaded model
model = HatecompClassifier.from_hatecomp_pretrained("Vicomtech")
tokenizer = HatecompTokenizer.from_hatecomp_pretrained("Vicomtech")
# To download a model if it does not exist locally
model = HatecompClassifier.from_hatecomp_pretrained("Vicomtech", download=True)
tokenizer = HatecompTokenizer.from_hatecomp_pretrained("Vicomtech")
# Force download a model (useful if the files become corrupted for any reason)
model = HatecompClassifier.from_hatecomp_pretrained("Vicomtech", force_download=True)
tokenizer = HatecompTokenizer.from_hatecomp_pretrained("Vicomtech")
The tokenizers also have the same download
and force_download
flags available, but if you are loading the tokenizer directly after the model, the files will be installed locally already, as the download retrieves the necessary data for both the model and the tokenizer.
Training models
The process for training a model is quite simple, as there is a custom trainer class designed specifically for the datasets and models. Also included is a convenience DataLoader
wrapper, which will handle collating the hatecomp data, since hatecomp datasets return ids, and the base torch.utils.data.DataLoader
does not handle those by default.
import torch
from hatecomp.datasets import Vicomtech
from hatecomp.datasets.base import DataLoader
from hatecomp.training import HatecompTrainer
from hatecomp.models import HatecompClassifier, HatecompTokenizer
dataset = Vicomtech()
model = HatecompClassifier.from_huggingface_pretrained(
"roberta-base",
dataset.num_classes
)
tokenizer = HatecompTokenizer.from_huggingface_pretrained(
"roberta-base",
)
train_set, test_set = dataset.split(0.1)
train_dataloader = DataLoader(train_set, batch_size=32)
test_dataloader = DataLoader(test_set, batch_size=64)
optimizer = torch.optim.AdamW(
model.parameters(),
lr=3e-5
)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=3e-5,
steps_per_epoch=len(train_dataloader),
epochs=5
)
loss_function = torch.nn.functional.cross_entropy
trainer = HatecompTrainer(
root="root_directory",
model=model,
tokenizer=tokenizer,
optimizer=optimizer,
scheduler=scheduler,
loss_function=loss_function,
train_dataloader=train_dataloader,
test_dataloader=test_dataloader,
epochs=5
)
trainer.train("cuda")
For an even more expedited training process, there is also the HatecompTrialRunner
, an example of using this class can be found in scripts/hyperopt.py
.
Results
Here is a list of results acheived on various datasets with Huggingface models, along with the SOTA performance (as best I could find). Since it is not always possible to find SOTA scores for obscure datsets measured with a particular metric, the hatecomp score is selected to match whatever SOTA could be found. (The links are locations where the SOTA reference was found. If you are looking for citations, please refer to the About.md
for each dataset)
Dataset | Metric | SOTA | hatecomp/huggingface |
---|---|---|---|
Vicomtech | Accuracy | 0.79 | 0.93 |
ZeerakTalat-NAACL | F1 | 0.74 | 0.94 |
ZeerakTalat-NLPCSS | F1 | 0.53 | 0.76 |
HASOC | F1 (Macro Average) | 0.53 | 0.55 |
TwitterSexism | F1 (Macro) | 0.87 | 0.99 |
MLMA | F1 (Multitask Macros EN) | [0.30, 0.43, 0.18, 0.57] | [0.58, 0.64, 0.16, 0.51] |
(If you know of a better SOTA than what is listed here, please create an issue or pull request.)
Also note that some of these datasets require tweet data. For these, a large number of tweet_ids return Unauthorized from the twitter API, so the data which the hatecomp models trained on is a subset of the total dataset. More information can be found in the following table:
Dataset | Total Size | Successfully Downloaded Tweets | Available Training Portion |
---|---|---|---|
ZeerakTalat-NAACL | 16907 | 7210 | 0.4264 |
ZeerakTalat-NLPCSS | 6909 | 5385 | 0.77941 |
TwitterSexism | 10583 | 5054 | 0.4775 |
This info is valid as of Feb 2022, and is probably subject to change as Twitter continues to lock down their API.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file hatecomp-0.4.2.tar.gz
.
File metadata
- Download URL: hatecomp-0.4.2.tar.gz
- Upload date:
- Size: 30.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4616dd1ccad759093e681b6ef66ab2ac07ac465e0acd6a68106fb173a5c7afea |
|
MD5 | 7252897857cfbad006415c9bd40bfed9 |
|
BLAKE2b-256 | 6c0270972b2ee8618f3e3769be6485b5b6e5ed0b8bd2f759ceeda190fc88db1a |
File details
Details for the file hatecomp-0.4.2-py3-none-any.whl
.
File metadata
- Download URL: hatecomp-0.4.2-py3-none-any.whl
- Upload date:
- Size: 40.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0793aaeea0ac5ea513ad124ae52f8b6ca54dfc175050508b75f9b87d5b89a6cd |
|
MD5 | 4f81dfaae6ed3cd57e8bd9bc44007f0e |
|
BLAKE2b-256 | eedb085e278cf1222c52d4da2518263b15dbe61e33942ea064ec892f6f96261a |