Social Media NLP package for PyTorch & PyTorch Lightning.
Project description
A Social Media Natural Language Processing package for PyTorch & PyTorch Lightning.
PyTorch Gleam
PyTorch Gleam builds upon PyTorch Lightning for the specific use-case of Natural Language Processing on Social Media, such as Twitter. PyTorch Gleam strives to make Social Media NLP research easier to understand, use, and extend. Gleam contains models I use in my research, from fine-tuning a BERT-based model with Lexical, Emotion, and Semantic information in a Graph Attention Network for stance identification towards COVID-19 misinformation, to using Information Retrieval systems to identify new types of misinformation on Twitter.
About Me
My name is Maxwell Weinzierl, and I am a Natural Language Processing researcher at the Human Technology Research Institute (HLTRI) at the University of Texas at Dallas. I am currently working on my PhD, which focuses on COVID-19 and HPV vaccine misinformation, trust, and more on Social Media platforms such as Twitter. I have built PyTorch Gleam to enable easy reproducibility for my published research, and for my own quick iterations on research ideas.
How To Use
Step 0: Install
Simple installation from PyPI
pip install pytorch-gleam
You may need to install CUDA drivers and other versions of PyTorch. See PyTorch and PyTorch Lightning for installation help.
Step 1: Create Experiment
Create a configs
folder with a YAML experiment file. Gleam utilizes PyTorch Lightning's CLI tools
to configure experiments from YAML files, which enables researchers to clearly look back
and identify both hyper-parameters and model code used in their experiments.
This example is from COVID-19 vaccine misinformation stance identification:
seed_everything: 0
model:
class_path: pytorch_gleam.modeling.models.MultiClassFrameLanguageModel
init_args:
learning_rate: 5e-4
pre_model_name: digitalepidemiologylab/covid-twitter-bert-v2
label_map:
No Stance: 0
Accept: 1
Reject: 2
threshold:
class_path: pytorch_gleam.modeling.thresholds.MultiClassThresholdModule
metric:
class_path: pytorch_gleam.modeling.metrics.F1PRMultiClassMetric
init_args:
mode: macro
num_classes: 3
trainer:
max_epochs: 10
accumulate_grad_batches: 4
check_val_every_n_epoch: 1
deterministic: true
num_sanity_val_steps: 1
checkpoint_callback: false
callbacks:
- class_path: pytorch_gleam.callbacks.FitCheckpointCallback
data:
class_path: pytorch_gleam.data.datasets.MultiClassFrameDataModule
init_args:
batch_size: 8
max_seq_len: 128
label_name: misinfo
label_map:
No Stance: 0
Accept: 1
Reject: 2
tokenizer_name: digitalepidemiologylab/covid-twitter-bert-v2
num_workers: 8
frame_path:
- covid19/misinfo.json
train_path:
- covid19/stance-train.jsonl
val_path:
- covid19/stance-dev.jsonl
test_path:
- covid19/stance-test.jsonl
Documentation on available models
, datasets
, and callbacks
will be provided soon.
Details about how to set up YAML experiment files are provided by PyTorch Lightning's documentation.
Annotations for this example are provided in the VaccineLies repository under covid19 as the CoVaxLies collection: CoVaxLies. You will need to download the tweet texts from the tweet ids from the Twitter API.
Step 2: Run Experiment
Create a models
folder for your saved TensorBoard logs and model weights.
Determine the GPU ID for the GPU you would like to utilize (multi-gpu supported) and provide the ID in a list, with
a comma at the end if it is a single GPU ID. You can also just specify an integer, such as 1
, and PyTorch Lightning
will try to find a single free GPU automatically.
Run the following command to start training:
gleam fit \
--config configs/covid-stance.yaml \
--trainer.gpus 1 \
--trainer.default_root_dir models/covid-stance
Your model will train, with TensorBoard logging all metrics, and a checkpoint will be saved upon completion.
Step 3: Evaluate Experiment
You can easily evaluate your system on a test collection as follows:
gleam test \
--config configs/covid-stance.yaml \
--trainer.gpus 1 \
--trainer.default_root_dir models/covid-stance
Examples
These are a work-in-progress, as my original research code is a bit messy, but they will be updated soon!
COVID-19 Vaccine Misinformation Detection on Twitter
COVID-19 Vaccine Misinformation Stance Identification on Twitter
COVID-19 Misinformation Stance Identification on Twitter
Vaccine Misinformation Transfer Learning
Vaccine Hesitancy Profiling on Twitter
- TODO
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pytorch-gleam-0.6.2.tar.gz
.
File metadata
- Download URL: pytorch-gleam-0.6.2.tar.gz
- Upload date:
- Size: 3.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.6 tqdm/4.62.2 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7b5a691c5bfd0f872f425a9096eea541b256d0f53e6ad8db14e8390258c8d77e |
|
MD5 | 3ccfbad852e3d9f5eb0aabfba57fc097 |
|
BLAKE2b-256 | d46afa0559320605885e8076248847d769b21d44cbe4b8793e4cf5f7dec804fa |