Skip to main content

A package for RST parsing

Project description

Neural-based RST Parsing And Analysis In Persuasive Discourse

This is the official PyTorch repo for RST-NN-Parser, a deep neural network model for RST parsing.

Neural-based RST Parsing And Analysis In Persuasive Discourse
Jinfen Li, Lu Xiao
W-NUT 2021

If Multi-EmoBERT is helpful for your research, please consider citing our paper:

@inproceedings{li2021neural,
  title={Neural-based rst parsing and analysis in persuasive discourse},
  author={Li, Jinfen and Xiao, Lu},
  booktitle={Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)},
  pages={274--283},
  year={2021}
}

Usage via Pip Package

install pip package

pip install rst-parser

use pip package

from rst_parser import rst_parser
tree_results, dis_results = rst_parser.parse(["The scientific community is making significant progress in understanding climate change. Researchers have collected vast amounts of data on temperature fluctuations, greenhouse gas emissions, and sea-level rise. This data shows a clear pattern of increasing global temperatures over the past century. However, there are still debates about the causes and consequences of climate change."])

tree_results explanation

each tree in the tree_results is the binary tree with the following node attributes
    text # text of an EDU
    own_rel # own relation
    edu_span # edu span
    own_nuc # own nuclearity
    lnode # left child nodes
    rnode # right child nodes
    pnode # parent node
    nuc_label # nuc_label: NN, NS, SN
    rel_label # one of the 18 relations

dis_results explanation

( Root (span 1 4)
  ( Nucleus (span 1 2) (rel2par span)
    ( Nucleus (leaf 1) (rel2par span) (text _!The scientific community is making significant progress!_) )
    ( Satellite (leaf 2) (rel2par Elaboration) (text _!in understanding climate change .!_) )
  )
  ( Satellite (span 3 4) (rel2par Elaboration)
    ( Nucleus (leaf 3) (rel2par span) (text _!Researchers have collected vast amounts of data on temperature fluctuations , greenhouse gas emissions , and sea-level rise .!_) )
    ( Satellite (leaf 4) (rel2par Contrast) (text _!This data shows a clear pattern of increasing global temperatures over the past century . However , there are still debates about the causes and consequences of climate change .!_) )
  )
)

Usage via Source Code

Environment

create a virtual environment

conda create -n emo_env python=3.9.16

install packages via conda first and then via pip

pip install -r requirements.txt
conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit
conda install pytorch torchvision pytorch-cuda=11.8 -c pytorch -c nvidia

rename .env.example as .env and change the variable values in the file

Neptune

Before running the code, you need to complete the following steps:

  1. Create a Neptune account and project.
  2. Edit the [NEPTUNE_API_TOKEN] and [NEPTUNE_NAME] fields in the .env file.

Multirun

Do grid search over different configs.

python main.py -m \
    dataset=rst_dt \
    seed=0,1,2,3,4,5 \

Evaluate checkpoint

This command evaluates a checkpoint on the train, dev, and test sets.

python main.py \
    training=evaluate \
    training.ckpt_path=/path/to/ckpt \
    training.eval_splits=train,dev,test \

Finetune checkpoint

python main.py \
    training=evaluate \
    training.ckpt_path=/path/to/ckpt \

Offline Mode

In offline mode, results are not logged to Neptune.

python main.py logger.offline=True

Debug Mode

In debug mode, results are not logged to Neptune, and we only train/evaluate for limited number of batches and/or epochs.

python main.py debug=True

Hydra Working Directory

Hydra will change the working directory to the path specified in configs/hydra/default.yaml. Therefore, if you save a file to the path './file.txt', it will actually save the file to somewhere like logs/runs/xxxx/file.txt. This is helpful when you want to version control your saved files, but not if you want to save to a global directory. There are two methods to get the "actual" working directory:

  1. Use hydra.utils.get_original_cwd function call
  2. Use cfg.work_dir. To use this in the config, can do something like "${data_dir}/${.dataset}/${model.arch}/"

Config Key

  • work_dir current working directory (where src/ is)

  • data_dir where data folder is

  • log_dir where log folder is (runs & multirun)

  • root_dir where the saved ckpt & hydra config are


Example Commands

Here, we assume the following:

1. Build dataset

The commands below are used to build pre-processed datasets, saved as pickle files. The model architecture is specified so that we can use the correct tokenizer for pre-processing. Remember to put a xxx.yaml file in the configs/dataset folder for the dataset you want to build.

python scripts/build_dataset.py --data_dir data \
    --dataset rst_dt --arch bert-base-uncased --max_length 32 --split train

python scripts/build_dataset.py --data_dir data \
    --dataset rst_dt --arch bert-base-uncased --split dev

python scripts/build_dataset.py --data_dir data \
    --dataset rst_dt --arch bert-base-uncased --split test

If the dataset is very large, you have the option to subsample part of the dataset for smaller-scale experiements. For example, in the command below, we build a train set with only 1000 train examples (sampled with seed 0).

python scripts/build_dataset.py --data_dir data \
    --dataset rst_dt --arch bert-base-uncased --split train \
    --num_samples 10 --seed 0

2. Train RST Parser

The command below is the most basic way to run main.py

Noted: train_batch_size, eval_batch_size are the number of documents in a batch, mini_batch_size is the number of EDUs in a document

python main.py -m \
    data=rst_dt \
    model.optimizer.lr=2e-5 \
    setup.train_batch_size=1 \
    setup.accumulate_grad_batches=1 \
    setup.eff_train_batch_size=1 \
    setup.eval_batch_size=1 \
    setup.mini_batch_size=10 \
    setup.num_workers=3 \
    seed=0,1,2

3. Train Model with Different Configs

Change blstm_hidden_size

python main.py -m \
    data=rst_dt \
    model.blstm_hidden_size=300 \
    model.optimizer.lr=2e-5 \
    setup.train_batch_size=1 \
    setup.accumulate_grad_batches=1 \
    setup.eff_train_batch_size=1 \
    setup.eval_batch_size=1 \
    setup.mini_batch_size=10 \
    setup.num_workers=3 \
    seed=0,1,2

4. Evaluate Model

exp_id is the folder name under your save_dir (e.g., "rst_dt_xxx"), ckpt_path is the checkpoint under the checkpoints folder in the exp_id folder. The results will be saved in the model_outputs folder in the exp_id folder.

python main.py -m \
    data=rst_dt \
    training=evaluate \
    ckpt_path = xxx \
    exp_id = xxx \
    setup.eval_batch_size=1 \
    setup.mini_batch_size=10 \
    setup.num_workers=3 \
    seed=0,1,2

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rst_parser-0.1.3.tar.gz (26.4 kB view details)

Uploaded Source

File details

Details for the file rst_parser-0.1.3.tar.gz.

File metadata

  • Download URL: rst_parser-0.1.3.tar.gz
  • Upload date:
  • Size: 26.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for rst_parser-0.1.3.tar.gz
Algorithm Hash digest
SHA256 13032734634d0ab3d271e1cbad2e7e35cd8e028baee21d6cdadffd262b14f8e9
MD5 e877235a7db913d4d29780d3e344ca37
BLAKE2b-256 ef6196f81b072a4edf858b8364d7600bbc537840dec69deb7ef927bb3f2191ac

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page