Skip to main content

A tool for sibylvariant transformations

Project description

Sibylvariant Transformations for Robust Text Classification

Summary

The vast majority of text transformation techniques in NLP are inherently limited in their ability to expand input space coverage due to an implicit constraint to preserve the original class label. In this work, we propose the notion of sibylvariance (SIB) to describe the broader set of transforms that relax the label-preserving constraint, knowably vary the expected class, and lead to significantly more diverse input distributions. We offer a unified framework to organize all data transformations, including two types of SIB: (1) Transmutationsconvert one discrete kind into another, (2) Mixture Mutations blend two or more classes together. To explore the role of sibylvariance within NLP, we implemented 41 text transformations, including several novel techniques like Concept2Sentence and SentMix. Sibylvariance also enables a unique form of adaptive training that generates new input mixtures for the most confused class pairs, challenging the learner to differentiate with greater nuance. Our experiments on six benchmark datasets strongly support the efficacy of sibylvariance for generalization performance, defect detection, and adversarial robustness.

Team

This project is developed by Professor Miryung Kim's Software Engineering and Analysis Laboratory at UCLA. If you encounter any problems, please open an issue or feel free to contact us:

How to cite

Please refer to our Findings of ACL'22 paper, Sibylvariant Transformations for Robust Text Classification for more details.

Bibtex

@inproceedings{harel-canada-etal-2022-sibylvariant,
    title = "Sibylvariant Transformations for Robust Text Classification",
    author = "Harel-Canada, Fabrice  and
      Gulzar, Muhammad Ali  and
      Peng, Nanyun  and
      Kim, Miryung",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-acl.140",
    doi = "10.18653/v1/2022.findings-acl.140",
    pages = "1771--1788",
    abstract = "The vast majority of text transformation techniques in NLP are inherently limited in their ability to expand input space coverage due to an implicit constraint to preserve the original class label. In this work, we propose the notion of sibylvariance (SIB) to describe the broader set of transforms that relax the label-preserving constraint, knowably vary the expected class, and lead to significantly more diverse input distributions. We offer a unified framework to organize all data transformations, including two types of SIB: (1) Transmutations convert one discrete kind into another, (2) Mixture Mutations blend two or more classes together. To explore the role of sibylvariance within NLP, we implemented 41 text transformations, including several novel techniques like Concept2Sentence and SentMix. Sibylvariance also enables a unique form of adaptive training that generates new input mixtures for the most confused class pairs, challenging the learner to differentiate with greater nuance. Our experiments on six benchmark datasets strongly support the efficacy of sibylvariance for generalization performance, defect detection, and adversarial robustness.",
}

Sibyl

Sibyl is a tool for generating new data from what you have already. Thus far, we have implemented over 40 different text transforms and a handful of image transforms.

There are two primary kinds of transformations:

  • Invariant (INV) : transform the input, but the expected output remains the same.
    • ex. "I love NY" + Emojify = "I 💗 NY", which has an INV effect on the sentiment.
  • Sibylvariant (SIB) : transform the input and the expected output may change in some way.
    • ex. "I love NY" + ChangeAntonym = "I hate NY", which has a SIB effect on the sentiment (label inverted from positive (1) to negative (0)).

Some transformations also result in soft labels, called sibylvariant mixture mutations. For example, within topic classification you could use SentMix on your data to randomly combine two inputs from different classes into one input and then shuffle the sentences around. This new data would have a new label with probabilities weighted according to the relative length of text contributed by the two inputs. Illustrating with AG_NEWS, if you mix an article about sports with one about business, it might result in a new soft label like [0, 0.5, 0.5, 0]. The intuition here is that humans would be able to recognize that there are two topics in a document and we should expect our models to behave similarly (i.e. the model predictions should be very close to 50/50 on the expected topics).

Installation

The Sibyl tool can be installed locally or via pypi:

pip install sibyl-tool

This repository contains a snapshot of the Sibyl tool for a particular point in time, roughly corresponding to the paper submission. Any future updates to the tool will be made under this repo.

Examples

Concept2Sentence

This transformation intakes a sentence, its associated integer label, and (optionally) a dataset name that is supported by huggingface/datasets. It works by extracting keyword concepts from the original sentence, passing them into a BART transformer trained to generate a new, related sentence which reflects the extracted concepts. Providing a dataset allows the function to use transformers-interpret to identify the most critical concepts for use in the generative step.

transform = Concept2Sentence(dataset='sst2')

in_text = "a disappointment for those who love alternate versions of the bard, particularly ones that involve deep fryers and hamburgers."
in_target = 0 # (sst2 dataset 0=negative, 1=positive)

out_text = transform(in_text, in_target)
Extracted Concepts: ['disappointment']
New Sentence: "disappointment is a common sight among many people."

NOTE: We highly recommend having a gpu for generative transforms like this one.

SentMix

This transformation takes in a set of inputs and targets, blends the sentences together and generates a new target label that reflects the ratio of words contributed from each of the source sentences. It produces a soft label that reflects mixed class membership and forces the model to differentiate classes with greater degrees of nuance.

transform = SentMix()

in_text = ["The characters are unlikeable and the script is awful. It's a waste of the talents of Deneuve and Auteuil.", 
           "Unwatchable. You can't even make it past the first three minutes. And this is coming from a huge Adam Sandler fan!!1",
           "An unfunny, unworthy picture which is an undeserving end to Peter Sellers' career. It is a pity this movie was ever made.",
           "I think it's one of the greatest movies which are ever made, and I've seen many... The book is better, but it's still a very good movie!",
           "The only thing serious about this movie is the humor. Well worth the rental price. I'll bet you watch it twice. It's obvious that Sutherland enjoyed his role.",
           "Touching; Well directed autobiography of a talented young director/producer. A love story with Rabin's assassination in the background. Worth seeing"
]

in_target = [0, 0, 0, 1, 1, 1] # (imdb dataset 0=negative, 1=positive)

new_text, new_target = transform((in_text, in_target), num_classes=2)
print(new_text[0])
print(new_target[0])
'The book is better, but it\'s still a very good movie!" It\'s a waste of the talents of Deneuve and Auteuil. I think it\'s one of the greatest movies which are ever made, and I\'ve seen many... b"The characters are unlikeable and the script is awful.',

[0.4380165289256198, 0.5619834710743802]

Evaluation Code

To recreate the evaluations performed for the paper, you can run the scripts provided in the eval folder after installing Sibyl. First, generate the datasets (to avoid variability from random seed selections), then run each of the rq_*.py files. Note that this will take an incredibly long time to process.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sibyl_tool-0.1.11.tar.gz (171.2 kB view details)

Uploaded Source

Built Distribution

sibyl_tool-0.1.11-py3-none-any.whl (202.8 kB view details)

Uploaded Python 3

File details

Details for the file sibyl_tool-0.1.11.tar.gz.

File metadata

  • Download URL: sibyl_tool-0.1.11.tar.gz
  • Upload date:
  • Size: 171.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.15

File hashes

Hashes for sibyl_tool-0.1.11.tar.gz
Algorithm Hash digest
SHA256 b06f9aa47c6247c823be0a43e53f6e7a3f64502394ff19c5e7477114c891e9f7
MD5 8ea454283b6b3ae13edfd0ef89589f28
BLAKE2b-256 4948e3d29d2ca351986db9c4fd4d54eeb4566c0360940af6447fc9fdbf4c6c96

See more details on using hashes here.

File details

Details for the file sibyl_tool-0.1.11-py3-none-any.whl.

File metadata

  • Download URL: sibyl_tool-0.1.11-py3-none-any.whl
  • Upload date:
  • Size: 202.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.15

File hashes

Hashes for sibyl_tool-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 1a876bbed9b8b360cfde2b424ce7f83959eba25de4512e73805be255315c19c0
MD5 367038ab9891abd77f1de8f760fbd8a6
BLAKE2b-256 4676d42ea621051b73faccd06d3f9c68ef7ac74b40f5fc84dfa67310edb69236

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page