Skip to main content

A tool for sibylvariant transformations

Project description

Sibyl

Sibyl is a tool for generating new data from what you have already. Transform your data in over 35 different ways by either selecting one or sampling n transformations at a time.

There are two primary kinds of transformations:

  • Invariant (INV) : transform the input, but the expected output remains the same.
    • ex. "I love NY" + Emojify = "I 💗 NY", which has an INV effect on the sentiment.
  • Sibylvariant (SIB) : transform the input and the expected output may change in some way.
    • ex. "I love NY" + ChangeAntonym = "I hate NY", which has a SIB effect on the sentiment (label inverted from positive (1) to negative (0)).

Some transformations also result in soft labels and they're called SIB-mix transformations. For example, within topic classification you could use SentMix on your data to randomly combine two inputs from different classes into one input and then shuffle the sentences around. This new data would have a new label with probabilities weighted according to the relative length of text contributed by the two inputs. Illustrating with AG_NEWS, if you mix an article about sports with one about business, it might result in a new soft label like [0, 0.5, 0.5, 0]. The intuition here is that humans would be able to recognize that there are two topics in a document and we should expect our models to behave similarly (i.e. the model predictions should be very close to 50/50 on the expected topics).

Pretrained Models

We offer links to several of pre-trained + fine-tuned BERT models used during our evaluation.

Examples

Here's a quick example of using a single transform:

from transforms import HomoglyphSwap

transform = HomoglyphSwap(change=0.75)
string_in = "The quick brown fox jumps over the lazy dog"
string_out = transform(string_in)
print(string_out) 

>> Tհe quіc𝒌 Ьⲅоԝn 𝚏о× ϳumрѕ оѵеⲅ 𝚝հе ⅼɑzу ԁoɡ

Here's a quick example using transform_dataset which uniformly samples from the taxonomized transformations so that you generate new data relevant to your particular task.

from datasets import load_dataset
from transforms import *
from utils import *

dataset = load_dataset('glue', 'sst2')
train_data = dataset['train']
train_data.rename_column_('sentence', 'text')

task = 'sentiment'
tran = 'SIB'
n = 2

out = transform_dataset(
    train_data[:5], 
    num_transforms=n, 
    task=task, 
    tran=tran
)

new_text, new_label, trans_applied = out

Here are some examples we've already prepared:

from utils import *

test_suites = pkl_load('assets/SST2/test_suites.pkl')
INV_test_suites = pkl_load('assets/SST2/INV_test_suites.pkl')
SIB_test_suites = pkl_load('assets/SST2/SIB_test_suites.pkl')

n = 3
df_orig = pd.DataFrame.from_dict(test_suites[0]).head(n)
df_INV  = pd.DataFrame.from_dict(INV_test_suites[0]).head(n)
df_SIB  = pd.DataFrame.from_dict({'data': SIB_test_suites[0]['data'], 
                                  'target': SIB_test_suites[0]['target'],
                                  'ts': SIB_test_suites[0]['ts']}).head(n)

df_orig.rename(columns={'data': 'original'}, inplace=True)
df_INV.rename(columns={'data': 'INV_transformed', 'ts' : 'transforms_applied'}, inplace=True)
df_SIB.rename(columns={'data': 'SIB_transformed', 'ts' : 'transforms_applied'}, inplace=True)

df = pd.concat([df_orig, df_INV, df_SIB], axis=1)

df
# original target INV_transformed target transforms_applied SIB_transformed target transforms_applied
0 boisterous and utterly charming 1 boisterous robust+ious and utterly charming 1 ['RandomInsertion', 'RandomCharInsert'] boisterous and utterly charming That being said, I loved it. 💁🏽‍♂ 1 ['InsertPositivePhrase', 'AddPositiveEmoji']
1 pathos-filled but ultimately life-affirming finale 1 рɑthos-fіlled but սⅼtimatеly li𝚏e-/ffirmiոɡ finɑlе 1 ['RandomCharSubst', 'HomoglyphSwap'] pathos-filled but ultimately life-affirming finale https://www.dictionary.com/browse/clunky 🙋 1 ['AddNegativeLink', 'AddPositiveEmoji']
2 with a lower i.q. than when i had entered 0 with a gloomy i.q. than when i had immerse 0 ['ChangeSynonym', 'ChangeHyponym'] with a lower i.q. than when i had entered 👨‍❤‍💋‍👨 That being said, I liked it. 1 ['AddPositiveEmoji', 'InsertPositivePhrase']

colab notebooks

Since our local machines were not even close to being powerful enough to handle the computation requirements of our project, we also created some google colab notebooks to help parrallelize our experimental evaluations.

train_SST2 https://colab.research.google.com/drive/13Gk_hDTJ25s_BPXmH6wmySIM4EI-YsEO?usp=sharing

train_AG_NEWS https://colab.research.google.com/drive/1vA8K6VX99Zmcr00nk-0-ZmITSLmWfyI-?usp=sharing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sibyl_tool-0.1.0.tar.gz (148.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sibyl_tool-0.1.0-py3-none-any.whl (171.9 kB view details)

Uploaded Python 3

File details

Details for the file sibyl_tool-0.1.0.tar.gz.

File metadata

  • Download URL: sibyl_tool-0.1.0.tar.gz
  • Upload date:
  • Size: 148.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.9

File hashes

Hashes for sibyl_tool-0.1.0.tar.gz
Algorithm Hash digest
SHA256 36db16f5e4a3f738d51899f666ec4f6837029c6cac4ee2a953316788e98496c8
MD5 d042bc5217ff1a264cf15a6e064a4f73
BLAKE2b-256 3955f269370a5ca60802475ed21cb1b15659df1705c21c240ff940c2b5c969c7

See more details on using hashes here.

File details

Details for the file sibyl_tool-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sibyl_tool-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 171.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.9

File hashes

Hashes for sibyl_tool-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 665dab332773c802a512a8afeeb2348b1a03848ad688de099c052e9bd2d4bc54
MD5 01643ca61837919ca58107bbc9106f55
BLAKE2b-256 90fe2b5e48ad39695b893bcbb52470e366275c5b96c3b4789a2134ddc1738582

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page