Skip to main content

Corruption of text datasets; model-independent and inspired byreal-world corruption causes.

Project description

Corrupted-Text: Realistic Out-of-Distribution Texts

test Code style: black Imports: isort Docstr-Coverage Python Version PyPi Deployment

A python library to generate out-of-distribution text datasets. Specifically, the library applies model-independent, commonplace corruptions (not model-specific, worst-case adversarial corruptions). We thus aim to allow benchmark-studies regarding robustness against realistic outliers.

Implemented Corruptions

Most corruptions are based on a set of common words, to which a corruptor is fitted. These common words may be domain specific and thus, the corruptor can be fitted with a base dataset from which the most common words are extracted.

Then, the following corruptions are randomly applied on a per-word basis:

  1. Bad Autocorrection Words are replaced with another, common word to which it has a small levenshtein distance. This mimicks wrong autocorrection, as for example done by "intelligent" mobile phone keyboards.
  2. Bad Autocompletion Words are replaced with another, common word with the same starting letters. This mimicks wrong autocompletion. If no common word with at least 3 common start letters is found, a bad autocorrection is attempted instead.
  3. Bad Synonym Words are replaced with a synonym, accoring to a naive, flat mapping extracted from WordNet, ignoring the context. This mimicks dictionary based translations, which are often wrong. This assumes that you are using an english-language dataset.
  4. Typo A single letter is replaced with another, randomly chosen letter.

To any word, at most one corruption is applied, i.e., corruptions are not applied on top of each other.

The severity ]0, 1] is a parameter to steer how many corruptions should be applied. It roughly corresponds to the percentage of words that should be corrupted (only rougly as not all bad autocompletion attempts are successful, and as sometimes, the bad synonyms consist of multiple words, thus extending the number of words in the text).

Optionally, users can define weights to each corruption type, steering how often they should be applied.

Accuracies

The following shows the accuracy of a regular, simple transformer model on the imdb sentiment classification dataset. Clearly, the higher the chosen corruption severity, the lower the model accuracy.

Severity 0 (*) 0.1 0.3 0.5 0.7 0.9 1 (max!)
Accuracy .87 .81 .78 .75 .71 0.66 0.64

(*) No corruption, original test set.

Installation

It's as simple as pip install corrupted-text.

You'll need python >= 3.7

Usage

Usage is very straigthforward. The following shows an example on how to corrupt the imdb sentiment classification dataset.

You can also run the example in colab: Run Example in Colab

import corrupted_text  # pip install corrupted-text
import logging 
from datasets import load_dataset # pip install datasets

# Enable Detailed Logging
logging.basicConfig(level=logging.INFO)

# Load the dataset (we use huggingface-datasets, but any list of strings is fine).
nominal_train = load_dataset("imdb", split="train")["text"]
nominal_test = load_dataset("imdb", split="test")["text"]

# Fit a corruptor (we fit on the training and test set,
#   but as this takes a while, you'd want to choose a smaller subset for larger datasets)
corruptor = corrupted_text.TextCorruptor(base_dataset=nominal_test + nominal_train,
                                         cache_dir=".mycache")

# Corrupt the test set with severity 0.5. The result is again a list of corrupted strings.
imdb_corrupted = corruptor.corrupt(nominal_test, severity=0.5, seed=1)

Citation

@inproceedings{Weiss2022SimpleTip, 
  title={Simple Techniques Work Surprisingly Well for Neural Network Test Prioritization and Active Learning (Replication Paper)},
  author={Weiss, Michael and Paolo, Tonella}, 
  booktitle={Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis},
  year={2022}
}

Other Corrupted Datasets

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corrupted-text-0.2.0.tar.gz (12.9 kB view details)

Uploaded Source

Built Distribution

corrupted_text-0.2.0-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file corrupted-text-0.2.0.tar.gz.

File metadata

  • Download URL: corrupted-text-0.2.0.tar.gz
  • Upload date:
  • Size: 12.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.8.2 requests/2.27.1 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.64.0 CPython/3.8.10

File hashes

Hashes for corrupted-text-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ff43fd3e1c96607fd645cbb9c32132f51e639e4dbbb10f499ca03b9baf1dd1d0
MD5 315155dc9cc9f3bb3ae5017665fb28ff
BLAKE2b-256 3596d164116baba2a0190603c59c0a2eaa73b7f76566150ace4aa61c5bab095e

See more details on using hashes here.

File details

Details for the file corrupted_text-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: corrupted_text-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 11.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.8.2 requests/2.27.1 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.64.0 CPython/3.8.10

File hashes

Hashes for corrupted_text-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c35dcebf43b454aef450544108b37a5e08f4c833f4e0c3b5ee80c1844ef9cf20
MD5 aac05ba6e109e30f4d625d0de6f074aa
BLAKE2b-256 85e5aec3aa3ff91c097e6fa8853f4c7c239244310545e68c4573b0991a7bb4a5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page