Skip to main content

Text aspects for nlp models

Project description

Documentation Status Downloads PyPI version

alt wildnlp-logo

Corrupt an input text to test NLP models' robustness.
For details refer to https://nlp-demo.readthedocs.io

Installation

pip install wild-nlp

Supported aspects

All together we defined and implemented 11 aspects of text corruption.

  1. Articles

    Randomly removes or swaps articles into wrong ones.

  2. Digits2Words

    Converts numbers into words. Handles floating numbers as well.

  3. Misspellings

    Misspells words appearing in the Wikipedia list of:

    • commonly misspelled English words
    • homophones
  4. Punctuation

    Randomly adds or removes specified punctuation marks.

  5. QWERTY

    Simulates errors made while writing on a QWERTY-type keyboard.

  6. RemoveChar

    Randomly removes:

    • characters from words or
    • white spaces from sentences
  7. SentimentMasking

    Replaces random, single character with for example an asterisk in:

  8. Swap

    Randomly swaps two characters within a word, excluding punctuations.

  9. Change char

    Randomly change characters according to chosen dictionary, default is 'ocr' to simulate simple OCR errors.

  10. White spaces

Randomly add or remove white spaces (listed as a parameter).

  1. Sub string

Randomly add a substring to simulate more comples signs.

- All aspects can be chained together with the wildnlp.aspects.utils.compose function.

Supported datasets

Aspects can be applied to any text. Below is the list of datasets for which we already implemented processing pipelines.

  1. CoNLL

    The CoNLL-2003 shared task data for language-independent named entity recognition.

  2. IMDB

    The IMDB dataset containing movie reviews for a sentiment analysis. The dataset consists of 50 000 reviews of two classes, negative and positive.

  3. SNLI

    The SNLI dataset supporting the task of natural language inference.

  4. SQuAD

    The SQuAD dataset for the Machine Comprehension problem.

Usage

from wildnlp.aspects.dummy import Reverser, PigLatin
from wildnlp.aspects.utils import compose
from wildnlp.datasets import SampleDataset

# Create a dataset object and load the dataset
dataset = SampleDataset()
dataset.load()

# Crate a composed corruptor function.
# Functions will be applied in the same order they appear.
composed = compose(Reverser(), PigLatin())

# Apply the function to the dataset
modified = dataset.apply(composed)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wild-nlp-1.0.2.tar.gz (44.9 kB view details)

Uploaded Source

Built Distribution

wild_nlp-1.0.2-py3-none-any.whl (53.3 kB view details)

Uploaded Python 3

File details

Details for the file wild-nlp-1.0.2.tar.gz.

File metadata

  • Download URL: wild-nlp-1.0.2.tar.gz
  • Upload date:
  • Size: 44.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.8

File hashes

Hashes for wild-nlp-1.0.2.tar.gz
Algorithm Hash digest
SHA256 def51dce4d5be1644b1109798631e75e780741a7effb99ea9ecb1a1b4a860031
MD5 58a3b292d0824ed743fba421013c6b5c
BLAKE2b-256 8200a656ff3a918c6b83bff6966f99a88e523a07685f1a0001dddd93f3c7bcbb

See more details on using hashes here.

File details

Details for the file wild_nlp-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: wild_nlp-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 53.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.8

File hashes

Hashes for wild_nlp-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a7105880ba002f3bb0a02340a945230fe117863316bf0f80ad1218b07a98099d
MD5 546c18a0bc18bbff626901772ab23a91
BLAKE2b-256 735516cac5d14cb71bfc31297e3d12662ab7b11bf2dd8ec4e79c648255cb1bdc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page