Skip to main content

Natural language processing augmentation library for deep neural networks

Project description

Build Status Codacy Badge Codecov Badge

nlpaug

This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter is the basic element of augmentation while Flow is a pipeline to orchestra multi augmenter together.

Starter Guides

Augmenter

Target Augmenter Action Description
Character RandomAug insert Insert character randomly
substitute Substitute character randomly
swap Swap character randomly
delete Delete character randomly
OcrAug substitute Simulate OCR engine error
KeyboardAug substitute Simulate keyboard distance error
Word RandomWordAug swap Swap word randomly
delete Delete word randomly
SpellingAug substitute Substitute word according to spelling mistake dictionary
WordNetAug substitute Substitute word according to WordNet's synonym
WordEmbsAug insert Insert word randomly from word2vec, GloVe or fasttext dictionary
substitute Substitute word based on word2vec, GloVe or fasttext embeddings
TfIdfAug insert Insert word randomly trained TF-IDF model
substitute Substitute word based on TF-IDF score
BertAug insert Insert word based by feeding surroundings word to BERT language model
substitute Substitute word based by feeding surroundings word to BERT language model
Spectrogram FrequencyMaskingAug substitute Set block of values to zero according to frequency dimension
TimeMaskingAug substitute Set block of values to zero according to time dimension
Audio NoiseAug substitute Inject noise
PitchAug substitute Adjust audio's pitch
ShiftAug substitute Shift time dimension forward/ backward
SpeedAug substitute Adjust audio's speed
CropAug delete Delete audio's segment
LoudnessAug substitute Adjust audio's volume
MaskAug substitute Mask audio's segment

Flow

Pipeline Description
Sequential Apply list of augmentation functions sequentially
Sometimes Apply some augmentation functions randomly

Installation

The library supports python 3.5+ in linux and window platform.

To install the library:

pip install nlpaug

or install the latest version (include BETA features) from github directly

pip install git+https://github.com/makcedward/nlpaug.git

If you use BertAug, install the following dependencies as well

pip install pytorch_pretrained_bert torch

If you use WordEmbsAug (word2vec, glove or fasttext), downloading pre-trained model first

from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.') # Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # Download GloVe model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # Download fasttext model

Recent Changes

BETA Aug 16, 2019

  • Add new augmenter (CropAug, LoudnessAug, MaskAug)
  • QwertyAug is deprecated. It will be replaced by KeyboardAug
  • Remove StopWordsAug. It will be replaced by RandomWordAug
  • Code refactoring
  • Added model download function for word2vec, GloVe and fasttext

0.0.6 Jul 29, 2019:

See changelog for more details.

Test

Word2vec, GloVe, Fasttext models are used in word insertion and substitution. Those model files are necessary in order to run test case. You have to add ".env" file in root directory and the content should be
    -   MODEL_DIR={MODEL FILE PATH}
Folder structure of model should be
    -- root directory
        - glove.6B.50d.txt
        - GoogleNews-vectors-negative300.bin
        - wiki-news-300d-1M.vec

Research Reference

Some of the above augmenters are inspired by the following research papers. However, it does not always follow original implementation due to different reasons. If original implementation is needed, please refer to original source code.

Augmenter Inspired by
RandomAug, SpellingAug Y. Belinkov and Y. Bisk. Synthetic and Natural Noise Both Break Neural Machine Translation. 2017
RandomAug J. Ebrahimi, A. Rao, D. Lowd and D. Dou. HotFlip: White-Box Adversarial Examples for Text Classification. 2018
RandomAug, RandomWordAug J. Ebrahimi, D. Lowd and Dou. On Adversarial Examples for Character-Level Neural Machine Translation. 2018
RandomAug, KeyboardAug D. Pruthi, B. Dhingra and Z. C. Lipton. Combating Adversarial Misspellings with Robust Word Recognition. 2019
RandomAug, RandomWordAug T. Niu and M. Bansal. Adversarial Over-Sensitivity and Over-Stability Strategies for Dialogue Models. 2018
RandomWordAug, WordNetAug P. Minervini and S. Riedel. Adversarially Regularising Neural NLI Models to Integrate Logical Background Knowledge. 2018
WordNetAug X. Zhang, J. Zhao and Y. LeCun. Character-level Convolutional Networks for Text Classification. 2015
WordNetAug S. Kobayashi and C. Coulombe. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs. 2018
TfIdfAug Q. Xie, Z. Dai, E Hovy, M. T. Luong and Q. V. Le. Unsupervised Data Augmentation. 2019
WordEmbsAug W. Y. Wang and D. Yang. That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets. 2015
BertAug S. Kobayashi. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relation. 2018
FrequencyMaskingAug, TimeMaskingAug D. S. Park, W. Chan, Y. Zhang, C. C. Chiu, B. Zoph, E. D. Cubuk and Q. V. Le. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. 2019

Data Source

Capatured data from internet for building augmenter/ test case.

See data source for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpaug-0.0.7.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

nlpaug-0.0.7-py3-none-any.whl (68.0 kB view details)

Uploaded Python 3

File details

Details for the file nlpaug-0.0.7.tar.gz.

File metadata

  • Download URL: nlpaug-0.0.7.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.4

File hashes

Hashes for nlpaug-0.0.7.tar.gz
Algorithm Hash digest
SHA256 1763e43aff3cb6ec9aa6e2b0e80783c1df670d513168a3e930f6a832f0d4921a
MD5 2cb771774c697252d44b57a1a7670a01
BLAKE2b-256 5eb103defd14f226b11ee458dc2635fe81c6831dd0765bbcc1632b4d44e36cde

See more details on using hashes here.

File details

Details for the file nlpaug-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: nlpaug-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 68.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.4

File hashes

Hashes for nlpaug-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 b77926fef5d2a37badf3e8e688f62e1fe26d514c922ed15349372bcfaa07e4f0
MD5 9cbe431fdd7554ee7bd533b43de6ba76
BLAKE2b-256 0ff7e73fe83928ff5e9b5a051b08865467f470996bcac819b40b2a620c6b9c45

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page