Skip to main content

Natural language processing augmentation library for deep neural networks

Project description



Build Code Quality

nlpaug

This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter is the basic element of augmentation while Flow is a pipeline to orchestra multi augmenter together.

Features

  • Generate synthetic data for improving model performance without manual effort
  • Simple, easy-to-use and lightweight library. Augment data in 3 lines of code
  • Plug and play to any neural network frameworks (e.g. PyTorch, TensorFlow)
  • Support textual and audio input

Textual Data Augmentation Example


Acoustic Data Augmentation Example


Section Description
Quick Demo How to use this library
Augmenter Introduce all available augmentation methods
Installation How to install this library
Recent Changes Latest enhancement
Extension Reading More real life examples or researchs
Reference Refernce of external resources such as data or model

Quick Demo

Augmenter

Augmenter Target Augmenter Action Description
Textual Character KeyboardAug substitute Simulate keyboard distance error
Textual OcrAug substitute Simulate OCR engine error
Textual RandomAug insert, substitute, swap, delete Apply augmentation randomly
Textual Word AntonymAug substitute Substitute opposite meaning word according to WordNet antonym
Textual ContextualWordEmbsAug insert, substitute Feeding surroundings word to BERT, DistilBERT, RoBERTa or XLNet language model to find out the most suitlabe word for augmentation
Textual RandomWordAug swap, delete Apply augmentation randomly
Textual SpellingAug substitute Substitute word according to spelling mistake dictionary
Textual SplitAug split Split one word to two words randomly
Textual SynonymAug substitute Substitute similar word according to WordNet/ PPDB synonym
Textual TfIdfAug insert, substitute Use TF-IDF to find out how word should be augmented
Textual WordEmbsAug insert, substitute Leverage word2vec, GloVe or fasttext embeddings to apply augmentation
Textual Sentence ContextualWordEmbsForSentenceAug insert Insert sentence according to XLNet, GPT2 or DistilGPT2 prediction
Signal Audio CropAug delete Delete audio's segment
Signal LoudnessAug substitute Adjust audio's volume
Signal MaskAug substitute Mask audio's segment
Signal NoiseAug substitute Inject noise
Signal PitchAug substitute Adjust audio's pitch
Signal ShiftAug substitute Shift time dimension forward/ backward
Signal SpeedAug substitute Adjust audio's speed
Signal VtlpAug substitute Change vocal tract
Signal Spectrogram FrequencyMaskingAug substitute Set block of values to zero according to frequency dimension
Signal TimeMaskingAug substitute Set block of values to zero according to time dimension

Flow

Augmenter Augmenter Description
Pipeline Sequential Apply list of augmentation functions sequentially
Pipeline Sometimes Apply some augmentation functions randomly

Installation

The library supports python 3.5+ in linux and window platform.

To install the library:

pip install nlpaug numpy matplotlib python-dotenv

or install the latest version (include BETA features) from github directly

pip install git+https://github.com/makcedward/nlpaug.git numpy matplotlib python-dotenv

If you use ContextualWordEmbsAug or ContextualWordEmbsForSentenceAug, install the following dependencies as well

pip install torch>=1.2.0 transformers>=2.0.0

If you use AntonymAug, SynonymAug, install the following dependencies as well

pip install nltk>=3.4.5

If you use WordEmbsAug (word2vec, glove or fasttext), downloading pre-trained model first

from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.') # Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # Download GloVe model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # Download fasttext model

If you use any one of audio augmenter, install the following dependencies as well

pip install librosa>=0.7.1

Recent Changes

**0.0.12 Feb 5, 2020

  • ContextualWordEmbsAug supports bert-base-multilingual-uncased (for non English inputs)
  • Fix missing library dependency #74
  • Fix single token error when using RandomWordAug #76
  • Fix replacing character in RandomCharAug error #77
  • Enhance word's augmenter to support regular expression stopwords #81
  • Enhance char's augmenter to support regular expression stopwords #86
  • KeyboardAug supports Thai language #92
  • Fix word casing issue #82

**0.0.11 Dec 6, 2019

  • Support color noise (pink, blue, red and violet noise) in audio's NoiseAug
  • Support given background noise in audio's NoiseAug
  • Support inject noise to portion of audio only in audio's NoiseAug
  • Introduce zone, coverage to all audio augmenter. Support only augmented portion of audio input
  • Add VTLP augmentation methods (Audio's augmenter)
  • Adopt latest transformer's interface #59
  • Support RoBERTa (including DistilRoBERTa) and DistilBERT (ContextualWordEmbsAug)
  • Support DistilGPT2 (ContextualWordEmbsForSentenceAug)
  • Fix librosa hard dependency #62
  • Introduce optimize attribute ContextualWordEmbsForSentenceAug #63
  • Optimize word selection for ContextualWordEmbsAug and ContextualWordEmbsForSentenceAug (Speed up around 30%)
  • Add retry mechanism into ContextualWordEmbsAug insert action #68

See changelog for more details.

Extension Reading

Reference

This library uses data (e.g. capturing from internet), research (e.g. following augmenter idea), model (e.g. using pre-trained model) See data source for more details.

Citing

@misc{ma2019nlpaug,
  title={NLP Augmentation},
  author={Edward Ma},
  howpublished={\url{https://github.com/makcedward/nlpaug}},
  year={2019}
}

Contributions (Supporting Other Languages)

  • sakares: Add Thai support to KeyboardAug

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpaug-0.0.12.tar.gz (44.7 kB view details)

Uploaded Source

Built Distribution

nlpaug-0.0.12-py3-none-any.whl (94.1 kB view details)

Uploaded Python 3

File details

Details for the file nlpaug-0.0.12.tar.gz.

File metadata

  • Download URL: nlpaug-0.0.12.tar.gz
  • Upload date:
  • Size: 44.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.10

File hashes

Hashes for nlpaug-0.0.12.tar.gz
Algorithm Hash digest
SHA256 a1b4a12e42ea39cab4759f9929e7d4c9cc3519b2cf23d6db20385e3834d33be0
MD5 0b361b6a9cf42126c6630b96f0eb3477
BLAKE2b-256 1b332b8d77ae2b2d32f96e9e8b168c5fdfcefc1c9849eec3f17424cdbad59ccc

See more details on using hashes here.

File details

Details for the file nlpaug-0.0.12-py3-none-any.whl.

File metadata

  • Download URL: nlpaug-0.0.12-py3-none-any.whl
  • Upload date:
  • Size: 94.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.10

File hashes

Hashes for nlpaug-0.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 b808d450d4e0641fcc87c2ae9743cdee2491446ae19b03f620970ffad29c6bd1
MD5 7b9eed39a5fef693c91046fbd047b34a
BLAKE2b-256 991302e03850a9a7f4765638b4016249d78a5ee6851068a495edcf0d386c9e1c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page