Skip to main content

Natural language processing augmentation library for deep neural networks

Project description

Build Code Quality

nlpaug

This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter is the basic element of augmentation while Flow is a pipeline to orchestra multi augmenter together.

Features

  • Generate synthetic data for improving model performance without manual effort
  • Simple, easy-to-use and lightweight library. Augment data in 3 lines of code
  • Plug and play to any neural network frameworks (e.g. PyTorch, TensorFlow)
  • Support textual and audio input

Textual Data Augmentation Example


Acoustic Data Augmentation Example


Section Description
Quick Demo How to use this library
Augmenter Introduce all available augmentation methods
Installation How to install this library
Recent Changes Latest enhancement
Extension Reading More real life examples or researchs
Reference Refernce of external resources such as data or model

Quick Demo

Augmenter

Augmenter Target Augmenter Action Description
Textual Character RandomAug insert, substitute, swap, delete Apply augmentation randomly
Textual OcrAug substitute Simulate OCR engine error
Textual KeyboardAug substitute Simulate keyboard distance error
Textual Word RandomWordAug swap, delete Apply augmentation randomly
Textual SpellingAug substitute Substitute word according to spelling mistake dictionary
Textual SynonymAug substitute Substitute similar word according to WordNet/ PPDB synonym
Textual AntonymAug substitute Substitute opposite meaning word according to WordNet antonym
Textual SplitAug split Split one word to two words randomly
Textual WordEmbsAug insert, substitute Leverage word2vec, GloVe or fasttext embeddings to apply augmentation
Textual TfIdfAug insert, substitute Use TF-IDF to find out how word should be augmented
Textual ContextualWordEmbsAug insert, substitute Feeding surroundings word to BERT and XLNet language model to find out the most suitlabe word for augmentation
Textual Sentence ContextualWordEmbsForSentenceAug insert Insert sentence according to XLNet or GPT2 prediction
Signal Audio NoiseAug substitute Inject noise
Signal PitchAug substitute Adjust audio's pitch
Signal ShiftAug substitute Shift time dimension forward/ backward
Signal SpeedAug substitute Adjust audio's speed
Signal CropAug delete Delete audio's segment
Signal LoudnessAug substitute Adjust audio's volume
Signal MaskAug substitute Mask audio's segment
Signal Spectrogram FrequencyMaskingAug substitute Set block of values to zero according to frequency dimension
Signal TimeMaskingAug substitute Set block of values to zero according to time dimension
Pipeline All Sequential Apply list of augmentation functions sequentially
Pipeline Sometimes Apply some augmentation functions randomly

Installation

The library supports python 3.5+ in linux and window platform.

To install the library:

pip install nlpaug numpy matplotlib python-dotenv

or install the latest version (include BETA features) from github directly

pip install git+https://github.com/makcedward/nlpaug.git numpy matplotlib python-dotenv

If you use ContextualWordEmbsAug or ContextualWordEmbsForSentenceAug, install the following dependencies as well

pip install torch>=1.2.0 transformers>=2.0.0

If you use AntonymAug, SynonymAug, install the following dependencies as well

pip install nltk>=3.4.5

If you use WordEmbsAug (word2vec, glove or fasttext), downloading pre-trained model first

from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.') # Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # Download GloVe model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # Download fasttext model

If you use any one of audio augmenter, install the following dependencies as well

pip install librosa>=0.7.1

Recent Changes

**0.0.10 Nov 4, 2019

  • Add aug_max to control maximum number of augmented item
  • Fix ContextualWordEmbsAug (for BERT) error when input is longer than max sequence length
  • Add RandomWordAug Substitute action
  • Fix ContextualWordEmbsAug error when no augmented data
  • Support multi thread processing (for CPU only) to speed up the augmentation
  • Fix KeyboardAug error #55

See changelog for more details.

Extension Reading

Reference

This library uses data (e.g. capturing from internet), research (e.g. following augmenter idea), model (e.g. using pre-trained model) See data source for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpaug-0.0.10.tar.gz (37.8 kB view details)

Uploaded Source

Built Distribution

nlpaug-0.0.10-py3-none-any.whl (83.2 kB view details)

Uploaded Python 3

File details

Details for the file nlpaug-0.0.10.tar.gz.

File metadata

  • Download URL: nlpaug-0.0.10.tar.gz
  • Upload date:
  • Size: 37.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.4

File hashes

Hashes for nlpaug-0.0.10.tar.gz
Algorithm Hash digest
SHA256 595747b67ccd35300c7f086202a572822f5e041551fabbf5e20065eb9615a99b
MD5 fb985f7d7563d0a6fb83fb1f62d63f8f
BLAKE2b-256 281aaf8e6f821d5e50f9eff44e59b62f677cb0044c1ebee1805cd2370f03f699

See more details on using hashes here.

File details

Details for the file nlpaug-0.0.10-py3-none-any.whl.

File metadata

  • Download URL: nlpaug-0.0.10-py3-none-any.whl
  • Upload date:
  • Size: 83.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.4

File hashes

Hashes for nlpaug-0.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 4803295d955d4b59b68f8b0d5bef61c2850c8c49f66bd3a84b143e865fd394dc
MD5 0f950509094fe53d6d63e044ec5ebdcd
BLAKE2b-256 6e45ce353d60920cabe773de35ee8dac0989659c055540fa50eb0f6ac774e6f0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page