Skip to main content

Natural language processing augmentation library for deep neural networks

Project description

Build Status Codacy Badge

nlpaug

This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter is the basic element of augmentation while Flow is a pipeline to orchestra multi augmenter together.

Starter Guides

Flow

Pipeline Description
Sequential Apply list of augmentation functions sequentially
Sometimes Apply some augmentation functions randomly

Textual Augmenter

Target Augmenter Action Description
Character RandomAug insert Insert character randomly
substitute Substitute character randomly
swap Swap character randomly
delete Delete character randomly
OcrAug substitute Simulate OCR engine error
KeyboardAug substitute Simulate keyboard distance error
Word RandomWordAug swap Swap word randomly
delete Delete word randomly
SpellingAug substitute Substitute word according to spelling mistake dictionary
SynonymAug substitute Substitute similar word according to WordNet/ PPDB synonym
AntonymAug substitute Substitute opposite meaning word according to WordNet antonym
SplitAug split Split one word to two words randomly
WordEmbsAug insert Insert word randomly from word2vec, GloVe or fasttext dictionary
substitute Substitute word based on word2vec, GloVe or fasttext embeddings
TfIdfAug insert Insert word randomly trained TF-IDF model
substitute Substitute word based on TF-IDF score
ContextualWordEmbsAug insert Insert word based by feeding surroundings word to BERT and XLNet language model
substitute Substitute word based by feeding surroundings word to BERT and XLNet language model
Sentence ContextualWordEmbsForSentenceAug insert Insert sentence according to XLNet or GPT2 prediction

Signal Augmenter

Target Augmenter Action Description
Audio NoiseAug substitute Inject noise
PitchAug substitute Adjust audio's pitch
ShiftAug substitute Shift time dimension forward/ backward
SpeedAug substitute Adjust audio's speed
CropAug delete Delete audio's segment
LoudnessAug substitute Adjust audio's volume
MaskAug substitute Mask audio's segment
Spectrogram FrequencyMaskingAug substitute Set block of values to zero according to frequency dimension
TimeMaskingAug substitute Set block of values to zero according to time dimension

Installation

The library supports python 3.5+ in linux and window platform.

To install the library:

pip install nlpaug numpy matplotlib python-dotenv

or install the latest version (include BETA features) from github directly

pip install git+https://github.com/makcedward/nlpaug.git numpy matplotlib python-dotenv

If you use ContextualWordEmbsAug or ContextualWordEmbsForSentenceAug, install the following dependencies as well

pip install torch>=1.2.0 transformers>=2.0.0

If you use AntonymAug, SynonymAug, install the following dependencies as well

pip install nltk

If you use WordEmbsAug (word2vec, glove or fasttext), downloading pre-trained model first

from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.') # Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # Download GloVe model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # Download fasttext model

If you use any one of audio augmenter, install the following dependencies as well

pip install librosa

Recent Changes

0.0.9 Sep 30, 2019

  • Added Swap Mode (adjacent, middle and random) for RandomAug (character level)
  • Added SynonymAug (WordNet/ PPDB) and AntonymAug (WordNet)
  • WordNetAug is deprecated. Uses SynonymAug instead
  • Introduce parameter n. Returning more than 1 augmented data. Changing output format from text (or numpy) to list of text (or numpy) if n > 1
  • Introduce parameter temperature in ContextualWordEmbsAug and ContextualWordEmbsForSentenceAug to control the randomness
  • aug_n parameter is deprecated. This parameter will be replaced by top_k parameter
  • Fixed tokenization issue #48
  • Upgraded transformers dependency (or pytorch_transformer) to 2.0.0
  • Upgraded PyTorch dependency to 1.2.0
  • Added SplitAug

See changelog for more details.

Source

This library uses data (e.g. capturing from internet), research (e.g. following augmenter idea), model (e.g. using pre-trained model) See data source for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpaug-0.0.9.tar.gz (36.2 kB view details)

Uploaded Source

Built Distribution

nlpaug-0.0.9-py3-none-any.whl (79.6 kB view details)

Uploaded Python 3

File details

Details for the file nlpaug-0.0.9.tar.gz.

File metadata

  • Download URL: nlpaug-0.0.9.tar.gz
  • Upload date:
  • Size: 36.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.4

File hashes

Hashes for nlpaug-0.0.9.tar.gz
Algorithm Hash digest
SHA256 50da3e9e68f1d03efef41c845521cbfe813cb896169269ac3a8fafe1b6bec897
MD5 dd11acd58d9997532c1fb6cb93f00ae4
BLAKE2b-256 ac929e3a5c1ad312c4975660b64a6fc2be340c82333aabd765e9f50918e63fa3

See more details on using hashes here.

File details

Details for the file nlpaug-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: nlpaug-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 79.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.4

File hashes

Hashes for nlpaug-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 0bd2a6bffd108b0ba86a50df47488d31b84c57849b43ae7073cc80aa573e44f3
MD5 a199222051facd5c9f531e2de6827fa1
BLAKE2b-256 401cba46a7e0e9608d1834b857483df72b2d46d66f95510a80dcdad3ffdbafb5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page