Skip to main content

Natural language processing augmentation library for deep neural networks

Project description

Build Status Codacy Badge Codecov Badge

nlpaug

This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter is the basic element of augmentation while Flow is a pipeline to orchestra multi augmenter together.

Starter Guides

Augmenter

Target Augmenter Action Description
Character RandomAug Insert Insert character randomly
Substitute Substitute character randomly
Swap Swap character randomly
Delete Delete character randomly
OcrAug Substitute Simulate OCR engine error
QwertyAug Substitute Simulate keyboard distnace error
Word RandomWordAug Swap Swap word randomly
Delete Delete word randomly
WordNetAug Substitute Substitute word according to WordNet's synonym
Word2vecAug Insert Insert word randomly from word2vec dictionary
Substitute Substitute word based on word2vec embeddings
GloVeAug Insert Insert word randomly from GloVe dictionary
Substitute Substitute word based on GloVe embeddings
FasttextAug Insert Insert word randomly from fasttext dictionary
Substitute Substitute word based on fasttext embeddings
BertAug Insert Insert word based by feeding surroundings word to BERT language model
Substitute Substitute word based by feeding surroundings word to BERT language model
Spectrogram FrequencyMaskingAug Substitute Set block of values to zero according to frequency dimension
TimeMaskingAug Substitute Set block of values to zero according to time dimension
Audio NoiseAug Substitute Inject noise
PitchAug Substitute Adjust pitch
ShiftAug Substitute Shift time dimension forward/ backward
SpeedAug Substitute Adjust speed of audio

Flow

Pipeline Description
Sequential Apply list of augmentation functions sequentially
Sometimes Apply some augmentation functions randomly

Installation

The library supports python 3.5+ in linux and window platform.

To install the library:

pip install nlpaug

or install the latest version (include BETA features) from github directly

pip install git+https://github.com/makcedward/nlpaug.git

Download word2vec or GloVe files if you use Word2VecAug, GloVeAug or FasttextAug:

Recent Changes

0.0.4 Jun 7, 2019:

  • Added stopwords feature in character and word augmenter.
  • Added character's swap augmenter.
  • Added word's swap augmenter.
  • Added validation rule for #1.
  • Fixed BERT reverse tokenization for #2.

0.0.3 May 23, 2019: Added Speed, Noise, Shift and Pitch augmenters for Audio

0.0.2 Apr 30, 2019: Added Frequency Masking and Time Masking for Speech Recognition (Spectrogram). Added librosa library dependency for converting wav to spectrogram.

0.0.1 Mar 20, 2019: Project initialization

Test

Word2vec and GloVe models are used in word insertion and substitution. Those model files are necessary in order to run test case. You have to add ".env" file in root directory and the content should be
    - MODEL_DIR={MODEL FILE PATH}
Folder structure of model should be
    -- root directory
        - glove.6B.50d.txt
        - GoogleNews-vectors-negative300.bin
        - wiki-news-300d-1M.vec

Research Reference

Augmenter Research
RandomAug, QwertyAug D. Pruthi, B. Dhingra and Z. C. Lipton. Combating Adversarial Misspellings with Robust Word Recognition. 2019
WordNetAug X. Zhang, J. Zhao and Y. LeCun. Character-level Convolutional Networks for Text Classification. 2015
FrequencyMaskingAug, TimeMaskingAug D. S. Park, W. Chan, Y. Zhang, C. C. Chiu, B. Zoph, E. D. Cubuk and Q. V. Le. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. 2019

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpaug-0.0.4.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlpaug-0.0.4-py3-none-any.whl (39.0 kB view details)

Uploaded Python 3

File details

Details for the file nlpaug-0.0.4.tar.gz.

File metadata

  • Download URL: nlpaug-0.0.4.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.4

File hashes

Hashes for nlpaug-0.0.4.tar.gz
Algorithm Hash digest
SHA256 e1efe6917b2e869618556941b7e4015e3af59024cbad2551f071dcc6784f9608
MD5 63da07ca16fb82e9c00807c10360c6c4
BLAKE2b-256 6a40bc871992881d3d4df4a1c6a1a3a7a7c7168b4416ec202a73fc7c6188114f

See more details on using hashes here.

File details

Details for the file nlpaug-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: nlpaug-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 39.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.4

File hashes

Hashes for nlpaug-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 8864295f7b7b4939323e15bb43913708b5db1b1281caa990099b36e94b850f59
MD5 935b7d9874e079959b5c017d2e4f8673
BLAKE2b-256 489a21a71df82fc5b329ee480eba73ed84f84e250beabad0f40ae6bb842f4092

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page