Natural language processing augmentation library for deep neural networks
Project description
nlpaug
This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter
is the basic element of augmentation while Flow
is a pipeline to orchestra multi augmenter together.
- Data Augmentation library for Text
- Data Augmentation library for Speech Recognition
- Data Augmentation library for Audio
- Does your NLP model able to prevent adversarial attack?
Starter Guides
- Augmentation for character and word
- Augmentation for spectrogram (audio input)
- Augmentation for audio
- How to train TF-IDF model
- How to create custom augmentation
- API Documentation
Flow
Pipeline | Description |
---|---|
Sequential | Apply list of augmentation functions sequentially |
Sometimes | Apply some augmentation functions randomly |
Textual Augmenter
Target | Augmenter | Action | Description |
---|---|---|---|
Character | RandomAug | insert | Insert character randomly |
substitute | Substitute character randomly | ||
swap | Swap character randomly | ||
delete | Delete character randomly | ||
OcrAug | substitute | Simulate OCR engine error | |
KeyboardAug | substitute | Simulate keyboard distance error | |
Word | RandomWordAug | swap | Swap word randomly |
delete | Delete word randomly | ||
SpellingAug | substitute | Substitute word according to spelling mistake dictionary | |
WordNetAug | substitute | Substitute word according to WordNet's synonym | |
WordEmbsAug | insert | Insert word randomly from word2vec, GloVe or fasttext dictionary | |
substitute | Substitute word based on word2vec, GloVe or fasttext embeddings | ||
TfIdfAug | insert | Insert word randomly trained TF-IDF model | |
substitute | Substitute word based on TF-IDF score | ||
ContextualWordEmbsAug | insert | Insert word based by feeding surroundings word to BERT and XLNet language model | |
substitute | Substitute word based by feeding surroundings word to BERT and XLNet language model | ||
Sentence | ContextualWordEmbsForSentenceAug | insert | Insert sentence according to GPT2 or XLNet prediction |
Signal Augmenter
Target | Augmenter | Action | Description |
---|---|---|---|
Audio | NoiseAug | substitute | Inject noise |
PitchAug | substitute | Adjust audio's pitch | |
ShiftAug | substitute | Shift time dimension forward/ backward | |
SpeedAug | substitute | Adjust audio's speed | |
CropAug | delete | Delete audio's segment | |
LoudnessAug | substitute | Adjust audio's volume | |
MaskAug | substitute | Mask audio's segment | |
Spectrogram | FrequencyMaskingAug | substitute | Set block of values to zero according to frequency dimension |
TimeMaskingAug | substitute | Set block of values to zero according to time dimension |
Installation
The library supports python 3.5+ in linux and window platform.
To install the library:
pip install nlpaug numpy matplotlib python-dotenv
or install the latest version (include BETA features) from github directly
pip install git+https://github.com/makcedward/nlpaug.git
If you use ContextualWordEmbsAug, install the following dependencies as well
pip install torch>=1.1.0 pytorch_pretrained_bert>=1.1.0
If you use WordNetAug, install the following dependencies as well
pip install nltk
If you use WordEmbsAug (word2vec, glove or fasttext), downloading pre-trained model first
from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.') # Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # Download GloVe model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # Download fasttext model
If you use any one of audio augmenter, install the following dependencies as well
pip install librosa
Recent Changes
0.0.8 Sep 4, 2019
- BertAug is replaced by ContextualWordEmbsAug
- Support GPU (for ContextualWordEmbsAug only) #26
- Upgraded pytorch_transformer to 1.1.0 version #33
- ContextualWordEmbsAug suuports both BERT and XLNet model
- Removed librosa dependency
- Add ContextualWordEmbsForSentenceAug for generating next sentence
- Fix sampling issue #38
See changelog for more details.
Source
The library contains the usage of the following pre-trained model:
- word2vec (Google): Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey Dean released Efficient Estimation of Word Representations in Vector Space
- GloVe (Standford): Jeffrey Pennington, Richard Socher, and Christopher D. Manning released GloVe: Global Vectors for Word Representation
- fastText (Facebook): Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch and Armand Joulin released Advances in Pre-Training Distributed Word Representations
- BERT (Google): Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova released BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Used Hugging Face PyTorch version.
- XLNet (Google/CMU): Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le released XLNet: Generalized Autoregressive Pretraining for Language Understanding. Used Hugging Face PyTorch version.
The library also captured data from internet for building augmenter/ test case. See data source for more details.
Research Reference
Some of the above augmenters are inspired by the following research papers. However, it does not always follow original implementation due to different reasons. If original implementation is needed, please refer to original source code.
- Y. Belinkov and Y. Bisk. Synthetic and Natural Noise Both Break Neural Machine Translation. 2017
- J. Ebrahimi, A. Rao, D. Lowd and D. Dou. HotFlip: White-Box Adversarial Examples for Text Classification. 2018
- J. Ebrahimi, D. Lowd and Dou. On Adversarial Examples for Character-Level Neural Machine Translation. 2018
- D. Pruthi, B. Dhingra and Z. C. Lipton. Combating Adversarial Misspellings with Robust Word Recognition. 2019
- T. Niu and M. Bansal. Adversarial Over-Sensitivity and Over-Stability Strategies for Dialogue Models. 2018
- P. Minervini and S. Riedel. Adversarially Regularising Neural NLI Models to Integrate Logical Background Knowledge. 2018
- X. Zhang, J. Zhao and Y. LeCun. Character-level Convolutional Networks for Text Classification. 2015
- S. Kobayashi and C. Coulombe. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs. 2018
- Q. Xie, Z. Dai, E Hovy, M. T. Luong and Q. V. Le. Unsupervised Data Augmentation. 2019
- W. Y. Wang and D. Yang. That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets. 2015
- S. Kobayashi. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relation. 2018
- D. S. Park, W. Chan, Y. Zhang, C. C. Chiu, B. Zoph, E. D. Cubuk and Q. V. Le. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. 2019
- R. Jia and P. Liang. Adversarial Examples for Evaluating Reading Comprehension Systems. 2017
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nlpaug-0.0.8.tar.gz
.
File metadata
- Download URL: nlpaug-0.0.8.tar.gz
- Upload date:
- Size: 33.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c8b29d2d94096bd6f02193faaa7a56583a150daf45c34e711ca59a42fe49285 |
|
MD5 | 53bfa7508bd49328c33ee0dd3d7ca3eb |
|
BLAKE2b-256 | fa89e5351b709d7e0f819519b9582bbabd6e5959c3f80a8ad80ace98ec144e95 |
File details
Details for the file nlpaug-0.0.8-py3-none-any.whl
.
File metadata
- Download URL: nlpaug-0.0.8-py3-none-any.whl
- Upload date:
- Size: 77.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 666f779f3114bcd306f455e22382b3490680667202192e93dc51e18ec76abcd5 |
|
MD5 | a5f34abee924e224cd91ef96f4772d09 |
|
BLAKE2b-256 | ab81d6447847b37315314e9f6fb592743d599a6479446cd481167b19d7000e29 |