Natural language processing augmentation library for deep neural networks
Project description
nlpaug
This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter
is the basic element of augmentation while Flow
is a pipeline to orchestra multi augmenter together.
Features
- Generate synthetic data for improving model performance without manual effort
- Simple, easy-to-use and lightweight library. Augment data in 3 lines of code
- Plug and play to any neural network frameworks (e.g. PyTorch, TensorFlow)
- Support textual and audio input
Textual Data Augmentation Example
Acoustic Data Augmentation Example
Section | Description |
---|---|
Quick Demo | How to use this library |
Augmenter | Introduce all available augmentation methods |
Installation | How to install this library |
Recent Changes | Latest enhancement |
Extension Reading | More real life examples or researchs |
Reference | Refernce of external resources such as data or model |
Quick Demo
- Example of Augmentation for Textual Inputs
- Example of Augmentation for Multilingual Textual Inputs
- Example of Augmentation for Spectrogram Inputs
- Example of Augmentation for Audio Inputs
- Example of Orchestra Multiple Augmenters
- How to train TF-IDF model
- How to create custom augmentation
- API Documentation
Augmenter
Augmenter | Target | Augmenter | Action | Description |
---|---|---|---|---|
Textual | Character | KeyboardAug | substitute | Simulate keyboard distance error |
Textual | OcrAug | substitute | Simulate OCR engine error | |
Textual | RandomAug | insert, substitute, swap, delete | Apply augmentation randomly | |
Textual | Word | AntonymAug | substitute | Substitute opposite meaning word according to WordNet antonym |
Textual | ContextualWordEmbsAug | insert, substitute | Feeding surroundings word to BERT, DistilBERT, RoBERTa or XLNet language model to find out the most suitlabe word for augmentation | |
Textual | RandomWordAug | swap, delete | Apply augmentation randomly | |
Textual | SpellingAug | substitute | Substitute word according to spelling mistake dictionary | |
Textual | SplitAug | split | Split one word to two words randomly | |
Textual | SynonymAug | substitute | Substitute similar word according to WordNet/ PPDB synonym | |
Textual | TfIdfAug | insert, substitute | Use TF-IDF to find out how word should be augmented | |
Textual | WordEmbsAug | insert, substitute | Leverage word2vec, GloVe or fasttext embeddings to apply augmentation | |
Textual | Sentence | ContextualWordEmbsForSentenceAug | insert | Insert sentence according to XLNet, GPT2 or DistilGPT2 prediction |
Signal | Audio | CropAug | delete | Delete audio's segment |
Signal | LoudnessAug | substitute | Adjust audio's volume | |
Signal | MaskAug | substitute | Mask audio's segment | |
Signal | NoiseAug | substitute | Inject noise | |
Signal | PitchAug | substitute | Adjust audio's pitch | |
Signal | ShiftAug | substitute | Shift time dimension forward/ backward | |
Signal | SpeedAug | substitute | Adjust audio's speed | |
Signal | VtlpAug | substitute | Change vocal tract | |
Signal | Spectrogram | FrequencyMaskingAug | substitute | Set block of values to zero according to frequency dimension |
Signal | TimeMaskingAug | substitute | Set block of values to zero according to time dimension |
Flow
Augmenter | Augmenter | Description |
---|---|---|
Pipeline | Sequential | Apply list of augmentation functions sequentially |
Pipeline | Sometimes | Apply some augmentation functions randomly |
Installation
The library supports python 3.5+ in linux and window platform.
To install the library:
pip install nlpaug numpy matplotlib python-dotenv
or install the latest version (include BETA features) from github directly
pip install git+https://github.com/makcedward/nlpaug.git numpy matplotlib python-dotenv
If you use ContextualWordEmbsAug or ContextualWordEmbsForSentenceAug, install the following dependencies as well
pip install torch>=1.2.0 transformers>=2.0.0
If you use AntonymAug, SynonymAug, install the following dependencies as well
pip install nltk>=3.4.5
If you use WordEmbsAug (word2vec, glove or fasttext), downloading pre-trained model first
from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.') # Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # Download GloVe model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # Download fasttext model
If you use any one of audio augmenter, install the following dependencies as well
pip install librosa>=0.7.1
Recent Changes
**0.0.12 Feb 5, 2020
- ContextualWordEmbsAug supports bert-base-multilingual-uncased (for non English inputs)
- Fix missing library dependency #74
- Fix single token error when using RandomWordAug #76
- Fix replacing character in RandomCharAug error #77
- Enhance word's augmenter to support regular expression stopwords #81
- Enhance char's augmenter to support regular expression stopwords #86
- KeyboardAug supports Thai language #92
- Fix word casing issue #82
**0.0.11 Dec 6, 2019
- Support color noise (pink, blue, red and violet noise) in audio's NoiseAug
- Support given background noise in audio's NoiseAug
- Support inject noise to portion of audio only in audio's NoiseAug
- Introduce
zone
,coverage
to all audio augmenter. Support only augmented portion of audio input - Add VTLP augmentation methods (Audio's augmenter)
- Adopt latest transformer's interface #59
- Support RoBERTa (including DistilRoBERTa) and DistilBERT (ContextualWordEmbsAug)
- Support DistilGPT2 (ContextualWordEmbsForSentenceAug)
- Fix librosa hard dependency #62
- Introduce
optimize
attribute ContextualWordEmbsForSentenceAug #63 - Optimize word selection for ContextualWordEmbsAug and ContextualWordEmbsForSentenceAug (Speed up around 30%)
- Add retry mechanism into ContextualWordEmbsAug insert action #68
See changelog for more details.
Extension Reading
- Data Augmentation library for Text
- Does your NLP model able to prevent adversarial attack?
- How does Data Noising Help to Improve your NLP Model?
- Data Augmentation library for Speech Recognition
- Data Augmentation library for Audio
- Unsupervied Data Augmentation
Reference
This library uses data (e.g. capturing from internet), research (e.g. following augmenter idea), model (e.g. using pre-trained model) See data source for more details.
Citing
@misc{ma2019nlpaug,
title={NLP Augmentation},
author={Edward Ma},
howpublished={\url{https://github.com/makcedward/nlpaug}},
year={2019}
}
Contributions (Supporting Other Languages)
- sakares: Add Thai support to KeyboardAug
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nlpaug-0.0.12.tar.gz
.
File metadata
- Download URL: nlpaug-0.0.12.tar.gz
- Upload date:
- Size: 44.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a1b4a12e42ea39cab4759f9929e7d4c9cc3519b2cf23d6db20385e3834d33be0 |
|
MD5 | 0b361b6a9cf42126c6630b96f0eb3477 |
|
BLAKE2b-256 | 1b332b8d77ae2b2d32f96e9e8b168c5fdfcefc1c9849eec3f17424cdbad59ccc |
File details
Details for the file nlpaug-0.0.12-py3-none-any.whl
.
File metadata
- Download URL: nlpaug-0.0.12-py3-none-any.whl
- Upload date:
- Size: 94.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b808d450d4e0641fcc87c2ae9743cdee2491446ae19b03f620970ffad29c6bd1 |
|
MD5 | 7b9eed39a5fef693c91046fbd047b34a |
|
BLAKE2b-256 | 991302e03850a9a7f4765638b4016249d78a5ee6851068a495edcf0d386c9e1c |