Natural language processing augmentation library for deep neural networks
Project description
nlpaug
This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter
is the basic element of augmentation while Flow
is a pipeline to orchestra multi augmenter together.
- Data Augmentation library for Text
- Does your NLP model able to prevent adversarial attack?
- How does Data Noising Help to Improve your NLP Model?
- Data Augmentation library for Speech Recognition
- Data Augmentation library for Audio
- Unsupervied Data Augmentation
Starter Guides
- Example of Augmentation for Textual Inputs
- Example of Augmentation for Spectrogram Inputs
- Example of Augmentation for Audio Inputs
- Example of Orchestra Multiple Augmenters
- How to train TF-IDF model
- How to create custom augmentation
- API Documentation
Flow
Pipeline | Description |
---|---|
Sequential | Apply list of augmentation functions sequentially |
Sometimes | Apply some augmentation functions randomly |
Textual Augmenter
Target | Augmenter | Action | Description |
---|---|---|---|
Character | RandomAug | insert | Insert character randomly |
substitute | Substitute character randomly | ||
swap | Swap character randomly | ||
delete | Delete character randomly | ||
OcrAug | substitute | Simulate OCR engine error | |
KeyboardAug | substitute | Simulate keyboard distance error | |
Word | RandomWordAug | swap | Swap word randomly |
delete | Delete word randomly | ||
SpellingAug | substitute | Substitute word according to spelling mistake dictionary | |
SynonymAug | substitute | Substitute similar word according to WordNet/ PPDB synonym | |
AntonymAug | substitute | Substitute opposite meaning word according to WordNet antonym | |
SplitAug | split | Split one word to two words randomly | |
WordEmbsAug | insert | Insert word randomly from word2vec, GloVe or fasttext dictionary | |
substitute | Substitute word based on word2vec, GloVe or fasttext embeddings | ||
TfIdfAug | insert | Insert word randomly trained TF-IDF model | |
substitute | Substitute word based on TF-IDF score | ||
ContextualWordEmbsAug | insert | Insert word based by feeding surroundings word to BERT and XLNet language model | |
substitute | Substitute word based by feeding surroundings word to BERT and XLNet language model | ||
Sentence | ContextualWordEmbsForSentenceAug | insert | Insert sentence according to XLNet or GPT2 prediction |
Signal Augmenter
Target | Augmenter | Action | Description |
---|---|---|---|
Audio | NoiseAug | substitute | Inject noise |
PitchAug | substitute | Adjust audio's pitch | |
ShiftAug | substitute | Shift time dimension forward/ backward | |
SpeedAug | substitute | Adjust audio's speed | |
CropAug | delete | Delete audio's segment | |
LoudnessAug | substitute | Adjust audio's volume | |
MaskAug | substitute | Mask audio's segment | |
Spectrogram | FrequencyMaskingAug | substitute | Set block of values to zero according to frequency dimension |
TimeMaskingAug | substitute | Set block of values to zero according to time dimension |
Installation
The library supports python 3.5+ in linux and window platform.
To install the library:
pip install nlpaug numpy matplotlib python-dotenv
or install the latest version (include BETA features) from github directly
pip install git+https://github.com/makcedward/nlpaug.git numpy matplotlib python-dotenv
If you use ContextualWordEmbsAug or ContextualWordEmbsForSentenceAug, install the following dependencies as well
pip install torch>=1.2.0 transformers>=2.0.0
If you use AntonymAug, SynonymAug, install the following dependencies as well
pip install nltk
If you use WordEmbsAug (word2vec, glove or fasttext), downloading pre-trained model first
from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.') # Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # Download GloVe model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # Download fasttext model
If you use any one of audio augmenter, install the following dependencies as well
pip install librosa
Recent Changes
0.0.9 Sep 30, 2019
- Added Swap Mode (adjacent, middle and random) for RandomAug (character level)
- Added SynonymAug (WordNet/ PPDB) and AntonymAug (WordNet)
- WordNetAug is deprecated. Uses SynonymAug instead
- Introduce parameter n. Returning more than 1 augmented data. Changing output format from text (or numpy) to list of text (or numpy) if n > 1
- Introduce parameter temperature in ContextualWordEmbsAug and ContextualWordEmbsForSentenceAug to control the randomness
- aug_n parameter is deprecated. This parameter will be replaced by top_k parameter
- Fixed tokenization issue #48
- Upgraded transformers dependency (or pytorch_transformer) to 2.0.0
- Upgraded PyTorch dependency to 1.2.0
- Added SplitAug
See changelog for more details.
Source
This library uses data (e.g. capturing from internet), research (e.g. following augmenter idea), model (e.g. using pre-trained model) See data source for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nlpaug-0.0.9.tar.gz
.
File metadata
- Download URL: nlpaug-0.0.9.tar.gz
- Upload date:
- Size: 36.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50da3e9e68f1d03efef41c845521cbfe813cb896169269ac3a8fafe1b6bec897 |
|
MD5 | dd11acd58d9997532c1fb6cb93f00ae4 |
|
BLAKE2b-256 | ac929e3a5c1ad312c4975660b64a6fc2be340c82333aabd765e9f50918e63fa3 |
File details
Details for the file nlpaug-0.0.9-py3-none-any.whl
.
File metadata
- Download URL: nlpaug-0.0.9-py3-none-any.whl
- Upload date:
- Size: 79.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0bd2a6bffd108b0ba86a50df47488d31b84c57849b43ae7073cc80aa573e44f3 |
|
MD5 | a199222051facd5c9f531e2de6827fa1 |
|
BLAKE2b-256 | 401cba46a7e0e9608d1834b857483df72b2d46d66f95510a80dcdad3ffdbafb5 |