Natural language processing augmentation library for deep neural networks
Project description
nlpaug
This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter
is the basic element of augmentation while Flow
is a pipeline to orchestra multi augmenter together.
- Data Augmentation library for Text
- Data Augmentation library for Speech Recognition
- Data Augmentation library for Audio
- Does your NLP model able to prevent adversarial attack?
Starter Guides
- Augmentation for character and word
- Augmentation for spectrogram (audio input)
- Augmentation for audio
- How to train TF-IDF model
- How to create custom augmentation
Augmenter
Target | Augmenter | Action | Description |
---|---|---|---|
Character | RandomAug | Insert | Insert character randomly |
Substitute | Substitute character randomly | ||
Swap | Swap character randomly | ||
Delete | Delete character randomly | ||
OcrAug | Substitute | Simulate OCR engine error | |
QwertyAug | Substitute | Simulate keyboard distnace error | |
Word | RandomWordAug | Swap | Swap word randomly |
Delete | Delete word randomly | ||
SpellingAug | Substitute | Substitute word according to spelling mistake dictionary | |
StopWordsAug | Delete | Remove stopwords randomly | |
WordNetAug | Substitute | Substitute word according to WordNet's synonym | |
Word2vecAug | Insert | Insert word randomly from word2vec dictionary | |
Substitute | Substitute word based on word2vec embeddings | ||
GloVeAug | Insert | Insert word randomly from GloVe dictionary | |
Substitute | Substitute word based on GloVe embeddings | ||
FasttextAug | Insert | Insert word randomly from fasttext dictionary | |
Substitute | Substitute word based on fasttext embeddings | ||
TfIdfAug | Insert | Insert word randomly trained TF-IDF model | |
Substitute | Substitute word based on TF-IDF score | ||
BertAug | Insert | Insert word based by feeding surroundings word to BERT language model | |
Substitute | Substitute word based by feeding surroundings word to BERT language model | ||
Spectrogram | FrequencyMaskingAug | Substitute | Set block of values to zero according to frequency dimension |
TimeMaskingAug | Substitute | Set block of values to zero according to time dimension | |
Audio | NoiseAug | Substitute | Inject noise |
PitchAug | Substitute | Adjust pitch | |
ShiftAug | Substitute | Shift time dimension forward/ backward | |
SpeedAug | Substitute | Adjust speed of audio |
Flow
Pipeline | Description |
---|---|
Sequential | Apply list of augmentation functions sequentially |
Sometimes | Apply some augmentation functions randomly |
Installation
The library supports python 3.5+ in linux and window platform.
To install the library:
pip install nlpaug
or install the latest version (include BETA features) from github directly
pip install git+https://github.com/makcedward/nlpaug.git
Download word2vec or GloVe files if you use Word2VecAug
, GloVeAug
or FasttextAug
:
- word2vec(GoogleNews-vectors-negative300)
- GloVe(glove.6B.50d)
- fasttext(wiki-news-300d-1M.vec.zip)
Recent Changes
0.0.6 Jul 29, 2019:
- Added new augmenter TF-IDF based word replacement augmenter(TfIdfAug)
- Added new augmenter Spelling mistake simulation augmenter(SpellingAug)
- Added new augmenter Stopword Dropout augmenter(StopWordsAug)
- Fixed #14
0.0.5 Jul 2, 2019:
See changelog for more details.
Test
Word2vec, GloVe, Fasttext models are used in word insertion and substitution. Those model files are necessary in order to run test case. You have to add ".env" file in root directory and the content should be
- MODEL_DIR={MODEL FILE PATH}
Folder structure of model should be
-- root directory
- glove.6B.50d.txt
- GoogleNews-vectors-negative300.bin
- wiki-news-300d-1M.vec
Research Reference
Data Source
Capatured data from internet for building augmenter/ test case.
See data source for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nlpaug-0.0.6.tar.gz
.
File metadata
- Download URL: nlpaug-0.0.6.tar.gz
- Upload date:
- Size: 24.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 744d2c08bedf70e21d87f80584783c274e5e65b9c6815e80bdcb48430d82b8e6 |
|
MD5 | bd466d7deff516775455e55744f55d6c |
|
BLAKE2b-256 | 6ac9ca8a2a8325b4aa712a228a98970209f7923f86af3e02ea500e794dcd887f |
File details
Details for the file nlpaug-0.0.6-py3-none-any.whl
.
File metadata
- Download URL: nlpaug-0.0.6-py3-none-any.whl
- Upload date:
- Size: 54.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a9a1d8cece43c60a9736e55110db70ee659e24c4bc0500bc6cc47295c15e6506 |
|
MD5 | 922646605ebbc5a1fd21970e756206fc |
|
BLAKE2b-256 | bd63544d0363c2d4bb88cb32d26d9a320f546beb8403f129f40238f5ef9804b3 |