Skip to main content

Natural language processing augmentation library for deep neural networks

Project description

[![Build Status](https://travis-ci.org/makcedward/nlpaug.svg?branch=master)](https://travis-ci.org/makcedward/nlpaug)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/2d6d1d08016a4f78818161a89a2dfbfb)](https://www.codacy.com/app/makcedward/nlpaug?utm_source=github.com&utm_medium=referral&utm_content=makcedward/nlpaug&utm_campaign=Badge_Grade)
[![Codecov Badge](https://codecov.io/gh/makcedward/nlpaug/branch/master/graph/badge.svg)](https://codecov.io/gh/makcedward/nlpaug)

# nlpaug

This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about [Data Augmentation in NLP](https://towardsdatascience.com/data-augmentation-in-nlp-2801a34dfc28)

## Feature

* Provide both character and word level augmentations which include:
* Character Augmentation: OCR, QWERTY(Keyboard Distance), Random Behavior
* Word Augmentation:
* Random Behavior: RandomWord
* Synonym: WordNet
* Word Embeddings: [word2vec, GloVe, fasttext](https://towardsdatascience.com/3-silver-bullets-of-word-embedding-in-nlp-10fa8f50cc5a),
* Language Models: [BERT](https://towardsdatascience.com/how-bert-leverage-attention-mechanism-and-transformer-to-learn-word-contextual-relations-5bbee1b6dbdb)
* Speech Recognition Augmentation:
* Spectrogram: Frequency Masking, Time Masking
* Flow orchestration is supported. Flow includes:
* Sequential: Apply data augmentations one by one
* Sometimes: Apply some augmentations randomly

## Example
* How to use [pre-defined augmentation](https://github.com/makcedward/nlpaug/blob/master/example/overview.ipynb)
* How to create [custom augmentation](https://github.com/makcedward/nlpaug/blob/master/example/custom_augmenter.ipynb)
* How to use [spectrogram augmentation for speech recognition](https://github.com/makcedward/nlpaug/blob/master/example/spectrogram_augmenter.ipynb)

Frequency Masking
![Frequency Masking](https://github.com/makcedward/nlpaug/blob/master/res/spectrogram-frequency_masking.png)

Time Masking
![Frequency Masking](https://github.com/makcedward/nlpaug/blob/master/res/spectrogram-time_masking.png)

## Installation

The library supports python 3.5+ in linux and window platform.

To install the library:
```bash
pip install nlpaug
```

Download word2vec or GloVe files if you use `Word2VecAug` or `GloVeAug`:
* word2vec([GoogleNews-vectors-negative300](https://code.google.com/archive/p/word2vec/))
* GloVe([glove.6B.50d](https://nlp.stanford.edu/projects/glove/))
* fasttext([wiki-news-300d-1M.vec.zip](https://fasttext.cc/docs/en/english-vectors.html))

## Recent Changes

**0.0.1** Mar 20, 2019: Project initialization

**0.0.2** Apr 30, 2019: Added Frequency Masking and Time Masking for Speech Recognition (Spectrogram). Added librosa library dependency for converting wav to spectrogram.

## Test

```
Word2vec and GloVe models are used in word insertion and substitution. Those model files are necessary in order to run test case. You have to add ".env" file in root directory and the content should be
- MODEL_DIR={MODEL FILE PATH}
```

```
Folder structure of model should be
-- root directory
- glove.6B.50d.txt
- GoogleNews-vectors-negative300.bin
- wiki-news-300d-1M.vec
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpaug-0.0.2.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

nlpaug-0.0.2-py3-none-any.whl (30.2 kB view details)

Uploaded Python 3

File details

Details for the file nlpaug-0.0.2.tar.gz.

File metadata

  • Download URL: nlpaug-0.0.2.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.4

File hashes

Hashes for nlpaug-0.0.2.tar.gz
Algorithm Hash digest
SHA256 bba25eaf42c1ee7a011fc78191f5c821da56e2460248f6b72dcf069a8e8ee37e
MD5 df0f72bbf07b5bbea1e02adaa06fe9fd
BLAKE2b-256 ec4db23bc7035c141b73f86b344b7ec178cdb41cc3ae585d5a1432fee2a830ad

See more details on using hashes here.

File details

Details for the file nlpaug-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: nlpaug-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 30.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.4

File hashes

Hashes for nlpaug-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4d1f81bf03dd5d70c8600ee786fa51b7ca778fc8ddfe47c450846cc9d534a46c
MD5 2a22d349fb0a0549b4df98e75ab0550f
BLAKE2b-256 663ae6b913243ae18370f2d2bd0a179e7ba56d1e11d9ef23b653a43c07a483c6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page