A library for augmenting text for natural language processing applications.
- Python 3
The following software packages are dependencies and will be installed automatically.
$ pip install numpy nltk gensim textblob googletrans
The following code downloads NLTK corpus for wordnet.
The following code downloads NLTK tokenizer. This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.
The following code downloads default NLTK part-of-speech tagger model. A part-of-speech tagger processes a sequence of words, and attaches a part of speech tag to each word.
Use gensim to load a pre-trained word2vec model. Like Google News from Google drive.
import gensim model = gensim.models.Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
Or training one from scratch using your data or the following public dataset:
Install from pip [Recommended]
$ pip install textaugment or install latest release $ pip install firstname.lastname@example.org:dsfsi/textaugment.git
Install from source
$ git clone email@example.com:dsfsi/textaugment.git $ cd textaugment $ python setup.py install
How to use
There are three types of augmentations which can be used:
from textaugment import Word2vec
from textaugment import Wordnet
- translate (This will require internet access)
from textaugment import Translate
>>> from textaugment import Word2vec >>> t = Word2vec(model='path/to/gensim/model'or 'gensim model itself') >>> t.augment('The stories are good') The films are good
>>> runs = 1 # By default. >>> v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf) >>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence. >>> t = Word2vec(model='path/to/gensim/model'or'gensim model itself', runs=5, v=False, p=0.5) >>> t.augment('The stories are good') The movies are excellent
>>> import nltk >>> nltk.download('punkt') >>> nltk.download('wordnet') >>> from textaugment import Wordnet >>> t = Wordnet() >>> t.augment('In the afternoon, John is going to town') In the afternoon, John is walking to town
>>> v = True # enable verbs augmentation. By default is True. >>> n = False # enable nouns augmentation. By default is False. >>> runs = 1 # number of times to augment a sentence. By default is 1. >>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence. >>> t = Wordnet(v=False ,n=True, p=0.5) >>> t.augment('In the afternoon, John is going to town') In the afternoon, Joseph is going to town.
>>> src = "en" # source language of the sentence >>> to = "fr" # target language >>> from textaugment import Translate >>> t = Translate(src="en", to="fr") >>> t.augment('In the afternoon, John is going to town') In the afternoon John goes to town
Built with ❤ on
Cite this paper when using this library.
MIT licensed. See the bundled LICENCE file for more details.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size textaugment-1.1-py3-none-any.whl (11.1 kB)||File type Wheel||Python version py3||Upload date||Hashes View hashes|
|Filename, size textaugment-1.1.tar.gz (10.4 kB)||File type Source||Python version None||Upload date||Hashes View hashes|
Hashes for textaugment-1.1-py3-none-any.whl