A python library to augment text data using NLP.
Project description
TextGenie
TextGenie is a python library that helps you augment your text dataset and generate similar kind of samples, thus generating a more robust dataset to train better models. It also takes care of labeled datasets while generating similar samples keeping their labels in memory.
It uses various Natural Language Processing methods such as paraphrase generation, BERT mask filling and converting text to active voice if found in passive voices. This library currently supports English
Language.
Installation
$ pip install textgenie
Example
from textgenie import TextGenie
textgenie = TextGenie("ramsrigouthamg/t5_paraphraser",'bert-base-uncased')
# Augment a list of sentences
sentences = ["The video was posted on Facebook by Alex.","I plan to run it again this time"]
textgenie.magic_lamp(sentences,"paraphrase: ",n_mask_predictions=5,convert_to_active=True)
# Augment data in a txt file
textgenie.magic_lamp("sentences.txt","paraphrase: ",n_mask_predictions=5,convert_to_active=True)
# Augment data in a csv file with labels
textgenie.magic_lamp("sentences.csv","paraphrase: ",n_mask_predictions=5,convert_to_active=True)
Usage
-
Initializing the augmentor:
textgenie = TextGenie(paraphrase_model_name='model_name',mask_model_name='model_name',spacy_model_name="model_name",device="cpu")
- Parameters:
- paraphrase_model_name:
- The name of the T5 paraphrase model.
- mask_model_name:
- BERT model that will be used to fill masks. This model is disabled by default. But can be enabled by mentioning the name of the BERT model to be used. A list of mask filling models can be found here
- spacy_model_name:
- Name of the Spacy model. Available models can be found here. The default value is set to en.
- device:
- The device where the model will be loaded. The default value is set to cpu.
- paraphrase_model_name:
- Parameters:
-
Methods:
- augment_sent_mask_filling():
- Generate augmented data using BERT mask filling.
- augment_sent_t5():
- Generate augmented data using T5 paraphrasing model.
- convert_to_active():
- Converts a sentence to active voice, if found in passive voice. Otherwise returns the same sentence.
- magic_once():
- This is a wrapper method for augment_sent_mask_filling(), augment_sent_t5() and convert_to_active() methods. Using this, a sentence can be augmented using all the above mentioned techniques.
- magic_lamp():
- This method can be used for augmenting whole dataset. Currently accepted dataset formats are:
txt
,csv
,tsv
andlist
. - If the dataset is in
list
ortxt
format, a list of augmented sentences will be returned. Also, atxt
file with the name sentences_aug.txt is saved containing the output of the augmented data. - If a dataset is in
csv
ortsv
format with labels, the dataset will be augmented along with keeping in memory the labels for the new samples and a pandas dataframe of the augmented data will be returned. Acsv
file will be generated with the augmented output with nameoriginal_csv_file_name_aug.csv
- This method can be used for augmenting whole dataset. Currently accepted dataset formats are:
- augment_sent_mask_filling():
References
License
Please check LICENSE
for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for textgenie-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9cbdea90a75ab5dac2dece1bc7e6db40b158a01c685d76a9ac25e99fedbd1d66 |
|
MD5 | 3450879c8914258bd295935928e0acad |
|
BLAKE2b-256 | d51db7b49082bcd720efd61c15e01af49f56029f6bc205e838a568190b3771d3 |