A python library to augment text data using NLP.

Project description

logo

TextGenie

TextGenie is a text data augmentations library that helps you augment your text dataset and generate similar kind of samples, thus generating a more robust dataset to train better models. It also takes care of labeled datasets while generating similar samples keeping their labels in memory.

It uses various Natural Language Processing methods such as paraphrase generation, BERT mask filling and converting text to active voice if found in passive voices. This library currently supports English Language.

Installation

pip install textgenie

Example

from textgenie import TextGenie

textgenie = TextGenie("hetpandya/t5-small-tapaco", "bert-base-uncased")

# Augment a list of sentences
sentences = [
    "The video was posted on Facebook by Alex.",
    "I plan to run it again this time",
]
textgenie.magic_lamp(
    sentences, "paraphrase: ", n_mask_predictions=5, convert_to_active=True
)

# Augment data in a txt file
textgenie.magic_lamp(
    "sentences.txt", "paraphrase: ", n_mask_predictions=5, convert_to_active=True
)

# Augment data in a csv file with labels
textgenie.magic_lamp(
    "sentences.csv",
    "paraphrase: ",
    n_mask_predictions=5,
    convert_to_active=True,
    label_column="Label",
    data_column="Text",
    column_names=["Text", "Label"],
)

Examples can be found in the examples notebook.

Usage

Initializing the augmentor: textgenie = TextGenie(paraphrase_model_name='model_name',mask_model_name='model_name',spacy_model_name="model_name",device="cpu")
- Parameters:
  - paraphrase_model_name:
    - The name of the T5 paraphrase model.
    - A list of pretrained model for paraphrase generation can be found here
  - mask_model_name:
    - BERT model that will be used to fill masks. This model is disabled by default. But can be enabled by mentioning the name of the BERT model to be used. A list of mask filling models can be found here
  - spacy_model_name:
    - Name of the Spacy model. Available models can be found here. The default value is set to en_core_web_sm.
  - device:
    - The device where the model will be loaded. The default value is set to cpu.
Methods:
- augment_sent_mask_filling():
  - Generate augmented data using BERT mask filling.
  - Parameters:
    - sent:
      - The sentence on which augmentation has to be applied.
    - n_mask_predictions:
      - The number of predictions, the BERT mask filling model should generate. The default value is set to 5.
- augment_sent_t5():
  - Generate augmented data using T5 paraphrasing model.
  - Parameters:
    - sent:
      - The sentence on which augmentation has to be applied.
    - prefix:
      - The prefix for the T5 model input.
    - n_predictions:
      - The number of number augmentations, the function should return. The default value is set to 5.
    - top_k:
      - The number of predictions, the T5 model should generate. The default value is set to 120.
    - max_length:
      - The max length of the sentence to feed to the model. The default value is set to 256.
- convert_to_active():
  - Converts a sentence to active voice, if found in passive voice. Otherwise returns the same sentence.
  - Parameters:
    - sent:
      - The sentence that has to be converted.
- magic_once():
  - This is a wrapper method for augment_sent_mask_filling(), augment_sent_t5() and convert_to_active() methods. Using this, a sentence can be augmented using all the above mentioned techniques.
  - Since this method can operate on individual text data, it can be merged with other packages.
  - Parameters:
    - sent:
      - The sentence that has to be augmented.
    - paraphrase_prefix:
      - The prefix for the T5 model input.
    - n_paraphrase_predictions:
      - The number of number augmentations, the function should return. The default value is set to 5.
    - paraphrase_top_k:
      - The number of predictions, the T5 model should generate. The default value is set to 120.
    - paraphrase_max_length:
      - The max length of the sentence to feed to the model. The default value is set to 256.
    - n_mask_predictions:
      - The number of predictions, the BERT mask filling model should generate. The default value is set to None.
    - convert_to_active:
      - If the sentence should be converted to active voice. The default value is set to True.
- magic_lamp():
  - This method can be used for augmenting whole dataset. Currently accepted dataset formats are: txt,csv,tsv and list.
  - If the dataset is in list or txt format, a list of augmented sentences will be returned. Also, a txt file with the name sentences_aug.txt is saved containing the output of the augmented data.
  - If a dataset is in csv or tsv format with labels, the dataset will be augmented along with keeping in memory the labels for the new samples and a pandas dataframe of the augmented data will be returned. A tsv file will be generated with the augmented output with name original_file_name_aug.tsv
  - Parameters:
    - sentences:
      - The dataset that has to be augmented. This can be a Python List, a txt, csv or tsv file.
    - paraphrase_prefix:
      - The prefix for the T5 model input.
    - n_paraphrase_predictions:
      - The number of number augmentations, the function should return. The default value is set to 5.
    - paraphrase_top_k:
      - The number of predictions, the T5 model should generate. The default value is set to 120.
    - paraphrase_max_length:
      - The max length of the sentence to feed to the model. The default value is set to 256.
    - n_mask_predictions:
      - The number of predictions, the BERT mask filling model should generate. The default value is set to None.
    - convert_to_active:
      - If the sentence should be converted to active voice. The default value is set to True.
    - label_column:
      - The name of the column that contains labeled data. The default value is set to None. This parameter is not required to be set if the dataset is in a Python List or a txt file.
    - data_column:
      - The name of the column that contains data. The default value is set to None. This parameter too is not required if the dataset is a Python List or a txt file.
    - column_names:
      - If the csv or tsv does not have column names, a Python list has to be passed to give the columns a name. Since this function also accepts Python List and a txt file, the default value is set to None. But, if csv or tsv files are used, this parameter has to be set.

References

Passive To Active licensed under the Apache License 2.0

Links

Please find an in depth explanation about the library on my blog.

License

Please check LICENSE for more details.

Project details

Release history Release notifications | RSS feed

This version

0.1.9.7

Nov 3, 2022

0.1.9.7b0 pre-release

Nov 3, 2022

0.1.9.6

Dec 17, 2021

0.1.9.5

Dec 17, 2021

0.1.9.4

Dec 17, 2021

0.1.9.3

Sep 28, 2021

0.1.9.2

Jul 13, 2021

0.1.9.1

Jun 29, 2021

0.1.8

Jun 26, 2021

0.1.7

Jun 23, 2021

0.1.6

Jun 22, 2021

0.1.5

Jun 22, 2021

0.1.3

Jun 22, 2021

0.1.1

Jun 21, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textgenie-0.1.9.7.tar.gz (225.4 kB view details)

Uploaded Nov 3, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

textgenie-0.1.9.7-py3-none-any.whl (12.7 kB view details)

Uploaded Nov 3, 2022 Python 3

File details

Details for the file textgenie-0.1.9.7.tar.gz.

File metadata

Download URL: textgenie-0.1.9.7.tar.gz
Upload date: Nov 3, 2022
Size: 225.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.7.15

File hashes

Hashes for textgenie-0.1.9.7.tar.gz
Algorithm	Hash digest
SHA256	`95834d1c4d810ea65405673e50a6650ec2d6b4aa68cb56d84e79cedd722f97cf`
MD5	`00b58bd9636d0a6e3b07794839d11cf5`
BLAKE2b-256	`9da618cc7673bc41279465bfc00222ed1edb33b4932503c300a4f756e72f5220`

See more details on using hashes here.

File details

Details for the file textgenie-0.1.9.7-py3-none-any.whl.

File metadata

Download URL: textgenie-0.1.9.7-py3-none-any.whl
Upload date: Nov 3, 2022
Size: 12.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.7.15

File hashes

Hashes for textgenie-0.1.9.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f5db18733730337b18a43a45a03a6b3d554cde67d16734a534c9667475ade1d4`
MD5	`78d2c7844af6003723e6c40db425e41f`
BLAKE2b-256	`131b53c3f5d13bdb90cc019adfb31faee68a366925191c3a2a19ea75bdeb0d4d`

See more details on using hashes here.

textgenie 0.1.9.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

TextGenie

Installation

Example

Usage

References

Links

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes