Skip to main content

A package for augmenting text data using NLP techniques directly in your pandas dataframe.

Project description

Easy Text Augmenter

Easy Text Augmenter is a Python package for augmenting text data directly on your pandas dataframe using various NLP techniques. There are only 3 techniques for now :

  • augment_random_word
  • augment_random_character
  • augment_word_bert

Installation

!pip install easy-nlp-augmentation
import easy_text_augmenter
easy_text_augmenter.info()

How to use

augment_random_word

import pandas as pd
from easy_text_augmenter import augment_random_word

df = pd.DataFrame({
    'text': ['This is a test', 'Another test data ', 'Of course we need more data', 'Newton does not like apple', 'Hello world I am a human'],
    'label': ['A', 'A', 'B', 'B', 'A']
})
classes_to_augment = ['A', 'B']
augmented_df = augment_random_word(df, classes_to_augment, augmentation_percentage=0.8, text_column='text')
print(augmented_df)

Result :

                          text label
0               This is a test     A
1           Another test data      A
2  Of course we need more data     B
3   Newton does not like apple     B
4     Hello world I am a human     A
5             Th is is a te st     A
6                 Another data     A
7   Does not newton like apple     B

augment_random_character

from easy_text_augmenter import augment_random_character

classes_to_augment = ['A', 'B']
augmented_df = augment_random_character(df, classes_to_augment, augmentation_percentage=0.8, text_column='text')
print(augmented_df)

Result :

                          text label
0               This is a test     A
1           Another test data      A
2  Of course we need more data     B
3   Newton does not like apple     B
4     Hello world I am a human     A
5               This is a estt     A
6            Another te8t data     A
7   Newtun d0e8 not like apple     B

augment_word_bert

from easy_text_augmenter import augment_word_bert

classes_to_augment = ['A', 'B']
augmented_df = augment_word_bert(df, classes_to_augment, augmentation_percentage=0.8, text_column='text', model_path='bert-base-uncased', random_state=70)
print(augmented_df)

Result :

                                          text label
0                               This is a test     A
1                           Another test data      A
2                  Of course we need more data     B
3                   Newton does not like apple     B
4                     Hello world I am a human     A
5                         another test of data     A
6                      this term is not a test     A
7  newton does absolutely not like every apple     B

Authors

Contact me at :

Documentation

augment_random_word

augment_random_word

Description:

The augment_random_word function augments a specified percentage of samples in given classes of a DataFrame by randomly applying one of three augmentation techniques (swap, delete, split) to the text column.

augment_random_word(df, classes_to_augment, augmentation_percentage, text_column, random_state=42, weights=[0.5, 0.3, 0.2])

Parameters:

  • df (pandas.DataFrame): The input DataFrame containing the text data and labels.
  • classes_to_augment (list): A list of class labels that need to be augmented.
  • augmentation_percentage (float): The percentage of samples to augment from each specified class.
  • text_column (str): The name of the column in the DataFrame that contains the text data.
  • random_state (int, optional): A random seed used for specify which rows to augment. Default is 42.
  • weights (list, optional): A list of weights to determine the probability of selecting each augmentation type. Default is [0.5, 0.3, 0.2] for swap, delete, and split, respectively.

weights techniques :

  • swap: randomly swap word in text.
  • delete: randomly delete word in text.
  • split: randomly split word in text.

Returns:

  • pandas.DataFrame: A new DataFrame with the augmented data appended to the original data.
augment_random_character

augment_random_character

Description:

The augment_random_character function performs random character-based augmentations on specific classes of text data within a DataFrame. It uses several augmentation techniques to randomly alter characters in the text, increasing the diversity of the dataset.

augment_random_character(df, classes_to_augment, augmentation_percentage, text_column, random_state=42, weights=[0.2, 0.2, 0.2, 0.2, 0.2])

Parameters:

  • df (pd.DataFrame): The input DataFrame containing text data and their corresponding labels.
  • classes_to_augment (list): A list of class labels indicating which classes should be augmented.
  • augmentation_percentage (float): The percentage of samples in each class that should be augmented.
  • text_column (str): The column name in the DataFrame that contains the text data to be augmented.
  • random_state (int, optional): A random seed used for specify which rows to augment. Default is 42.
  • weights (list, optional): A list of weights for each augmentation technique, used to determine the probability of choosing each technique. Default is [0.2, 0.2, 0.2, 0.2, 0.2].

weights techniques :

  • aug_ocr: OCR-based augmentation.
  • aug_keyboard: Keyboard error simulation.
  • aug_insert: Random character insertion.
  • aug_swap: Random character swapping.
  • aug_delete: Random character deletion.

Returns:

  • pandas.DataFrame: A new DataFrame with the augmented data appended to the original data.
augment_word_bert

augment_word_bert

Description:

The augment_word_bert function augments text data in a DataFrame using a BERT-based word augmentation technique. It inserts or substitutes words in the specified text column for a given percentage of samples in the specified classes.

def augment_word_bert(df, classes_to_augment, augmentation_percentage, text_column, model_path, random_state=42, weights=[0.7, 0.3])

Parameters:

  • df (pandas.DataFrame): The DataFrame containing the data to be augmented.
  • classes_to_augment (list): A list of class labels indicating which classes should be augmented.
  • augmentation_percentage (float): The percentage of samples within each class to augment (e.g., 0.2 for 20%).
  • text_column (str): The name of the column in the DataFrame that contains the text to be augmented.
  • model_path (str): The path to the pre-trained BERT model used for augmentation.
  • random_state (int, optional): A random seed used for specify which rows to augment. Default is 42.
  • weights (list, optional): The weights for choosing between the insertion and substitution augmentation techniques (default is [0.7, 0.3]).

Returns:

  • pandas.DataFrame: The original DataFrame with additional augmented samples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easy_nlp_augmentation-1.6.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

easy_nlp_augmentation-1.6-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file easy_nlp_augmentation-1.6.tar.gz.

File metadata

  • Download URL: easy_nlp_augmentation-1.6.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for easy_nlp_augmentation-1.6.tar.gz
Algorithm Hash digest
SHA256 04c99b878cfcb4dbfd01b8b8d9fc9761d7ab10e2f0528724b021d63d0d5c1da3
MD5 88b135aee3326ef9aa69e8c8c4e84f0d
BLAKE2b-256 f611f34b07707d612c5bff19f63749218a49c39f2e0a0b1b55565e415a62e76d

See more details on using hashes here.

File details

Details for the file easy_nlp_augmentation-1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for easy_nlp_augmentation-1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 5e2e1474b34a5da0e1b38dcb4865d24904667c71da46f33e404f185c36348ae3
MD5 00ebca21ce9b36078d75706b732ae6df
BLAKE2b-256 958a45fd7277b71b3bd1c3cddb6853da4db5d23f65dd2841d13c92afe52fbf15

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page