A package for augmenting text data using NLP techniques directly in your pandas dataframe.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Easy Text Augmenter

Easy Text Augmenter is a Python package for augmenting text data directly on your pandas dataframe using various NLP techniques. There are only 3 techniques for now :

augment_random_word
augment_random_character
augment_word_bert

Installation

!pip install easy-nlp-augmentation
import easy_text_augmenter
easy_text_augmenter.info()

How to use

augment_random_word

import pandas as pd
from easy_text_augmenter import augment_random_word

df = pd.DataFrame({
    'text': ['This is a test', 'Another test data ', 'Of course we need more data', 'Newton does not like apple', 'Hello world I am a human'],
    'label': ['A', 'A', 'B', 'B', 'A']
})
classes_to_augment = ['A', 'B']
augmented_df = augment_random_word(df, classes_to_augment, augmentation_percentage=0.8, text_column='text')
print(augmented_df)

Result :

                          text label
0               This is a test     A
1           Another test data      A
2  Of course we need more data     B
3   Newton does not like apple     B
4     Hello world I am a human     A
5             Th is is a te st     A
6                 Another data     A
7   Does not newton like apple     B

augment_random_character

from easy_text_augmenter import augment_random_character

classes_to_augment = ['A', 'B']
augmented_df = augment_random_character(df, classes_to_augment, augmentation_percentage=0.8, text_column='text')
print(augmented_df)

Result :

                          text label
0               This is a test     A
1           Another test data      A
2  Of course we need more data     B
3   Newton does not like apple     B
4     Hello world I am a human     A
5               This is a estt     A
6            Another te8t data     A
7   Newtun d0e8 not like apple     B

augment_word_bert

from easy_text_augmenter import augment_word_bert

classes_to_augment = ['A', 'B']
augmented_df = augment_word_bert(df, classes_to_augment, augmentation_percentage=0.8, text_column='text', model_path='bert-base-uncased', random_state=70)
print(augmented_df)

Result :

                                          text label
0                               This is a test     A
1                           Another test data      A
2                  Of course we need more data     B
3                   Newton does not like apple     B
4                     Hello world I am a human     A
5                         another test of data     A
6                      this term is not a test     A
7  newton does absolutely not like every apple     B

Authors

Contact me at :

Documentation

augment_random_word

Description:

The augment_random_word function augments a specified percentage of samples in given classes of a DataFrame by randomly applying one of three augmentation techniques (swap, delete, split) to the text column.

augment_random_word(df, classes_to_augment, augmentation_percentage, text_column, random_state=42, weights=[0.5, 0.3, 0.2])

Parameters:

df (pandas.DataFrame): The input DataFrame containing the text data and labels.
classes_to_augment (list): A list of class labels that need to be augmented.
augmentation_percentage (float): The percentage of samples to augment from each specified class.
text_column (str): The name of the column in the DataFrame that contains the text data.
random_state (int, optional): A random seed used for specify which rows to augment. Default is 42.
weights (list, optional): A list of weights to determine the probability of selecting each augmentation type. Default is [0.5, 0.3, 0.2] for swap, delete, and split, respectively.

weights techniques :

swap: randomly swap word in text.
delete: randomly delete word in text.
split: randomly split word in text.

Returns:

pandas.DataFrame: A new DataFrame with the augmented data appended to the original data.

augment_random_character

Description:

The augment_random_character function performs random character-based augmentations on specific classes of text data within a DataFrame. It uses several augmentation techniques to randomly alter characters in the text, increasing the diversity of the dataset.

augment_random_character(df, classes_to_augment, augmentation_percentage, text_column, random_state=42, weights=[0.2, 0.2, 0.2, 0.2, 0.2])

Parameters:

df (pd.DataFrame): The input DataFrame containing text data and their corresponding labels.
classes_to_augment (list): A list of class labels indicating which classes should be augmented.
augmentation_percentage (float): The percentage of samples in each class that should be augmented.
text_column (str): The column name in the DataFrame that contains the text data to be augmented.
random_state (int, optional): A random seed used for specify which rows to augment. Default is 42.
weights (list, optional): A list of weights for each augmentation technique, used to determine the probability of choosing each technique. Default is [0.2, 0.2, 0.2, 0.2, 0.2].

weights techniques :

aug_ocr: OCR-based augmentation.
aug_keyboard: Keyboard error simulation.
aug_insert: Random character insertion.
aug_swap: Random character swapping.
aug_delete: Random character deletion.

Returns:

pandas.DataFrame: A new DataFrame with the augmented data appended to the original data.

augment_word_bert

Description:

The augment_word_bert function augments text data in a DataFrame using a BERT-based word augmentation technique. It inserts or substitutes words in the specified text column for a given percentage of samples in the specified classes.

def augment_word_bert(df, classes_to_augment, augmentation_percentage, text_column, model_path, random_state=42, weights=[0.7, 0.3])

Parameters:

df (pandas.DataFrame): The DataFrame containing the data to be augmented.
classes_to_augment (list): A list of class labels indicating which classes should be augmented.
augmentation_percentage (float): The percentage of samples within each class to augment (e.g., 0.2 for 20%).
text_column (str): The name of the column in the DataFrame that contains the text to be augmented.
model_path (str): The path to the pre-trained BERT model used for augmentation.
random_state (int, optional): A random seed used for specify which rows to augment. Default is 42.
weights (list, optional): The weights for choosing between the insertion and substitution augmentation techniques (default is [0.7, 0.3]).

Returns:

pandas.DataFrame: The original DataFrame with additional augmented samples.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.6

Feb 13, 2025

1.1

Jun 26, 2024

1.0

Jun 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easy_nlp_augmentation-1.6.tar.gz (5.8 kB view details)

Uploaded Feb 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

easy_nlp_augmentation-1.6-py3-none-any.whl (7.0 kB view details)

Uploaded Feb 13, 2025 Python 3

File details

Details for the file easy_nlp_augmentation-1.6.tar.gz.

File metadata

Download URL: easy_nlp_augmentation-1.6.tar.gz
Upload date: Feb 13, 2025
Size: 5.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for easy_nlp_augmentation-1.6.tar.gz
Algorithm	Hash digest
SHA256	`04c99b878cfcb4dbfd01b8b8d9fc9761d7ab10e2f0528724b021d63d0d5c1da3`
MD5	`88b135aee3326ef9aa69e8c8c4e84f0d`
BLAKE2b-256	`f611f34b07707d612c5bff19f63749218a49c39f2e0a0b1b55565e415a62e76d`

See more details on using hashes here.

File details

Details for the file easy_nlp_augmentation-1.6-py3-none-any.whl.

File metadata

Download URL: easy_nlp_augmentation-1.6-py3-none-any.whl
Upload date: Feb 13, 2025
Size: 7.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for easy_nlp_augmentation-1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5e2e1474b34a5da0e1b38dcb4865d24904667c71da46f33e404f185c36348ae3`
MD5	`00ebca21ce9b36078d75706b732ae6df`
BLAKE2b-256	`958a45fd7277b71b3bd1c3cddb6853da4db5d23f65dd2841d13c92afe52fbf15`

See more details on using hashes here.

easy-nlp-augmentation 1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Easy Text Augmenter

Installation

How to use

augment_random_word

augment_random_character

augment_word_bert

Authors

Documentation

augment_random_word

augment_random_character

augment_word_bert

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes