A package for augmenting text data using NLP techniques directly in your pandas dataframe.
Project description
Easy Text Augmenter
Easy Text Augmenter is a Python package for augmenting text data directly on your pandas dataframe using various NLP techniques. There are only 3 techniques for now :
- augment_random_word
- augment_random_character
- augment_word_bert
Installation
!pip install easy-nlp-augmentation
import easy_text_augmenter
easy_text_augmenter.info()
How to use
augment_random_word
import pandas as pd
from easy_text_augmenter import augment_random_word
df = pd.DataFrame({
'text': ['This is a test', 'Another test data ', 'Of course we need more data', 'Newton does not like apple', 'Hello world I am a human'],
'label': ['A', 'A', 'B', 'B', 'A']
})
classes_to_augment = ['A', 'B']
augmented_df = augment_random_word(df, classes_to_augment, augmentation_percentage=0.8, text_column='text')
print(augmented_df)
Result :
text label
0 This is a test A
1 Another test data A
2 Of course we need more data B
3 Newton does not like apple B
4 Hello world I am a human A
5 Th is is a te st A
6 Another data A
7 Does not newton like apple B
augment_random_character
from easy_text_augmenter import augment_random_character
classes_to_augment = ['A', 'B']
augmented_df = augment_random_character(df, classes_to_augment, augmentation_percentage=0.8, text_column='text')
print(augmented_df)
Result :
text label
0 This is a test A
1 Another test data A
2 Of course we need more data B
3 Newton does not like apple B
4 Hello world I am a human A
5 This is a estt A
6 Another te8t data A
7 Newtun d0e8 not like apple B
augment_word_bert
from easy_text_augmenter import augment_word_bert
classes_to_augment = ['A', 'B']
augmented_df = augment_word_bert(df, classes_to_augment, augmentation_percentage=0.8, text_column='text', model_path='bert-base-uncased', random_state=70)
print(augmented_df)
Result :
text label
0 This is a test A
1 Another test data A
2 Of course we need more data B
3 Newton does not like apple B
4 Hello world I am a human A
5 another test of data A
6 this term is not a test A
7 newton does absolutely not like every apple B
Authors
Contact me at :
Documentation
augment_random_word
augment_random_word
Description:
The augment_random_word function augments a specified percentage of samples in given classes of a DataFrame by randomly applying one of three augmentation techniques (swap, delete, split) to the text column.
augment_random_word(df, classes_to_augment, augmentation_percentage, text_column, random_state=42, weights=[0.5, 0.3, 0.2])
Parameters:
df(pandas.DataFrame): The input DataFrame containing the text data and labels.classes_to_augment(list): A list of class labels that need to be augmented.augmentation_percentage(float): The percentage of samples to augment from each specified class.text_column(str): The name of the column in the DataFrame that contains the text data.random_state(int, optional): A random seed used for specify which rows to augment. Default is 42.weights(list, optional): A list of weights to determine the probability of selecting each augmentation type. Default is [0.5, 0.3, 0.2] for swap, delete, and split, respectively.
weights techniques :
- swap: randomly swap word in text.
- delete: randomly delete word in text.
- split: randomly split word in text.
Returns:
- pandas.DataFrame: A new DataFrame with the augmented data appended to the original data.
augment_random_character
augment_random_character
Description:
The augment_random_character function performs random character-based augmentations on specific classes of text data within a DataFrame. It uses several augmentation techniques to randomly alter characters in the text, increasing the diversity of the dataset.
augment_random_character(df, classes_to_augment, augmentation_percentage, text_column, random_state=42, weights=[0.2, 0.2, 0.2, 0.2, 0.2])
Parameters:
df(pd.DataFrame): The input DataFrame containing text data and their corresponding labels.classes_to_augment(list): A list of class labels indicating which classes should be augmented.augmentation_percentage(float): The percentage of samples in each class that should be augmented.text_column(str): The column name in the DataFrame that contains the text data to be augmented.random_state(int, optional): A random seed used for specify which rows to augment. Default is 42.weights(list, optional): A list of weights for each augmentation technique, used to determine the probability of choosing each technique. Default is [0.2, 0.2, 0.2, 0.2, 0.2].
weights techniques :
- aug_ocr: OCR-based augmentation.
- aug_keyboard: Keyboard error simulation.
- aug_insert: Random character insertion.
- aug_swap: Random character swapping.
- aug_delete: Random character deletion.
Returns:
- pandas.DataFrame: A new DataFrame with the augmented data appended to the original data.
augment_word_bert
augment_word_bert
Description:
The augment_word_bert function augments text data in a DataFrame using a BERT-based word augmentation technique. It inserts or substitutes words in the specified text column for a given percentage of samples in the specified classes.
def augment_word_bert(df, classes_to_augment, augmentation_percentage, text_column, model_path, random_state=42, weights=[0.7, 0.3])
Parameters:
df(pandas.DataFrame): The DataFrame containing the data to be augmented.classes_to_augment(list): A list of class labels indicating which classes should be augmented.augmentation_percentage(float): The percentage of samples within each class to augment (e.g., 0.2 for 20%).text_column(str): The name of the column in the DataFrame that contains the text to be augmented.model_path(str): The path to the pre-trained BERT model used for augmentation.random_state(int, optional): A random seed used for specify which rows to augment. Default is 42.weights(list, optional): The weights for choosing between the insertion and substitution augmentation techniques (default is [0.7, 0.3]).
Returns:
- pandas.DataFrame: The original DataFrame with additional augmented samples.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file easy_nlp_augmentation-1.6.tar.gz.
File metadata
- Download URL: easy_nlp_augmentation-1.6.tar.gz
- Upload date:
- Size: 5.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04c99b878cfcb4dbfd01b8b8d9fc9761d7ab10e2f0528724b021d63d0d5c1da3
|
|
| MD5 |
88b135aee3326ef9aa69e8c8c4e84f0d
|
|
| BLAKE2b-256 |
f611f34b07707d612c5bff19f63749218a49c39f2e0a0b1b55565e415a62e76d
|
File details
Details for the file easy_nlp_augmentation-1.6-py3-none-any.whl.
File metadata
- Download URL: easy_nlp_augmentation-1.6-py3-none-any.whl
- Upload date:
- Size: 7.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e2e1474b34a5da0e1b38dcb4865d24904667c71da46f33e404f185c36348ae3
|
|
| MD5 |
00ebca21ce9b36078d75706b732ae6df
|
|
| BLAKE2b-256 |
958a45fd7277b71b3bd1c3cddb6853da4db5d23f65dd2841d13c92afe52fbf15
|