Skip to main content

Abstractive Summarization for Data Augmentation

Project description

absum - Abstractive Summarization for Data Augmentation

License PyPI version Python 3.6, 3.7

Introduction

Imbalanced class distribution is a common problem in ML. Undersampling combined with oversampling are two methods of addressing this issue. A technique such as SMOTE can be effective for oversampling, although the problem becomes a bit more difficult with multilabel datasets. MLSMOTE has been proposed, but the high dimensional nature of numerical vectors created from text can sometimes make other forms of data augmentation more appealing.

absum is an NLP library that uses abstractive summarization to perform data augmentation in order to oversample under-represented classes in datasets. Recent developments in abstractive summarization make this approach optimal in achieving realistic data for the augmentation process.

It uses the latest Huggingface T5 model by default, but is designed in a modular way to allow you to use any pre-trained or out-of-the-box Transformers models capable of abstractive summarization. absum is format agnostic, expecting only a dataframe containing text and all features. It also uses multiprocessing to achieve optimal performance.

Singular summarization calls are also possible.

Algorithm

  1. Append counts or the number of rows to add for each feature are first calculated with a ceiling threshold. Namely, if a given feature has 1000 rows and the ceiling is 100, its append count will be 0.

  2. For each feature it then completes a loop from an append index range to the append count specified for that given feature. The append index is stored to allow for multi processing.

  3. An abstractive summarization is calculated for a specified size subset of all rows that uniquely have the given feature. If multiprocessing is set, the call to abstractive summarization is stored in a task array later passed to a sub-routine that runs the calls in parallel using the multiprocessing library, vastly reducing runtime.

  4. Each summarization is appended to a new dataframe with the respective features one-hot encoded.

Installation

Via pip

pip install absum

From source

git clone https://github.com/aaronbriel/absum.git
pip install [--editable] .

or

pip install git+https://github.com/aaronbriel/absum.git

Usage

absum expects a DataFrame containing a text column which defaults to 'text', and the remaining columns representing one-hot encoded features. If additional columns are present that you do not wish to be considered, you have the option to pass in specific one-hot encoded features as a comma-separated string to the 'features' parameter. All available parameters are detailed in the Parameters section below.

import pandas as pd
from absum import Augmentor

csv = 'path_to_csv'
df = pd.read_csv(csv)
augmentor = Augmentor(df, text_column='review_text')
df_augmented = augmentor.abs_sum_augment()
# Store resulting dataframe as a csv
df_augmented.to_csv(csv.replace('.csv', '-augmented.csv'), encoding='utf-8', index=False)

Running singular summarization on any chunk of text is simple:

text = chunk_of_text_to_summarize
augmentor = Augmentor(min_length=100, max_length=200)
output = augmentor.get_abstractive_summarization(text)

NOTE: When running any summarizations you may see the following warning message which can be ignored: "Token indices sequence length is longer than the specified maximum sequence length for this model (2987 > 512). Running this sequence through the model will result in indexing errors". For more information refer to this issue.

Parameters

Name Type Description
df (:class:pandas.Dataframe, optional, defaults to None) Dataframe containing text and one-hot encoded features.
text_column (:obj:string, optional, defaults to "text") Column in df containing text.
features (:obj:string, optional, defaults to None) Comma-separated string of features to possibly augment data for.
device (:class:torch.device, optional, 'cuda' or 'cpu') Torch device to run on cuda if available otherwise cpu.
model (:class:~transformers.T5ForConditionalGeneration, optional, defaults to T5ForConditionalGeneration.from_pretrained('t5-small')) Model used for abstractive summarization.
tokenizer (:class:~transformers.T5Tokenizer, optional, defaults to T5Tokenizer.from_pretrained('t5-small')) Tokenizer used for abstractive summarization.
return_tensors (:obj:str, optional, defaults to "pt") Can be set to ‘tf’, ‘pt’ or ‘np’ to return respectively TensorFlow tf.constant, PyTorch torch.Tensor or Numpy :oj: np.ndarray instead of a list of python integers.
num_beams (:obj:int, optional, defaults to 4) Number of beams for beam search. Must be between 1 and infinity. 1 means no beam search. Default to 1.
no_repeat_ngram_size (:obj:int, optional, defaults to 4 If set to int > 0, all ngrams of size no_repeat_ngram_size can only occur once.
min_length (:obj:int, optional, defaults to 10) The min length of the sequence to be generated. Between 0 and infinity. Default to 10.
max_length (:obj:int, optional, defaults to 50) The max length of the sequence to be generated. Between min_length and infinity. Default to 50.
early_stopping (:obj:bool, optional, defaults to True) bool if set to True beam search is stopped when at least num_beams sentences finished per batch. Defaults to False as defined in configuration_utils.PretrainedConfig.
skip_special_tokens (:obj:bool, optional, defaults to True) Don't decode special tokens (self.all_special_tokens). Default: False.
num_samples (:obj:int, optional, defaults to 100) Number of samples to pull from dataframe with specific feature to use in generating new sample with Abstractive Summarization.
threshold (:obj:int, optional, defaults to 3500) Maximum ceiling for each feature, normally the under-sample max.
multiproc (:obj:bool, optional, defaults to True) If set, stores calls to abstractive summarization in array which is then passed to run_cpu_tasks_in_parallel to allow for increasing performance through multiprocessing.
debug (:obj:bool, optional, defaults to True) If set, prints generated summarizations.

Citation

Please reference this library and the HuggingFace pytorch-transformers library if you use this work in a published or open-source project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

absum-0.2.3.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

absum-0.2.3-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file absum-0.2.3.tar.gz.

File metadata

  • Download URL: absum-0.2.3.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.10

File hashes

Hashes for absum-0.2.3.tar.gz
Algorithm Hash digest
SHA256 47fdc62dd0233bdb9694aab695956a6ee0ce5bdd44bed719ed01e9a1cec8cb39
MD5 93d058402d6961eb0a300dbd6425fb9f
BLAKE2b-256 049052d0184165b4d026a87b2756a9da6041ab0b11101b8bd1c7013baf7a4208

See more details on using hashes here.

File details

Details for the file absum-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: absum-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.10

File hashes

Hashes for absum-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 58548285288429ab3aaf3e54c96584547b983cc71b63e7b032e9adc751912003
MD5 73fd287e453fa4cf3d51e047dcfd4acb
BLAKE2b-256 5d8726b23875d9795b54337897decde8324841ca9a977de791e9fde0d696ee7b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page