Skip to main content

Databalancer is the python library dedicated to balance the imbalanced text classification datasets before the model training in machine learning applications

Project description

Databalancer

Databalancer is the python library using in machine learning applications to balance the imbalanced text classification datasets before the model training.

Features

  • Databalancer is able to balance any imbalanced text classification datasets
  • If the given dataset is imbalanced then while balancing no existing data is removed, but new data will be generated and added to the dataset
  • For a particular class the newly generated data will be the paraphrases of the existing data in that particular class
  • By default these paraphrases are generated using the ramsrigouthamg/t5_paraphraser model (You can read more about the model from Huggingface official documentation)
  • Databalancer also provides another method called classCountVisualization to show the dataset class count distribution

Installation

Install the databalancer package with pip

 pip install databalancer

Compatibility

Databalancer is only compatable with python 3.6.9 or above.

Quick Start

The library databalancer provides two different functionalities.

1 - classCountVisualization

2 - balanceDataset

classCountVisualization

#Import the classCountVisualization from the 'databalancer' module
from databalancer import classCountVisualization

#Pass the required datasetname(here traindata.csv) to the function
classCountVisualization("traindata.csv")

Output

Imbalanced dataset pie plot

balanceDataset

#Import the balanceDataset from the 'databalancer' module
from databalancer import balanceDataset

#Pass the dataset name which is to be balanced(here traindata.csv) to the balanceDataset function
balanceDataset("traindata.csv")

The above code will balance the dataset and store the balanced dataset('balanced_data.csv') in the local machine.

To show the balanced dataset class count distribution, run the code below.

from databalancer import classCountVisualization

classCountVisualization("balanced_data.csv")

Balanced dataset pie plot

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databalancer-0.0.8.tar.gz (174.1 kB view hashes)

Uploaded Source

Built Distribution

databalancer-0.0.8-py3-none-any.whl (9.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page