Skip to main content

tailcalib is a Python library for balancing a long-tailed / imbalanced dataset by generating synthetic datapoints which will inturn increase the class-wise and overall test accuracy on the original dataset.

Project description

tailcalib

tailcalib is a Python library for balancing a long-tailed / imbalanced dataset by generating synthetic datapoints which will inturn increase the class-wise and overall test accuracy on the original dataset.

This package is based on the paper Feature Generation for Long-tail Classification by Rahul Vigneswaran, Marc T. Law, Vineeth N. Balasubramanian, Makarand Tapaswi.

For much more detailed experiments, code and instructions, check rahulvigneswaran/TailCalibX Star on GitHub .

💻 Installation

Use the package manager pip to install tailcalib.

pip install tailcalib

👨‍💻 Basic Usage

# Import
from tailcalib import tailcalib

# Initialize
a = tailcalib(base_engine="numpy")   # Options: "numpy", "pytorch"

# Imbalanced random fake data
import numpy as np
X = np.random.rand(200,100)
y = np.random.randint(0,10, (200,))

# Balancing the data using "tailcalib"
feat, lab, gen = a.generate(X=X, y=y)

# Output comparison
print(f"Before: {np.unique(y, return_counts=True)}")
print(f"After: {np.unique(lab, return_counts=True)}")

🧪 Advanced Usage

🧩 Sample code

# Import
from tailcalib import tailcalib

# Initialize
a = tailcalib(base_engine="numpy")   # Options: "numpy", "pytorch"

# Imbalanced random fake data
import numpy as np
# Train data
X_train = np.random.rand(200,100)
y_train = np.random.randint(0,10, (200,))
# Test data
X_test = np.random.rand(20,100)
y_test = np.random.randint(0,10, (20,))

# Balancing the data using "tailcalib". 
# Try to play with the other hyperparameters to get a better generated datapoint.
feat, lab, gen = a.generate(X=X_train, y=y_train, tukey_value=1.0, alpha=0.0, topk=1, extra_points=0, shuffle=True)

# Always remember to convert the val/test data before doing validation/testing.
X_test, y_test = a.convert_others(X=X_test, y=y_test)

# Output comparison
print(f"Before: {np.unique(y_train, return_counts=True)}")
print(f"After: {np.unique(lab, return_counts=True)}")

⚙ Arguments

  • X : Features
  • y : Corresponding labels
  • tukey_value : Value to convert any distrubution of data into a normal distribution. Defaults to 1.0.
  • alpha : Decides how spread out the generated data is. Defaults to 0.0.
  • topk : Decides how many nearby classes should be taken into consideration for the mean and std of the newly generated data. Defaults to 1.
  • extra_points : By default the number of datapoints to be generated is decided based on the class with the maximum datapoints. This variable decides how many more extra datapoints should be generated on top of that. Defaults to 0.
  • shuffle : Shuffles the generated and original datapoints together. Defaults to True.

📤 Returns:

  • feat_all : Tukey transformed train data + generated datapoints
  • labs_all : Corresponding labels to feat_all
  • generated_points : Dict that consists of just the generated points with class label as keys.

🪀 Results on a Toy Dataset Open In Colab

The higher the Imb ratio, the more imbalanced the dataset is. Imb ratio = maximum_sample_count/minimum_sample_count.

Check this notebook to play with the toy example from which the plot below was generated.

📃 Citation

If you use this package in any of your work, cite as,

@inproceedings{rahul2021tailcalibX,
    title   = {{Feature Generation for Long-tail Classification}},
    author  = {Rahul Vigneswaran, Marc T. Law, Vineeth N. Balasubramanian, Makarand Tapaswi},
    booktitle = {ICVGIP},
    year = {2021}
}

👁 Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

❤ About me

Rahul Vigneswaran

✨ Extras

🐝 Long-tail buzz : If you are interested in deep learning research which involves long-tailed / imbalanced dataset, take a look at Long-tail buzz to learn about the recent trending papers in this field.

📝 License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tailcalib-0.0.1.tar.gz (7.3 kB view hashes)

Uploaded Source

Built Distribution

tailcalib-0.0.1-py3-none-any.whl (7.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page