tailcalib is a Python library for balancing a long-tailed / imbalanced dataset by generating synthetic datapoints which will inturn increase the class-wise and overall test accuracy on the original dataset.

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

tailcalib

tailcalib is a Python library for balancing a long-tailed / imbalanced dataset by generating synthetic datapoints which will inturn increase the class-wise and overall test accuracy on the original dataset.

This package is based on the paper Feature Generation for Long-tail Classification by Rahul Vigneswaran, Marc T. Law, Vineeth N. Balasubramanian, Makarand Tapaswi.

For much more detailed experiments, code and instructions, check rahulvigneswaran/TailCalibX .

💻 Installation

Use the package manager pip to install tailcalib.

pip install tailcalib

👨‍💻 Basic Usage

# Import
from tailcalib import tailcalib

# Initialize
a = tailcalib(base_engine="numpy")   # Options: "numpy", "pytorch"

# Imbalanced random fake data
import numpy as np
X = np.random.rand(200,100)
y = np.random.randint(0,10, (200,))

# Balancing the data using "tailcalib"
feat, lab, gen = a.generate(X=X, y=y)

# Output comparison
print(f"Before: {np.unique(y, return_counts=True)}")
print(f"After: {np.unique(lab, return_counts=True)}")

🧪 Advanced Usage

🧩 Sample code

# Import
from tailcalib import tailcalib

# Initialize
a = tailcalib(base_engine="numpy")   # Options: "numpy", "pytorch"

# Imbalanced random fake data
import numpy as np
# Train data
X_train = np.random.rand(200,100)
y_train = np.random.randint(0,10, (200,))
# Test data
X_test = np.random.rand(20,100)
y_test = np.random.randint(0,10, (20,))

# Balancing the data using "tailcalib". 
# Try to play with the other hyperparameters to get a better generated datapoint.
feat, lab, gen = a.generate(X=X_train, y=y_train, tukey_value=1.0, alpha=0.0, topk=1, extra_points=0, shuffle=True)

# Always remember to convert the val/test data before doing validation/testing.
X_test, y_test = a.convert_others(X=X_test, y=y_test)

# Output comparison
print(f"Before: {np.unique(y_train, return_counts=True)}")
print(f"After: {np.unique(lab, return_counts=True)}")

⚙ Arguments

X : Features
y : Corresponding labels
tukey_value : Value to convert any distrubution of data into a normal distribution. Defaults to 1.0.
alpha : Decides how spread out the generated data is. Defaults to 0.0.
topk : Decides how many nearby classes should be taken into consideration for the mean and std of the newly generated data. Defaults to 1.
extra_points : By default the number of datapoints to be generated is decided based on the class with the maximum datapoints. This variable decides how many more extra datapoints should be generated on top of that. Defaults to 0.
shuffle : Shuffles the generated and original datapoints together. Defaults to True.

📤 Returns:

feat_all : Tukey transformed train data + generated datapoints
labs_all : Corresponding labels to feat_all
generated_points : Dict that consists of just the generated points with class label as keys.

🪀 Results on a Toy Dataset

The higher the Imb ratio, the more imbalanced the dataset is. Imb ratio = maximum_sample_count/minimum_sample_count.

Check this notebook to play with the toy example from which the plot below was generated.

📃 Citation

If you use this package in any of your work, cite as,

@inproceedings{rahul2021tailcalibX,
    title   = {{Feature Generation for Long-tail Classification}},
    author  = {Rahul Vigneswaran, Marc T. Law, Vineeth N. Balasubramanian, Makarand Tapaswi},
    booktitle = {ICVGIP},
    year = {2021}
}

👁 Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

❤ About me

Rahul Vigneswaran

✨ Extras

🐝 Long-tail buzz : If you are interested in deep learning research which involves long-tailed / imbalanced dataset, take a look at Long-tail buzz to learn about the recent trending papers in this field.

📝 License

MIT

Project details

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.1

Nov 8, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tailcalib-0.0.1.tar.gz (7.3 kB view hashes)

Uploaded Nov 8, 2021 Source

Built Distribution

tailcalib-0.0.1-py3-none-any.whl (7.7 kB view hashes)

Uploaded Nov 8, 2021 Python 3

Hashes for tailcalib-0.0.1.tar.gz

Hashes for tailcalib-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`20e7e4e01773cbf7807555679463a1994a93c61172c8011a792db2b123a90f47`
MD5	`a0885d1f7e8cf50d0b680710853b500a`
BLAKE2b-256	`bb3616cd071ba39f1e344459204daa91a0ea7a4df02b59bc2744da9fd3fafeee`

Hashes for tailcalib-0.0.1-py3-none-any.whl

Hashes for tailcalib-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ba9e1de7fa8aef4366b5022df5b6feac3a2ff000731e37c832ffdae5a8736888`
MD5	`8974b95cfe84b2f7c31a8a5b304eb59a`
BLAKE2b-256	`194e3c75fb0cf69b2ba686e49dac5b274c8f15e819f89f8a4e55440024674d85`