Skip to main content

Implementation of a general interface for clustering based over-sampling algorithms.

Project description

cluster-over-sampling

ci doc

Category Tools
Development black ruff
Package version pythonversion downloads
Documentation mkdocs
Communication gitter discussions

Introduction

SMOTE algorithm and any other oversampling method based on the SMOTE data generation mechanism creates synthetic samples along line segments that join minority class instances. SMOTE addresses only the issue of between-classes imbalance.

The within-classes imbalanced issue can be addressed by clustering the input space and applying any oversampling algorithm for each resulting cluster with an appropriate resampling ratio. cluster-over-sampling provides a general interface for clustering-based oversampling algorithms. It is compatible with scikit-learn and imbalanced-learn. SOMO [^1], KMeans-SMOTE [^2] and G-SOMO[^3] are specific realizations of this approach and they are provided in cluster-over-sampling. Additionally, any combination of scikit-learn clusterer and imbalanced-learn oversampler is supported.

Installation

cluster-over-sampling is currently available on the PyPi's repository, and you can install it via pip:

pip install cluster-over-sampling

SOM clusterer requires optional dependencies:

pip install cluster-over-sampling[som]

Similarly for Geometric SMOTE oversampler:

pip install cluster-over-sampling[gsmote]

You can also install both of them:

pip install cluster-over-sampling[all]

Usage

All the classes included in cluster-over-sampling follow the imbalanced-learn API using the functionality of the base oversampler. Using scikit-learn convention, the data are represented as follows:

  • Input data X: 2D array-like or sparse matrices.
  • Targets y: 1D array-like.

The clustering-based oversamplers implement a fit method to learn from X and y:

clustering_based_oversampler.fit(X, y)

They also implement a fit_resample method to resample X and y:

X_resampled, y_resampled = clustering_based_oversampler.fit_resample(X, y)

References

If you use cluster-over-sampling in a scientific publication, we would appreciate citations to any of the following papers:

[^1]: G. Douzas, F. Bacao, "Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning", Expert Systems with Applications, vol. 82, pp. 40-52, 2017. [^2]: G. Douzas, F. Bacao, F. Last, "Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE", Information Sciences, vol. 465, pp. 1-20, 2018. [^3]: G. Douzas, F. Bacao, F. Last, "G-SOMO: An oversampling approach based on self-organized maps and geometric SMOTE", Expert Systems with Applications, vol. 183,115230, 2021.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cluster-over-sampling-0.4.0.tar.gz (25.5 kB view hashes)

Uploaded Source

Built Distribution

cluster_over_sampling-0.4.0-py3-none-any.whl (24.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page