Skip to main content

Oversampling for imbalanced learning based on k-means and SMOTE

Project description

Oversampling for Imbalanced Learning based on K-Means and SMOTE

PyPI version Build Status Docs Status codecov

K-Means SMOTE is an oversampling method for class-imbalanced data. It aids classification by generating minority class samples in safe and crucial areas of the input space. The method avoids the generation of noise and effectively overcomes imbalances between and within classes.

This project is a python implementation of k-means SMOTE. It is compatible with the scikit-learn-contrib project imbalanced-learn.

Installation

Dependencies

The implementation is tested under python 3.6 and works with the latest release of the imbalanced-learn framework:

  • imbalanced-learn (>=0.4.0, <0.5)

  • numpy (numpy>=1.13, <1.16)

  • scikit-learn (>=0.19.0, <0.21)

Installation

Pypi

pip install kmeans-smote

From Source

Clone this repository and run the setup.py file. Use the following commands to get a copy from GitHub and install all dependencies:

git clone https://github.com/felix-last/kmeans_smote.git
cd kmeans-smote
pip install .

Documentation

Find the API documentation at https://kmeans_smote.readthedocs.io. As this project follows the imbalanced-learn API, the imbalanced-learn documentation might also prove helpful.

Example Usage

import numpy as np
from imblearn.datasets import fetch_datasets
from kmeans_smote import KMeansSMOTE

datasets = fetch_datasets(filter_data=['oil'])
X, y = datasets['oil']['data'], datasets['oil']['target']

[print('Class {} has {} instances'.format(label, count))
 for label, count in zip(*np.unique(y, return_counts=True))]

kmeans_smote = KMeansSMOTE(
    kmeans_args={
        'n_clusters': 100
    },
    smote_args={
        'k_neighbors': 10
    }
)
X_resampled, y_resampled = kmeans_smote.fit_sample(X, y)

[print('Class {} has {} instances after oversampling'.format(label, count))
 for label, count in zip(*np.unique(y_resampled, return_counts=True))]

Expected Output:

Class -1 has 896 instances
Class 1 has 41 instances
Class -1 has 896 instances after oversampling
Class 1 has 896 instances after oversampling

Take a look at imbalanced-learn pipelines for efficient usage with cross-validation.

About

K-means SMOTE works in three steps:

  1. Cluster the entire input space using k-means [1].

  2. Distribute the number of samples to generate across clusters:

    1. Filter out clusters which have a high number of majority class samples.

    2. Assign more synthetic samples to clusters where minority class samples are sparsely distributed.

  3. Oversample each filtered cluster using SMOTE [2].

Contributing

Please feel free to submit an issue if things work differently than expected. Pull requests are also welcome - just make sure that tests are green by running pytest before submitting.

Citation

If you use k-means SMOTE in a scientific publication, we would appreciate citations to the following paper:

@article{kmeans_smote,
    title = {Oversampling for Imbalanced Learning Based on K-Means and SMOTE},
    author = {Last, Felix and Douzas, Georgios and Bacao, Fernando},
    year = {2017},
    archivePrefix = "arXiv",
    eprint = "1711.00837",
    primaryClass = "cs.LG"
}

References

[1] MacQueen, J. “Some Methods for Classification and Analysis of Multivariate Observations.” Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967, p. 281-297.

[2] Chawla, Nitesh V., et al. “SMOTE: Synthetic Minority over-Sampling Technique.” Journal of Artificial Intelligence Research, vol. 16, Jan. 2002, p. 321357, doi:10.1613/jair.953.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmeans_smote-0.1.1.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

kmeans_smote-0.1.1-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file kmeans_smote-0.1.1.tar.gz.

File metadata

  • Download URL: kmeans_smote-0.1.1.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.0

File hashes

Hashes for kmeans_smote-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ef0d0f8d7a559ccd7727935732a49ab9a17b5d97be762f145fc0c81f84101f19
MD5 7e27f8496ee687215dc74cdfa885c308
BLAKE2b-256 d69945dd088f6152d3cfd2098bbbcf78d62be1b867c2b68ec933e411b27169ec

See more details on using hashes here.

File details

Details for the file kmeans_smote-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: kmeans_smote-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 8.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.0

File hashes

Hashes for kmeans_smote-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 42f5afacbe75c45a9b1795c6ae1960e7658e8258d0d6cd7a5f98564290aebcbb
MD5 3c2f49820618a99d8abc3c989fcb5c63
BLAKE2b-256 db617743e3f926bc6398c4cd76d2185b399567c6283619fea52f1e0fcb60ec41

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page