Skip to main content

Self-paced Ensemble for classification on class-imbalanced data.

Project description


"Self-paced Ensemble for Highly Imbalanced Massive Data Classification" (ICDE 2020).
[Paper] [Slides] [arXiv] [PyPI] [Documentation]

Self-paced Ensemble (SPE) is an ensemble learning framework for massive highly imbalanced classification. It is an easy-to-use solution to class-imbalanced problems, features outstanding computing efficiency, good performance, and wide compatibility with different learning models. This SPE implementation supports multi-class classification.

Note: SPE is now a part of imbalanced-ensemble [Doc, PyPI]. Try it for more methods and advanced features!

Cite Us

If you find this repository helpful, please consider citing our work:

@inproceedings{liu2020self-paced-ensemble,
    title={Self-paced Ensemble for Highly Imbalanced Massive Data Classification},
    author={Liu, Zhining and Cao, Wei and Gao, Zhifeng and Bian, Jiang and Chen, Hechang and Chang, Yi and Liu, Tie-Yan},
    booktitle={2020 IEEE 36th International Conference on Data Engineering (ICDE)},
    pages={841--852},
    year={2020},
    organization={IEEE}
}

Install

It is recommended to use pip for installation.
Please make sure the latest version is installed to avoid potential problems:

$ pip install self-paced-ensemble            # normal install
$ pip install --upgrade self-paced-ensemble  # update if needed

Or you can install SPE by clone this repository:

$ git clone https://github.com/ZhiningLiu1998/self-paced-ensemble.git
$ cd self-paced-ensemble
$ python setup.py install

Following dependencies are required:

Table of Contents

Background

SPE performs strictly balanced under-sampling in each iteration and is therefore very computationally efficient. In addition, SPE does not rely on calculating the distance between samples to perform resampling. It can be easily applied to datasets that lack well-defined distance metrics (e.g. with categorical features / missing values) without any modification. Moreover, as a generic ensemble framework, our methods can be easily adapted to most of the existing learning methods (e.g., C4.5, SVM, GBDT, and Neural Network) to boost their performance on imbalanced data. Compared to existing imbalance learning methods, SPE works particularly well on datasets that are large-scale, noisy, and highly imbalanced (e.g. with imbalance ratio greater than 100:1). Such kind of data widely exists in real-world industrial applications. The figure below gives an overview of the SPE framework.

image

Usage

Documentation

Our SPE implementation can be used much in the same way as the ensemble classifiers in sklearn.ensemble. Detailed documentation of SelfPacedEnsembleClassifier can be found HERE.

Examples

API demo

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Prepare class-imbalanced train & test data
X, y = make_classification(n_classes=2, random_state=42, weights=[0.1, 0.9])
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=42)

# Train an SPE classifier
clf = SelfPacedEnsembleClassifier(
        base_estimator=DecisionTreeClassifier(), 
        n_estimators=10,
    ).fit(X_train, y_train)

# Predict with an SPE classifier
clf.predict(X_test)

Advanced usage example

Please see usage_example.ipynb.

Compare SPE with other methods

Please see comparison_example.ipynb.

Results

Dataset links: Credit Fraud, KDDCUP, Record Linkage, Payment Simulation.

image

Comparisons of SPE with traditional resampling/ensemble methods in terms of performance & computational efficiency.

image

image

image

Miscellaneous

This repository contains:

  • Implementation of Self-paced Ensemble
  • Implementation of 5 ensemble-based imbalance learning baselines
    • SMOTEBoost [1]
    • SMOTEBagging [2]
    • RUSBoost [3]
    • UnderBagging [4]
    • BalanceCascade [5]
  • Implementation of resampling based imbalance learning baselines [6]
  • Additional experimental results

NOTE: The implementations of other ensemble and resampling methods are based on imbalanced-ensemble and imbalanced-learn.

References

# Reference
[1] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, Smoteboost: Improving prediction of the minority class in boosting. in European conference on principles of data mining and knowledge discovery. Springer, 2003, pp. 107–119
[2] S. Wang and X. Yao, Diversity analysis on imbalanced data sets by using ensemble models. in 2009 IEEE Symposium on Computational Intelligence and Data Mining. IEEE, 2009, pp. 324–331.
[3] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, “Rusboost: A hybrid approach to alleviating class imbalance,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 40, no. 1, pp. 185–197, 2010.
[4] R. Barandela, R. M. Valdovinos, and J. S. Sanchez, “New applications´ of ensembles of classifiers,” Pattern Analysis & Applications, vol. 6, no. 3, pp. 245–256, 2003.
[5] X.-Y. Liu, J. Wu, and Z.-H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539–550, 2009.
[6] Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17):1–5, 2017.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

self-paced-ensemble-0.1.4.tar.gz (42.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

self_paced_ensemble-0.1.4-py3.8.egg (97.7 kB view details)

Uploaded Egg

self_paced_ensemble-0.1.4-py2.py3-none-any.whl (45.7 kB view details)

Uploaded Python 2Python 3

File details

Details for the file self-paced-ensemble-0.1.4.tar.gz.

File metadata

  • Download URL: self-paced-ensemble-0.1.4.tar.gz
  • Upload date:
  • Size: 42.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for self-paced-ensemble-0.1.4.tar.gz
Algorithm Hash digest
SHA256 086392941c8413afa603a7a168c2829895c5d695324ade9d305b99bf0a7d80f8
MD5 4819c2c1ed51e5667ccc8fef72186364
BLAKE2b-256 5f01d5e1c2fdabaa442fd754a34198ac9f4c6536bf176c24dd96709b6d56073d

See more details on using hashes here.

File details

Details for the file self_paced_ensemble-0.1.4-py3.8.egg.

File metadata

  • Download URL: self_paced_ensemble-0.1.4-py3.8.egg
  • Upload date:
  • Size: 97.7 kB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for self_paced_ensemble-0.1.4-py3.8.egg
Algorithm Hash digest
SHA256 1bdfe1e3afed2dffde279230a74e9317bacbdcf6f38386391bc9c49d970025b9
MD5 343fbcefdc6d119edb6d1bdf905ee3b5
BLAKE2b-256 bb6d3d7a730143c10e44b1a2a58f16f93a691f99af240c60a8c49beb0052a28b

See more details on using hashes here.

File details

Details for the file self_paced_ensemble-0.1.4-py2.py3-none-any.whl.

File metadata

  • Download URL: self_paced_ensemble-0.1.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 45.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for self_paced_ensemble-0.1.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 c7304d31fe43fea06cef6a9306c6c375117fbaa6834c4b6af4c9e6e6ac392dc5
MD5 34b6bd40a62c320e0092af982ae150f8
BLAKE2b-256 c3eee15b8ebe9ed304ba66edac49aedc01d6d01d389c523d3f83d13d1e77bae0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page