Toolbox for ensemble learning on class-imbalanced dataset.
Reason this release was yanked:
Please install the latest version to avoid unexpected problems.
Project description
Imbalanced Ensemble: ensemble learning for class-imbalanced data in Python.
[Documentation]
[PyPI]
[Changelog]
[Source]
[Download]
imbalanced-ensemble (IMBENS, imported as imbalanced_ensemble
) is a Python toolbox for quick implementing and deploying ensemble learning algorithms on class-imbalanced data.
The problem of learning from imbalanced data is also known as imbalanced learning or long-tail learning (under multi-class scenario).
IMBENS includes more than 15 ensemble imbalanced learning (EIL) algorithms, from the classical SMOTEBoost (2003) and RUSBoost (2010) to recent SPE (2020), from resampling-based methods to cost-sensitive ensemble learning.
IMBENS is featured for:
- ๐ Unified, easy-to-use API design.
- ๐ Capable for multi-class imbalanced learning out-of-box.
- ๐ Optimized performance with parallelization when possible using joblib.
- ๐ Powerful, customizable, interactive training logging and visualizer.
- ๐ Full compatibility with other popular packages like scikit-learn and imbalanced-learn.
API Demo:
# Train an SPE classifier
from imbalanced_ensemble.ensemble import SelfPacedEnsembleClassifier
clf = SelfPacedEnsembleClassifier(random_state=42)
clf.fit(X_train, y_train)
# Predict with an SPE classifier
clf.predict(X)
Table of Contents
- Installation
- Highlights
- List of implemented methods
- 5-min Quick Start with IMBENS
- Acknowledgements
- References
Installation
It is recommended to use pip for installation.
Please make sure the latest version is installed to avoid potential problems:
$ pip install imbalanced-ensemble # normal install
$ pip install --upgrade imbalanced-ensemble # update if needed
Or you can install imbalanced-ensemble by clone this repository:
$ git clone https://github.com/ZhiningLiu1998/imbalanced-ensemble.git
$ cd imbalanced-ensemble
$ pip install .
imbalanced-ensemble requires following dependencies:
- Python (>=3.6)
- numpy (>=1.16.0)
- pandas (>=1.1.3)
- scipy (>=0.19.1)
- joblib (>=0.11)
- scikit-learn (>=0.24)
- matplotlib (>=3.3.2)
- seaborn (>=0.11.0)
- tqdm (>=4.50.2)
Highlights
- ๐ Unified, easy-to-use API design.
All ensemble learning methods implemented in IMBENS share a unified API design. Similar to sklearn, all methods have functions (e.g.,fit()
,predict()
,predict_proba()
) that allow users to deploy them with only a few lines of code. - ๐ Extended functionalities, wider application scenarios.
All methods in IMBENS are ready for multi-class imbalanced classification. We extend binary ensemble imbalanced learning methods to get them to work under the multi-class scenario. Additionally, for supported methods, we provide more training options like class-wise resampling control, balancing scheduler during the ensemble training process, etc. - ๐ Detailed training log, quick intuitive visualization.
We provide additional parameters (e.g.,eval_datasets
,eval_metrics
,training_verbose
) infit()
for users to control the information they want to monitor during the ensemble training. We also implement anEnsembleVisualizer
to quickly visualize the ensemble estimator(s) for providing further information/conducting comparison. See an example here. - ๐ Wide compatiblilty.
IMBENS is designed to be compatible with scikit-learn (sklearn) and also other compatible projects like imbalanced-learn. Therefore, users can take advantage of various utilities from the sklearn community for data processing/cross-validation/hyper-parameter tuning, etc.
List of implemented methods
Currently, 16 ensemble imbalanced learning methods were implemented:
- Resampling-based
- Under-sampling + Ensemble
- Over-sampling + Ensemble
- Reweighting-based
- Cost-sensitive Learning
AdaCostClassifier
[8]AdaUBoostClassifier
[9]AsymBoostClassifier
[10]
- Cost-sensitive Learning
- Compatible
Note:
imbalanced-ensemble
is still under development, please see API reference for the latest list.
5-min Quick Start with IMBENS
A minimal working example
Taking self-paced ensemble [1] as an example, it only requires less than 10 lines of code to deploy it:
>>> from imbalanced_ensemble.ensemble import SelfPacedEnsembleClassifier
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>>
>>> X, y = make_classification(n_samples=1000, n_classes=3,
... n_informative=4, weights=[0.2, 0.3, 0.5],
... random_state=0)
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.2, random_state=42)
>>> clf = SelfPacedEnsembleClassifier(random_state=0)
>>> clf.fit(X_train, y_train)
SelfPacedEnsembleClassifier(...)
>>> clf.predict(X_test)
array([...])
Customizing training log
All ensemble classifiers in IMBENS support customizable training logging.
The training log is controlled by 3 parameters eval_datasets
, eval_metrics
, and training_verbose
of the fit()
method.
Read more details in the fit documentation.
Customize granularity and content of the training log
clf.fit(X_train, y_train,
train_verbose={
'granularity': 10,
'print_distribution': False,
'print_metrics': True,
}
)
โโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ Data: train โ
โ #Estimators โ Metric โ
โ โ acc balanced_acc weighted_f1 โ
โฃโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโซ
โ 1 โ 0.964 0.970 0.964 โ
โ 10 โ 1.000 1.000 1.000 โ
โ 20 โ 1.000 1.000 1.000 โ
โ 30 โ 1.000 1.000 1.000 โ
โ 40 โ 1.000 1.000 1.000 โ
โ 50 โ 1.000 1.000 1.000 โ
โฃโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโซ
โ final โ 1.000 1.000 1.000 โ
โโโโโโโโโโโโโโโปโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Add evaluation dataset(s)
clf.fit(X_train, y_train,
eval_datasets={'valid': (X_valid, y_valid)},
train_verbose={
'granularity': 10,
'print_distribution': False,
'print_metrics': True,
}
)
โโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ Data: train โ Data: valid โ
โ #Estimators โ Metric โ Metric โ
โ โ acc balanced_acc weighted_f1 โ acc balanced_acc weighted_f1 โ
โฃโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโซ
โ 1 โ 0.939 0.961 0.940 โ 0.935 0.933 0.936 โ
โ 10 โ 1.000 1.000 1.000 โ 0.971 0.974 0.971 โ
โ 20 โ 1.000 1.000 1.000 โ 0.982 0.981 0.982 โ
โ 30 โ 1.000 1.000 1.000 โ 0.983 0.983 0.983 โ
โ 40 โ 1.000 1.000 1.000 โ 0.983 0.982 0.983 โ
โ 50 โ 1.000 1.000 1.000 โ 0.983 0.982 0.983 โ
โฃโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโซ
โ final โ 1.000 1.000 1.000 โ 0.983 0.982 0.983 โ
โโโโโโโโโโโโโโโปโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโปโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Customize evaluation metric(s)
from sklearn.metrics import accuracy_score, f1_score
clf.fit(X_train, y_train,
eval_datasets={'valid': (X_valid, y_valid)},
eval_metrics={
'acc': (accuracy_score, {}),
'balanced_acc': (balanced_accuracy_score, {}),
}
train_verbose={
'granularity': 10,
'print_distribution': False,
'print_metrics': True,
}
)
โโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโ
โ โ Data: train โ Data: valid โ
โ #Estimators โ Metric โ Metric โ
โ โ acc balanced_acc โ acc balanced_acc โ
โฃโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโซ
โ 1 โ 0.942 0.961 โ 0.919 0.936 โ
โ 10 โ 1.000 1.000 โ 0.976 0.976 โ
โ 20 โ 1.000 1.000 โ 0.977 0.977 โ
โ 30 โ 1.000 1.000 โ 0.981 0.980 โ
โ 40 โ 1.000 1.000 โ 0.980 0.979 โ
โ 50 โ 1.000 1.000 โ 0.981 0.980 โ
โฃโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโซ
โ final โ 1.000 1.000 โ 0.981 0.980 โ
โโโโโโโโโโโโโโโปโโโโโโโโโโโโโโโโโโโโโโโปโโโโโโโโโโโโโโโโโโโโโโโ
Visualize ensemble classifiers
The imbalanced_ensemble.visualizer
sub-module provide an ImbalancedEnsembleVisualizer
.
It can be used to visualize the ensemble estimator(s) for further information or comparison.
Read more details in the visualizer documentation.
Fit an ImbalancedEnsembleVisualizer
from imbalanced_ensemble.ensemble import SelfPacedEnsembleClassifier
from imbalanced_ensemble.ensemble import RUSBoostClassifier
from imbalanced_ensemble.ensemble import EasyEnsembleClassifier
from sklearn.tree import DecisionTreeClassifier
# Fit ensemble classifiers
init_kwargs = {'base_estimator': DecisionTreeClassifier()}
ensembles = {
'spe': SelfPacedEnsembleClassifier(**init_kwargs).fit(X_train, y_train),
'rusboost': RUSBoostClassifier(**init_kwargs).fit(X_train, y_train),
'easyens': EasyEnsembleClassifier(**init_kwargs).fit(X_train, y_train),
}
# Fit visualizer
from imbalanced_ensemble.visualizer import ImbalancedEnsembleVisualizer
visualizer = ImbalancedEnsembleVisualizer().fit(ensembles=ensembles)
Plot performance curves
fig, axes = visualizer.performance_lineplot()
Plot confusion matrices
fig, axes = visualizer.confusion_matrix_heatmap()
Acknowledgements
Many samplers and utilities are adapted from imbalanced-learn, which is an amazing project!
References
# | Reference |
---|---|
[1] | Zhining Liu, Wei Cao, Zhifeng Gao, Jiang Bian, Hechang Chen, Yi Chang, and Tie-Yan Liu. 2019. Self-paced Ensemble for Highly Imbalanced Massive Data Classification. 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 2020, pp. 841-852. |
[2] | X.-Y. Liu, J. Wu, and Z.-H. Zhou, Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539โ550, 2009. |
[3] | Chen, Chao, Andy Liaw, and Leo Breiman. โUsing random forest to learn imbalanced data.โ University of California, Berkeley 110 (2004): 1-12. |
[4] | C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, Rusboost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 40, no. 1, pp. 185โ197, 2010. |
[5] | Maclin, R., & Opitz, D. (1997). An empirical evaluation of bagging and boosting. AAAI/IAAI, 1997, 546-551. |
[6] | N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, Smoteboost: Improving prediction of the minority class in boosting. in European conference on principles of data mining and knowledge discovery. Springer, 2003, pp. 107โ119 |
[7] | S. Wang and X. Yao, Diversity analysis on imbalanced data sets by using ensemble models. in 2009 IEEE Symposium on Computational Intelligence and Data Mining. IEEE, 2009, pp. 324โ331. |
[8] | Fan, W., Stolfo, S. J., Zhang, J., & Chan, P. K. (1999, June). AdaCost: misclassification cost-sensitive boosting. In Icml (Vol. 99, pp. 97-105). |
[9] | Shawe-Taylor, G. K. J., & Karakoulas, G. (1999). Optimizing classifiers for imbalanced training sets. Advances in neural information processing systems, 11(11), 253. |
[10] | Viola, P., & Jones, M. (2001). Fast and robust classification using asymmetric adaboost and a detector cascade. Advances in Neural Information Processing System, 14. |
[11] | Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1), 119-139. |
[12] | Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140. |
[13] | Guillaume Lemaรฎtre, Fernando Nogueira, and Christos K. Aridas. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17):1โ5, 2017. |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for imbalanced-ensemble-0.1.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | a03a2490a9627cb1929a0c33a23f7dd9ddc4111b599955d337f76c3522fdae34 |
|
MD5 | 0b4bd422e9e47d45171c3dca4cd5f89b |
|
BLAKE2b-256 | 0d2743d61902793f399a32fe39796a2147513a6d9b035ae179daf29db914812d |
Hashes for imbalanced_ensemble-0.1.3-py3.8.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e9a83e57d463e13815dcf18a7022f4dcd3b6785b841977199bce2c33fc6e3e8 |
|
MD5 | 9e1a94a39302b930bb2c9cecb49b7164 |
|
BLAKE2b-256 | cc30b4cc4c6afddfda9d34869b04608ead03ea9cbc017905bd943d2fb61e4350 |
Hashes for imbalanced_ensemble-0.1.3-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f6cc6446b95b20030552cf7930e3364b1210534fc39516400de8d59b18cc4c6 |
|
MD5 | 28a21a5142e47c84719b29c704d2af9a |
|
BLAKE2b-256 | c9ad6796686cb941bdff073192278b334f6accedd3c3a1620674ba664c827dfa |