Skip to main content

A Python Toolbox for Machine Learning Model Combination

Project description

Deployment & Documentation & Stats

PyPI version Documentation Status GitHub stars GitHub forks Downloads Downloads

Build Status & Coverage & Maintainability & License

Build Status Build status Coverage Status Maintainability License

combo is a Python toolbox for combining or aggregating ML models and scores for various tasks, including classification, clustering, anomaly detection, and raw score. It has been widely used in data science competitions and real-world tasks, such as Kaggle.

Model and score combination can be regarded as a subtask of ensemble learning, but is often beyond the scope of ensemble learning. For instance, averaging the results of multiple runs of a ML model is deemed as a reliable way of eliminating the randomness for better stability. See figure below for some popular combination approaches.

Combination Framework Demo

combo is featured for:

  • Unified APIs, detailed documentation, and interactive examples across various algorithms.

  • Advanced models, including dynamic classifier/ensemble selection and LSCP.

  • Broad applications for classification, clustering, anomaly detection, and raw score.

  • Comprehensive coverage for supervised, unsupervised, and semi-supervised scenarios.

  • Optimized performance with JIT and parallelization when possible, using numba and joblib.

API Demo:

from combo.models.stacking import Stacking
# base classifiers
classifiers = [DecisionTreeClassifier(), LogisticRegression(),
               KNeighborsClassifier(), RandomForestClassifier(),
               GradientBoostingClassifier()]

clf = Stacking(base_clfs=classifiers) # initialize a Stacking model
clf.fit(X_train)

# predict on unseen data
y_test_labels = clf.predict(X_test)  # label prediction
y_test_proba = clf.predict_proba(X_test)  # probability prediction

Table of Contents:


Installation

It is recommended to use pip for installation. Please make sure the latest version is installed, as combo is updated frequently:

pip install combo            # normal install
pip install --upgrade combo  # or update if needed
pip install --pre combo      # or include pre-release version for new features

Alternatively, you could clone and run setup.py file:

git clone https://github.com/yzhao062/combo.git
cd combo
pip install .

Required Dependencies:

  • Python 3.5, 3.6, or 3.7

  • joblib

  • matplotlib

  • numpy>=1.13

  • numba>=0.35

  • scipy>=0.19.1

  • scikit_learn>=0.19.1


API Cheatsheet & Reference

Full API Reference: (https://pycombo.readthedocs.io/en/latest/api.html). API cheatsheet for most of the models:

  • fit(X): Fit an estimator.

  • predict(X): Predict on a particular sample once the estimator is fitted.

  • predict_proba(X): Predict the probability of a sample belonging to each class. Only applicable for classification tasks.


Proposed Algorithms

combo groups combination frameworks by tasks.

  • For most of the tasks, the following combination methods for raw scores are feasible [7]:

    1. Averaging & Weighted Averaging & Median

    2. Maximization

    3. Majority Vote & Weighted Majority Vote

    4. Median

Some of the methods are tasks specific:

  • Classifier combination: combine multiple supervised classifiers together for training and prediction

    1. SimpleClassifierAggregator: combining classifiers by (i) (weighted) average (ii) maximization (iii) median and (iv) (weighted) majority vote

    2. Dynamic Classifier Selection & Dynamic Ensemble Selection [3] (work-in-progress)

    3. Stacking (meta ensembling): build an additional classifier to learn base estimator weights [2]

  • Cluster combination: combine and align unsupervised clustering results

    1. Clusterer Ensemble [6]

  • Anomaly detection: combine unsupervised (and supervised) outlier detectors

    1. SimpleDetectorCombination: combining outlier score results by (i) (weighted) average (ii) maximization (iii) median and (iv) (weighted) majority vote

    2. Average of Maximum (AOM) [1]

    3. Maximum of Average (MOA) [1]

    4. Thresholding

    5. Locally Selective Combination (LSCP) [4]

    6. XGBOD: a semi-supervised combination framework for outlier detection [5]


Quick Start for Classifier Combination

“examples/classifier_comb_example.py” demonstrates the basic API of predicting with multiple classifiers. It is noted that the API across all other algorithms are consistent/similar.

  1. Initialize a group of classifiers as base estimators

    # initialize a group of classifiers
    classifiers = [DecisionTreeClassifier(), LogisticRegression(),
                   KNeighborsClassifier(), RandomForestClassifier(),
                   GradientBoostingClassifier()]
  2. Initialize, fit, predict, and evaluate with a simple aggregator (average)

    from combo.models.classifier_comb import SimpleClassifierAggregator
    
    clf = SimpleClassifierAggregator(classifiers, method='average')
    clf.fit(X_train, y_train)
    y_test_predicted = clf.predict(X_test)
    evaluate_print('Combination by avg   |', y_test, y_test_predicted)
  3. See a sample output of classifier_comb_example.py

    Decision Tree        | Accuracy:0.9386, ROC:0.9383, F1:0.9521
    Logistic Regression  | Accuracy:0.9649, ROC:0.9615, F1:0.973
    K Neighbors          | Accuracy:0.9561, ROC:0.9519, F1:0.9662
    Gradient Boosting    | Accuracy:0.9605, ROC:0.9524, F1:0.9699
    Random Forest        | Accuracy:0.9605, ROC:0.961, F1:0.9693
    
    Combination by avg   | Accuracy:0.9693, ROC:0.9677, F1:0.9763
    Combination by w_avg | Accuracy:0.9781, ROC:0.9716, F1:0.9833
    Combination by max   | Accuracy:0.9518, ROC:0.9312, F1:0.9642
    Combination by w_vote| Accuracy:0.9649, ROC:0.9644, F1:0.9728
    Combination by median| Accuracy:0.9693, ROC:0.9677, F1:0.9763

Quick Start for Clustering Combination

“examples/cluster_comb_example.py” demonstrates the basic API of combining multiple base clustering estimators.

  1. Initialize a group of clustering methods as base estimators

    from combo.models.cluster_comb import ClustererEnsemble
    
    # Initialize a set of estimators
    estimators = [KMeans(n_clusters=n_clusters),
                  MiniBatchKMeans(n_clusters=n_clusters),
                  AgglomerativeClustering(n_clusters=n_clusters)]
  2. Initialize an Clusterer Ensemble class and fit the model

    # combine by Clusterer Ensemble
    clf = ClustererEnsemble(estimators, n_clusters=n_clusters)
    clf.fit(X)
  3. Get the aligned results

    # generate the labels on X
    aligned_labels = clf.aligned_labels_
    predicted_labels = clf.labels_

An Example of Stacking

“examples/stacking_example.py” demonstrates the basic API of stacking (meta ensembling).

  1. Initialize a group of classifiers as base estimators

    # initialize a group of classifiers
    classifiers = [DecisionTreeClassifier(), LogisticRegression(),
                   KNeighborsClassifier(), RandomForestClassifier(),
                   GradientBoostingClassifier()]
  2. Initialize, fit, predict, and evaluate with Stacking

    from combo.models.stacking import Stacking
    
    clf = Stacking(base_clfs=classifiers, n_folds=4, shuffle_data=False,
                keep_original=True, use_proba=False, random_state=random_state)
    
    clf.fit(X_train, y_train)
    y_test_predict = clf.predict(X_test)
    evaluate_print('Stacking | ', y_test, y_test_predict)
  3. See a sample output of stacking_example.py

    Decision Tree        | Accuracy:0.9386, ROC:0.9383, F1:0.9521
    Logistic Regression  | Accuracy:0.9649, ROC:0.9615, F1:0.973
    K Neighbors          | Accuracy:0.9561, ROC:0.9519, F1:0.9662
    Gradient Boosting    | Accuracy:0.9605, ROC:0.9524, F1:0.9699
    Random Forest        | Accuracy:0.9605, ROC:0.961, F1:0.9693
    
    Stacking             | Accuracy:0.9868, ROC:0.9841, F1:0.9899

Development Status

combo is currently under development as of July 24, 2019. A concrete plan has been laid out and will be implemented in the next few months.

Similar to other libraries built by us, e.g., Python Outlier Detection Toolbox (pyod), combo is also targeted to be published in Journal of Machine Learning Research (JMLR), open-source software track. A demo paper to AAAI or IJCAI may be submitted soon for progress update.

Watch & Star to get the latest update! Also feel free to send me an email (zhaoy@cmu.edu) for suggestions and ideas.


Reference

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

combo-0.0.5.tar.gz (25.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page