Skip to main content

Ensemble models creation trough Bayesian Optimization

Project description

EnsemblesOpt


Building modeling ensembles can be a very slow process in case you want to find out which combination of base learner is the best performing, plus the training time increases as the ensemble size increases. The problem can be thought of as: given N base models find the best set of base learner of dimension K with K<N, and the search can be optimized according to a Bayesian Optimization approach.

This repository contains the project for a package for speeding up the process of finding best base learners for building ensemble models trough Bayesian Optimization using Gaussian Processes as surrogate function and Expected Improvement (EI), Probability of Improvement (PI) or Upper Confidence Bound (UCB) as acquisition functions, along optimization routines developed using Optuna library.
The black-box function is defined as the n cross-validation score of the chosen evaluation metric for the ensemble considered during the iteration. Each base model is mapped to an integer value and their combination is passed to the objective function for evaluation.

Install by running:

!pip install EnsemblesOpt==0.0.8

Code Snippets

First import the base models from where to search for the best ensemble of a given size


from sklearn.tree import ExtraTreeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier

Bayesian optimization search:

#initialize the Bayesian_Voting_Ensemble
from EnsemblesOpt import Bayesian_Voting_Ensemble


BS=Bayesian_Voting_Ensemble(ensemble_size=2,
                            models_list=[ExtraTreeClassifier(),
                                             DecisionTreeClassifier(),
                                             MLPClassifier(),
                                             SGDClassifier(),
                                             KNeighborsClassifier()],
                           xi=0.05,
                           random_init_points=7,
                           maximize_obj=True,
                           scoring='roc_auc',
                           task='classification',
                           acquisition_func='EI')
                           
#fit the Bayesian_Voting_Ensemble                         
Best_model = BS.fit(X,y,
       Nfold=5,
       n_iters=9,
       stratify=True)
Output:
Collecting initial random points...
Searching best ensemble...
-trial  0 |Score value: 0.8626962395405989
-trial  1 |Score value: 0.8755565498352099
-trial  2 |Score value: 0.8742921444887171
-trial  3 |Score value: 0.8868338004352088
-trial  4 |Score value: 0.8562297244914867
-trial  5 |Score value: 0.8629782101656331
-trial  6 |Score value: 0.865559835850203
-trial  7 |Score value: 0.887221833391049
-trial  8 |Score value: 0.8534670721947504
-trial  9 |Score value: 0.8283346726135243
Best Ensemble:
 [LGBMClassifier(bagging_fraction=0.9861531786655775, bagging_freq=3,
               feature_fraction=0.14219334035549125,
               lambda_l1=7.009080384469092e-07, lambda_l2=5.029465681170278e-06,
               learning_rate=0.08695762873585877, max_bin=1255,
               min_child_samples=93, n_estimators =316, num_leaves=38,
               silent='warn'), GradientBoostingClassifier()] 
best score 0.887221833391049

Common parameters for the Bayesian_Voting_Ensemble class:

Parameter Usage
"ensemble_size" Number of base estimators to build the ensemble, the bigger the ensemble the more time consuming and complex the final model will be.
"ensemble_size" Number of base estimators to build the ensemble, the bigger the ensemble the more time consuming and complex the final model will be.
"models_list" List of base models. If value provided is "None" preloaded list of models will be used.
"xi" Exploration parameter, higher values lead to more explorative behaviour and viceversa for lower value (default xi=0.01) .
"random_init_points" Number of initial points to take from the objective function.
"maximize_obj" Whether to maximize or minimize the objective function [True or False].
"scoring" Metric to optimize.
"task" Equals "classification" or "regression".
"type_p" Only in case of classification problem select 'soft' or 'hard'.
"acquisition_func" Acquisition function choose between "PI" (probability of improvement), "EI" (expected improvement) or "UCB" (upper confidence bound)

Common parameters for the fit method:

Parameter Usage
"X" Training dataset without target variable.
"y" Target variable.
"n_iters" Number of trials to execute optimization.
"N_folds" Number of folds for cross validation.
"stratify" Stratify cv splits based on target distribuition [True or False]

The 'scoring' parameter takes the same values from sklearn API (link of available list: https://scikit-learn.org/stable/modules/model_evaluation.html)

Optuna best stacking ensemble search:

from EnsemblesOpt import Optuna_StackEnsemble_Search

Opt=Optuna_StackEnsemble_Search(scoring_metric="roc_auc",
                                direction="maximize",
                                problem_type='classification',
                                size_stack=2,
                                models_list=[ExtraTreeClassifier(),
                                             DecisionTreeClassifier(),
                                             MLPClassifier(),
                                             SGDClassifier(),
                                             KNeighborsClassifier()],
                                             meta_learner=LogisticRegression())

Best_model, study = Opt.fit(X,y,n_trials=50,N_folds=3,stratify=True)

Common parameters for the Optuna_StackEnsemble_Search class:

Parameter Usage
"size_stack" Number of base estimators to build the ensemble, the bigger the ensemble the more time consuming and complex the final model will be.
"models_list" List of base models.
"scoring_metric" Metric to optimize.
"problem_type" Equals "classification" or "regression".
"direction" Equals "maximize" or "minimize".
"meta_learner" Meta learner for the stack ensemble, if not provided Optuna will search for one from the base models.

Common parameters for the fit method:

Parameter Usage
"X" Training dataset without target variable.
"y" target variable.
"n_iters" Number of trials to execute optimization.
"N_folds" Number of folds for cross validation.
"stratify" Stratify cv splits based on target distribuition [True or False]

Optuna best voting ensemble search:

from EnsemblesOpt import Optuna_VotingEnsemble_Search

Opt=Optuna_VotingEnsemble_Search(scoring_metric="roc_auc",direction="maximize",
                                problem_type='classification',
                                ensemble_size=2,
                                models_list=[ExtraTreeClassifier(),
                                             DecisionTreeClassifier(),
                                             MLPClassifier(),
                                             SGDClassifier(),
                                             KNeighborsClassifier()],
                                voting_type='soft'
                               )
Best_model, study = Opt.fit(X,y,n_trials=10,
            N_folds=3,
            stratify=True)

Common parameters for the Optuna_StackEnsemble_Search class:
scoring_metric,direction,problem_type,size_stack=3,models_list=[]

Parameter Usage
"ensemble_size" Number of base estimators to build the ensemble, the bigger the ensemble the more time consuming and complex the final model will be.
"models_list" List of base models.
"scoring_metric" Metric to optimize.
"problem_type" Equals "classification" or "regression".
"direction" Equals "maximize" or "minimize".
"voting_type" Voting type 'soft' or 'hard'.

Common parameters for the fit method:

Parameter Usage
"X" Training dataset without target variable.
"y" target variable.
"n_iters" Number of trials to execute optimization.
"N_folds" Number of folds for cross validation.
"stratify" Stratify cv splits based on target distribuition [True or False]

Optuna search best weights for voting ensemble:

from EnsemblesOpt import Optuna_Voting_weights_tuner

Opt=Optuna_Voting_weights_tuner(scoring_metric="roc_auc",
                                direction="maximize",
                                problem_type='classification',
                                models_list=[ExtraTreeClassifier(),
                                             DecisionTreeClassifier(),
                                             MLPClassifier(),
                                             SGDClassifier(),
                                             KNeighborsClassifier()],
                                voting_type='soft',
                                weights_list=[1,2,3]
                               )
Best_model, study = Opt.fit(X,y,n_trials=10,
        N_folds=3,
        stratify=True)

Common parameters for the Optuna_StackEnsemble_Search class:

Parameter Usage
"models_list" List of base models.
"scoring_metric" Metric to optimize.
"problem_type" Equals "classification" or "regression".
"direction" Equals "maximize" or "minimize".
"voting_type" Voting type 'soft' or 'hard'.
"weights_list" Weights to test out type list of integers or floats ex. [0,1,2,3,...] .

Common parameters for the fit method:

Parameter Usage
"X" Training dataset without target variable.
"y" target variable.
"n_iters" Number of trials to execute optimization.
"N_folds" Number of folds for cross validation.
"stratify" Stratify cv splits based on target distribuition [True or False]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

EnsemblesOpt-0.1.2.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

EnsemblesOpt-0.1.2-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file EnsemblesOpt-0.1.2.tar.gz.

File metadata

  • Download URL: EnsemblesOpt-0.1.2.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.10

File hashes

Hashes for EnsemblesOpt-0.1.2.tar.gz
Algorithm Hash digest
SHA256 bba33678d39a086bff3a8e5d94c8ad99cf6a24e0c0fe442168078298c1652ee4
MD5 5e2499c32ee27787ed6bd22bc3298813
BLAKE2b-256 18fccd44cacde297925b8072a49d228350efbc2f54e0a4113e7466cf53f13981

See more details on using hashes here.

File details

Details for the file EnsemblesOpt-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: EnsemblesOpt-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 10.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.10

File hashes

Hashes for EnsemblesOpt-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 585ca3af9f5dd622fe051683f71c804e108146c6a023e8fb40e1aed81606cb18
MD5 41a9be72e22b446671fa68d15d97da4b
BLAKE2b-256 026be15118036f875f054577ffd1c2df3e4c81633b977eb50ac68c8d4b9e1752

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page