A general package to handle nested crossvalidation for any estimator that implements the scikitlearn estimator interface.
Project description
NestedCrossValidation
This repository implements a general nested crossvalidation function. Ready to use with ANY estimator that implements the ScikitLearn estimator interface.
Installing the pacakge:
You can find the package on pypi* and install it via pip by using the following command:
pip install nestedcv
You can also install it from the wheel file on the Releases page.
* we gradually push updates, pull this master from github if you want the absolute latest changes.
Usage
Be mindful of the options that are available for NestedCV. Some crossvalidation options are defined in a dictionary cv_options
.
This package is optimized for any estimator that implements a scikitlearn wrapper, e.g. XGBoost, LightGBM, KerasRegressor, KerasClassifier etc.
Single algorithm
Here is a single example using Random Forest
#import the package
from nested_cv import NestedCV
# Define a parameters grid
param_grid = {
'max_depth': [3, None],
'n_estimators': [100,200,300,400,500,600,700,800,900,1000],
'max_features' : [50,100,150,200] # Note: You might not have that many features
}
# Define parameters for function
# Default scoring: RMSE
nested_CV_search = NestedCV(model=RandomForestRegressor(), params_grid=param_grid , outer_kfolds=5, inner_kfolds=5,
cv_options={'sqrt_of_score':True, 'randomized_search_iter':30})
nested_CV_search.fit(X=X,y=y)
grid_nested_cv.score_vs_variance_plot()
print('\nCumulated best parameter grid was:\n{0}'.format(nested_CV_search.best_params))
Multiple algorithms
Here is an example using Random Forest, XGBoost and LightGBM.
models_to_run = [RandomForestRegressor(), xgb.XGBRegressor(), lgb.LGBMRegressor()]
models_param_grid = [
{ # 1st param grid, corresponding to RandomForestRegressor
'max_depth': [3, None],
'n_estimators': [100,200,300,400,500,600,700,800,900,1000],
'max_features' : [50,100,150,200]
},
{ # 2nd param grid, corresponding to XGBRegressor
'learning_rate': [0.05],
'colsample_bytree': np.linspace(0.3, 0.5),
'n_estimators': [100,200,300,400,500,600,700,800,900,1000],
'reg_alpha' : (1,1.2),
'reg_lambda' : (1,1.2,1.4)
},
{ # 3rd param grid, corresponding to LGBMRegressor
'learning_rate': [0.05],
'n_estimators': [100,200,300,400,500,600,700,800,900,1000],
'reg_alpha' : (1,1.2),
'reg_lambda' : (1,1.2,1.4)
}
]
for i,model in enumerate(models_to_run):
nested_CV_search = NestedCV(model=model, params_grid=models_param_grid[i], outer_kfolds=5, inner_kfolds=5,
cv_options={'sqrt_of_score':True, 'randomized_search_iter':30})
nested_CV_search.fit(X=X,y=y)
model_param_grid = nested_CV_search.best_params
print('\nCumulated best parameter grid was:\n{0}'.format(model_param_grid))
NestedCV Parameters
Name  type  description 

model  estimator  The estimator implements scikitlearn estimator interface. 
params_grid  dictionary "dict"  The dict contains hyperparameters for model. 
outer_kfolds  int  Number of outer Kpartitions in KFold 
inner_kfolds  int  Number of inner Kpartitions in KFold 
cv_options  dictionary "dict"  Next section 
n_jobs  int  Number of jobs for joblib to run (multiprocessing) 
cv_options
value options
metric
: Callable from sklearn.metrics, default = mean_squared_error
A scoring metric used to score each model
metric_score_indicator_lower
: boolean, default = True
Choose whether lower score is better for the metric calculation or higher score is better.
sqrt_of_score
: boolean, default = False
Whether or not if the square root should be taken of score
randomized_search
: boolean, default = True
Whether to use gridsearch or randomizedsearch from sklearn
randomized_search_iter
: int, default = 10
Number of iterations for randomized search
recursive_feature_elimination
: boolean, default = False
Whether to do feature elimination
predict_proba
: boolean, default = False
If true, predict probabilities instead for a class, instead of predicting a class
multiclass_average
: string, default = 'binary'
For some classification metrics with a multiclass prediction, you need to specify an average other than 'binary'
Returns
variance
: Model variance by numpy.var()
outer_scores
: A list of the outer scores, from the outer crossvalidation
best_inner_score_list
: A list of best inner scores for each outer loop
best_params
: All best params from each inner loop cumulated in a dict
best_inner_params_list
: Best inner params for each outer loop as an array of dictionaries
How to use the output?
We suggest looking at the best hyperparameters together with the score for each outer loop. Look at how stable the model appears to be in a nested crossvalidation setting. If the outer score changes a lot, then it might indicate instability in your model. In that case, start over with making a new model.
After Nested CrossValidation?
If the results from nested crossvalidation are stable: Run a normal crossvalidation with the same procedure as in nested crossvalidation, i.e. if you used feature selection in nested crossvalidation, you should also do that in normal crossvalidation.
Limitations
 XGBoost implements a
early_stopping_rounds
, which cannot be used in this implementation. Other similar parameters might not work in combination with this implementation. The function will have to be adopted to use special parameters like that.
What did we learn?
 Using ScikitLearn will lead to a faster implementation, since the ScikitLearn community has implemented many functions that does much of the work.
 We have learned and applied this package in our main project about House Price Prediction.
Why use Nested CrossValidation?
Controlling the biasvariance tradeoff is an essential and important task in machine learning, indicated by [Cawley and Talbot, 2010]. Many articles indicate that this is possible by the use of nested crossvalidation, one of them by Varma and Simon, 2006. Other interesting literature for nested crossvalidation are [Varoquaox et al., 2017] and [Krstajic et al., 2014].
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for nested_cv0.916py3noneany.whl
Algorithm  Hash digest  

SHA256  270fb8d22413b7ff3cf3f96455c6b9f13a1d1f739401eedb213dd8b372602604 

MD5  3fe290c8e648ba0abb36d8775e03f60b 

BLAKE2b256  fc9246185d13c88fb01e0b64674f80f70d41fe5d85610d7cc0638f6e4de9a074 