Compute Friedman and Popescu's H statistics, in order to look for interactions among variables in scikit-learn gradient-boosting models.
Project description
This distribution provides a Python module for computing Friedman and Popescu’s H statistics, in order to look for interactions among variables in scikit-learn gradient-boosting models (http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting).
See Jerome H. Friedman and Bogdan E. Popescu, 2008, “Predictive learning via rule ensembles”, Ann. Appl. Stat. 2:916-954, http://projecteuclid.org/download/pdfview_1/euclid.aoas/1223908046, s. 8.1.
Installation
pip install sklearn-gbmi
Usage
Given a scikit-learn gradient-boosting model gbm that has been fitted to a NumPy array or pandas data frame array_or_frame and a list of indices of columns of the array or columns of the data frame indices_or_columns, the H statistic of the variables represented by the elements of array_or_frame and specified by indices_or_columns can be computed via
from sklearn_gbmi import *
h(gbm, array_or_frame, indices_or_columns)
Alternatively, the two-variable H statistic of each pair of variables represented by the elements of array_or_frame and specified by indices_or_columns can be computed via
from sklearn_gbmi import *
h_all_pairs(gbm, array_or_frame, indices_or_columns)
(Compared to iteratively calling h, calling h_all_pairs avoids redundant computations.)
indices_or_columns is optional, with default value ‘all’. If it is ‘all’, then all columns of array_or_frame are used.
NaN is returned if a computation is spoiled by weak main effects and rounding errors.
H varies from 0 to 1. The larger H, the stronger the evidence for an interaction among the variables.
Example
See the Jupyter notebook example.ipynb (https://github.com/ralphhaygood/sklearn-gbmi/blob/master/example.ipynb) for a complete example of how to use the module.
Notes
1. Per Friedman and Popescu, only variables with strong main effects should be examined for interactions. Strengths of main effects are available as gbm.feature_importances_ once gbm has been fitted.
2. Per Friedman and Popescu, collinearity among variables can lead to interactions in gbm that are not present in the target function. To forestall such spurious interactions, check for strong correlations among variables before fitting gbm.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.