Skip to main content

vimpy: nonparametric variable importance assessment in python

Project description

vimpy: nonparametric variable importance assessment in python

PyPI version License: MIT

Author: Brian Williamson

Introduction

In predictive modeling applications, it is often of interest to determine the relative contribution of subsets of features in explaining an outcome; this is often called variable importance. It is useful to consider variable importance as a function of the unknown, underlying data-generating mechanism rather than the specific predictive algorithm used to fit the data. This package provides functions that, given fitted values from predictive algorithms, compute nonparametric estimates of deviance- and variance-based variable importance, along with asymptotically valid confidence intervals for the true importance.

Installation

You may install a stable release of vimpy using pip by running python pip install vimpy from a Terminal window. Alternatively, you may install within a virtualenv environment.

You may install the current dev release of vimpy by downloading this repository directly.

Issues

If you encounter any bugs or have any specific feature requests, please file an issue.

Example

This example shows how to use vimpy in a simple setting with simulated data and using a single regression function. For more examples and detailed explanation, please see the R vignette (to come).

## load required libraries
import numpy as np
import vimpy
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

## -------------------------------------------------------------
## problem setup
## -------------------------------------------------------------
## define a function for the conditional mean of Y given X
def cond_mean(x = None):
    f1 = np.where(np.logical_and(-2 <= x[:, 0], x[:, 0] < 2), np.floor(x[:, 0]), 0) 
    f2 = np.where(x[:, 1] <= 0, 1, 0)
    f3 = np.where(x[:, 2] > 0, 1, 0)

    f6 = np.absolute(x[:, 5]/4) ** 3
    f7 = np.absolute(x[:, 6]/4) ** 5

    f11 = (7./3)*np.cos(x[:, 10]/2)

    ret = f1 + f2 + f3 + f6 + f7 + f11

    return ret

## create data
np.random.seed(4747)
n = 100
p = 15
s = 1 # importance desired for X_1
x = np.zeros((n, p))
for i in range(0, x.shape[1]) :
    x[:,i] = np.random.normal(0, 2, n)

y = cond_mean(x) + np.random.normal(0, 1, n)

## -------------------------------------------------------------
## preliminary step: get regression estimators
## -------------------------------------------------------------
## use grid search to get optimal number of trees and learning rate
ntrees = np.arange(100, 3500, 500)
lr = np.arange(.01, .5, .05)

param_grid = [{'n_estimators':ntrees, 'learning_rate':lr}]

## set up cv objects
cv_full = GridSearchCV(GradientBoostingRegressor(loss = 'ls', max_depth = 1), param_grid = param_grid, cv = 5)
cv_small = GridSearchCV(GradientBoostingRegressor(loss = 'ls', max_depth = 1), param_grid = param_grid, cv = 5)

## fit the full regression
cv_full.fit(x, y)
full_fit = cv_full.best_estimator_.predict(x)

## fit the reduced regression
x_small = np.delete(x, s, 1) # delete the columns in s
cv_small.fit(x_small, full_fit)
small_fit = cv_small.best_estimator_.predict(x_small)

## -------------------------------------------------------------
## get variable importance estimates
## -------------------------------------------------------------
## set up the vimp object
vimp = vimpy.vimp_regression(y, x, full_fit, small_fit, s)
## get the naive estimator
vimp.plugin()
## get the corrected estimator
vimp.update()
vimp.onestep_based_estimator()
## get a standard error
vimp.onestep_based_se()
## get a confidence interval
vimp.get_ci()

## -------------------------------------------------------------
## get variable importance estimates using cross-validation
## -------------------------------------------------------------
full_fits = [None]*V
small_fits = [None]*V
for v in range(V):
    cv_full.fit(x[folds == v, :], y[folds == v])
    full_fits[v] = cv_full.best_estimator_.predict(x[folds == v, :])
    x_small = np.delete(x[folds == v, :], s, 1) # delete the columns in s
    cv_small.fit(x_small, full_fits[v])
    small_fits[v] = cv_small.best_estimator_.predict(x_small)

## set up the outcome and vimp object
ys = [y[folds == v] for v in range(V)]
vimp_cv = vimpy.cv_vim(ys, x, full_fits, small_fits, V, folds, "regression", s)
## get the naive estimator
vimp_cv.plugin()
## get the corrected estimator
vimp_cv.update()
vimp_cv.onestep_based_estimator()
## get a standard error
vimp_cv.onestep_based_se()
## get a confidence interval
vimp_cv.get_ci()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vimpy-1.0.0.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

vimpy-1.0.0-py2-none-any.whl (6.9 kB view details)

Uploaded Python 2

File details

Details for the file vimpy-1.0.0.tar.gz.

File metadata

  • Download URL: vimpy-1.0.0.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/2.7.15rc1

File hashes

Hashes for vimpy-1.0.0.tar.gz
Algorithm Hash digest
SHA256 340989839b1e648d6f7b3c2301ab8263b9465fe5267b3f3b9c981d143fd9b124
MD5 0486431ffdae309b8cbea51a3f343f9a
BLAKE2b-256 bec173654fa26d68f3b3cd7e2aa6a4932e7b4a04b8208b0e0eb5d8b9de9d5a70

See more details on using hashes here.

File details

Details for the file vimpy-1.0.0-py2-none-any.whl.

File metadata

  • Download URL: vimpy-1.0.0-py2-none-any.whl
  • Upload date:
  • Size: 6.9 kB
  • Tags: Python 2
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/2.7.15rc1

File hashes

Hashes for vimpy-1.0.0-py2-none-any.whl
Algorithm Hash digest
SHA256 040e5bf5c1782a15daf3c62ab895b892e2497aef101b8e5393a5b86d4922c535
MD5 6c5983826249c920ee58520543c576b9
BLAKE2b-256 38463ddcfa54ce80f05f91d8c2cfb057a05df573e0a930578b566f365786b9e7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page