This package is an extension of the KernelExplainer of shap package that explains the output of any machine learning model, taking into account dependencies between features.
Project description
Shapley values for correlated features
This package contains an extension of the shap package based on the paper 'Explaining individual predictions when features are dependent: More accurate approximations to Shapley values' that describes methods to more accurately approximate shapley values when features in the dataset are correlated.
Installation
To install the package with pip, simply run
pip install corr_shap
Alternatively, you can download the corr_shap repository and create a conda environment with
conda env create -f environment.yml
Background
SHAP
SHAP (SHapley Additive exPlanations) is a method to explain the output of a machine learning model. It uses Shapley values from game theory to compute the contribution of each input feature to the output of the model. Therefore, it can help users understand the factors influencing a model's decision-making process. Since the computational effort to calculate Shapley values grows exponentially, approximation methods such as Kernel SHAP are needed. See the paper 'A Unified Approach to Interpreting Model Predictions' by Scott M. Lundberg and Su-In Lee for more details on Kernel SHAP or their SHAP git repo for the implementation.
Correlated Explainer
One disadvantage of Kernel SHAP is the fact that it assumes that all features are independent. If there is a high correlation between the features, the results of Kernel SHAP can be inaccurate. Therefore, Kjersti Aas, Martin Jullum and Anders Løland propose an extension of Kernel SHAP in their paper 'Explaining individual predictions when features are dependent: More accurate approximations to Shapley values'. Instead of assuming feature independence, they use either a Gaussian distribution, a Gaussian copula distribution, an empirical conditional distribution, or a combination of the empirical distribution with one of the other two. This can produce more accurate results in case of dependent features.
Their proposed method is implemented in the 'CorrExplainer' class. Based on the chosen sampling strategy, the CorrExplainer uses one of the distributions mentioned above or returns the same result as the Kernel Explainer (while having a faster runtime) in case the 'default' sampling strategy is chosen. In our comparisons (with data sets 'adult', 'linear independent 60' and 'diabetes') the CorrExplainer was between 6 to 19 times faster than the Kernel Explainer. However, in its current implementation it is only suitable for the explanation of tabular data.
Examples
Explaining a single instance
Below is a code example that shows how to use the CorrExplainer to explain a single instance of the 'adult' dataset and display the result in a bar plot.
from sklearn import linear_model
from sklearn.model_selection import train_test_split
import shap
from corr_shap.CorrExplainer import CorrExplainer
# load data
x, y = shap.datasets.adult()
# train model
x_training_data, x_test_data, y_training_data, y_test_data \
= train_test_split(x, y, test_size=0.2, random_state=0)
model = linear_model.LinearRegression()
model.fit(x_training_data, y_training_data)
# create explanation object with CorrExplainer
explainer = CorrExplainer(model.predict, x_training_data, sampling="default")
explanation = explainer(x_test_data[:1])
shap.plots.bar(explanation)
Explaining full 'adult' dataset
To get a sense, which features are most important in the whole dataset and not just a single instance, the shap values for each feature and each sample can be visualized in the same plot. See example code here.
Credit default data
Another example with a credit default dataset from the rivapy package with high correlation between the features 'income' and 'savings' and a model that ignores the 'savings' feature can be found here.
Bar plot explaining a single instance:
Summary plot explaining multiple samples:
Further examples can be found in the examples folder.
References
- 'A Unified Approach to Interpreting Model Predictions' Scott M. Lundberg, Su-In Lee
- 'Explaining individual predictions when features are dependent: More accurate approximations to Shapley values' Kjersti Aas, Martin Jullum and Anders Løland
- shap package
- rivapy package
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file corr_shap-0.0.1.tar.gz
.
File metadata
- Download URL: corr_shap-0.0.1.tar.gz
- Upload date:
- Size: 17.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 283f00fb26805b802871d464cd1d7c14cc770c2d08bf08d6159ca232dd043034 |
|
MD5 | 273fce281eede801083656014d5c8455 |
|
BLAKE2b-256 | b18074afb697b944b27e81fb9f22a391101ee2ad50ebea462a9bb9ada85b93da |
File details
Details for the file corr_shap-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: corr_shap-0.0.1-py3-none-any.whl
- Upload date:
- Size: 17.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41cb95a815e9ca9239cfcbb6d19062edb8419f4803ef27755424169b56e58fb0 |
|
MD5 | 5b2f2e2d946ef25aa6a19768369e98a7 |
|
BLAKE2b-256 | 5510a63044e170156163f123cb74e5a7ac8856e1e90b3d229ac5f3d273166760 |