Skip to main content

This package is an extension of the KernelExplainer of shap package that explains the output of any machine learning model, taking into account dependencies between features.

Project description

Shapley values for correlated features

This package contains an extension of the shap package based on the paper 'Explaining individual predictions when features are dependent: More accurate approximations to Shapley values' that describes methods to more accurately approximate shapley values when features in the dataset are correlated.

Installation

To install the package with pip, simply run

pip install corr-shap

Alternatively, you can download the corr_shap repository and create a conda environment with

conda env create -f environment.yml

Background

SHAP

SHAP (SHapley Additive exPlanations) is a method to explain the output of a machine learning model. It uses Shapley values from game theory to compute the contribution of each input feature to the output of the model. Therefore, it can help users understand the factors influencing a model's decision-making process. Since the computational effort to calculate Shapley values grows exponentially, approximation methods such as Kernel SHAP are needed. See the paper 'A Unified Approach to Interpreting Model Predictions' by Scott M. Lundberg and Su-In Lee for more details on Kernel SHAP or their SHAP git repo for the implementation.

Correlated Explainer

One disadvantage of Kernel SHAP is the fact that it assumes that all features are independent. If there is a high correlation between the features, the results of Kernel SHAP can be inaccurate. Therefore, Kjersti Aas, Martin Jullum and Anders Løland propose an extension of Kernel SHAP in their paper 'Explaining individual predictions when features are dependent: More accurate approximations to Shapley values'. Instead of assuming feature independence, they use either a Gaussian distribution, a Gaussian copula distribution, an empirical conditional distribution, or a combination of the empirical distribution with one of the other two. This can produce more accurate results in case of dependent features.

Their proposed method is implemented in the 'CorrExplainer' class. Based on the chosen sampling strategy, the CorrExplainer uses one of the distributions mentioned above or returns the same result as the Kernel Explainer (while having a faster runtime) in case the 'default' sampling strategy is chosen. In our comparisons (with data sets 'adult', 'linear independent 60' and 'diabetes') the CorrExplainer was between 6 to 19 times faster than the Kernel Explainer. However, in its current implementation it is only suitable for the explanation of tabular data.

Examples

Explaining a single instance

Below is a code example that shows how to use the CorrExplainer to explain a single instance of the 'adult' dataset and display the result in a bar plot.

from sklearn import linear_model
from sklearn.model_selection import train_test_split
import shap
from corr_shap.CorrExplainer import CorrExplainer

# load data
x, y = shap.datasets.adult()

# train model
x_training_data, x_test_data, y_training_data, y_test_data \
    = train_test_split(x, y, test_size=0.2, random_state=0)
model = linear_model.LinearRegression()
model.fit(x_training_data, y_training_data)

# create explanation object with CorrExplainer
explainer = CorrExplainer(model.predict, x_training_data, sampling="default")
explanation = explainer(x_test_data[:1])

shap.plots.bar(explanation)

plot

Explaining full 'adult' dataset

To get a sense, which features are most important in the whole dataset and not just a single instance, the shap values for each feature and each sample can be visualized in the same plot. See example code here.

plot

Credit default data

Another example with a credit default dataset from the rivapy package with high correlation between the features 'income' and 'savings' and a model that ignores the 'savings' feature can be found here.

Bar plot explaining a single instance: plot

Summary plot explaining multiple samples: plot

Further examples can be found in the examples folder.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corr_shap-0.0.2.tar.gz (17.2 kB view details)

Uploaded Source

Built Distribution

corr_shap-0.0.2-py3-none-any.whl (17.9 kB view details)

Uploaded Python 3

File details

Details for the file corr_shap-0.0.2.tar.gz.

File metadata

  • Download URL: corr_shap-0.0.2.tar.gz
  • Upload date:
  • Size: 17.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for corr_shap-0.0.2.tar.gz
Algorithm Hash digest
SHA256 6fe2d174619b1977e1fe406b5c3a6000d54c797ed76c319552cb00560a5bf748
MD5 6d92ad9a8b5b54e1c6266ef36dfe8b9f
BLAKE2b-256 7cacee7829676820ee067c43c8769ea7c334c5bcef274c2465290ffaf8453fa4

See more details on using hashes here.

File details

Details for the file corr_shap-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: corr_shap-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 17.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for corr_shap-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8655163760dffa13a0afa3b6a1e550c01def4a4427ab25a708627013f21e15be
MD5 5e2f202d9098df3dfc6f7947ec2e357a
BLAKE2b-256 67e9cebcb7672a9b1c6f927d8d313749aa8ac877e560ab39304f3d20580c0072

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page