Skip to main content

A SHAP Waterfall Chart for interpreting local differences between observations

Project description

Install

Using pip (recommended)

pip install shapwaterfall

Introduction

Many times when VMware Data Science Teams present their Machine Learning models' propensity to buy scores (estimated probabilities) to stakeholders, stakeholders ask why a customer's propensity to buy is higher than the other customer. The stakeholder's question was our primary motivation.

We were further concerned with recent algorithm transparency language in the EU's General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). Although the 'right to explanation' is not necessarily clear, our desire is to act in good faith by providing local explainability and interpretability between two references, observations, clients, and customers.

This graph solution provides a local classification model interpretability between two observations, which internally we call customers. It uses each customer's estimated probability and fills the gap between the two probabilities with SHAP values that are ordered from higher to lower importance. We prefer SHAP over others (for example, LIME) because of its concrete theory and ability to fairly distribute effects.

Updated, this package works for all classification models. We added the Kernel Explainer.

The package requires a classifier, training data, validation/test/scoring data, the two observations of interest (row index), and the desired number of important features. The package produces a Waterfall Chart.

Command

shapwaterfall(clf, X_tng, X_val, index1, index2, num_features)

Required

  • clf: a tree based classifier that is fitted to X_tng, training data.
  • X_tng: the training Data Frame used to fit the model.
  • X_val: the validation, test, or scoring Data Frame under observation. Note that the data frame must contain an extra column who's label is 'Reference'.
  • index1 and index2: the first and second index numbers.
  • num_features: the number of important features that describe the local interpretability between to the two observations.

Dependent Packages

The shapwaterfall package requires the following python packages:

import pandas as pd
import numpy as np
import shap
import matplotlib.pyplot as plt
import waterfall_chart

Examples

Random Forest on WI Breast Cancer Data

# Scikit-Learn WI Breast Cancer Data Example
# packages
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import shap
import matplotlib.pyplot as plt
import waterfall_chart
from shapwaterfall import shapwaterfall

# models
rf_clf = RandomForestClassifier(n_estimators=1666, max_features="auto", min_samples_split=2, min_samples_leaf=2, max_depth=20, bootstrap=True, n_jobs=1)

# load and organize Wisconsin Breast Cancer Data
data = load_breast_cancer()
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']

# data splits
X_tng, X_val, y_tng, y_val = train_test_split(features, labels, test_size=0.33, random_state=42)

print(X_tng.shape) # (381, 30)
print(X_val.shape) # (188, 30)

X_tng = pd.DataFrame(X_tng)
X_tng.columns = feature_names
X_val = pd.DataFrame(X_val)
X_val.columns = feature_names

# fit classifiers and measure AUC
clf = rf_clf.fit(X_tng, y_tng)
pred_rf = clf.predict_proba(X_val)
score_rf = roc_auc_score(y_val,pred_rf[:,1])
print(score_rf, 'Random Forest AUC')

# 0.9951893425434809 Random Forest AUC

# Use Case 1
shapwaterfall(clf, X_tng, X_val, 5, 100, 5)
shapwaterfall(clf, X_tng, X_val, 100, 5, 7)

# Use Case 2
shapwaterfall(clf, X_tng, X_val, 36, 94, 5)
shapwaterfall(clf, X_tng, X_val, 94, 36, 7)

Authors

John Halstead, jhalstead@vmware.com

Rajesh Vikraman, rvikraman@vmware.com

Ravi Prasad K, rkondapalli@vmware.com

References

  1. Dua, D., Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data]. Irvine, CA: University of California, School of Information and Computer Science.

  2. Iliev, K., Putatunda, S. (2019). “SHAP and LIME Model Interpretability”, VMware EDA AA & DS CoE PowerPoint Presentation, Palo Alto, CA, USA.

  3. Dataman, D. (2019). “Explain Your Model with the SHAP Values”, Medium: Towards Data Science, available at https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d.

  4. Gillies, S. (2020). “The Shapely User Manual”, Shapely 1.8dev documentation, available at https://shapely.readthedocs.io/en/latest/manual.html.

  5. Nayak, A. (2019). “Idea Behind LIME and SHAP: the intuition behind ML interpretation models”, Medium: Towards Data Science, available at https://towardsdatascience.com/idea-behind-lime-and-shap-b603d35d34eb.

  6. Molnar, C. (2020). “Interpretable Machine Learning: a Guide for Making Black Box Models Explainable”, E-book available at https://christophm.github.io/interpretable-ml-book/, updated July 20, 2020, Chapters 5.7 (Local Surrogate (LIME)) and 5.10. (SHAP (SHapley Additive exPlanations)).

  7. Lundberg, S. (2018). “SHAP Explainers and Plots”, available at https://shap.readthedocs.io/en/latest/#.

  8. Owen, S. (2019). “Detecting Data Bias Using SHAP and Machine Learning: What Machine Learning and SHAP Can Tell Us about the Relationship between Developer Salaries and the Gender Pay Gap”, Databricks, available at https://databricks.com/blog/2019/06/17/detecting-bias-with-shap.html.

  9. Moffit, C. (2014). “Creating a Waterfall Chart in Python”, Practical Business Python, available at https://pbpython.com/waterfall-chart.html.

  10. Sharma, A. (2018). “Decrypting your Machine Learning model using LIME: why should you trust your model?”, Medium: Towards Data Science, available at: https://towardsdatascience.com/decrypting-your-machine-learning-model-using-lime-5adc035109b5.

  11. Ribeiro, MT. (2017). “LIME Documentation, Release 0.1”, available at https://buildmedia.readthedocs.org/media/pdf/lime-ml/latest/lime-ml.pdf.

  12. Hulstaert, L. (2018). “Understanding model predictions with LIME”, Medium: Towards Data Science, available at https://towardsdatascience.com/understanding-model-predictions-with-lime-a582fdff3a3b.

  13. Saabas, A. (2015). “treeinterpreter 0.2.2”, PyPl, available at https://pypi.org/project/treeinterpreter/.

  14. Saabas, A. (2015). “Random forest interpretation with scikit-learn”, Diving into Data: A blog on machine learning, data mining and visualization, available at http://blog.datadive.net/random-forest-interpretation-with-scikit-learn/.

  15. Singh, M., Kiran R, Harris, S. (2019). “Corona Impact: VMW Bookings and Propensity Models”, Vmware EDA AA & DS CoE PowerPoint Presentation, Palo Alto, CA, USA.

  16. Lundberg, S., Lee, S. (2017). “A Unified Approach to Interpreting Model Predictions”, 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA.

  17. Bowen, D., Ungar, L., (2020). “Generalized SHAP: Generating multiple types of explanations in machine learning”, Pre-print, June 15, 2020.

  18. Veder, K. (2020). “An Overview of SHAP-based Feature Importance Measures and Their Applications To Classification”, Pre-print, May 8, 2020.

  19. Lundberg, S., Erion, G., Lee, S. (2019). “Consistent Individualized Feature Attribution for Tree Ensembles”, Pre-print, March 7, 2019.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shapwaterfall-0.2.4.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

shapwaterfall-0.2.4-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file shapwaterfall-0.2.4.tar.gz.

File metadata

  • Download URL: shapwaterfall-0.2.4.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.5.0.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.6

File hashes

Hashes for shapwaterfall-0.2.4.tar.gz
Algorithm Hash digest
SHA256 b5e5ab7a4adcfd08f9aa1458c2cae694440344c944bc1289ebf96f3a53b8ff47
MD5 a28b3e5e1cf44121bf64debd6f70e8e1
BLAKE2b-256 8f36524a03b53084b3dd80b9fca90695e47a0ee2abd272b93a380fb25ee55539

See more details on using hashes here.

File details

Details for the file shapwaterfall-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: shapwaterfall-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.5.0.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.6

File hashes

Hashes for shapwaterfall-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 0ff017cc8b491e0152d1af19b05782a71e91463e677cf95576baa9d7d9c0e625
MD5 541fd0eb0436bea2ad1a61d4a2de9cbc
BLAKE2b-256 f9b738a326a37fb0b08e475eedfb2e55520625c33f96b12f8dbdb5b804c36c0b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page