Skip to main content

Heuristic for quick feature selection for tabular regression/classification using shapley values

Project description

Overview

shap-select implements a heuristic for fast feature selection, for tabular regression and classification models.

The basic idea is running a linear or logistic regression of the target on the Shapley values of the original features, on the validation set, discarding the features with negative coefficients, and ranking/filtering the rest according to their statistical significance. For motivation and details, refer to our research paper see the example notebook

Earlier packages using Shapley values for feature selection exist, the advantages of this one are

  • Regression on the validation set to combat overfitting
  • Only a single fit of the original model needed
  • A single intuitive hyperparameter for feature selection: statistical significance
  • Bonferroni correction for multiclass classification
  • Address collinearity of (Shapley value) features by repeated (linear/logistic) regression

Usage

from shap_select import shap_select
# Here model is any model supported by the shap library, fitted on a different (train) dataset
# Task can be regression, binary, or multiclass
selected_features_df = shap_select(model, X_val, y_val, task="multiclass", threshold=0.05)
  feature name t-value stat.significance coefficient selected
0 x5 20.211299 0.000000 1.052030 1
1 x4 18.315144 0.000000 0.952416 1
2 x3 6.835690 0.000000 1.098154 1
3 x2 6.457140 0.000000 1.044842 1
4 x1 5.530556 0.000000 0.917242 1
5 x6 2.390868 0.016827 1.497983 1
6 x7 0.901098 0.367558 2.865508 0
7 x8 0.563214 0.573302 1.933632 0
8 x9 -1.607814 0.107908 -4.537098 -1

Citation

If you use shap-select in your research, please cite our paper:

@misc{kraev2024shapselectlightweightfeatureselection,
      title={Shap-Select: Lightweight Feature Selection Using SHAP Values and Regression}, 
      author={Egor Kraev and Baran Koseoglu and Luca Traverso and Mohammed Topiwalla},
      year={2024},
      eprint={2410.06815},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.06815}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shap_select-0.1.2.tar.gz (15.2 kB view details)

Uploaded Source

File details

Details for the file shap_select-0.1.2.tar.gz.

File metadata

  • Download URL: shap_select-0.1.2.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for shap_select-0.1.2.tar.gz
Algorithm Hash digest
SHA256 6161931a98bc30955e02a3a5f8d827270fbbff33df67bf7c6ffc10d3b69952e3
MD5 66b956495cdd00e9a80a83b29ba4563e
BLAKE2b-256 e9bcb71597f055aca3fb55d12a87be875fb2aead8a6c255d455d56f7e236d668

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page