Skip to main content

Feature selection using statistical significance of shap values with powershap-statistical-analysis-rs

Project description

PowerShap logo

PyPI Latest Release support-version codecov Downloads PRs Welcome Testing DOI

powershap is a feature selection method that uses statistical hypothesis testing and power calculations on Shapley values, enabling fast and intuitive wrapper-based feature selection.

Installation ⚙️

pip pip install powershap

Usage 🛠

powershap is built to be intuitive, it supports various models including linear, tree-based, and even deep learning models for classification and regression tasks.

from powershap import PowerShap
from catboost import CatBoostClassifier

X, y = ...  # your classification dataset

selector = PowerShap(
    model=CatBoostClassifier(n_estimators=250, verbose=0, use_best_model=True)
)

selector.fit(X, y)  # Fit the PowerShap feature selector
selector.transform(X)  # Reduce the dataset to the selected features

Features ✨

  • default automatic mode
  • scikit-learn compatible
  • supports various models
  • insights into the feature selection method: call the ._processed_shaps_df on a fitted PowerSHAP feature selector.
  • tested code!

Benchmarks ⏱

Check out our benchmark results here.

How does it work ⁉️

Powershap is built on the core assumption that an informative feature will have a larger impact on the prediction compared to a known random feature.

  • Powershap trains multiple models with different random seeds on different subsets of the data. Each iteration it adds a random uniform feature to the dataset for training.
  • In a single iteration after training a model, powershap calculates the absolute Shapley values of all features, including the random feature. If there are multiple outputs or multiple classes, powershap uses the maximum across these multiple outputs. These values are then averaged for each feature, symbolising the impact of the feature in this iteration.
  • After performing all iterations, each feature then has an array of impacts. The impact array of each feature is then compared to the average of the random feature impact array using the percentile formula to provide a p-value. This tests whether the feature has a larger impact than the random feature and outputs a low p-value if true.
  • Powershap then outputs all features with a p-value below the provided threshold. The threshold is by default 0.01.

Automatic mode 🤖

The required number of iterations and the threshold values are hyperparameters of powershap. However, to avoid manually optimizing the hyperparameters powershap by default uses an automatic mode that automatically determines these hyperparameters.

  • The automatic mode first starts with executing powershap using ten iterations.
  • Then, for each feature powershap calculates the effect size and the statistical power of the test using a student-t power test.
  • Using the calculated effect size, powershap then calculates the required iterations to achieve a predefined power requirement. By default this is 0.99, which represents a false positive probability of 0.01.
  • If the required iterations are larger than the already performed iterations, powershap then further executes for the extra required iterations.
  • Afterward, powershap re-calculates the required iterations and it keeps re-executing until the required iterations are met.

Referencing our package :memo:

If you use powershap in a scientific publication, we would highly appreciate citing us as:

@InProceedings{10.1007/978-3-031-26387-3_5,
author="Verhaeghe, Jarne
and Van Der Donckt, Jeroen
and Ongenae, Femke
and Van Hoecke, Sofie",
title="Powershap: A Power-Full Shapley Feature Selection Method",
booktitle="Machine Learning and Knowledge Discovery in Databases",
year="2023",
publisher="Springer International Publishing",
address="Cham",
pages="71--87",
isbn="978-3-031-26387-3"
}

Paper was presented at ECML PKDD 2022. The manuscript can be found here and on the github.


👤 Jarne Verhaeghe, Jeroen Van Der Donckt

License

This package is available under the MIT license. More information can be found here: https://github.com/predict-idlab/powershap/blob/main/LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

powershap_rs-0.1.2.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

powershap_rs-0.1.2-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file powershap_rs-0.1.2.tar.gz.

File metadata

  • Download URL: powershap_rs-0.1.2.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.11.7 Darwin/23.3.0

File hashes

Hashes for powershap_rs-0.1.2.tar.gz
Algorithm Hash digest
SHA256 a69ebba8fc50e1fdbaf2b9474b2d612025a875612c4091cd549e46726a8bbe70
MD5 56a6219ebf09f8821e789ede25f5e03b
BLAKE2b-256 23af51f1533219239d976b72583356b8a6ecf7199572abde573f6c46875c2470

See more details on using hashes here.

File details

Details for the file powershap_rs-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: powershap_rs-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 16.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.11.7 Darwin/23.3.0

File hashes

Hashes for powershap_rs-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 748649a3f94947c17fde988ff7e89a0be29d71b8065e9925d30f4d0c4a373a32
MD5 9d3131a0748773a57fd8a68725122daa
BLAKE2b-256 628bc7d72823a2660c103f10eae7eb34cca5ed0cde5ee2042dd3effd9fa9b10d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page