Skip to main content

Feature selection package based on SHAP and target permutation, for pandas and Spark

Project description

PyPi Conda ReadTheDocs

shapicant

shapicant is a feature selection package based on SHAP [LUN] and target permutation, for pandas and Spark.

It is inspired by PIMP [ALT], with some differences:

  • PIMP fits a probability distribution to the population of null importances or, alternatively, uses a non-parametric estimation of the PIMP p-values. Instead, shapicant only implements the non-parametric estimation.

  • For the non-parametric estimation, PIMP computes the fraction of null importances that are more extreme than the true importance (i.e. r/n). Instead, shapicant computes it as (r+1)/(n+1) [NOR].

  • PIMP uses the Gini importance of Random Forest models or the Mutual Information criterion. Instead, shapicant uses SHAP values.

  • While feature importance measures such as the Gini importance show an absolute feature importance, SHAP provides both positive and negative impacts. Instead of taking the mean absolute value of the SHAP values for each feature as feature importance, shapicant takes the mean value for positive and negative SHAP values separately. The true importance needs to be consistently higher than null importances for both positive and negative impacts. For multi-class classification, the true importance needs to be higher for at least one of the classes.

  • While feature importance measures such as the Gini importance of Random Forest models are computed on the training set, SHAP values can be computed out-of-sample. Therefore, shapicant allows to compute them on a distinct validation set. To decide whether to compute them on the training set or on a validation set, you can refer to this discussion for “Training vs. Test Data” (it talks about PFI [BRE], which is a different algorithm, but the general idea is still applicable).

Permuting the response vector instead of permuting features has some advantages:

  • The dependence between predictor variables remains unchanged.

  • The number of permutations can be much smaller than the number of predictor variables for high dimensional datasets (unlike PFI [BRE]) and there is no need to add shadow features (unlike Boruta [KUR]).

  • Since the features set does not change during iterations, the distributed implementation is more straightforward.

Installation

Dependencies

shapicant requires:

  • Python (>= 3.6)

  • shap (>= 0.36.0)

  • numpy

  • pandas

  • scikit-learn

  • tqdm

For Spark, we also need:

  • pyspark (>= 3.0)

  • pyarrow

User installation

The easiest way to install shapicant is using pip

pip install shapicant

or conda

conda install -c conda-forge shapicant

Documentation

Installation documentation, API reference and examples can be found on the documentation.

References

[LUN]

Lundberg, S., & Lee, S.I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (pp. 4765–4774).

[ALT]

Altmann, A., Toloşi, L., Sander, O., & Lengauer, T. (2010). Permutation importance: a corrected feature importance measure Bioinformatics, 26 (10), 1340-1347.

[NOR]

North, B. V., Curtis, D., & Sham, P. C. (2002). A note on the calculation of empirical P values from Monte Carlo procedures. American journal of human genetics, 71 (2), 439–441.

[BRE] (1,2)

Breiman, L. (2001). Random Forests Machine Learning, 45 (1), 5–32.

[KUR]

Kursa, M., & Rudnicki, W. (2010). Feature Selection with Boruta Package Journal of Statistical Software, 36, 1-13.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shapicant-0.4.0.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

shapicant-0.4.0-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file shapicant-0.4.0.tar.gz.

File metadata

  • Download URL: shapicant-0.4.0.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for shapicant-0.4.0.tar.gz
Algorithm Hash digest
SHA256 f711d37ad212e3f46dfc829d372f8337d6618b877e7da6640cdccfe0a30d5147
MD5 566afd91497aac5a46933d95154da5bc
BLAKE2b-256 81720090db58c9e2844549d8a53b93d3697545302cd92186d9c01ec2bd8530be

See more details on using hashes here.

File details

Details for the file shapicant-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: shapicant-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 15.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for shapicant-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fcc1ceb78d93f6b53f62fa04b66053f1d03cb2044e008c96dc1364540540fd4c
MD5 8ba242af25062188ea760a6afa0e038a
BLAKE2b-256 cb9b687b9631562132093fac2f4d893b8cc06d92038cd80837475062d7805ec4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page