Skip to main content

Triglav: Iterative Refinement and Selection of Stable Features Using Shapley Values

Project description

Triglav - Feature Selection Using Iterative Refinement

CI

Overview

Triglav (named after the Slavic god of divination) attempts to discover all relevant features using an iterative refinement approach. This approach is based after the method introduced in Boruta with several modifications:

  1. Features are clustered and the impact of each cluster is assessed as the average of the Shapley scores of the features associated with each cluster.

  2. Like Boruta, a set of shadow features is created. However, an ensemble of classifiers is used to measure the Shapley scores of each real feature and its shadow counterpart, producing a distribution of scores. A Wilcoxon signed-rank test is used to determine the significance of each cluster and p-values are adjusted to correct for multiple comparisons across each round. Clusters with adjusted p-values below 'alpha' are considered a hit.

  3. At each iteration at or over 'n_iter_fwer', two beta-binomial distributions are used to determine if a cluster should be retained or not. The first distribution models the hit rate while the the second distribution models the rejection rate. For a cluster to be successfully selected the probability of a hit must be significant after correcting for multiple comparisons and applying a Bonferroni correction for each iteration greater than or equal to the 'n_iter_fwer' parameter. For a cluster to be rejected a similar round of reasoning applies. Clusters that are not rejected remain tentative.

  4. After the iterative refinement stage SAGE scores could be used to select the best feature from each cluster.

While this method may not produce all features important for classification, it does have some nice properties. First of all, by using an Extremely Randomized Trees model as the default, dependencies between features can be accounted for. Further, decision tree models are better able to partition the sample space. This can result in the selection of both globally optimal and locally optimal features. Finally, this approach identifies stable clusters of features since only those which consistently pass the Wilcoxon signed-rank test are selected. This makes this approach more robust to differences in training data.

Install

With Conda from BioConda:

conda install -c bioconda triglav

From PyPI:

pip install triglav

From source:

git clone https://github.com/jrudar/Triglav.git
cd Triglav
pip install .
# or create a virtual environment
python -m venv venv
source venv/bin/activate
pip install .

Interface

An overview of the API can be found here.

Usage and Examples

Examples of how to use Triglav can be found here.

Contributing

To contribute to the development of Triglav please read our contributing guide

References

Coming Soon

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

triglav-1.0.7.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

triglav-1.0.7-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file triglav-1.0.7.tar.gz.

File metadata

  • Download URL: triglav-1.0.7.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for triglav-1.0.7.tar.gz
Algorithm Hash digest
SHA256 4d0b12a5eae2a80c7c816aabe11409c6d1cc7f0f6051e33af87b345c8a7fe340
MD5 d0085f5f1790de178b843ce356e02242
BLAKE2b-256 8d04ab8f17d720f60c13121113769b4fe3361b308596af78d9bdc420c88cfaef

See more details on using hashes here.

File details

Details for the file triglav-1.0.7-py3-none-any.whl.

File metadata

  • Download URL: triglav-1.0.7-py3-none-any.whl
  • Upload date:
  • Size: 13.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for triglav-1.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 cc95a1d81b677b8c27a8ebb7cc92022c453ce8e354c3a091c7f9ceb3696b4677
MD5 71ec7da40c169d3cf1d4c56bfd5734b4
BLAKE2b-256 700b9ef376f8a39f95a48971ee20ee8cc8805c6c82f74115b37134871736b3fe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page