Triglav: Iterative Refinement and Selection of Stable Features Using Shapley Values
Project description
Triglav - Feature Selection Using Iterative Refinement
Overview
Triglav (named after the Slavic god of divination) attempts to discover all relevant features using an iterative refinement approach. This approach is based after the method introduced in Boruta with several modifications:
-
Features are clustered and the impact of each cluster is assessed as the average of the Shapley scores of the features associated with each cluster.
-
Like Boruta, a set of shadow features is created. However, an ensemble of classifiers is used to measure the Shapley scores of each real feature and its shadow counterpart, producing a distribution of scores. A Wilcoxon signed-rank test is used to determine the significance of each cluster and p-values are adjusted to correct for multiple comparisons across each round. Clusters with adjusted p-values below 'alpha' are considered a hit.
-
At each iteration at or over 'n_iter_fwer', two beta-binomial distributions are used to determine if a cluster should be retained or not. The first distribution models the hit rate while the the second distribution models the rejection rate. For a cluster to be successfully selected the probability of a hit must be significant after correcting for multiple comparisons and applying a Bonferroni correction for each iteration greater than or equal to the 'n_iter_fwer' parameter. For a cluster to be rejected a similar round of reasoning applies. Clusters that are not rejected remain tentative.
-
After the iterative refinement stage SAGE scores could be used to select the best feature from each cluster.
While this method may not produce all features important for classification, it does have some nice properties. First of all, by using an Extremely Randomized Trees model as the default, dependencies between features can be accounted for. Further, decision tree models are better able to partition the sample space. This can result in the selection of both globally optimal and locally optimal features. Finally, this approach identifies stable clusters of features since only those which consistently pass the Wilcoxon signed-rank test are selected. This makes this approach more robust to differences in training data.
Install
With Conda from BioConda:
conda install -c bioconda triglav
From PyPI:
pip install triglav
From source:
git clone https://github.com/jrudar/Triglav.git
cd Triglav
pip install .
# or create a virtual environment
python -m venv venv
source venv/bin/activate
pip install .
Interface
An overview of the API can be found here.
Usage and Examples
Examples of how to use Triglav
can be found here.
Contributing
To contribute to the development of Triglav
please read our contributing guide
References
Coming Soon
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file triglav-1.0.7.tar.gz
.
File metadata
- Download URL: triglav-1.0.7.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4d0b12a5eae2a80c7c816aabe11409c6d1cc7f0f6051e33af87b345c8a7fe340 |
|
MD5 | d0085f5f1790de178b843ce356e02242 |
|
BLAKE2b-256 | 8d04ab8f17d720f60c13121113769b4fe3361b308596af78d9bdc420c88cfaef |
File details
Details for the file triglav-1.0.7-py3-none-any.whl
.
File metadata
- Download URL: triglav-1.0.7-py3-none-any.whl
- Upload date:
- Size: 13.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cc95a1d81b677b8c27a8ebb7cc92022c453ce8e354c3a091c7f9ceb3696b4677 |
|
MD5 | 71ec7da40c169d3cf1d4c56bfd5734b4 |
|
BLAKE2b-256 | 700b9ef376f8a39f95a48971ee20ee8cc8805c6c82f74115b37134871736b3fe |