Benchmarking framework for Feature Selection algorithms 🚀
Project description
fseval
A Feature Ranker benchmarking library. Useful for Feature Selection and Interpretable AI methods. Allows plotting feature importance scores on an online dashboard. Neatly integrates Hydra with wandb.
Any sklearn style estimator can be used as a Feature Ranker. Estimator must estimate at least one of:
- Feature importance, using
feature_importances_
. - Feature subset, using
feature_support_
. - Feature ranking, using
feature_ranking_
.
Main functionality:
- 📊 Online dashboard. Experiments can be uploaded to wandb for seamless experiment tracking and visualization. Feature importance and subset validation plots are built-in.
- 🔄 Scikit-Learn integration. Integrates nicely with sklearn. Any estimator that implements
fit
is supported. - 🗄 Dataset adapters. Datasets can be loaded dynamically using an adapter. OpenML support is built-in.
- 🎛 Synthetic dataset generation. Synthetic datasets can be generated and configured right in the library itself.
- 📌 Relevant features ground-truth. Datasets can have ground-truth relevant features defined, so the estimated versus the ground-truth feature importance is automatically plotted in the dashboard.
- ⚜️ Subset validation. Allows you to validate the quality of a feature ranking, by running a validation estimator on some of the
k
best feature subsets. - ⚖️ Bootstrapping. Allows you to approximate the stability of an algorithm by running multiple experiments on bootstrap resampled datasets.
- ⚙️ Reproducible configs. Uses Hydra as a config parser, to allow configuring every part of the experiment. The config can be uploaded to wandb, so the experiment can be replayed later.
Install
pip install fseval
Usage
fseval is run via a CLI. Example:
fseval \
+dataset=synclf_easy \
+estimator@ranker=chi2 \
+estimator@validator=decision_tree
Which runs Chi2 feature ranking on the Synclf easy dataset, and validates feature subsets using a Decision Tree.
To see all the configurable options, run:
fseval --help
Weights and Biases integration
Integration with wandb is built-in. Create an account and login to the CLI with wandb login
. Then, enable wandb using callbacks="[wandb]"
:
fseval \
callbacks="[wandb]" \
+callbacks.wandb.project=fseval-readme \
+dataset=synclf_easy \
+estimator@ranker=chi2 \
+estimator@validator=decision_tree
We can now explore the results on the online dashboard:
✨
Running bootstraps
Bootstraps can be run, to approximate the stability of an algorithm. Bootstrapping works by creating multiple dataset permutations and running the algorithm on each of them. A simple way to create dataset permutations is to resample with replacement.
In fseval, bootstrapping can be configured with resample=bootstrap
:
fseval \
resample=bootstrap \
n_bootstraps=8 \
+dataset=synclf_easy \
+estimator@ranker=chi2 \
+estimator@validator=decision_tree
To run the entire experiment 8 times, each for a resampled dataset.
In the dashboard, plots are already set up to support bootstrapping:
Shows the validation results for 25 bootstraps. ✨
Launching multiple experiments at once
To launch multiple experiments, use --multirun
.
fseval \
--multirun \
+dataset=synclf_easy \
+estimator@ranker=[boruta,featboost,chi2] \
+estimator@validator=decision_tree
Which launches 3 jobs.
See the multirun overriding syntax. For example, you can select multiple groups using []
, a range using range(start, stop, step)
and all options using glob(*)
.
Multiprocessing
The experiment can run in parallel. The list of bootstraps is distributed over the CPU's. To use all available processors set n_jobs=-1
:
fseval [...] n_jobs=-1
Alternatively, set n_jobs
to the specific amount of processors to use. e.g. n_jobs=4
if you have a quad-core.
When using bootstraps, it can be efficient to use an amount that is divisible by the amount of CPU's:
fseval [...] resample=bootstrap n_bootstraps=8 n_jobs=4
would cause all 8 CPU's to be utilized efficiently.
Distributed processing
Since fseval uses Hydra, all its plugins can also be used. Some plugins for distributed processing are:
- RQ launcher. Uses Redis Queue (RQ) to launch jobs.
- Submitit Launcher. Submits jobs directly to a SLURM cluster. See the example setup.
Example:
fseval --multirun [...] hydra/launcher=rq
To submit jobs to RQ.
Configuring a Feature Ranker
The entirety of the config can be overriden like pleased. Like such, also feature rankers can be configured. For example:
fseval [...] +validator.classifier.estimator.criterion=entropy
Changes the Decision Tree criterion to entropy. One could perform a hyper-parameter sweep over some parameters like so:
fseval --multirun [...] +validator.classifier.estimator.criterion=entropy,gini
or, in case of a ranker:
fseval --multirun [...] +ranker.classifier.estimator.learning_rate="range(0.1, 2.1, 0.1)"
Which launches 20 jobs with different learning rates (this hyper-parameter applies to +estimator@ranker=featboost
). See multirun docs for syntax.
Config directory
The amount of command-line arguments quickly adds up. Any configuration can also be loaded from a dir. It is configured with --config-dir
:
fseval --config-dir ./conf
With the ./conf
directory containing:
.
└── conf
└── experiment
└── my_experiment_presets.yaml
Then, my_experiment_presets.yaml
can contain:
# @package _global_
defaults:
- override /resample: bootstrap
- override /callbacks:
- wandb
callbacks:
wandb:
project: my-first-benchmark
n_bootstraps: 20
n_jobs: 4
Which configures wandb, bootstrapping, and multiprocessing. ✓ See the example config.
Also, extra estimators or datasets can be added:
.
└── conf
├── estimator
│ └── my_custom_ranker.yaml
└── dataset
└── my_custom_dataset.yaml
We can now use the newly installed estimator and dataset:
fseval --config-dir ./conf +estimator@ranker=my_custom_ranker +dataset=my_custom_dataset
🙌🏻
Where my_custom_ranker.yaml
would be any estimator definition, and my_custom_dataset.yaml
any dataset dataset definition.
Built-in Feature Rankers
A number of rankers are already built-in, which can be used without further configuring. See:
Ranker | Dependency | Command line argument |
---|---|---|
ANOVA F-Value | - | +estimator@ranker=anova_f_value |
Boruta | pip install Boruta |
+estimator@ranker=boruta |
Chi2 | - | +estimator@ranker=chi2 |
Decision Tree | - | +estimator@ranker=decision_tree |
FeatBoost | pip install git+https://github.com/amjams/FeatBoost.git |
+estimator@ranker=featboost |
MultiSURF | pip install skrebate |
+estimator@ranker=multisurf |
Mutual Info | - | +estimator@ranker=mutual_info |
ReliefF | pip install skrebate |
+estimator@ranker=relieff |
Stability Selection | pip install git+https://github.com/dunnkers/stability-selection.git matplotlib ℹ️ |
+estimator@ranker=stability_selection |
TabNet | pip install pytorch-tabnet |
+estimator@ranker=tabnet |
XGBoost | pip install xgboost |
+estimator@ranker=xgb |
Infinite Selection | pip install git+https://github.com/dunnkers/infinite-selection.git ℹ️ |
+estimator@ranker=infinite_selection |
ℹ️ This library was customized to make it compatible with the fseval pipeline.
If you would like to install simply all dependencies, download the fseval requirements.txt file and run pip install -r requirements.txt
.
About
Built by Jeroen Overschie as part of a Masters Thesis.
(Data Science and Computational Complexity, University of Groningen)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.