Skip to main content

Benchmarking framework for Feature Selection algorithms 🚀

Project description

fseval

build status pypi badge

A Feature Ranker benchmarking library. Useful for Feature Selection and Interpretable AI methods. Allows plotting feature importance scores on an online dashboard. Neatly integrates Hydra with wandb.

Any sklearn style estimator can be used as a Feature Ranker. Estimator must estimate at least one of:

  1. Feature importance, using feature_importances_.
  2. Feature subset, using feature_support_.
  3. Feature ranking, using feature_ranking_.

Main functionality:

  • 📊 Online dashboard. Experiments can be uploaded to wandb for seamless experiment tracking and visualization. Feature importance and subset validation plots are built-in.
  • 🔄 Scikit-Learn integration. Integrates nicely with sklearn. Any estimator that implements fit is supported.
  • 🗄 Dataset adapters. Datasets can be loaded dynamically using an adapter. OpenML support is built-in.
  • 🎛 Synthetic dataset generation. Synthetic datasets can be generated and configured right in the library itself.
  • 📌 Relevant features ground-truth. Datasets can have ground-truth relevant features defined, so the estimated versus the ground-truth feature importance is automatically plotted in the dashboard.
  • ⚜️ Subset validation. Allows you to validate the quality of a feature ranking, by running a validation estimator on some of the k best feature subsets.
  • ⚖️ Bootstrapping. Allows you to approximate the stability of an algorithm by running multiple experiments on bootstrap resampled datasets.
  • ⚙️ Reproducible configs. Uses Hydra as a config parser, to allow configuring every part of the experiment. The config can be uploaded to wandb, so the experiment can be replayed later.

Install

pip install fseval

Usage

fseval is run via a CLI. Example:

fseval \
  +dataset=synclf_easy \
  +estimator@ranker=chi2 \
  +estimator@validator=decision_tree

Which runs Chi2 feature ranking on the Synclf easy dataset, and validates feature subsets using a Decision Tree.

To see all the configurable options, run:

fseval --help

Weights and Biases integration

Integration with wandb is built-in. Create an account and login to the CLI with wandb login. Then, enable wandb using callbacks="[wandb]":

fseval \
  callbacks="[wandb]" \
  +callbacks.wandb.project=fseval-readme \
  +dataset=synclf_easy \
  +estimator@ranker=chi2 \
  +estimator@validator=decision_tree

We can now explore the results on the online dashboard:

Running bootstraps

Bootstraps can be run, to approximate the stability of an algorithm. Bootstrapping works by creating multiple dataset permutations and running the algorithm on each of them. A simple way to create dataset permutations is to resample with replacement.

In fseval, bootstrapping can be configured with resample=bootstrap:

fseval \
  resample=bootstrap \
  n_bootstraps=8 \
  +dataset=synclf_easy \
  +estimator@ranker=chi2 \
  +estimator@validator=decision_tree 

To run the entire experiment 8 times, each for a resampled dataset.

In the dashboard, plots are already set up to support bootstrapping:

Shows the validation results for 25 bootstraps. ✨

Launching multiple experiments at once

To launch multiple experiments, use --multirun.

fseval \
  --multirun \
  +dataset=synclf_easy \
  +estimator@ranker=[boruta,featboost,chi2] \
  +estimator@validator=decision_tree 

Which launches 3 jobs.

See the multirun overriding syntax. For example, you can select multiple groups using [], a range using range(start, stop, step) and all options using glob(*).

Multiprocessing

The experiment can run in parallel. The list of bootstraps is distributed over the CPU's. To use all available processors set n_jobs=-1:

fseval [...] n_jobs=-1

Alternatively, set n_jobs to the specific amount of processors to use. e.g. n_jobs=4 if you have a quad-core.

When using bootstraps, it can be efficient to use an amount that is divisible by the amount of CPU's:

fseval [...] resample=bootstrap n_bootstraps=8 n_jobs=4

would cause all 8 CPU's to be utilized efficiently.

Distributed processing

Since fseval uses Hydra, all its plugins can also be used. Some plugins for distributed processing are:

Example:

fseval --multirun [...] hydra/launcher=rq

To submit jobs to RQ.

Configuring a Feature Ranker

The entirety of the config can be overriden like pleased. Like such, also feature rankers can be configured. For example:

fseval [...] +validator.classifier.estimator.criterion=entropy

Changes the Decision Tree criterion to entropy. One could perform a hyper-parameter sweep over some parameters like so:

fseval --multirun [...] +validator.classifier.estimator.criterion=entropy,gini

or, in case of a ranker:

fseval --multirun [...] +ranker.classifier.estimator.learning_rate="range(0.1, 2.1, 0.1)"

Which launches 20 jobs with different learning rates (this hyper-parameter applies to +estimator@ranker=featboost). See multirun docs for syntax.

Config directory

The amount of command-line arguments quickly adds up. Any configuration can also be loaded from a dir. It is configured with --config-dir:

fseval --config-dir ./conf

With the ./conf directory containing:

.
└── conf
    └── experiment
        └── my_experiment_presets.yaml

Then, my_experiment_presets.yaml can contain:

# @package _global_
defaults:
  - override /resample: bootstrap
  - override /callbacks:
    - wandb

callbacks:
  wandb:
    project: my-first-benchmark

n_bootstraps: 20
n_jobs: 4

Which configures wandb, bootstrapping, and multiprocessing. ✓ See the example config.

Also, extra estimators or datasets can be added:

.
└── conf
    ├── estimator
       └── my_custom_ranker.yaml
    └── dataset
        └── my_custom_dataset.yaml

We can now use the newly installed estimator and dataset:

fseval --config-dir ./conf +estimator@ranker=my_custom_ranker +dataset=my_custom_dataset

🙌🏻

Where my_custom_ranker.yaml would be any estimator definition, and my_custom_dataset.yaml any dataset dataset definition.

Built-in Feature Rankers

A number of rankers are already built-in, which can be used without further configuring. See:

Ranker Dependency Command line argument
ANOVA F-Value - +estimator@ranker=anova_f_value
Boruta pip install Boruta +estimator@ranker=boruta
Chi2 - +estimator@ranker=chi2
Decision Tree - +estimator@ranker=decision_tree
FeatBoost pip install git+https://github.com/amjams/FeatBoost.git +estimator@ranker=featboost
MultiSURF pip install skrebate +estimator@ranker=multisurf
Mutual Info - +estimator@ranker=mutual_info
ReliefF pip install skrebate +estimator@ranker=relieff
Stability Selection pip install git+https://github.com/dunnkers/stability-selection.git matplotlib ℹ️ +estimator@ranker=stability_selection
TabNet pip install pytorch-tabnet +estimator@ranker=tabnet
XGBoost pip install xgboost +estimator@ranker=xgb
Infinite Selection pip install git+https://github.com/dunnkers/infinite-selection.git ℹ️ +estimator@ranker=infinite_selection

ℹ️ This library was customized to make it compatible with the fseval pipeline.

If you would like to install simply all dependencies, download the fseval requirements.txt file and run pip install -r requirements.txt.

About

Built by Jeroen Overschie as part of a Masters Thesis.

(Data Science and Computational Complexity, University of Groningen)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fseval-2.1.2.tar.gz (38.6 kB view hashes)

Uploaded Source

Built Distribution

fseval-2.1.2-py3-none-any.whl (56.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page