Skip to main content

Compute (Target) Permutation Importances of a machine learning model

Project description

Target Permutation Importances (Null Importances)

image Downloads image Actions status Ruff

[Source] [Bug Report] [Documentation] [API Reference]


Overview

Null Importances is normalized feature importance measures that can correct the feature importance bias. The method is based on repeated permutations of the outcome vector for estimating the distribution of measured importance for each variable in a non-informative setting.

Overall, this package

  1. Fit the given model class $M$ times with different model's random_state to get $M$ actual feature importances of feature f: $A_f = [a_{f_1},a_{f_2}...a_{f_M}]$.
  2. Fit the given model class with different model's random_state and shuffled targets for $N$ times to get $N$ feature random importances: $R_f = [r_{f_1},r_{f_2}...r_{f_N}]$.
  3. Compute the final importances of a feature $f$ by various methods, such as:
    • $I_f = Avg(A_f) - Avg(R_f)$
    • $I_f = Avg(A_f) / (Avg(R_f) + 1)$
    • $I_f = Wasserstein Distance(A_f, R_f)$

We want $M \ge 1$ and $N \gg 1$. Having $M=1$ means the actual importances depends on only 1 model's random_state (Which is also fine).

Not to be confused with sklearn.inspection.permutation_importance, this sklearn method is about feature permutation instead of target permutation.

This method were originally proposed/implemented by:


Features

  1. Compute null importances with only one function call.
  2. Support models that provide information about feature importance (e.g. coef_, feature_importances_), such as RandomForestClassifier, RandomForestRegressor, XGBClassifier. XGBRegressor, LGBMClassifier, LGBMRegressor,CatBoostClassifier, CatBoostRegressor, Lasso, LinearSVC, etc.
  3. Support sklearn's MultiOutputClassifier or MultiOutputRegressor interface.
  4. Support data in pandas.DataFrame and numpy.ndarray
  5. Highly customizable with both the exposed compute and generic_compute functions.
  6. Proven effectiveness in Kaggle competitions and in Our Benchmarks Results.

Benchmarks

Here are some examples of Top Kaggle solutions using this method:

Year Competition Medal Link
2023 Predict Student Performance from Game Play Gold 3rd place solution
2019 Elo Merchant Category Recommendation Gold 16th place solution
2018 Home Credit Default Risk Gold 10th place solution

Below show the benchmark results of running null-importances with feature selection on multiple models and datasets. "better" means it is better than running feature selection with the model's built-in feature importances. Details can be seen here.

model n_dataset n_better better %
RandomForestClassifier 10 10 100.0
RandomForestRegressor 12 8 66.67
XGBClassifier 10 7 70.0
XGBRegressor 12 7 58.33
LGBMClassifier 10 8 80.0
LGBMRegressor 12 8 66.67
CatBoostClassifier 10 6 60.0
CatBoostRegressor 12 8 66.67

Installation

pip install target-permutation-importances

or with poetry:

poetry add target-permutation-importances

If with Python 3.8.x & Pandas 1.x.x:

pip install target-permutation-importances==1.19.0

or with poetry:

poetry add target-permutation-importances==1.19.0

Although this package is tested on models from sklearn, xgboost, catboost, lightgbm, they are not a hard requirement for the installation, you can use this package for any model if it implements the sklearn interface. For models that don't follow sklearn interface, you can use the exposed generic_compute method as discussed in the Advance Usage / Customization section.

Dependencies:

[tool.poetry.dependencies]
python = "^3.9"
numpy = "^1.23.5"
pandas = "^1.5.3"
tqdm = "^4.48.2"
beartype = "^0.14.1"
scipy = "^1.9"

Get Started (Functional APIs)

Tree Models with feature_importances_ Attribute

# import the package
import target_permutation_importances as tpi

# Prepare a dataset
import pandas as pd
from sklearn.datasets import load_breast_cancer

# Models
from sklearn.ensemble import RandomForestClassifier

data = load_breast_cancer()

# Convert to a pandas dataframe
Xpd = pd.DataFrame(data.data, columns=data.feature_names)

# Compute permutation importances with default settings
result_df = tpi.compute(
    model_cls=RandomForestClassifier, # The constructor/class of the model.
    model_cls_params={ # The parameters to pass to the model constructor. Update this based on your needs.
        "n_jobs": -1,
    },
    model_fit_params={}, # The parameters to pass to the model fit method. Update this based on your needs.
    X=Xpd, # pd.DataFrame, np.ndarray
    y=data.target, # pd.Series, np.ndarray
    num_actual_runs=2,
    num_random_runs=10,
    # Options: {compute_permutation_importance_by_subtraction, compute_permutation_importance_by_division}
    # Or use your own function to calculate.
    permutation_importance_calculator=tpi.compute_permutation_importance_by_subtraction,
)

print(result_df[["feature", "importance"]].sort_values("importance", ascending=False).head())

Fork above code from Kaggle.

Outputs:

Running 2 actual runs and 10 random runs
100%|██████████| 2/2 [00:01<00:00,  1.62it/s]
100%|██████████| 10/10 [00:06<00:00,  1.46it/s]
                 feature  importance
25       worst perimeter    0.117495
22  worst concave points    0.089949
26          worst radius    0.084632
7    mean concave points    0.064289
20            worst area    0.062485
8         mean concavity    0.047122
10        mean perimeter    0.029270
5              mean area    0.014566
11           mean radius    0.014346
0             area error    0.000693

Linear Models with coef_ Attribute

# import the package
import target_permutation_importances as tpi

# Prepare a dataset
import pandas as pd
from sklearn.datasets import load_breast_cancer

# Models
from sklearn.svm import LinearSVC

data = load_breast_cancer()

# Convert to a pandas dataframe
Xpd = pd.DataFrame(data.data, columns=data.feature_names)

# Compute permutation importances with default settings
result_df = tpi.compute(
    model_cls=LinearSVC, # The constructor/class of the model.
    model_cls_params={"max_iter": 1000}, # The parameters to pass to the model constructor. Update this based on your needs.
    model_fit_params={}, # The parameters to pass to the model fit method. Update this based on your needs.
    X=Xpd, # pd.DataFrame, np.ndarray
    y=data.target, # pd.Series, np.ndarray
    num_actual_runs=1,
    num_random_runs=10,
    # Options: {compute_permutation_importance_by_subtraction, compute_permutation_importance_by_division}
    # Or use your own function to calculate.
    permutation_importance_calculator=tpi.compute_permutation_importance_by_subtraction,
)

print(result_df[["feature", "importance"]].sort_values("importance", ascending=False).head(10))

Fork above code from Kaggle.

Outputs:

              feature  importance
10     mean perimeter    0.067352
29      worst texture    0.029602
11        mean radius    0.029509
26       worst radius    0.026499
21  worst compactness    0.010139
23    worst concavity    0.009149
25    worst perimeter    0.008779
14       mean texture    0.007845
0          area error    0.007540
20         worst area    0.004508

With sklearn.multioutput

# import the package
import target_permutation_importances as tpi

# Prepare a dataset
import pandas as pd
from sklearn.datasets import make_regression

# Models
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputClassifier, MultiOutputRegressor

# Multi target regression
X, y = make_regression(
    n_samples=100,
    n_features=20,
    n_targets=5,
)

# Compute permutation importances with default settings
result_df = tpi.compute(
    model_cls=MultiOutputRegressor,
    model_cls_params={ # The constructor/class of MultiOutputRegressor
        "estimator": RandomForestRegressor(n_estimators=2),
    },
    model_fit_params={}, # The parameters to pass to the model fit method. Update this based on your needs.
    X=X, # pd.DataFrame, np.ndarray
    y=y, # pd.Series, np.ndarray
    num_actual_runs=2,
    num_random_runs=10,
    # Options: {compute_permutation_importance_by_subtraction, compute_permutation_importance_by_division}
    # Or use your own function to calculate.
    permutation_importance_calculator=tpi.compute_permutation_importance_by_subtraction,
)

print(result_df[["feature", "importance"]].sort_values("importance", ascending=False).head())

Fork above code from Kaggle.

Outputs:

Running 2 actual runs and 10 random runs
100%|██████████| 2/2 [00:00<00:00,  4.97it/s]
100%|██████████| 10/10 [00:01<00:00,  5.10it/s]
    feature  importance
19       19    0.233211
10       10    0.058238
0         0    0.049782
17       17    0.031321
2         2    0.028209

You can find more detailed examples in the "Feature Selection Examples" section.

Customization

Changing model or parameters

You can pick your own model by changing model_cls, model_cls_params and model_fit_params, for example, using with LGBMClassifier with a importance_type=gain and colsample_bytree=0.5:

result_df = tpi.compute(
    model_cls=LGBMClassifier, # The constructor/class of the model.
    model_cls_params={ # The parameters to pass to the model constructor. Update this based on your needs.
        "n_jobs": -1,
        "importance_type": "gain",
        "colsample_bytree": 0.5,
    },
    model_fit_params={}, # The parameters to pass to the model fit method. Update this based on your needs.
    X=Xpd, # pd.DataFrame, np.ndarray
    y=data.target, # pd.Series, np.ndarray
    num_actual_runs=2,
    num_random_runs=10,
    # Options: {compute_permutation_importance_by_subtraction, compute_permutation_importance_by_division}
    # Or use your own function to calculate.
    permutation_importance_calculator=tpi.compute_permutation_importance_by_subtraction,
)

Note: Tree models are greedy. Usually it is a good idea to introduce some randomness into the tree model by setting colsample_* params. It forces the model to explore the importances of different features. In other words, setting these parameters avoid a feature from being under-representative in the importance calculation because of having another highly correlated feature.

Changing null importances calculation

You can pick your own calculation method by changing permutation_importance_calculator. There are 3 provided calculations:

  • tpi.compute_permutation_importance_by_subtraction
  • tpi.compute_permutation_importance_by_division
  • tpi.compute_permutation_importance_by_wasserstein_distance

You can also implement you own calculation function and pass it in. The function needs to follow PermutationImportanceCalculatorType specification, you can find it in API Reference

Advance Customization

This package exposes tpi.generic_compute to allow advance customization. Read the followings for details:


Get Started (scikit-learn APIs)

TargetPermutationImportancesWrapper follows scikit-learn interfaces and support scikit-learn feature selection method such as SelectFromModel:

# Import the package
import target_permutation_importances as tpi

# Prepare a dataset
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer

# Models
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

data = load_breast_cancer()

# Convert to a pandas dataframe
Xpd = pd.DataFrame(data.data, columns=data.feature_names)

# Compute permutation importances with default settings
wrapped_model = tpi.TargetPermutationImportancesWrapper(
    model_cls=RandomForestClassifier, # The constructor/class of the model.
    model_cls_params={ # The parameters to pass to the model constructor. Update this based on your needs.
        "n_jobs": -1,
    },
    num_actual_runs=2,
    num_random_runs=10,
    # Options: {compute_permutation_importance_by_subtraction, compute_permutation_importance_by_division}
    # Or use your own function to calculate.
    permutation_importance_calculator=tpi.compute_permutation_importance_by_subtraction,
)
wrapped_model.fit(
    X=Xpd, # pd.DataFrame, np.ndarray
    y=data.target, # pd.Series, np.ndarray
    # And other fit parameters for the model.
)
# Get the feature importances as a pandas dataframe
result_df = wrapped_model.feature_importances_df
print(result_df[["feature", "importance"]].sort_values("importance", ascending=False).head())


# Select top-5 features with sklearn `SelectFromModel`
selector = SelectFromModel(
    estimator=wrapped_model, prefit=True, max_features=5, threshold=-np.inf
).fit(Xpd, data.target)
selected_x = selector.transform(Xpd)
print(selected_x.shape)
print(selector.get_feature_names_out())

Fork above code from Kaggle.

Outputs:

Running 2 actual runs and 10 random runs
100%|██████████| 2/2 [00:01<00:00,  1.80it/s]
100%|██████████| 10/10 [00:06<00:00,  1.55it/s]
                 feature  importance
22       worst perimeter    0.151953
27  worst concave points    0.124407
20          worst radius    0.119090
7    mean concave points    0.098747
23            worst area    0.096943
(569, 5)
['mean concave points' 'worst radius' 'worst perimeter' 'worst area'
 'worst concave points']

Feature Selection Examples


Development Setup and Contribution Guide

Python Version

You can find the suggested development Python version in .python-version. You might consider setting up Pyenv if you want to have multiple Python versions on your machine.

Python packages

This repository is setup with Poetry. If you are not familiar with Poetry, you can find package requirements listed in pyproject.toml. Otherwise, you can just set it up with poetry install

Run Benchmarks

To run the benchmark locally on your machine, run make run_tabular_benchmark or python -m benchmarks.run_tabular_benchmark

Make Changes

Following the Make Changes Guide from Github Before committing or merging, please run the linters defined in make lint and the tests defined in make test


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

target_permutation_importances-2.1.0.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

target_permutation_importances-2.1.0-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file target_permutation_importances-2.1.0.tar.gz.

File metadata

  • Download URL: target_permutation_importances-2.1.0.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.0 Linux/6.2.0-1018-azure

File hashes

Hashes for target_permutation_importances-2.1.0.tar.gz
Algorithm Hash digest
SHA256 5deec5590d54752e55bcd5f3393aa7bc5b7c96e7d431a5583585f9a5404e1dfb
MD5 f286afd27f64762a70e66b654c804204
BLAKE2b-256 34a323e3938ba149a979f459512cff9b997fe33d12fafa62d683d707ad6a2937

See more details on using hashes here.

File details

Details for the file target_permutation_importances-2.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for target_permutation_importances-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0095908d5ca3eb77aed52dce0aa602a1003bd9686ab87627600b98ff320e3882
MD5 d8d693568124463ced096c7f74997543
BLAKE2b-256 d07c0feea4973b0a440ea3e7b8a24ef2423a927d03b4d7037492dd31d613bb37

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page