Skip to main content

This project uses shapely values for selecting Top n features compatible with scikit learn pipeline

Project description

Zoish

Zoish is a package built to ease machine learning development. One of its main parts is a class that uses SHAP (SHapley Additive exPlanation) for a better feature selection. It is compatible with scikit-learn pipeline . This package uses FastTreeSHAP while calculation shap values and SHAP for plotting.

Introduction

ScallyShapFeatureSelector of Zoish package can receive various parameters. From a tree-based estimator class to its tunning parameters and from Grid search, Random Search, or Optuna to their parameters. Samples will be split to train and validation set, and then optimization will estimate optimal related parameters.

After that, the best subset of features with higher shap values will be returned. This subset can be used as the next steps of the Sklearn pipeline.

Installation

Zoish package is available on PyPI and can be installed with pip:

pip install zoish

Supported estimators

  • XGBRegressor XGBoost
  • XGBClassifier XGBoost
  • RandomForestClassifier
  • RandomForestRegressor
  • CatBoostClassifier
  • CatBoostRegressor
  • BalancedRandomForestClassifier
  • LGBMClassifier LightGBM
  • LGBMRegressor LightGBM

Usage

  • Find features using specific tree-based models with the highest shap values after hyper-parameter optimization
  • Plot the shap summary plot for selected features
  • Return a sorted two-column Pandas data frame with a list of features and shap values.

Example

Import required libraries

from zoish.feature_selectors.zoish_feature_selector import ScallyShapFeatureSelector
import xgboost
from optuna.pruners import HyperbandPruner
from optuna.samplers._tpe.sampler import TPESampler
from sklearn.model_selection import KFold,train_test_split
import pandas as pd
from sklearn.pipeline import Pipeline
from feature_engine.imputation import (
    CategoricalImputer,
    MeanMedianImputer
    )
from category_encoders import OrdinalEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import (
    r2_score
    )
from zoish.utils.helper_funcs import catboost

Computer Hardware Data Set (a regression problem)

urldata= "https://archive.ics.uci.edu/ml/machine-learning-databases/cpu-performance/machine.data"
# column names
col_names=[
    "vendor name",
    "Model Name",
    "MYCT",
    "MMIN",
    "MMAX",
    "CACH",
    "CHMIN",
    "CHMAX",
    "PRP"
]
# read data
data = pd.read_csv(urldata,header=None,names=col_names,sep=',')

Train test split

X = data.loc[:, data.columns != "PRP"]
y = data.loc[:, data.columns == "PRP"]
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.33, random_state=42)

Find feature types for later use

int_cols =  X_train.select_dtypes(include=['int']).columns.tolist()
float_cols =  X_train.select_dtypes(include=['float']).columns.tolist()
cat_cols =  X_train.select_dtypes(include=['object']).columns.tolist()

Define Feature selector and set its arguments

SFC_CATREG_OPTUNA = ScallyShapFeatureSelector(
        n_features=5,
        estimator=catboost.CatBoostRegressor(),
        estimator_params={
                  # desired lower bound and upper bound for depth
                  'depth'         : [6,10],
                  # desired lower bound and upper bound for depth
                  'learning_rate' : [0.05, 0.1],  
                    },
        hyper_parameter_optimization_method="optuna",
        shap_version="v0",
        measure_of_accuracy="r2",
        list_of_obligatory_features=[],
        test_size=0.33,
        cv=KFold(n_splits=3, random_state=42, shuffle=True),
        with_shap_summary_plot=True,
        with_stratified=False,
        verbose=0,
        random_state=42,
        n_jobs=-1,
        n_iter=100,
        eval_metric=None,
        number_of_trials=20,
        sampler=TPESampler(),
        pruner=HyperbandPruner(),
    )

Build sklearn Pipeline

pipeline =Pipeline([
            # int missing values imputers
            ('intimputer', MeanMedianImputer(
                imputation_method='median', variables=int_cols)),
            # category missing values imputers
            ('catimputer', CategoricalImputer(variables=cat_cols)),
            #
            ('catencoder', OrdinalEncoder()),
            # feature selection
            ('SFC_CATREG_OPTUNA', SFC_CATREG_OPTUNA),
            # add any regression model from sklearn e.g., LinearRegression
            ('regression', LinearRegression())


 ])

pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)


print('r2 score : ')
print(r2_score(y_test,y_pred))

There are more examples available in the notebooks directory.

License

Licensed under the BSD 2-Clause License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zoish-1.56.0.tar.gz (116.3 kB view details)

Uploaded Source

Built Distribution

zoish-1.56.0-py3-none-any.whl (117.1 kB view details)

Uploaded Python 3

File details

Details for the file zoish-1.56.0.tar.gz.

File metadata

  • Download URL: zoish-1.56.0.tar.gz
  • Upload date:
  • Size: 116.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.14 CPython/3.10.5 Linux/5.10.104-linuxkit

File hashes

Hashes for zoish-1.56.0.tar.gz
Algorithm Hash digest
SHA256 360d80ffda564c3b7adad1ba0a33722eef88382d551f2897138e0d901dec1836
MD5 603fb020eb7b14a3cc11687fdf9fce7d
BLAKE2b-256 d5500a146dd98e11afbb37b1d668b5df1b5b9d02dfc7edadca1c79b5bec0a569

See more details on using hashes here.

File details

Details for the file zoish-1.56.0-py3-none-any.whl.

File metadata

  • Download URL: zoish-1.56.0-py3-none-any.whl
  • Upload date:
  • Size: 117.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.14 CPython/3.10.5 Linux/5.10.104-linuxkit

File hashes

Hashes for zoish-1.56.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fea957739ed3f58e8bb60e1879dfecc8dec00cf1470a159c7363aa64f72cab2a
MD5 211fc2e9c51fd3d0dbe910a5b074c41b
BLAKE2b-256 990b2d1e8e21c832be8f0404a6b4b16a2fbd76e0cad0fe655bec7fdfbf046660

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page