This project uses shapely values for selecting Top n features compatible with scikit learn pipeline

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Zoish

Zoish is a package built to ease machine learning development. One of its main parts is a class that uses SHAP (SHapley Additive exPlanation) for a better feature selection. It is compatible with scikit-learn pipeline . This package uses FastTreeSHAP while calculation shap values and SHAP for plotting.

Introduction

ScallyShapFeatureSelector of Zoish package can receive various parameters. From a tree-based estimator class to its tunning parameters and from Grid search, Random Search, or Optuna to their parameters. Samples will be split to train and validation set, and then optimization will estimate optimal related parameters.

After that, the best subset of features with higher shap values will be returned. This subset can be used as the next steps of the Sklearn pipeline.

Installation

Zoish package is available on PyPI and can be installed with pip:

pip install zoish

Supported estimators

XGBRegressor XGBoost
XGBClassifier XGBoost
RandomForestClassifier
RandomForestRegressor
CatBoostClassifier
CatBoostRegressor
BalancedRandomForestClassifier
LGBMClassifier LightGBM
LGBMRegressor LightGBM

Usage

Find features using specific tree-based models with the highest shap values after hyper-parameter optimization
Plot the shap summary plot for selected features
Return a sorted two-column Pandas data frame with a list of features and shap values.

Examples

Import required libraries

from zoish.feature_selectors.optunashap import OptunaShapFeatureSelector
import xgboost
from optuna.pruners import HyperbandPruner
from optuna.samplers._tpe.sampler import TPESampler
from sklearn.model_selection import KFold,train_test_split
import pandas as pd
from sklearn.pipeline import Pipeline
from feature_engine.imputation import (
    CategoricalImputer,
    MeanMedianImputer
    )
from category_encoders import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    f1_score)
import lightgbm
import matplotlib.pyplot as plt
import optuna

Computer Hardware Data Set (a classification problem)

urldata= "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
# column names
col_names=["age", "workclass", "fnlwgt" , "education" ,"education-num",
"marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week",
"native-country","label"
]
# read data
data = pd.read_csv(urldata,header=None,names=col_names,sep=',')
data.head()

data.loc[data['label']=='<=50K','label']=0
data.loc[data['label']==' <=50K','label']=0

data.loc[data['label']=='>50K','label']=1
data.loc[data['label']==' >50K','label']=1

data['label']=data['label'].astype(int)

Train test split

X = data.loc[:, data.columns != "label"]
y = data.loc[:, data.columns == "label"]

X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.33, stratify=y['label'], random_state=42)

Find feature types for later use

int_cols =  X_train.select_dtypes(include=['int']).columns.tolist()
float_cols =  X_train.select_dtypes(include=['float']).columns.tolist()
cat_cols =  X_train.select_dtypes(include=['object']).columns.tolist()

Define Feature selector and set its arguments

optuna_classification_lgb = OptunaShapFeatureSelector(
        # general argument setting        
        verbose=1,
        random_state=0,
        logging_basicConfig = None,
        # general argument setting        
        n_features=4,
        list_of_obligatory_features_that_must_be_in_model=[],
        list_of_features_to_drop_before_any_selection=[],
        # shap argument setting        
        estimator=lightgbm.LGBMClassifier(),
        estimator_params={
        "max_depth": [4, 9],
        "reg_alpha": [0, 1],

        },
        # shap arguments
        model_output="raw", 
        feature_perturbation="interventional", 
        algorithm="auto", 
        shap_n_jobs=-1, 
        memory_tolerance=-1, 
        feature_names=None, 
        approximate=False, 
        shortcut=False, 
        plot_shap_summary=False,
        save_shap_summary_plot=True,
        path_to_save_plot = './summary_plot.png',
        shap_fig = plt.figure(),
        ## optuna params
        test_size=0.33,
        with_stratified = False,
        performance_metric = 'f1',
        # optuna study init params
        study = optuna.create_study(
            storage = None,
            sampler = TPESampler(),
            pruner= HyperbandPruner(),
            study_name  = None,
            direction = "maximize",
            load_if_exists = False,
            directions  = None,
            ),
        study_optimize_objective_n_trials=10, 

)

Build sklearn Pipeline



pipeline =Pipeline([
            # int missing values imputers
            ('intimputer', MeanMedianImputer(
                imputation_method='median', variables=int_cols)),
            # category missing values imputers
            ('catimputer', CategoricalImputer(variables=cat_cols)),
            #
            ('catencoder', OrdinalEncoder()),
            # feature selection
            ('optuna_classification_lgb', optuna_classification_lgb),
            # classification model
            ('logistic', LogisticRegression())


 ])


pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)


print('F1 score : ')
print(f1_score(y_test,y_pred))
print('Classification report : ')
print(classification_report(y_test,y_pred))
print('Confusion matrix : ')
print(confusion_matrix(y_test,y_pred))

More examples are available in the examples.

License

Licensed under the BSD 2-Clause License.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

5.0.4

Feb 7, 2024

5.0.3

Jan 30, 2024

5.0.2

Nov 25, 2023

5.0.1

Nov 21, 2023

5.0.0

Nov 20, 2023

4.7.0

Nov 20, 2023

4.6.0

Sep 20, 2023

4.5.0

Aug 10, 2023

4.3.0

Jul 31, 2023

4.1.0

Jul 23, 2023

3.7.1

May 17, 2023

3.7.0

May 16, 2023

3.6.1

May 17, 2023

3.6.0

May 15, 2023

3.5.0

Apr 25, 2023

3.4.0

Apr 25, 2023

3.3.0

Apr 20, 2023

3.1.0

Feb 27, 2023

2.1.0

Jan 12, 2023

2.0.1

Jan 12, 2023

1.63.0

Sep 28, 2022

1.62.0

Sep 7, 2022

This version

1.61.0

Aug 28, 2022

1.60.0

Aug 28, 2022

1.59.0

Aug 16, 2022

1.58.0

Aug 10, 2022

1.57.0

Jul 27, 2022

1.56.0

Jul 27, 2022

1.55.0

Jul 19, 2022

1.54.0

Jul 19, 2022

1.52.0

Jul 19, 2022

1.51.0

Jul 9, 2022

1.30.0

Jul 8, 2022

1.24.0

Jul 8, 2022

0.1.3

Jun 26, 2022

0.1.0

Jun 24, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zoish-1.61.0.tar.gz (152.1 kB view hashes)

Uploaded Aug 28, 2022 Source

Built Distribution

zoish-1.61.0-py3-none-any.whl (153.2 kB view hashes)

Uploaded Aug 28, 2022 Python 3

Hashes for zoish-1.61.0.tar.gz

Hashes for zoish-1.61.0.tar.gz
Algorithm	Hash digest
SHA256	`5413bad72208404f09cc43ee07e8d2af6f9404bab8a14c6199f3c542fc616b18`
MD5	`d0c8dd3586f16447e6593154dc483211`
BLAKE2b-256	`0d2867353f289ad4803de4b0a9eca57755a19885674fa43c7b8b699c5598b75e`

Hashes for zoish-1.61.0-py3-none-any.whl

Hashes for zoish-1.61.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`346e0ac03567ad3c397a1f3ae0a32cc8c563873caff60b88d8a9ee2730fa3310`
MD5	`9981868a90d819ee353e5a21db4e5f53`
BLAKE2b-256	`e2f755d379603c057faf93fe58c6fa974c327fed7dd154954a2aed8f4f737542`