Skip to main content

Automated Feature Selection & Feature Importance Calculation Framework

Project description

Automated Feature Selection & Importance

autofeatselect is a python library that automates and accelerates feature selection processes for machine learning projects.

It helps to calculate feature importance scores & rankings with several methods and also helps to detect and remove highly correlated variables.

Installation

You can install from PyPI:

pip install autofeatselect

Key Features

autofeatselect offers a wide range of features to support feature selection and importance analysis:

  • Automated Feature Selection: Various automated feature selection methods, such as LGBM Importance, XGBoost Importance, RFECV so on.
  • Feature Importance Analysis: Calculation and visualization of feature importance scores for different algorithms seperately.
  • Correlation Analysis: Perform correlation analysis to identify and drop correlated features automatically.

Full List of Methods

Correlation Calculation Methods

  • Pearson, Spearman & Kendall Correlation Coefficients for Continuous Variables
  • Cramer's V Scores for Categorical Variables

Feature Selection Methods

  • LightGBM Feature Importance Scores
  • XGBoost Feature Importance Scores (with Target Encoding for Categorical Variables)
  • Random Forest Feature Importance Scores (with Target Encoding for Categorical Variables)
  • LassoCV Coefficients (with One Hot Encoding for Categorical Variables)
  • Permutation Importance Scores (LightGBM as the estimator)
  • RFECV Rankings (LightGBM as the estimator)
  • Boruta Rankings (Random Forest as the estimator)

Usage

  • Calculating Correlations & Detecting Highly Correlated Features
num_static_feats = ['x1', 'x2'] #Static features to be kept regardless of the correlation results.

corr_df_num, remove_list_num = CorrelationCalculator.numeric_correlations(X=X_train,
                                                                          features=num_feats, #List of continuous features
                                                                          static_features=num_static_feats,
                                                                          corr_method='pearson',
                                                                          threshold=0.9)

corr_df_cat, remove_list_cat = CorrelationCalculator.categorical_correlations(X=X_train,
                                                                              features=cat_feats, #List of categorical features
                                                                              static_features=None,
                                                                              threshold=0.9)
  • Calculating Single Feature Importance Score & Plot Results
#Create Feature Selection Object
feat_selector = FeatureSelector(modeling_type='classification', # 'classification' or 'regression'
                                X_train=X_train,
                                y_train=y_train,
                                X_test=None,
                                y_test=None,
                                numeric_columns=num_feats,
                                categorical_columns=cat_feats,
                                seed=24)

#Train LightGBM model & return importance results as pd.DataFrame 
lgbm_importance_df = feat_selector.lgbm_importance(hyperparam_dict=None,
                                                   objective=None,
                                                   return_plot=True)


#Apply RFECV with using LightGBM as the estimator & return importance results as pd.DataFrame 
lgbm_hyperparams = {'learning_rate': 0.01, 'max_depth': 6, 'n_estimators': 400,
                    'num_leaves': 30, 'random_state':24, 'importance_type':'gain'
                   }
rfecv_hyperparams = {'step':3, 'min_features_to_select':5, 'cv':5}

rfecv_importance_df = feat_selector.rfecv_importance(lgbm_hyperparams=lgbm_hyperparams,
                                                     rfecv_hyperparams=rfecv_hyperparams,
                                                     return_plot=False)
  • Calculating Single Feature Importance Score & Plot Results
#Automated correlation analysis & applying multiple feature selection methods
feat_selector = AutoFeatureSelect(modeling_type='classification',
                                  X_train=X_train,
                                  y_train=y_train,
                                  X_test=X_test,
                                  y_test=y_test,
                                  numeric_columns=num_feats,
                                  categorical_columns=cat_feats,
                                  seed=24)

corr_features = feat_selector.calculate_correlated_features(static_features=None,
                                                            num_threshold=0.9,
                                                            cat_threshold=0.9)

feat_selector.drop_correlated_features()

final_importance_df = feat_selector.apply_feature_selection(selection_methods=['lgbm', 'xgb', 'perimp', 'rfecv', 'boruta'])

License

This project is completely free, open-source and licensed under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

AutoFeatSelect-0.1.5.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

AutoFeatSelect-0.1.5-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file AutoFeatSelect-0.1.5.tar.gz.

File metadata

  • Download URL: AutoFeatSelect-0.1.5.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.8

File hashes

Hashes for AutoFeatSelect-0.1.5.tar.gz
Algorithm Hash digest
SHA256 046d7408cadcd371479e0f6cca5c0cd6f3bf358244ed6961e214c89056fdcedd
MD5 e5ff55aa02fb6acf89199908c97dcfcd
BLAKE2b-256 dfb480b6d39886c5d7fd3beb6f564ecc436ac234e11e97a1b2e1c45708fe5548

See more details on using hashes here.

File details

Details for the file AutoFeatSelect-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for AutoFeatSelect-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c7ee58deae929ff4dd1aae0876da3056ac682fb154feb73d61aeec191e3e3ddc
MD5 139f90a141e1a60d20a297fda49d4126
BLAKE2b-256 fb81813be4d53882c404154c50f834a93b6e831b91eca8bfcf01ceb07d9ed508

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page