Skip to main content

examines the relationships between a new feature and everyother feature in a dataset

Project description

feature_test

This package provides tools to test one feature against all other features in a dataset. It is intended to determine whether a new feature is a good candidate for addition to an established dataset used in a machine learning model. The benefit of this package is that instead of adding a feature to a dataset, running it through a long training and evaluation process, and then interpreting results, the feature can be quickly tested using feature_test; enabling users to quickly discard unuseful features and move forward with potentially impactful ones. This package provides the following tools to aid this determination.

  • Correlation analysis
    • Calculate the correlations
    • Identify highly correlated feature
    • Categorize correlation
  • Chi-Square tests
    • Calculate the chi-square statistic and p-vlaue
    • Categorize the chi-square result
  • Recursive feature elimination (RFE)
    • Rank features by their importance
  • Lasso regularization coefficients
    • Calculate the coefficients of a linear model using Lasso regularization
  • Ridge regularization coefficients
    • Calculate the coefficients of a linear model using Ridge Regularization
  • Decision tree coefficients
    • Calculate the coefficients for each feature using a decision tree model
  • Special thanks to the AREN team for their guidance and Annie Tran for the open sourcing her feature test code.

Installation

python3 -m pip install feature-test

utils.Utils

class utils.Utils(X, columns)
A collection of functions that enables users to get data on a dataframe or adjust it for testing.

Parameters:

  • X: pandas.DataFrame
    • A pandas.DataFrame containing the dataset
  • columns: List
    • A list of strings correlating to features in a dataset.
Methods
_______________________________________________________________________________________
get_columns(X: pd.DataFrame) Returns a list of column names in a dataframe.
exclude_columns(X: pd.DataFrame, columns: List) Exclude a list of columns from a dataframe.

tests.Correlation

class tests.correlation(X, new_feature)

A suite of functions that calculates and reports the correlation coefficient between features and all other features in a dataset.

Parameters:

  • X: pandas.DataFrame
    • A pandas.DataFrame containing the dataset
  • new_feature: string
    • A string indicating the column in X to test against all other features
Methods
_______________________________________________________________________________________
calc_corr(X: pd.DataFrame, new_feature: str) Returns a dataframe with the correlation values for each new_feature/feature combination.
similar_corr(X: pd.DataFrame, new_feature: str) Returns a list of features highly correlated with the new_feature.
categorize_correlations(X: pd.DataFrame, correlation_threshold: float = 0.6) Returns a dataframe with the correlations categorized. Possible values are high, medium, and low.
get_correlations(X: pd.DataFrame, new_feature: str) Returns a dataframe of new_feature/feature combinations, their correlations, and their correlation category.

tests.ChiSquare

class tests.ChiSquare(X, new_feature)

Calculates the chi-squared statistic of the new feature against each categorical feature in a dataset. Also categorizes the chi-square result based on the p-value and effect size as measured by cramers v.

Parameters:

  • X: pandas.DataFrame
    • A pandas.DataFrame containing the dataset
  • new_feature: string
    • A string indicating the column in X to test against all other features
Methods
_______________________________________________________________________________________
calc_chi_sq(X: pd.DataFrame, new_feature: str) Returns a dataframe that includes the chi-square result categorization.

tests.FeatureSelection

class tests.FeatureSelection(X, target)


Parameters:

  • X: pandas.DataFrame
    • A pandas.DataFrame containing the dataset
  • target: string
    • A string indicating the prediction column
Methods
_______________________________________________________________________________________
rfe_rankings(X: pd.DataFrame, target: str, classifier=None) Returns a dataframe that includes the recursive feature elimination feature ranking.
lasso_rankings(X: pd.DataFrame, target: str) Returns a dataframe that includes the linear model coefficients for features after lasso regularization.
ridge_coefficients(X: pd.DataFrame, target: str) Returns a dataframe that includes the linear model coefficients for features after ridge regularization.
dtree_coefficients(X: pd.DataFrame, target: str) Returns a dataframe that includes the decision tree model coefficients for features.
run_feature_classifiers(X: pd.DataFrame, target: str) Returns a dataframe that includes the the results for rfe_rankings, lasso_rankings, ridge_rankings, and dtree_rankings.

Examples


from pandas import util

from feature_test.utils import Utils
from feature_test.feature_tests import Correlation, FeatureSelection, ChiSq

# Create a test dataset
df= util.testing.makeDataFrame()
df.head()
feature A B C D
lRhANYYD2r 0.572559 -1.409978 0.687618 -0.923502
YzYG07kY1O 0.145629 -1.446946 -0.003526 0.304385
cT3KK078Gt -1.007378 1.263980 1.107897 0.844689
JW4Kg2EGVo 0.536701 -1.477372 -0.866873 1.539458
2mucO1cf2Z -1.101875 0.518555 0.384916 -0.031403
c = Utils.get_columns(df)

['A', 'B', 'C', 'D']

corr_df = Correlation.calc_corr(df, 'A')
corr_df
feature_1 feature_2 corr
A B 0.081662
A C 0.203858
A D 0.064999
rep_df = Correlation.get_correlations(df, 'D')
rep_df
feature_1 feature_2 corr corr_cat
A B 0.071466 low
A C 0.105306 low
A D 0.121130 low
ChiSq.calc_chi_sq(df, 'A')
feature_1 feature_2 chi_sq_cat
A B NOT SIGNIFICANT
A C NOT SIGNIFICANT
A D NOT SIGNIFICANT
FeatureSelection.rfe_rankings(df, 'A')
feature rfe_rank
B 3
C 1
D 2
FeatureSelection.lasso_rankings(df, 'A')
feature lasso_coef lasso_importance
B 0.0 2.0
C 0.0 2.0
D -0.0 2.0
FeatureSelection.ridge_rankings(df, 'A')
feature ridge_coef ridge_importance
B 0.050871 3.0
C 0.096362 1.0
D -0.095220 2.0
FeatureSelection.dtree_rankings(df, 'A')
feature random_forest_coefficient random_forest_importance
B 0.323091 2.0
C 0.269673 3.0
D 0.407236 1.0
FeatureSelection.dtree_rankings(df, 'A')
feature rfe_rank lasso_coef lasso_importance ridge_coef ridge_importance random_forest_coefficient random_forest_importance
B 3 0.0 2.0 0.050871 3.0 0.323091 2.0
C 1 0.0 2.0 0.096362 1.0 0.269673 3.0
D 2 -0.0 2.0 -0.095220 2.0 0.407236 1.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

feature-test-0.1.13.tar.gz (9.7 kB view hashes)

Uploaded Source

Built Distribution

feature_test-0.1.13-py3-none-any.whl (9.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page