examines the relationships between a new feature and everyother feature in a dataset

These details have not been verified by PyPI

Project links

Project description

feature_test

This package provides tools to test one feature against all other features in a dataset. It is intended to determine whether a new feature is a good candidate for addition to an established dataset used in a machine learning model. The benefit of this package is that instead of adding a feature to a dataset, running it through a long training and evaluation process, and then interpreting results, the feature can be quickly tested using feature_test; enabling users to quickly discard unuseful features and move forward with potentially impactful ones. This package provides the following tools to aid this determination.

Correlation analysis
- Calculate the correlations
- Identify highly correlated feature
- Categorize correlation
Chi-Square tests
- Calculate the chi-square statistic and p-vlaue
- Categorize the chi-square result
Recursive feature elimination (RFE)
- Rank features by their importance
Lasso regularization coefficients
- Calculate the coefficients of a linear model using Lasso regularization
Ridge regularization coefficients
- Calculate the coefficients of a linear model using Ridge Regularization
Decision tree coefficients
- Calculate the coefficients for each feature using a decision tree model

Special thanks to the AREN team for their guidance and Annie Tran for the open sourcing her feature test code.

Installation

python3 -m pip install feature-test

utils.Utils

class utils.Utils(X, columns)
A collection of functions that enables users to get data on a dataframe or adjust it for testing.

Parameters:

X: pandas.DataFrame
- A pandas.DataFrame containing the dataset
columns: List
- A list of strings correlating to features in a dataset.

Methods
_______________________________________________________________________________________

get_columns(X: pd.DataFrame)	Returns a list of column names in a dataframe.
exclude_columns(X: pd.DataFrame, columns: List)	Exclude a list of columns from a dataframe.

tests.Correlation

class tests.correlation(X, new_feature)

A suite of functions that calculates and reports the correlation coefficient between features and all other features in a dataset.

Parameters:

X: pandas.DataFrame
- A pandas.DataFrame containing the dataset
new_feature: string
- A string indicating the column in X to test against all other features

Methods
_______________________________________________________________________________________

calc_corr(X: pd.DataFrame, new_feature: str)	Returns a dataframe with the correlation values for each new_feature/feature combination.
similar_corr(X: pd.DataFrame, new_feature: str)	Returns a list of features highly correlated with the new_feature.
categorize_correlations(X: pd.DataFrame, correlation_threshold: float = 0.6)	Returns a dataframe with the correlations categorized. Possible values are high, medium, and low.
get_correlations(X: pd.DataFrame, new_feature: str)	Returns a dataframe of new_feature/feature combinations, their correlations, and their correlation category.

tests.ChiSquare

class tests.ChiSquare(X, new_feature)

Calculates the chi-squared statistic of the new feature against each categorical feature in a dataset. Also categorizes the chi-square result based on the p-value and effect size as measured by cramers v.

Parameters:

X: pandas.DataFrame
- A pandas.DataFrame containing the dataset
new_feature: string
- A string indicating the column in X to test against all other features

Methods
_______________________________________________________________________________________

calc_chi_sq(X: pd.DataFrame, new_feature: str)

Returns a dataframe that includes the chi-square result categorization.

tests.FeatureSelection

class tests.FeatureSelection(X, target)

Parameters:

X: pandas.DataFrame
- A pandas.DataFrame containing the dataset
target: string
- A string indicating the prediction column

Methods
_______________________________________________________________________________________

rfe_rankings(X: pd.DataFrame, target: str, classifier=None)	Returns a dataframe that includes the recursive feature elimination feature ranking.
lasso_rankings(X: pd.DataFrame, target: str)	Returns a dataframe that includes the linear model coefficients for features after lasso regularization.
ridge_coefficients(X: pd.DataFrame, target: str)	Returns a dataframe that includes the linear model coefficients for features after ridge regularization.
dtree_coefficients(X: pd.DataFrame, target: str)	Returns a dataframe that includes the decision tree model coefficients for features.
run_feature_classifiers(X: pd.DataFrame, target: str)	Returns a dataframe that includes the the results for rfe_rankings, lasso_rankings, ridge_rankings, and dtree_rankings.

Examples

from pandas import util

from feature_test.utils import Utils
from feature_test.feature_tests import Correlation, FeatureSelection, ChiSq

# Create a test dataset
df= util.testing.makeDataFrame()
df.head()

feature	A	B	C	D
lRhANYYD2r	0.572559	-1.409978	0.687618	-0.923502
YzYG07kY1O	0.145629	-1.446946	-0.003526	0.304385
cT3KK078Gt	-1.007378	1.263980	1.107897	0.844689
JW4Kg2EGVo	0.536701	-1.477372	-0.866873	1.539458
2mucO1cf2Z	-1.101875	0.518555	0.384916	-0.031403

c = Utils.get_columns(df)

['A', 'B', 'C', 'D']

corr_df = Correlation.calc_corr(df, 'A')
corr_df

feature_1	feature_2	corr
A	B	0.081662
A	C	0.203858
A	D	0.064999

rep_df = Correlation.get_correlations(df, 'D')
rep_df

feature_1	feature_2	corr	corr_cat
A	B	0.071466	low
A	C	0.105306	low
A	D	0.121130	low

ChiSq.calc_chi_sq(df, 'A')

feature_1	feature_2	chi_sq_cat
A	B	NOT SIGNIFICANT
A	C	NOT SIGNIFICANT
A	D	NOT SIGNIFICANT

FeatureSelection.rfe_rankings(df, 'A')

feature	rfe_rank
B	3
C	1
D	2

FeatureSelection.lasso_rankings(df, 'A')

feature	lasso_coef	lasso_importance
B	0.0	2.0
C	0.0	2.0
D	-0.0	2.0

FeatureSelection.ridge_rankings(df, 'A')

feature	ridge_coef	ridge_importance
B	0.050871	3.0
C	0.096362	1.0
D	-0.095220	2.0

FeatureSelection.dtree_rankings(df, 'A')

feature	random_forest_coefficient	random_forest_importance
B	0.323091	2.0
C	0.269673	3.0
D	0.407236	1.0

FeatureSelection.dtree_rankings(df, 'A')

feature	rfe_rank	lasso_coef	lasso_importance	ridge_coef	ridge_importance	random_forest_coefficient	random_forest_importance
B	3	0.0	2.0	0.050871	3.0	0.323091	2.0
C	1	0.0	2.0	0.096362	1.0	0.269673	3.0
D	2	-0.0	2.0	-0.095220	2.0	0.407236	1.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.21

Mar 3, 2023

0.1.20

Mar 3, 2023

0.1.19

Mar 3, 2023

0.1.18

Mar 3, 2023

0.1.17

Mar 3, 2023

0.1.16

Feb 22, 2023

0.1.15

Mar 3, 2022

0.1.14

Mar 3, 2022

This version

0.1.13

Mar 2, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

feature-test-0.1.13.tar.gz (9.7 kB view hashes)

Uploaded Mar 2, 2022 Source

Built Distribution

feature_test-0.1.13-py3-none-any.whl (9.0 kB view hashes)

Uploaded Mar 2, 2022 Python 3

Hashes for feature-test-0.1.13.tar.gz

Hashes for feature-test-0.1.13.tar.gz
Algorithm	Hash digest
SHA256	`e276b66d313ebb938754ad725d33267c03cc86ef52c216c5d82e15a9413b32e6`
MD5	`cb85dfd0ee75bf84b4f6a6c929862014`
BLAKE2b-256	`10409f06772b6227ad43947847e44a6fa063ad8d2854387c558e987540ac94cf`

Hashes for feature_test-0.1.13-py3-none-any.whl

Hashes for feature_test-0.1.13-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eff0e7c8067a1cbbd3709c581c4212d286438ee80a86bca198e66f446c7a382b`
MD5	`96c5baea8280135568186332d7be6f38`
BLAKE2b-256	`2c1ababc8e1d9687838c61c60c5f806b8ca1ff48f19abb7a436ddd8ca3f10c44`