Feature reduction using multi-collinearity
Project description
Multi-Collinearity Reduction
This package can be used to remove features from your dataset in a way that helps reduce multi-collinearity.
[!NOTE] If you're reading this on PyPI website, please head over to the README file on the GitHub repo for better readability.
How to Use :rocket:
-
Import the
MultiCollinearity
class into your code:from multicollinearity.feature_reduction import MultiCollinearity
-
Initialize the class object:
mc = MultiCollinearity(df, target)
The first argument must be the name of the pandas
dataframe, and the second argument is the name of the target column in that dataframe.
Here are all parameters available in this class:
df
: Name of the pandas dataframey_col
: Name of the target column in the input dataframecorr_plot
: If set to True, two correlation heatmaps will be created (default: False)verbose
: Set to False if you don't want any information about which features are droped and why (default: True)
-
Perform some feature reduction based on pairwise correlations:
df_temp = mc.pairwise_corr()
You can inspect the content of df_temp
, which should have fewer columns if any fatures were dropped due to high correlation with another feature.
Here are all parameters available in this function:
min_vars_to_keep
: The default value is set to 10, which means that the feature reduction process will stop if the number of features that are left in the dataframe reaches this value. Change this value according to your requirements.corr_tol
: This value is used to determine whether a pair of features are heavily correlated with each other or not. If the correlation value is higher than the threshold, they are considered highly correlated and one of them will be dropped. (default: .9)
-
Finally, perform further feature reduction based on multi-collinearity:
df_final = mc.multi_collin()
The final dataframe will have the reduced set of features that you can then use for training a model.
Here are all parameters available in this function:
cond_index_tol
: The condition index threshold value. If the condition index is higher than this threshold, feature reduction will commence. The feature reduction process will stop if the condition index value reaches below this threshold. (defauld: 30)min_vars_to_keep
: (Same as above.) The default value is set to 10, which means that the feature reduction process will stop if the number of features that are left in the dataframe reaches this value. Change this value according to your requirements.
How it Works :gear:
The original idea was presented (by yours truly) in PyData 2016. The video is available online, but the sound quality is not very good.
The feature reduction is performed in two (sequential) steps:
- Feature Reduction based on Pairwise Correlations
Consider the following example to understand how this works: Two features, x1 and x3, have an absolute correlation coefficient of .91. This value is higher than the threshold value of .90, so one of these two features will be dropped.
To decide which feature should be dropped, we look at their correlation with the target column. In the example above, x1 has a higher correlation with the target, so it will be kept while x3 will be dropped.
This process will continue until either the pairwise correlation threshold is reached or the minimun number of features (to keep) threshold is reached.
- Feature Reduction based on Multi-Collinearity
First, eigenvalue decomposition is performed on the correlation matrix. There will be as many eigenvalues and eigenvectors calculated as the number of features. If multi-collinearity is present in the dataset, there will be at least one of these eigenvectors (aka directions) that is reduntant, i.e., it's not explaining any variance in the data. The redundant eigenvector can be identified by calculating condition index for each eigenvector.
Condition Index = (max(eigenvalue) / min(eigenvalue)) ** .5
The eigenvector with the highest condition index (above the threshold) is one of the culprits (to cause multi-collinearity in the data). We need to discard that particular direction. But since this is an eigenvector (and not a feature in the original dataframe), we can't remove it from the original dataframe. Instead, we will find out which feature has the highest factor loading on that eigenvector. We identify that feature that leans the heaviest on the eigenvector and discard it from the dataset. We repeat this process until one of the two criteria is met: min_vars_to_keep
or cond_index_tol
.
Consider the following example to understand how this works:
We identify u4 as the redundant eigenvector. Now x3 has the highest loading on that eigenvector, so we will discard x3 and then iterate.
[!NOTE] Check out the
example.ipynb
notebook in this repo which demonstrates the functionality of this package with the Boston Housing dataset.
TODO :ballot_box_with_check:
- Add test cases.
- Make it possible to run this without the presence of the target variable, i.e., do this in an unsupervised way.
SOURCE :book:
Link: https://a.co/d/7PyVHSZ
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file multi-collinearity-0.0.2.tar.gz
.
File metadata
- Download URL: multi-collinearity-0.0.2.tar.gz
- Upload date:
- Size: 6.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4c44d9c2f0113ca0e5d982f9828a398d5d91b47cb9661ea47607fedeeb1337cd |
|
MD5 | 050df5f0e7c2a42780afd42644dffdc7 |
|
BLAKE2b-256 | ab5f413dcaaa93c0f8fb8606fb6503b9d783a7ae9a89163edddb31b3ad82a148 |
Provenance
File details
Details for the file multi_collinearity-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: multi_collinearity-0.0.2-py3-none-any.whl
- Upload date:
- Size: 7.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c7c63f5a403d14fc61812a28264b51625c8cc5c09a1908bb29acf2cb679378d9 |
|
MD5 | 78d10ce379119332bcf3026faea14a5e |
|
BLAKE2b-256 | 1da4618862426638f2bffd26e48738f6964b85699bfa84e1e4eb65e6d4eece43 |