Skip to main content

Feature reduction using multi-collinearity

Project description

Multi-Collinearity Reduction

This package can be used to remove features from your dataset in a way that helps reduce multi-collinearity.

How to Use :rocket:

  1. Import the MultiCollinearity class into your code:

    from multicollinearity.feature_reduction import MultiCollinearity
    
  2. Initialize the class object:

    mc = MultiCollinearity(df, target)
    

The first argument must be the name of the pandas dataframe, and the second argument is the name of the target column in that dataframe.

Here are all parameters available in this class:

  • df: Name of the pandas dataframe
  • y_col: Name of the target column in the input dataframe
  • corr_plot: If set to True, two correlation heatmaps will be created (default: False)
  • verbose: Set to False if you don't want any information about which features are droped and why (default: True)
  1. Perform some feature reduction based on pairwise correlations:

    df_temp = mc.pairwise_corr()
    

You can inspect the content of df_temp, which should have fewer columns if any fatures were dropped due to high correlation with another feature.

Here are all parameters available in this function:

  • min_vars_to_keep: The default value is set to 10, which means that the feature reduction process will stop if the number of features that are left in the dataframe reaches this value. Change this value according to your requirements.
  • corr_tol: This value is used to determine whether a pair of features are heavily correlated with each other or not. If the correlation value is higher than the threshold, they are considered highly correlated and one of them will be dropped. (default: .9)
  1. Finally, perform further feature reduction based on multi-collinearity:

    df_final = mc.multi_collin()
    

The final dataframe will have the reduced set of features that you can then use for training a model.

Here are all parameters available in this function:

  • cond_index_tol: The condition index threshold value. If the condition index is higher than this threshold, feature reduction will commence. The feature reduction process will stop if the condition index value reaches below this threshold. (defauld: 30)
  • min_vars_to_keep: (Same as above.) The default value is set to 10, which means that the feature reduction process will stop if the number of features that are left in the dataframe reaches this value. Change this value according to your requirements.

How it Works :gear:

The original idea was presented (by yours truly) in PyData 2016. The video is available online, but the sound quality is not very good.

The feature reduction is performed in two (sequential) steps:

  1. Feature Reduction based on Pairwise Correlations

Consider the following example to understand how this works: Two features, x1 and x3, have an absolute correlation coefficient of .91. This value is higher than the threshold value of .90, so one of these two features will be dropped.

Feature Reduction using Pairwise Correlations

To decide which feature should be dropped, we look at their correlation with the target column. In the example above, x1 has a higher correlation with the target, so it will be kept while x3 will be dropped.

This process will continue until either the pairwise correlation threshold is reached or the minimun number of features (to keep) threshold is reached.

  1. Feature Reduction based on Multi-Collinearity

First, eigenvalue decomposition is performed on the correlation matrix. There will be as many eigenvalues and eigenvectors calculated as the number of features. If multi-collinearity is present in the dataset, there will be at least one of these eigenvectors (aka directions) that is reduntant, i.e., it's not explaining any variance in the data. The redundant eigenvector can be identified by calculating condition index for each eigenvector.

Condition Index = (max(eigenvalue) / min(eigenvalue)) ** .5

The eigenvector with the highest condition index (above the threshold) is one of the culprits (to cause multi-collinearity in the data). We need to discard that particular direction. But since this is an eigenvector (and not a feature in the original dataframe), we can't remove it from the original dataframe. Instead, we will find out which feature has the highest factor loading on that eigenvector. We identify that feature that leans the heaviest on the eigenvector and discard it from the dataset. We repeat this process until one of the two criteria is met: min_vars_to_keep or cond_index_tol.

Consider the following example to understand how this works:

We identify u4 as the redundant eigenvector. Now x3 has the highest loading on that eigenvector, so we will discard x3 and then iterate.

Feature Reduction using Multi-Collinearity

[!NOTE] Check out the example.ipynb notebook in this repo which demonstrates the functionality of this package with the Boston Housing dataset.

TODO :ballot_box_with_check:

  1. Add test cases.
  2. Make it possible to run this without the presence of the target variable, i.e., do this in an unsupervised way.

SOURCE :book:

Classical and Modern Regression with Applications

Link: https://a.co/d/7PyVHSZ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multi-collinearity-0.0.1.tar.gz (6.7 kB view details)

Uploaded Source

Built Distribution

multi_collinearity-0.0.1-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file multi-collinearity-0.0.1.tar.gz.

File metadata

  • Download URL: multi-collinearity-0.0.1.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for multi-collinearity-0.0.1.tar.gz
Algorithm Hash digest
SHA256 2d05816c9d624f2030fa64cf8af7f570c9fe61b8e5cdeb6e919b28ebab10f82f
MD5 2cc0144ba46d0d319e3a20cec658d839
BLAKE2b-256 381e806beb7d726c8fcdf0e33f2c93d9dae3bfdfcc299e380d6a0574338f719c

See more details on using hashes here.

Provenance

File details

Details for the file multi_collinearity-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for multi_collinearity-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4bc6703c531c4ab79aa87bfc6ba7257f43376e9104fbad3a2f1b59d6d131bd8a
MD5 6ce6f3bfeff8bdbb1d4db3d591325725
BLAKE2b-256 eb898f4e28432bed4339e1906d01fa561c3511cfbd2cbe818d04b6a61c23b348

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page