Skip to main content

Identify multicollinearity issues by correlation, VIF, and visualizations.

Project description

collinearity_tool

Identify multicollinearity issues by correlation, VIF, and visualizations. This package is designed for beginners of Python who want to identify multicollinearity issues by applying a simple function. It automates the process of building a proper correlation matrix, creating correlation heat map and identifying pairwise highly correlated variables. An R version of package is also in the progress of development.

1. Description

Functions

The following four functions are in the collinearity_tool package:

  • corr_matrix: A function that returns a generic correlation matrix and a longer form one for all numerical variables in a data frame.
  • corr_heatmap: A function that returns a correlation heatmap given a dataframe.
  • vif_bar_plot: A function that returns a list containing a data frame for Variable Inflation Factors (VIF) and a bar chart of the VIFs for each explanatory variable in a multiple linear regression model.
  • col_identify: A function that identifies multicollinearity based on highly correlated pairs (using Pearson coefficient) with VIF values exceeding the threshold.

Package ecosystems

Motivation This package aims to fill the simplify the decision-making process while addressing multicollinearity. This tool brings several other packages together into one interface. Multicollinearity tools exist but they do not encompass all of the components included in this tool.

For example, linear regression, plotting tools and correlation matrix packages are already part of the Python ecosystem (as part of Pandas, Scipy, and so on). What makes this package different is that it combines the tools together to create a single package that will allow the researcher to locate troublesome multicollinearity issues.

In addition, the collinearity_tool helps new users, unfamiliar with Python and its broad ecosystem, to plot and deduce multicollinearity without prior knowledge of plotting, calculating VIFF's or manipulating data to create plots and tables.

variance_inflation_factor() This function is necessary to calculate VIF. It is part of the statsmodels documentation package. The VIF package calculates the VIF score which predicts how well the variable can be predicted using other explanatory variables in the dataset using linear regression. Higher values highlight multicollinearity problems. The output is a simple dataframe with two columns: feature (variable name) and VIF (VIF value).

scipy.stats.linregress
Scipy is a necessary package for this collinearity tool. This package conducts linear regression using linregress and provides necessary statistical information. For more information on the package, please see the following documentation.

Pandas: corr()
Pandas is another necessary package for this collinearity tool. This package conducts linear regression using and produces a correlation matrix using corr. The output is a DataFrame in the shape of a correlation matrix. For more information on the package, please see the following documentation).

Altair
Altair is a popular plotting package. It provides the necessary tools to create the heatmap for the collinearity tool. For more information on Altair and heatmaps, please refer to this example.

2. Installation

$ pip install collinearity_tool

3. Usage

collinearity can be used to identify multicollinearity issues by correlation, VIF, and visualizations as follows:

import pandas as pd
import collinearity_tool.collinearity_tool as cl

data = pd.read_csv('test.csv') # path to your file
cl.corr_matrix(data)
cl.corr_heatmap(data)
vif = cl.vif_bar_plot(x, y, data, 6) # x and y are the choice of the variables
cl.col_identify(data, x, y)

4. Contributors

  • Anahita Einolghozati
  • Chaoran Wang
  • Katia Aristova
  • Lisheng Mao

5. Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

6. License

collinearity_tool was created by Anahita Einolghozati, Chaoran Wang, Katia Aristova, Lisheng Mao. It is licensed under the terms of the MIT license.

7. Credits

collinearity_tool was created with cookiecutter and the py-pkgs-cookiecutter template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

collinearity_tool-0.1.7.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

collinearity_tool-0.1.7-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file collinearity_tool-0.1.7.tar.gz.

File metadata

  • Download URL: collinearity_tool-0.1.7.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for collinearity_tool-0.1.7.tar.gz
Algorithm Hash digest
SHA256 266b210f698e0a604863ab058143d3ee3cba0f3d91cc87762c427e360573baf9
MD5 bfbf9629899c91481a9ad3e98213da0f
BLAKE2b-256 38d1f3c84c841db0476c901cedc9015fd4b9218e741a0de56c8eecd06aa64e16

See more details on using hashes here.

File details

Details for the file collinearity_tool-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: collinearity_tool-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for collinearity_tool-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 aae95ff66dc677a2c22be580bc29820265c14cc14bf3983a9f825fe0d199c9da
MD5 d3ba42285df9de6e90147753cab4f230
BLAKE2b-256 ef87b7487e934d7f7ee34fa7e1a7666cc2027888d9ad299b7c8bb4ebd8b006d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page