collection of utility functions for correlation analysis

korr

collection of utility functions for correlation analysis

Usage

Check the examples folder for notebooks.

Compute correlation matrix and its p-values

• pearson – Pearson/Sample correlation (interval- and ratio-scale data)

• kendall – Kendall’s tau rank correlation (ordinal data)

• spearman – Spearman rho rank correlation (ordinal data)

• mcc – Matthews correlation coefficient between binary variables

EDA, Dig deeper into results

• flatten – A table (pandas) with one row for each correlation pairs with the variable indicies, corr., p-value. For example, try to find “good” cutoffs with corr_vs_pval and then look up the variable indicies with flatten afterwards.

• slice_yx – slice a correlation and p-value matrix of a (y,X) dataset into a (y,x_i) vector and (x_j, x_k) matrices

• corr_vs_pval – Histogram to find p-value cutoffs (alpha) for a) highly correlated pairs, b) unrelated pairs, c) the mixed results.

• bracket_pval – Histogram with more fine-grained p-value brackets.

• corrgram – Correlogram, heatmap of correlations with p-values in brackets

Utility functions

• confusion – Confusion matrix. Required for Matthews correlation (mcc) and is a bitter faster than sklearn’s

Parameter Stability

• bootcorr – Estimate multiple correlation matrices based on bootstrapped samples. From there you can assess how stable correlation estimates are (how sensitive against in-sample variation). For example, stable estimates are good candidates for modeling, and unstable correlation pairs are good candidates for P-hacking and non-reproducibility.

Variable Selection, Search Functions

• mincorr – From all estimated correlation pairs, pick a given n=3,5,.. of variables with low and insignificant correlations among each other. (See binsel package for an application.)

• find_best – Find the N “best”, i.e. high and most significant, correlations

• find_worst – Find the N “worst”, i.e. insignificant/random and low, correlations

• find_unrelated – Return variable indicies of unrelated pairs (in terms of insignificant p-value)

Appendix

Installation

The korr git repo is available as PyPi package

pip install korr

Install a virtual environment

python3.7 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir

(If your git repo is stored in a folder with whitespaces, then don’t use the subfolder .venv. Use an absolute path without whitespaces.)

Commands

• Check syntax: flake8 --ignore=F401

• Run Unit Tests: python -W ignore -m unittest discover

• Remove .pyc files: find . -type f -name "*.pyc" | xargs rm

• Remove __pycache__ folders: find . -type d -name "__pycache__" | xargs rm -rf

Publish

pandoc README.md --from markdown --to rst -s -o README.rst
python setup.py sdist
twine upload -r pypi dist/*

Support

Please open an issue for support.

Contributing

Please contribute using Github Flow. Create a branch, add commits, and open a pull request.

Project details

Uploaded source