Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

Fast correlations

Project description

galileo

This package contains several functions for explorative data analysis with a focus on association mining between variable pairs. The methods used here are optimized for Pandas dataframes and are inspired by the corrcoef function provided by numpy.

Because these functions rely on native matrix-level operations provided by numpy, many are orders of magnitude faster than naive looping-based alternatives. This makes them useful for constructing large association networks or for feature extraction, which have important uses in areas such as biomarker discovery.

The current functions available are listed below by variable comparison type. Benchmarks are also provided with comparisons to an equivalent looping-based method.

Requirements: Python 3, numpy, pandas, scipy, statsmodels.

Functions

Continuous vs. continuous

mat_corrs(A, B, method="pearson")

Computes pairwise Pearson or Spearman correlations between columns of A and B, provided that there are no missing values in either matrix.

mat_corrs_nan(A, B, method="pearson")

Computes pairwise Pearson or Spearman correlations between A and the columns of B, provided that A is a series and B is a dataframe that may or may not contain some missing values.

mat_corrs_naive(A, B, method="pearson")

Same functionality as mat_corrs, but uses a double loop for direct computation of statistics.

Continuous vs. categorical

mat_mwus(A, B, use_continuity=True)

Computes pairwise Mann-Whitney U tests between columns of A (continuous samples) and B (binary samples). Assumes that A and B both do not contain any missing values.

mat_mwus_naive(A, B, use_continuity=True)

Same functionality as mat_mwus, but uses a double loop for direct computation of statistics.

Categorical vs. categorical

mat_fishers(A, B)

Computes pairwise Fisher's exact tests between columns of A and B, provided that both are boolean-castable matrices and do not contain any missing values.

mat_fishers_nan(A, B)

Computes pairwise Fisher's exact tests between columns of A and B, provided that both are boolean-castable matrices and may or may not contain missing values.

mat_fishers_naive(A, B)

Same functionality as mat_fishers, but uses a double loop for direct computation of statistics.

Utilities

generate_test(n_samples, A_n_cols, B_n_cols, A_type="continuous", B_type="continuous", nans=False)

Generates randomly-initialized matrix pairs for testing and benchmarking.

Benchmarks

These benchmarks were run with 1,000 samples per variable (i.e. setting each input matrice to have 1,000 rows). The number of variables in A was set to 100, and the number of variables in B was varied as shown below. The number of pairwise comparisons calaulated (equivalent to the product of A and B's column counts) is also indicated.

Benchmark scripts can be found in /test/benchmarks.ipynb.

Pearson correlations

Column count of B Total comparisons Runtime, mat_corrs_naive, seconds Runtime, mat_corrs, seconds Speedup factor
10 1,000 3.27 0.020 ×163
100 10,000 30.55 0.066 ×461
1,000 100,000 303.30 0.53 ×574

Spearman correlations

Column count of B Total comparisons Runtime, mat_corrs_naive, seconds Runtime, mat_corrs, seconds Speedup factor
10 1,000 4.47 0.026 ×171
100 10,000 44.55 0.081 ×553
1,000 100,000 493.73 0.70 ×704

Mann-Whitney U tests

Column count of B Total comparisons Runtime, mat_mwus_naive, seconds Runtime, mat_mwus, seconds Speedup factor
10 1,000 6.94 0.18 ×38
100 10,000 60.68 1.09 ×56
1,000 100,000 615.59 8.15 ×76

Fisher's exact tests

Column count of B Total comparisons Runtime, mat_fishers_naive, seconds Runtime, mat_fishers, seconds Speedup factor
10 1,000 2.63 0.41 ×6
100 10,000 25.19 3.78 ×7
1,000 100,000 254.19 37.57 ×7

Project details


Release history Release notifications

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for galileo-k, version 0.0.1
Filename, size File type Python version Upload date Hashes
Filename, size galileo_k-0.0.1-py3-none-any.whl (13.1 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size galileo-k-0.0.1.tar.gz (10.6 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page