Fast correlations
Project description
galileo
This package contains several functions for explorative data analysis with a focus on association mining between variable pairs. The methods used here are optimized for Pandas dataframes and are inspired by the corrcoef
function provided by numpy
.
Because these functions rely on native matrixlevel operations provided by numpy
, many are orders of magnitude faster than naive loopingbased alternatives. This makes them useful for constructing large association networks or for feature extraction, which have important uses in areas such as biomarker discovery.
The current functions available are listed below by variable comparison type. Benchmarks are also provided with comparisons to an equivalent loopingbased method.
Requirements: Python 3, numpy
, pandas
, scipy
, statsmodels
.
Functions
Continuous vs. continuous
mat_corrs(A, B, method="pearson")
Computes pairwise Pearson or Spearman correlations between columns of A and B, provided that there are no missing values in either matrix.
mat_corrs_nan(A, B, method="pearson")
Computes pairwise Pearson or Spearman correlations between A and the columns of B, provided that A is a series and B is a dataframe that may or may not contain some missing values.
mat_corrs_naive(A, B, method="pearson")
Same functionality as mat_corrs
, but uses a double loop for direct computation of statistics.
Continuous vs. categorical
mat_mwus(A, B, use_continuity=True)
Computes pairwise MannWhitney U tests between columns of A (continuous samples) and B (binary samples). Assumes that A and B both do not contain any missing values.
mat_mwus_naive(A, B, use_continuity=True)
Same functionality as mat_mwus
, but uses a double loop for direct computation of statistics.
Categorical vs. categorical
mat_fishers(A, B)
Computes pairwise Fisher's exact tests between columns of A and B, provided that both are booleancastable matrices and do not contain any missing values.
mat_fishers_nan(A, B)
Computes pairwise Fisher's exact tests between columns of A and B, provided that both are booleancastable matrices and may or may not contain missing values.
mat_fishers_naive(A, B)
Same functionality as mat_fishers
, but uses a double loop for direct computation of statistics.
Utilities
generate_test(n_samples, A_n_cols, B_n_cols, A_type="continuous", B_type="continuous", nans=False)
Generates randomlyinitialized matrix pairs for testing and benchmarking.
Benchmarks
These benchmarks were run with 1,000 samples per variable (i.e. setting each input matrice to have 1,000 rows). The number of variables in A was set to 100, and the number of variables in B was varied as shown below. The number of pairwise comparisons calaulated (equivalent to the product of A and B's column counts) is also indicated.
Benchmark scripts can be found in /test/benchmarks.ipynb
.
Pearson correlations
Column count of B  Total comparisons  Runtime, mat_corrs_naive , seconds 
Runtime, mat_corrs , seconds 
Speedup factor 

10  1,000  3.27  0.020  ×163 
100  10,000  30.55  0.066  ×461 
1,000  100,000  303.30  0.53  ×574 
Spearman correlations
Column count of B  Total comparisons  Runtime, mat_corrs_naive , seconds 
Runtime, mat_corrs , seconds 
Speedup factor 

10  1,000  4.47  0.026  ×171 
100  10,000  44.55  0.081  ×553 
1,000  100,000  493.73  0.70  ×704 
MannWhitney U tests
Column count of B  Total comparisons  Runtime, mat_mwus_naive , seconds 
Runtime, mat_mwus , seconds 
Speedup factor 

10  1,000  6.94  0.18  ×38 
100  10,000  60.68  1.09  ×56 
1,000  100,000  615.59  8.15  ×76 
Fisher's exact tests
Column count of B  Total comparisons  Runtime, mat_fishers_naive , seconds 
Runtime, mat_fishers , seconds 
Speedup factor 

10  1,000  2.63  0.41  ×6 
100  10,000  25.19  3.78  ×7 
1,000  100,000  254.19  37.57  ×7 
Project details
Release history Release notifications
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size galileo_k0.0.1py3noneany.whl (13.1 kB)  File type Wheel  Python version py3  Upload date  Hashes View hashes 
Filename, size galileok0.0.1.tar.gz (10.6 kB)  File type Source  Python version None  Upload date  Hashes View hashes 
Hashes for galileo_k0.0.1py3noneany.whl
Algorithm  Hash digest  

SHA256  6a5f2ca8b30aa2c1472a00e008184ce24e5f6c72621d110881a4d8180e0f52a1 

MD5  d85b830842477e29d7f10d2c1c89bd8d 

BLAKE2256  eec2c1b563c5b77827f562a20215434e48b30e5701d90216fb28b3686d052403 