Skip to main content

No project description provided

Project description

many

This package serves as a general-use toolkit for frequently-implemented statistical and visual methods.

Installation

pip install many

Components

Statistical methods

The statistical methods comprise several functions for association mining between variable pairs. The methods used here are optimized for pandas DataFrames and are inspired by the corrcoef function provided by numpy.

Because these functions rely on native matrix-level operations provided by numpy, many are orders of magnitude faster than naive looping-based alternatives. This makes them useful for constructing large association networks or for feature extraction, which have important uses in areas such as biomarker discovery. All methods also return estimates of statistical significance.

In certain cases such as the computation of correlation coefficients, these vectorized methods come with the caveat of numerical instability. As a compromise, "naive" loop-based implementations are also provided for testing and comparison. It is recommended that any significant results obtained with the vectorized methods be verified with these base methods.

The current functions available are listed below by variable comparison type. Benchmarks are also provided with comparisons to the equivalent looping-based method. In all methods, a melt option is provided to return the outputs as a set of row-column variable-variable pair statistic matrices or as a single DataFrame with each statistic melted to a column.

Continuous vs. continuous

mat_corr(a_mat, b_mat, melt: bool, method: str)

Computes pairwise Pearson or Spearman correlations between columns of a_mat and b_mat, provided that there are no missing values in either matrix. method can be either "pearson" or "spearman".

mat_corr_nan(a_mat, b_mat, melt: bool, method: str)

Computes pairwise Pearson or Spearman correlations between a_mat and the columns of b_mat, provided that a_mat is a Series and b_mat is a DataFrame that may or may not contain some missing values. method can be either "pearson" or "spearman".

mat_corr_naive(a_mat, b_mat, melt: bool, method: str, pbar=False)

Same functionality as mat_corr, but uses a double loop for direct computation of statistics. method can be either "pearson" or "spearman".

Continuous vs. categorical

mat_mwu(a_mat, b_mat, melt: bool, effect: str, use_continuity=True)

Computes pairwise Mann-Whitney U tests between columns of a_mat (continuous samples) and b_mat (binary samples). Assumes that a_mat and b_mat both do not contain any missing values. effect can only be rank_biserial. use_continuity specifies whether a continuity correction should be applied.

mat_mwu_naive( a_mat, b_mat, melt: bool, effect: str, use_continuity=True, pbar=False)

Same functionality as mat_mwu, but uses a double loop for direct computation of statistics. Unlike mat_mwus, effect parameters of "mean", "median", and "rank_biserial" are all supported.

Categorical vs. categorical

mat_fisher(a_mat, b_mat, melt: bool, pseudocount=0)

Computes pairwise Fisher's exact tests between columns of a_mat and b_mat, provided that both are boolean-castable matrices and do not contain any missing values. The pseudocount parameter (which must be an integer) specifies the value that should be added to all cells of the contingency matrices.

mat_fisher_nan(a_mat, b_mat, melt: bool, pseudocount=0)

Computes pairwise Fisher's exact tests between columns of a_mat and b_mat, provided that both are boolean-castable matrices and may or may not contain missing values.

mat_fisher_naive(a_mat, b_mat, melt: bool, pseudocount=0, pbar=False)

Same functionality as mat_fisher, but uses a double loop for direct computation of statistics.

Benchmarks

Benchmarks were run with 1,000 samples per variable (i.e. setting each input matrix to have 1,000 rows). The number of variables in a_mat was set to 100, and the number of variables in b_mat was varied as shown below. The number of pairwise comparisons (equivalent to the product of the column counts of a_mat and b_mat) is also indicated.

Benchmarks were run on an i7-7700K with 16GB of 2133 MHz RAM.

mat_corr (Pearson)
Column count of b_mat Total comparisons Runtime, mat_corr_naive, seconds Runtime, mat_corr, seconds Speedup
10 1,000 0.29 0.01 ×25.86
100 10,000 2.67 0.07 ×37.37
1,000 100,000 26.36 0.35 ×74.47
mat_corr (Spearman)
Column count of b_mat Total comparisons Runtime, mat_corr_naive, seconds Runtime, mat_corr, seconds Speedup
10 1,000 0.74 0.02 ×46.74
100 10,000 7.28 0.08 ×94.21
1,000 100,000 72.06 0.38 ×190.58
mat_mwu
Column count of b_mat Total comparisons Runtime, mat_mwu_naive, seconds Runtime, mat_mwu, seconds Speedup
10 1,000 1.06 0.14 ×7.55
100 10,000 10.38 0.71 ×14.62
1,000 100,000 103.17 6.52 ×15.82
mat_fisher
Column count of b_mat Total comparisons Runtime, mat_fisher_naive, seconds Runtime, mat_fisher, seconds Speedup
10 1,000 1.18 0.24 ×5.01
100 10,000 11.72 2.42 ×4.83
1,000 100,000 116.48 23.77 ×4.90
mat_fisher_nan
Column count of b_mat Total comparisons Runtime, mat_fisher_naive, seconds Runtime, mat_fisher_nan, seconds Speedup
10 1,000 1.23 0.14 ×8.56
100 10,000 12.04 1.43 ×8.40
1,000 100,000 120.20 14.46 ×8.31

Visual methods

Development

  1. Install dependencies with poetry install
  2. Initialize environment with poetry shell
  3. Initialize pre-commit hooks with pre-commit install

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

many-0.2.3.tar.gz (19.6 kB view hashes)

Uploaded Source

Built Distribution

many-0.2.3-py3-none-any.whl (23.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page