No project description provided

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

many

This package serves as a general-use toolkit for frequently-implemented statistical and visual methods.

Installation

pip install many

Components

Statistical methods

The statistical methods comprise several functions for association mining between variable pairs. The methods used here are optimized for pandas DataFrames and are inspired by the corrcoef function provided by numpy.

Because these functions rely on native matrix-level operations provided by numpy, many are orders of magnitude faster than naive looping-based alternatives. This makes them useful for constructing large association networks or for feature extraction, which have important uses in areas such as biomarker discovery. All methods also return estimates of statistical significance.

In certain cases such as the computation of correlation coefficients, these vectorized methods come with the caveat of numerical instability. As a compromise, "naive" loop-based implementations are also provided for testing and comparison. It is recommended that any significant results obtained with the vectorized methods be verified with these base methods.

The current functions available are listed below by variable comparison type. Benchmarks are also provided with comparisons to the equivalent looping-based method. In all methods, a melt option is provided to return the outputs as a set of row-column variable-variable pair statistic matrices or as a single DataFrame with each statistic melted to a column.

Continuous vs. continuous

mat_corr(a_mat, b_mat, melt: bool, method: str)

Computes pairwise Pearson or Spearman correlations between columns of a_mat and b_mat, provided that there are no missing values in either matrix. method can be either "pearson" or "spearman".

mat_corr_nan(a_mat, b_mat, melt: bool, method: str)

Computes pairwise Pearson or Spearman correlations between a_mat and the columns of b_mat, provided that a_mat is a Series and b_mat is a DataFrame that may or may not contain some missing values. method can be either "pearson" or "spearman".

mat_corr_naive(a_mat, b_mat, melt: bool, method: str, pbar=False)

Same functionality as mat_corr, but uses a double loop for direct computation of statistics. method can be either "pearson" or "spearman".

Continuous vs. categorical

mat_mwu(a_mat, b_mat, melt: bool, effect: str, use_continuity=True)

Computes pairwise Mann-Whitney U tests between columns of a_mat (continuous samples) and b_mat (binary samples). Assumes that a_mat and b_mat both do not contain any missing values. effect can only be rank_biserial. use_continuity specifies whether a continuity correction should be applied.

mat_mwu_naive( a_mat, b_mat, melt: bool, effect: str, use_continuity=True, pbar=False)

Same functionality as mat_mwu, but uses a double loop for direct computation of statistics. Unlike mat_mwus, effect parameters of "mean", "median", and "rank_biserial" are all supported.

Categorical vs. categorical

mat_fisher(a_mat, b_mat, melt: bool, pseudocount=0)

Computes pairwise Fisher's exact tests between columns of a_mat and b_mat, provided that both are boolean-castable matrices and do not contain any missing values. The pseudocount parameter (which must be an integer) specifies the value that should be added to all cells of the contingency matrices.

mat_fisher_nan(a_mat, b_mat, melt: bool, pseudocount=0)

Computes pairwise Fisher's exact tests between columns of a_mat and b_mat, provided that both are boolean-castable matrices and may or may not contain missing values.

mat_fisher_naive(a_mat, b_mat, melt: bool, pseudocount=0, pbar=False)

Same functionality as mat_fisher, but uses a double loop for direct computation of statistics.

Benchmarks

Benchmarks were run with 1,000 samples per variable (i.e. setting each input matrix to have 1,000 rows). The number of variables in a_mat was set to 100, and the number of variables in b_mat was varied as shown below. The number of pairwise comparisons (equivalent to the product of the column counts of a_mat and b_mat) is also indicated.

Benchmarks were run on an i7-7700K with 16GB of 2133 MHz RAM.

`mat_corr` (Pearson)

Column count of `b_mat`	Total comparisons	Runtime, `mat_corr_naive`, seconds	Runtime, `mat_corr`, seconds	Speedup
10	1,000	0.29	0.01	×25.86
100	10,000	2.67	0.07	×37.37
1,000	100,000	26.36	0.35	×74.47

`mat_corr` (Spearman)

Column count of `b_mat`	Total comparisons	Runtime, `mat_corr_naive`, seconds	Runtime, `mat_corr`, seconds	Speedup
10	1,000	0.74	0.02	×46.74
100	10,000	7.28	0.08	×94.21
1,000	100,000	72.06	0.38	×190.58

`mat_mwu`

Column count of `b_mat`	Total comparisons	Runtime, `mat_mwu_naive`, seconds	Runtime, `mat_mwu`, seconds	Speedup
10	1,000	1.06	0.14	×7.55
100	10,000	10.38	0.71	×14.62
1,000	100,000	103.17	6.52	×15.82

`mat_fisher`

Column count of `b_mat`	Total comparisons	Runtime, `mat_fisher_naive`, seconds	Runtime, `mat_fisher`, seconds	Speedup
10	1,000	1.18	0.24	×5.01
100	10,000	11.72	2.42	×4.83
1,000	100,000	116.48	23.77	×4.90

`mat_fisher_nan`

Column count of `b_mat`	Total comparisons	Runtime, `mat_fisher_naive`, seconds	Runtime, `mat_fisher_nan`, seconds	Speedup
10	1,000	1.23	0.14	×8.56
100	10,000	12.04	1.43	×8.40
1,000	100,000	120.20	14.46	×8.31

Visual methods

Development

Install dependencies with poetry install
Initialize environment with poetry shell
Initialize pre-commit hooks with pre-commit install

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.7.2

Apr 7, 2024

0.7.1

Apr 7, 2024

0.7.0

Apr 7, 2024

0.6.9

Jan 9, 2022

0.6.8

May 2, 2021

0.6.7

May 2, 2021

0.6.6

Mar 10, 2021

0.6.4

Nov 21, 2020

0.6.3

Nov 20, 2020

0.6.2

Nov 20, 2020

0.6.1

Nov 20, 2020

0.6.0

Nov 15, 2020

0.5.4

Nov 6, 2020

0.5.3

Sep 7, 2020

0.5.2

Sep 6, 2020

0.5.1

Sep 5, 2020

0.5.0

Sep 5, 2020

0.4.0

Aug 27, 2020

0.3.0

Aug 24, 2020

0.2.4

Aug 17, 2020

This version

0.2.3

Aug 17, 2020

0.2.2

Aug 17, 2020

0.2.1

Aug 16, 2020

0.1.0

Jul 31, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

many-0.2.3.tar.gz (19.6 kB view hashes)

Uploaded Aug 17, 2020 Source

Built Distribution

many-0.2.3-py3-none-any.whl (23.0 kB view hashes)

Uploaded Aug 17, 2020 Python 3

Hashes for many-0.2.3.tar.gz

Hashes for many-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`074ae1a5d9651ffc9738eeadf609ebe17ddc1ac38f17c4182c2b8f4edaebabf8`
MD5	`3d69b3f0b8e5c3b4b6acd8b3db882594`
BLAKE2b-256	`d5dae200a6dd4dcbcab0947116def1e5ecc5278d8d0e7c70036323ee74c3e037`

Hashes for many-0.2.3-py3-none-any.whl

Hashes for many-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c490ca8610799690c90b29ef38654e4334184e521f7d0bd5760e4c63b2f419e6`
MD5	`c41ca0c73508e71c57f137649da13d6c`
BLAKE2b-256	`d7fd099f656ab7beddee28ba9a398196578dfe5ed7ff68076a599ebd2f9e9f45`

many 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

many

Installation

Components

Statistical methods

Continuous vs. continuous

Continuous vs. categorical

Categorical vs. categorical

Benchmarks

`mat_corr` (Pearson)

`mat_corr` (Spearman)

`mat_mwu`

`mat_fisher`

`mat_fisher_nan`

Visual methods

Development

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

many 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

many

Installation

Components

Statistical methods

Continuous vs. continuous

Continuous vs. categorical

Categorical vs. categorical

Benchmarks

mat_corr (Pearson)

mat_corr (Spearman)

mat_mwu

mat_fisher

mat_fisher_nan

Visual methods

Development

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

`mat_corr` (Pearson)

`mat_corr` (Spearman)

`mat_mwu`

`mat_fisher`

`mat_fisher_nan`