No project description provided

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

many

This package serves as a general-use toolkit for frequently-implemented statistical and visual methods.

Installation

pip install many

Note: if you want to use CUDA-accelerated statistical methods (i.e. many.stats.mat_mwu_gpu), you must also independently install the corresponding version of cupy.

Components

Statistical methods

The statistical methods comprise several functions for association mining between variable pairs. The methods used here are optimized for pandas DataFrames and are inspired by the corrcoef function provided by numpy.

Because these functions rely on native matrix-level operations provided by numpy, many are orders of magnitude faster than naive looping-based alternatives. This makes them useful for constructing large association networks or for feature extraction, which have important uses in areas such as biomarker discovery. All methods also return estimates of statistical significance.

In certain cases such as the computation of correlation coefficients, these vectorized methods come with the caveat of numerical instability. As a compromise, "naive" loop-based implementations are also provided for testing and comparison. It is recommended that any significant results obtained with the vectorized methods be verified with these base methods.

The current functions available are listed below by variable comparison type. Benchmarks are also provided with comparisons to the equivalent looping-based method. In all methods, a melt option is provided to return the outputs as a set of row-column variable-variable pair statistic matrices or as a single DataFrame with each statistic melted to a column.

Continuous vs. continuous

mat_corr(a_mat, b_mat, melt: bool, method: str)

Computes pairwise Pearson or Spearman correlations between columns of a_mat and b_mat, provided that there are no missing values in either matrix. method can be either "pearson" or "spearman".

mat_corr_nan(a_mat, b_mat, melt: bool, method: str)

Computes pairwise Pearson or Spearman correlations between a_mat and the columns of b_mat, provided that a_mat is a Series and b_mat is a DataFrame that may or may not contain some missing values. method can be either "pearson" or "spearman".

mat_corr_naive(a_mat, b_mat, melt: bool, method: str, pbar=False)

Same functionality as mat_corr, but uses a double loop for direct computation of statistics. method can be either "pearson" or "spearman".

Continuous vs. categorical

mat_mwu(a_mat, b_mat, melt: bool, effect: str, use_continuity=True)

Computes pairwise Mann-Whitney U tests between columns of a_mat (continuous samples) and b_mat (binary samples). Assumes that a_mat and b_mat both do not contain any missing values. effect can only be rank_biserial. use_continuity specifies whether a continuity correction should be applied.

mat_mwu_gpu(a_mat, b_mat, melt: bool, effect: str, use_continuity=True)

Exact same behavior as mat_mwu, with the exception that computation is accelerated via cupy.

mat_mwu_naive(a_mat, b_mat, melt: bool, effect: str, use_continuity=True, pbar=False)

Same functionality as mat_mwu, but uses a double loop for direct computation of statistics. Unlike mat_mwus, effect parameters of "mean", "median", and "rank_biserial" are all supported.

Categorical vs. categorical

mat_fisher(a_mat, b_mat, melt: bool, pseudocount=0)

Computes pairwise Fisher's exact tests between columns of a_mat and b_mat, provided that both are boolean-castable matrices and do not contain any missing values. The pseudocount parameter (which must be an integer) specifies the value that should be added to all cells of the contingency matrices.

mat_fisher_nan(a_mat, b_mat, melt: bool, pseudocount=0)

Computes pairwise Fisher's exact tests between columns of a_mat and b_mat, provided that both are boolean-castable matrices and may or may not contain missing values.

mat_fisher_naive(a_mat, b_mat, melt: bool, pseudocount=0, pbar=False)

Same functionality as mat_fisher, but uses a double loop for direct computation of statistics.

Benchmarks

Benchmarks were run with 1,000 samples per variable (i.e. setting each input matrix to have 1,000 rows). The number of variables in a_mat was set to 100, and the number of variables in b_mat was varied as shown below. The number of pairwise comparisons (equivalent to the product of the column counts of a_mat and b_mat) is also indicated.

Benchmarks were run on an i7-7700K with 16GB of 2133 MHz RAM. GPU benchmarks were performed on a GTX 1080.

Visual methods

Continuous vs. continuous

scatter_grid(dataframe)

Plot relationships between columns in a DataFrame, coloring by density and inserting labels given a set of significant value masks.

regression(
    x, y, method, ax=None, alpha=0.5, text_pos=(0.1, 0.9), scatter_kwargs={}
)

Plot two sets of points with along with their regression coefficient.

dense_regression(
    x,
    y,
    method,
    ax=None,
    colormap=None,
    cmap_offset=0,
    text_pos=(0.1, 0.9),
    scatter_kwargs={},
)

Plot two sets of points and their regression coefficient, along with density-based coloring.

dense_plot(
    x,
    y,
    text_adjust: bool,
    ax=None,
    labels_mask=None,
    labels=None,
    colormap=None,
    cmap_offset=0,
    scatter_kwargs={},
    x_offset=0,
    y_offset=0,
)

Plot two sets of points, coloring by density and inserting labels given a set of significant value masks. Density estimated by Gaussian KDE.

Continuous vs. categorical

two_dists(
    binary,
    continuous,
    method,
    summary_type,
    ax=None,
    pal=["#eaeaea", "#a5dee5"],
    annotate=True,
    stripplot=False,
    seaborn_kwargs={},
    stripplot_kwargs={},
)

Compare the distributions of a continuous variable when grouped by a binary one.

multi_dists(
    continuous,
    categorical,
    count_cutoff,
    summary_type,
    ax=None,
    stripplot=False,
    order="ascending",
    newline_counts=False,
    xtick_rotation=45,
    xtick_ha="right",
    seaborn_kwargs={},
    stripplot_kwargs={},
)

Compare the distributions of a continuous variable when grouped by a categorical one.

roc_auc_curve(y, y_pred, ax=None)

Plot the ROC curve along with the AUC statistic of predictions against ground truths.

pr_curve(y, y_pred, ax=None)

Plot the precision-recall curve of predictions against ground truths.

binary_metrics(y, y_pred)

Make several plots to evaluate a binary classifier:

1. Boxplots of predicted values
2. Violinplots of predicted values
3. ROC-AUC plot
4. Precision-recall curve

Categorical vs. categorical

binary_contingency(a, b, ax=None, heatmap_kwargs={})

Plot agreement between two binary variables, along with the odds ratio and Fisher's exact test p-value.

Development

Install dependencies with poetry install
Initialize environment with poetry shell
Initialize pre-commit hooks with pre-commit install

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.7.2

Apr 7, 2024

0.7.1

Apr 7, 2024

0.7.0

Apr 7, 2024

0.6.9

Jan 9, 2022

0.6.8

May 2, 2021

0.6.7

May 2, 2021

0.6.6

Mar 10, 2021

0.6.4

Nov 21, 2020

0.6.3

Nov 20, 2020

0.6.2

Nov 20, 2020

0.6.1

Nov 20, 2020

0.6.0

Nov 15, 2020

0.5.4

Nov 6, 2020

0.5.3

Sep 7, 2020

This version

0.5.2

Sep 6, 2020

0.5.1

Sep 5, 2020

0.5.0

Sep 5, 2020

0.4.0

Aug 27, 2020

0.3.0

Aug 24, 2020

0.2.4

Aug 17, 2020

0.2.3

Aug 17, 2020

0.2.2

Aug 17, 2020

0.2.1

Aug 16, 2020

0.1.0

Jul 31, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

many-0.5.2.tar.gz (22.0 kB view hashes)

Uploaded Sep 6, 2020 Source

Built Distribution

many-0.5.2-py3-none-any.whl (25.3 kB view hashes)

Uploaded Sep 6, 2020 Python 3

Hashes for many-0.5.2.tar.gz

Hashes for many-0.5.2.tar.gz
Algorithm	Hash digest
SHA256	`20e90383d12f7b58c02a2994253420b49db0cfe22fc376c9e479ccd1451ae6cb`
MD5	`0539dbac69a785694f7b34a3b3ae606f`
BLAKE2b-256	`033d5e40ee1dc63822884fe80837b6465f3551ec2bd19bde7be3eda0f0650ee2`

Hashes for many-0.5.2-py3-none-any.whl

Hashes for many-0.5.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`41d22fa8d0903d2da65a479319db3de73b4a1cabef6eaff104b6a6404cbd9fd7`
MD5	`68595d2e664117daca5ab02920fa1a5e`
BLAKE2b-256	`17a271ee985f4a394c895cb142d5119ec4bc7b748ac95461166f66721b0d7adf`