Skip to main content

beta-binomial based testing of count data

Project description

CORNCOB: Beta-binomial regression of count data

Introduction

This is a python implementation of the R corncob package. This is a beta-binomal model-based approach for hypothesis testing of count data such as that generated by microbiome 16S rRNA or RNAseq experiments.

The method was first developed by Dr. Willis and her mentee Bryan D Martin, and is fully described in Modeling microbial abundances and dysbiosis with beta-binomial regression.

CORNCOB stands for: COunt RegressioN for Correlated Observations with the Beta-binomial

Why you should use this technique is detailed below.

Installation

Installation can be done via pip: pip install corncob

Usage

corncob -C counts.csv -VA covariates_abund.csv -VD covariates_disp.csv -O output.csv

counts.csv format

element,specimen_label_01,specimen_label_02,....
total,1244,1344,....
sv_name,20,0,.....

The header row should provide specimen (nee observation) labels.

The first data row should be for the total count across all elements for each specimen (e.g. the total count of reads for all ASVs in a 16S rRNA experiment.)

Each row after should be for one element (ESV, OTU, gene, etc). The first column should be an unique identifier for that element. The remainder of columns should be the counts for that element.

Zeros are fine. Unlike other count-based metrics (i.e. DESEQ2) you do not need to zero-inflate and in fact should not. Nulls or blank cells will be replaced with zeros.

covariates.csv format

cov1,cov2,cov3,...
331,0.12,0000.1,...
....

This is an exogenous matrix for the covariates. Each column corresponds to one covariate to be fitted. An intercept column is always added. Any empty cells will be replaced with zeros. If not provided, only an intercept will be used for regression.

One can provide distinct covariate tables for abundance (-VA) and dispersion (-VD)

Output

A CSV file with a header. The header has the various attributes, in the rough format of element_id,converged,abd__cov_name__Estimate,abd__cov_name__se,abd__cov_name__t,abd__cov_name__p,...

The first block indicates if this is for abundance (abd) or dispersion (disp).

The second block in the same of the covariate, taken directly from the input files.

The third block is

  • Estimate: The fitted estimated coefficent for that covariate for abundance or dispersion
  • se: The standard error of the estimate
  • t: The t-value, derived via the Wald statistic
  • p: The p-value (derived from the t-value). Two tailed.

Why use beta-binomial regression?

Thanks to next-generation sequencing, biological scientists have an opportunity to make discoveries with count data. Whether RNAseq, ATAC-seq, microbiome 16S rRNA amplicons, or shotgun metagenomes, often the data we wish to compare to our outcomes are ultimately a big table of number of reads assigned in each specimen to a gene, locus, microbe, etc. These read-counts pose an analytic challenge.

While superficially appearing like any other continuous measure (particularly when normalized to relative abundance), these counts-of-reads are tricky. It is typical that the total number of reads to vary a lot from specimen-to-specimen. Even modest noise in the most predominant features (e.g. OTU) can wreak havoc on the precision of true featues further into the tail. Each read-count is not really independent; thre are likely all sorts of latent covariance we are barely beginning to understand. Compounding all this is that most of these read-count matricies are sparse; most of the values are zeros.

An ideal regression method for read-count data would:

  • Be fine with zeros, even a lot of zeros, and not neet to do something like "add one to each count" to work.
  • Recognize read-counts are likely correlated with one another in ways that we do not fully understand
  • Take advantage of the variable total read depth (total reads) per specimen.
  • Consider how our covariates (outcomes) relate not just to the relative abundance but also the variability (dispersion) of a given feature's read count.

The adaptation of beta-binomial models by Martin et al fullfils these criteria.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corncob-0.0.1.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

corncob-0.0.1-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file corncob-0.0.1.tar.gz.

File metadata

  • Download URL: corncob-0.0.1.tar.gz
  • Upload date:
  • Size: 7.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.19.1 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.7.7

File hashes

Hashes for corncob-0.0.1.tar.gz
Algorithm Hash digest
SHA256 184e264c857092ce7a132fefd14d7843a0f86065e62c35645b210ded7c6cab5e
MD5 8b4a79f404c5e26c52a11718c185cfe8
BLAKE2b-256 d0bb7286fdc033b4c65ea7964a6874f40195a94260613720a45e9df611523b5c

See more details on using hashes here.

File details

Details for the file corncob-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: corncob-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.19.1 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.7.7

File hashes

Hashes for corncob-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 50641b4916571d4429cadad219e848892522a3eaa759c925597a698d0e78cc53
MD5 56b4dc46c212e0fadaf6c46f8cc2dff3
BLAKE2b-256 200dd270ff7b746c28eb9d68980de514a74e2bd8db07f2216779376232ad4fcc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page