beta-binomial based testing of count data
Project description
CORNCOB: Beta-binomial regression of count data
Introduction
This is a python implementation of the R corncob package. This is a beta-binomal model-based approach for hypothesis testing of count data such as that generated by microbiome 16S rRNA or RNAseq experiments.
The method was first developed by Dr. Willis and her mentee Bryan D Martin, and is fully described in Modeling microbial abundances and dysbiosis with beta-binomial regression.
CORNCOB stands for: COunt RegressioN for Correlated Observations with the Beta-binomial
Why you should use this technique is detailed below.
Installation
Installation can be done via pip:
pip install corncob
Usage
corncob -C counts.csv -VA covariates_abund.csv -VD covariates_disp.csv -O output.csv
counts.csv format
element,specimen_label_01,specimen_label_02,....
total,1244,1344,....
sv_name,20,0,.....
The header row should provide specimen (nee observation) labels.
The first data row should be for the total count across all elements for each specimen (e.g. the total count of reads for all ASVs in a 16S rRNA experiment.)
Each row after should be for one element (ESV, OTU, gene, etc). The first column should be an unique identifier for that element. The remainder of columns should be the counts for that element.
Zeros are fine. Unlike other count-based metrics (i.e. DESEQ2) you do not need to zero-inflate and in fact should not. Nulls or blank cells will be replaced with zeros.
covariates.csv format
cov1,cov2,cov3,...
331,0.12,0000.1,...
....
This is an exogenous matrix for the covariates. Each column corresponds to one covariate to be fitted. An intercept column is always added. Any empty cells will be replaced with zeros. If not provided, only an intercept will be used for regression.
One can provide distinct covariate tables for abundance (-VA
) and dispersion (-VD
)
Output
A CSV file with a header. The header has the various attributes, in the rough format of
element_id,converged,abd__cov_name__Estimate,abd__cov_name__se,abd__cov_name__t,abd__cov_name__p,...
The first block indicates if this is for abundance (abd
) or dispersion (disp
).
The second block in the same of the covariate, taken directly from the input files.
The third block is
Estimate
: The fitted estimated coefficent for that covariate for abundance or dispersionse
: The standard error of the estimatet
: The t-value, derived via the Wald statisticp
: The p-value (derived from the t-value). Two tailed.
Why use beta-binomial regression?
Thanks to next-generation sequencing, biological scientists have an opportunity to make discoveries with count data. Whether RNAseq, ATAC-seq, microbiome 16S rRNA amplicons, or shotgun metagenomes, often the data we wish to compare to our outcomes are ultimately a big table of number of reads assigned in each specimen to a gene, locus, microbe, etc. These read-counts pose an analytic challenge.
While superficially appearing like any other continuous measure (particularly when normalized to relative abundance), these counts-of-reads are tricky. It is typical that the total number of reads to vary a lot from specimen-to-specimen. Even modest noise in the most predominant features (e.g. OTU) can wreak havoc on the precision of true featues further into the tail. Each read-count is not really independent; thre are likely all sorts of latent covariance we are barely beginning to understand. Compounding all this is that most of these read-count matricies are sparse; most of the values are zeros.
An ideal regression method for read-count data would:
- Be fine with zeros, even a lot of zeros, and not neet to do something like "add one to each count" to work.
- Recognize read-counts are likely correlated with one another in ways that we do not fully understand
- Take advantage of the variable total read depth (total reads) per specimen.
- Consider how our covariates (outcomes) relate not just to the relative abundance but also the variability (dispersion) of a given feature's read count.
The adaptation of beta-binomial models by Martin et al fullfils these criteria.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file corncob-0.0.1.tar.gz
.
File metadata
- Download URL: corncob-0.0.1.tar.gz
- Upload date:
- Size: 7.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.19.1 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 184e264c857092ce7a132fefd14d7843a0f86065e62c35645b210ded7c6cab5e |
|
MD5 | 8b4a79f404c5e26c52a11718c185cfe8 |
|
BLAKE2b-256 | d0bb7286fdc033b4c65ea7964a6874f40195a94260613720a45e9df611523b5c |
File details
Details for the file corncob-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: corncob-0.0.1-py3-none-any.whl
- Upload date:
- Size: 9.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.19.1 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50641b4916571d4429cadad219e848892522a3eaa759c925597a698d0e78cc53 |
|
MD5 | 56b4dc46c212e0fadaf6c46f8cc2dff3 |
|
BLAKE2b-256 | 200dd270ff7b746c28eb9d68980de514a74e2bd8db07f2216779376232ad4fcc |