A bioinformatics package to estimate the depth-complexity ratio and amplification noisiness of a sequencing experiment.
Project description
NAME
decoratio — estimate the DEpth-COmplexity RATIO and amplification noisiness of a sequencing experiment based on the PCR clone size histogram.
INSTALLATION
python3 -m pip decoratio
Decoratio depends on numpy
, scipy
, and matplotlib
.
SYNOPSIS
python3 -m decoratio [--model=MODEL] [...] CLONE_SIZE_HISTOGRAM.tsv
AUTHORS & CITATION
Nicolas C. Rochette, Angel G. Rivera-Colón, Julian M. Catchen
If you use Decoratio, please cite:
Rochette NC, Rivera-Colón AG, Walsh J, Sanger TJ, Campbell-Staton SC, Catchen JM (2022). On the causes, consequences, and avoidance of PCR duplicates: towards a theory of library complexity. bioRxiv. https://doi.org/10.1101/2022.10.10.511638
DESCRIPTION
PCR duplicates become prevalent in sequencing experiments when the sequencing effort exceeds the number of molecules the library was originally prepared from. The precise duplicates rate that is observed depends primarily on the ratio between the sequencing depth and the complexity of the library, as well as secondarily on the variance of the amplification factor among the original template molecules the library.
This program leverages PCR duplicate patterns (namely, the relative occurrence of duplication clones of various sizes) in a sequencing dataset to jointly estimate (1) the depth-complexity ratio, and (2) a variance model for the PCR amplification of the library. As the sequencing depth should be known to the experimenter, the depth-complexity ratio is a library complexity estimate. This, together with the amplification variance model, informs on the quality of the library, allowing to identify failed libraries and/or to compare the performance of a set of library preparation protocols.
Input TSV histogram
The main input for the program is a TSV table giving the number of duplication clones of each size that were observed. A duplication clone is a set of individual reads that are all descend from the same template, and are PCR duplicates of one another. Thus, clones of size 1 correspond to singleton reads that do not have duplicates; clones of size 2 to pairs of duplicate reads; clones of size 3 to read triplets; etc.
The input file should look like this:
clone_size n_clones
1 9746142
2 7320751
3 4969106
4 3023127
5 1693158
6 893344
7 456659
...
# (Empty size classes may be omitted.)
For instructions on how to obtain this histogram from sequencing data (e.g., BAM alignment files), see the EXAMPLES section below.
Options
--model
: an amplification model specifier such as "logskewnormal" (default),
"lognormal", or "inheffbeta:12"; see AMPLIFICATION MODELS below
--output-prefix
: defaults to CLONE_SIZE_HISTOGRAM.tsv.decoratio.*
--ratio-min
, --ratio-max
: constrains the optimization range for the depth-complexity ratio
--no-plots
, --tsv-out
, --log-level
: miscelaneous output control
Amplification models
Default settings should work well in most cases.
Three models are currently implemented for the amplification: log-skew-normal
(logskewnormal
, default), log-normal (lognormal
), and the empirical
inherited efficiency model (inheff
and inheffbeta
for the original and beta
distribution-based versions respectively). All models may be specified without
parameters, in which case their parameters will be jointly optimized with the
depth-complexity ratio, or with user-specified parameters, in which case only
the depth-complexity ratio will be optimized.
Log-skew-normal and log-normal models
The PCR duplicate rate is only sensitive to the variance in relative amplification factors, and not to the mean amplification factor. As a result, the log-normal model has only one parameter (a standard deviation) and log-skew-normal two (a standard deviation and a skew). Thus, for these models the possible values take the form:
--model=lognormal
--model=lognormal:0.54
--model=logskewnormal
--model=logskewnormal:0.72:-3.4
All numerical values are given in units of the natural logarithm (this is important as e.g., an amplification factor log-distribution with a standard deviation of 0.54 in base e has a standard deviation of 0.78 in base 2).
Inherited efficiency models
This is the empirical model of (Best et al., 2015). We provide two closely related versions, the normal distribution-based model and a stabilized model based on the beta distribution, which we recommend.
A number of PCR cycles (e.g., 12) must always be provided. The models then depend on two additional parameters, the mean efficiency of the PCR process and its standard deviation across clones. Thus, possible values take the form:
--model=inheff:12
--model=inheffbeta:12
--model=inheffbeta:12:0.76:0.14
Computations based on this model are substantially slower. The default is to
perform binned simulations, as this is more computationally efficient, and the
default effort is to perform 1000 simulations for each of 1000
efficiency bins. This should work well, but can be changed with
--inheff-n-efficiency-bins
and --inheff-n-sims-per-bin
. Binning during
simulations can be deactivated with --inheff-legacy-sims
. If optimizing the
parameters, the default is to optimize in a one-dimentional space, but
--inheff-uncouple-beta-parameters
will cause the two parameters to be
optimized independently (for the beta distribution version only). This may
impact the precise shape of the amplification factor distribution.
EXAMPLES
curl -O https://bitbucket.org/rochette/decoratio/raw/HEAD/examples/brown_alone.clone_sizes.tsv
# Using the default logskewnormal amplification model:
python3 -m decoratio ./brown_alone.clone_sizes.tsv
# Using the lognormal amplification model (this is faster):
python3 -m decoratio --model=lognormal ./brown_alone.clone_sizes.tsv
# Using a user-specified amplification model:
python3 -m decoratio --model=logskewnormal:0.72:-3.4 ./brown_alone.clone_sizes.tsv
Sample output:
decoratio-1.0.0 (python-3.8, scipy-1.5.2)
Called as: decoratio ./brown_alone.clone_sizes.tsv
Outputing to: ./brown_alone.clone_sizes.tsv.decoratio.logskewnormal.*
Reading input file...
Input data shows 60.8% PCR duplicates.
Optimizing the depth-complexity ratio & amplification noise...
Optimized:
depth_complexity_ratio = 1.93
model = logskewnormal:0.719:-3.4
Model fit was 99.1%; the predicted duplicate was rate 60.7% and the saturation
75.8%; the amplification model's noisiness was 0.719 and its skewness -0.72.
Writing output files...
./brown_alone.clone_sizes.tsv.decoratio.logskewnormal.png
./brown_alone.clone_sizes.tsv.decoratio.logskewnormal.ampl_factors.png
Done.
Obtaining the input file
Using gstacks
For RAD-seq datasets analyzed using STACKS, the clone sizes histogram is present in one of the output files by default:
stacks-dist-extract ./gstacks.log.distribs pcr_clone_size_distrib > ./clone_sizes.tsv
python3 -m decoratio ./clone_sizes.tsv
# The program can actually be run directly on the distribs file:
python3 -m decoratio ./gstacks.log.distribs
n.b., for datasets that comprise several libraries, for PCR duplicates and library complexity purposes gstacks should be run separately on each library/set of samples.
Using samtools-markdup
prefix=./my_sample
samtools view -b -f0x2 -F0x904 $prefix.bam |
samtools collate -O - |
samtools fixmate -m - - |
samtools sort |
samtools markdup -t - $prefix.markdup.bam
# We then extract the `do:` (presumably "duplicate of") tags in the SAM records.
samtools view -f0x40 $prefix.markdup.bam |
sed -E 's/.*\tdo:Z:([^\t]+).*/\1/; s/\t.*//' |
gzip > $prefix.clones.txt.gz
gzip -cd $prefix.clones.txt.gz |
sort | uniq -c | awk '{print $1}' |
sort -n | uniq -c |
awk 'BEGIN{print "clone_size\tn_clones"} {print $2 "\t" $1}' \
> $prefix.clone_sizes.tsv
python3 -m decoratio $prefix.clone_sizes.tsv
Using GATK/Picard
n.b., as of v2.27.4 / June 2022, Picard-MarkDuplicates appears to rely on sequence identity to identify duplicates; this may underestimate duplication levels & overestimate complexity.
prefix=./my_sample
java -jar picard-2.27.4.jar \
MarkDuplicates \
--TAG_DUPLICATE_SET_MEMBERS --READ_NAME_REGEX null \
-I $prefix.bam -O $prefix.picard-markdups.bam -M $prefix.picard-markdups.txt \
&> $prefix.picard-markdups.log
# We then extract the `DS:` (duplicate set size, i.e., clone size) tags in the SAM records.
samtools view -f0x40 -F0xC04 $prefix.picard-markdups.bam |
sed -E '/\tDS:/!s/.*/1/; s/.*\tDS:i:([0-9]+).*/\1/' |
sort -n | uniq -c |
awk 'BEGIN{print "clone_size\tn_clones"} {print $2 "\t" $1}' \
> $prefix.clone_sizes.tsv
python3 -m decoratio $prefix.clone_sizes.tsv
Using UMI-tools
Handling of duplicates with UMI-tools is appropriate for experiments that make use of unique molecular identifers (e.g., single-cell RNA-seq datasets).
Users should use the umi_tools group
command, exporting the information about
the size of "read groups" (i.e., PCR duplication clones) to a text file with the
--group-out
option.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for decoratio-1.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 839fe535ce2b42ab8b039ef794a50402fe7d83d4cf2135e5aa6af0dcfa1b87e5 |
|
MD5 | f126d6b49196c8cc40c5b69a661106d4 |
|
BLAKE2b-256 | 3c41b57187c99e5d62c746a395604ef0af0a662ce0747c427a70fc6b8b3b0d74 |