Quickly get coverage statistics given reads and an assembly
Project description
Quickly get coverage statistics given reads and an assembly.
Motivation
While there are tools that will calculate read-coverage statistics, they do not scale particularly well for large datasets, large sample numbers, or large reference FASTAs. Koverage is designed to place minimal burden on I/O and RAM to allow for maximum scalability.
Install
Koverage is available on PyPI and Bioconda.
Recommend create env for installation:
conda create -n koverage python=3.11
conda activate koverage
Install with PIP:
pip install koverage
Install with Bioconda:
conda install -c bioconda koverage
Test the installation
koverage test
Developer install:
git clone https://github.com/beardymcjohnface/Koverage.git
cd Koverage
pip install -e .
Usage
Get coverage statistics from mapped reads (default method).
koverage run --reads readDir --ref assembly.fasta
Get coverage statistics using kmers (scales much better for very large reference FASTAs).
koverage run --reads readDir --ref assembly.fasta kmer
Any unrecognised commands are passed onto Snakemake. Run Koverage on a HPC using a Snakemake profile.
koverage run --reads readDir --ref assembly.fasta --profile mySlurmProfile
Parsing samples with --reads
You can pass either a directory of reads or a TSV file to --reads
.
Note that Koverage expects your read file names to include R1 or R2 e.g. Tynes-BDA-rw-1_S14_L001_R1_001.fastq.gz or SRR7141305_R2.fastq.gz.
- Directory: Koverage will infer sample names and _R1/_R2 pairs from the filenames.
- TSV file: Koverage expects 2 or 3 columns, with column 1 being the sample name and columns 2 and 3 the reads files.
More information and examples are available here
Test
You can test the methods with the inbuilt dataset like so.
# test default method
koverage test
# test all methods
koverage test map kmer coverm
Coverage methods
Mapping-based (default)
koverage run ...
# or
koverage run ... map
This method will map reads using minimap2 and use the mapping coordinates to calculate coverage. This method is suitable for most applications.
Kmer-based
koverage run ... kmer
This method calculates Jellyfish databases of the sequencing reads. It samples kmers from all reference contigs and queries them from the Jellyfish DBs to calculate coverage statistics. This method is exceptionally fast for very large reference genomes.
CoverM
koverage run ... coverm
We've included a wrapper for CoverM which you may find useful. The wrapper manually runs minimap2 and then invokes CoverM on the sorted BAM file. It then combines the output from all samples like the other methods. If you have a large tempfs/ you'll probably find it faster to run CoverM directly on your reads. CoverM is not currently available for MacOS.
Outputs
Mapping-based
Default output files using fast estimations for mean, median, hitrate, and variance.
sample_coverage.tsv
Per sample and per contig counts.Column | description |
---|---|
Sample | Sample name derived from read file name |
Contig | Contig ID from assembly FASTA |
Count | Raw mapped read count |
RPM | Reads per million |
RPKM | Reads per kilobase million |
RPK | Reads per kilobase |
TPM | Transcripts per million |
Mean | Estimated mean read depth |
Median | Estimated median read depth |
Hitrate | Estimated fraction of contig with depth > 0 |
Variance | Estimated read depth variance |
all_coverage.tsv
Per contig counts (all samples).Column | description |
---|---|
Contig | Contig ID from assembly FASTA |
Count | Raw mapped read count |
RPM | Reads per million |
RPKM | Reads per kilobase million |
RPK | Reads per kilobase |
TPM | Transcripts per million |
Kmer-based
Outputs for kmer-based coverage metrics. Kmer outputs are gzipped as it is anticipated that this method will be used with very large reference FASTA files.
sample_kmer_coverage.NNmer.tsv.gz
Per sample and contig kmer coverage.Column | description |
---|---|
Sample | Sample name derived from read file name |
Contig | Contig ID from assembly FASTA |
Sum | Sum of sampled kmer depths |
Mean | Mean sampled kmer depth |
Median | Median sampled kmer depth |
Hitrate | Fraction of kmers with depth > 0 |
Variance | Variance of lowest 95 % of sampled kmer depths |
all_kmer_coverage.NNmer.tsv.gz
Contig kmer coverage (all samples).Column | description |
---|---|
Contig | Contig ID from assembly FASTA |
Sum | Sum of sampled kmer depths |
Mean | Mean sampled kmer depth |
Median | Median sampled kmer depth |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file koverage-0.1.11.tar.gz
.
File metadata
- Download URL: koverage-0.1.11.tar.gz
- Upload date:
- Size: 19.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24d785d654524a59c109f6fb3b3f6fc211b5f7177d643597f144fc6954a74d57 |
|
MD5 | bfa0567200efd9d8b6a7c9b0c280b665 |
|
BLAKE2b-256 | 901f754fbcaf32305ad0578604100b8c0619fbcb82eafb85801514e0728130a5 |
File details
Details for the file koverage-0.1.11-py3-none-any.whl
.
File metadata
- Download URL: koverage-0.1.11-py3-none-any.whl
- Upload date:
- Size: 19.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb8d5e7573aa4710f43c40ea7da470eb99ea3329b10a413f9b49e86cf88ac02a |
|
MD5 | 3c430ff5be99ba8ca1c335e3eebe6d6b |
|
BLAKE2b-256 | 4a882c5af99fcb8de272e6ca812547519d540adbcb51f1267807e603dd79d214 |