A pipeline to align, quality control, and summarize tiled amplicon coverage (of a virus, probably) from sequencing reads.
Project description
๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐
๐๐๐๐๐๐๐๐๐ฆ๐ฆ๐ฆ๐๐ฆ๐ฆ๐๐ฆ๐ฆ๐๐ฆ๐ฆ๐ฆ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐
๐๐๐๐๐๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐
๐๐๐ฆ๐ฆ๐๐ฆ๐๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐ฆ๐๐ฆ๐ฆ๐๐ฆ๐ฆ๐ฆ๐๐๐๐๐๐
๐๐ฆ๐๐๐๐ฆ๐๐๐ฆ๐ฆ๐ฆ๐๐ฆ๐๐๐๐ฆ๐๐ฆ๐ฆ๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐๐ฆ๐
๐๐๐ฆ๐๐๐ฆ๐๐๐ฆ๐๐ฆ๐๐ฆ๐๐๐๐ฆ๐๐ฆ๐๐๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐ฆ๐ฆ๐๐ฆ๐ฆ๐๐ฆ๐
๐๐๐๐ฆ๐๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐๐๐ฆ๐๐ฆ๐๐๐๐๐๐ฆ๐๐๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐๐ฆ๐ฆ๐
๐๐ฆ๐ฆ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ฆ๐๐ฆ๐๐๐ฆ๐
๐๐๐๐๐๐๐๐๐๐๐ฆ๐ฆ๐๐๐ฆ๐ฆ๐๐๐ฆ๐ฆ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐
๐๐๐๐๐๐๐๐๐ฆ๐ฆ๐๐๐ฆ๐ฆ๐๐๐ฆ๐ฆ๐๐๐ฆ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐
scampiman
A pipeline to align, quality control, and summarize tiled amplicon coverage (of a virus, probably) from sequencing reads.
Rationale: We noticed 1) that tiled amplicon data can come in many forms from many technologies, and 2) errors introduced in library prep can lead to sequencing artifacts that, if not handled properly, can cause issues with downstream analysis.
Requires: Reads, Reference Genome(s), Primer .bed File
Produces: Alignment Summary, Samtools Ampliconstats File, Table of Amplicon Coverage .tsv
- Input formats
.fastqor.bam - Input logic
file(s)ordirectory(with files) - Seq tech
illumina short readorONT - Read config
single-endorpaired-end
- Align reads to reference (
mappy) and filter unwanted alignments pysam: sort, ampliconclip, ampliconstats- Parse ampliconstats output into table, output
.tsv
Alignment Filtering
Subpar alignments are filtered out before amplicon analysis is performed. This step attempts to remove issues that may have arisen during library preparation, for either singleโ or pairedโend reads, that can cause misrepresentation of amplicon diversity.
The number of removed alignments is reported in the alignment summary as 'removed_reads_primary' and is saved within the failed.bam output along with unmapped reads.
Singleโend Reads
The filtering parameters for singleโend reads are designed to correct for ligationโbased errors that may occur, particularly in ONT ligationโbased sequencing kits (e.g., SQKโNBD114).
The filtering parameters are as follows:
- Removes reads with supplementary alignments that overlap <50% with the primary alignmentโs reference region.
- Removes reads that produce supplementary alignments mapping to the same strand as the primary alignment.
Removal of these reads is important because it accounts for:
- ligation between amplicons originating from different regions of the genome.
- ligation between segments originating from different sources. For example, different barcodes of ONT kits.
Pairedโend Reads
Scampiman assumes that pairedโend reads were generated on an Illumina or similar platform.
The filtering parameters are as follows:
- Removes paired reads that align to the same strand.
- Removes paired reads that have unequal numbers of alignments (indicating mapping error).
- Removes paired reads where one mate is unmapped.
- Removes paired reads whose reference alignments do not overlap.
Removal of these reads is important because it accounts for:
- Illumina's platform sequencing paired reads from opposing strands of the same DNA fragment.
- potential ligation or mapping errors.
- the necessity for the entire (gap-less) amplicon to be represented in the analysis.
Install
Note: consider making an isolated environment (conda or venv) for scampiman.
Easiest Way
- Simply
pip installwith the release tarball link.
pip install https://github.com/tiszalab/scampiman/archive/refs/tags/v0.1.2.tar.gz
Alternative Methods:
-
clone this repo or download and unpack release.
-
pipinstall scampiman
From the terminal:
cd scampiman
pip install .
- Either method should install
scampimanas a runnable command from the terminal.
Running scampiman
Highly Recommended: quality filter reads before running scampiman with e.g. fastp (short reads) or fastplong (long reads)!
With a directory of unaligned .bam files from an ONT run:
scampiman -r proj1/bam_pass/barcode24 -b SARS-CoV-2.ARTIC_5.3.2.primer.bed -g sars_cov2_MN908947.3.fasta -s barcode24 -o proj1_scampi -f bam -t directory -c single-end --seqtech ont
You can also specify multiple directories:
scampiman -r flowcell1/bam_pass/barcode24 flowcell2/bam_pass/barcode24 -b SARS-CoV-2.ARTIC_5.3.2.primer.bed -g sars_cov2_MN908947.3.fasta -s barcode24 -o proj1_scampi -f bam -t directory -c single-end --seqtech ont
With some unaligned .bam files from an ONT run:
scampiman -r proj1/bam_pass/barcode24/*bam -b SARS-CoV-2.ARTIC_5.3.2.primer.bed -g sars_cov2_MN908947.3.fasta -s barcode24 -o proj1_scampi -f bam -t files -c single-end --seqtech ont
It's better to use -t directory if you are using all files in a directory.
With .fastq files:
from a paired-end Illumina run:
scampiman -r my_fastqs/seq1.R1.fastq my_fastqs/seq1.R2.fastq -b SARS-CoV-2.ARTIC_5.3.2.primer.bed -g sars_cov2_MN908947.3.fasta -s seq1 -o proj2_scampi -f fastq -t files -c paired-end --seqtech illumina
Single-end (e.g. ONT) works too:
scampiman -r my_fastqs/seq1.ONT.fastq -b SARS-CoV-2.ARTIC_5.3.2.primer.bed -g sars_cov2_MN908947.3.fasta -s seq1 -o proj2_scampi -f fastq -t files -c single-end --seqtech ont
Keeping aligned .bam files for downstream analysis:
Commonly, you will want to keep the properly filtered .bam file to run downstream analysis to determine lineage, derive consensus genome, or analyze allele frequency.
scampiman -r my_fastqs/seq1.ONT.fastq -b SARS-CoV-2.ARTIC_5.3.2.primer.bed -g sars_cov2_MN908947.3.fasta -s seq1 -o proj2_scampi -f fastq -t files -c single-end --seqtech ont --keep bam
Plotting data (not thoroughly tested/robust)
See conda environment requirements below.
This needs an index file in .xlsx format with (at least) the following header columns:
- Barcode ID
- Sample ID
Rscript scampiman/plot_script/plot_scampiman_batch1.R scampi_projects my_amplicons_projs1to4.pdf
terminal command to add R plotting capabilities
conda activate scampiman
conda install -c conda-forge conda-forge::r-rprojroot conda-forge::r-tidyverse conda-forge::r-cowplot conda-forge::r-readxl
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scampiman-0.1.2.tar.gz.
File metadata
- Download URL: scampiman-0.1.2.tar.gz
- Upload date:
- Size: 19.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cfaeb6cde499f957e17164d83674362672b717e5b7cc90db81f4e1daad434808
|
|
| MD5 |
3384932a5850b9841b8b300007ef7f88
|
|
| BLAKE2b-256 |
cc577a7ff1332d26660dd1e1548343fa7e88cbc98029e14d5e31431189d18eae
|
File details
Details for the file scampiman-0.1.2-py3-none-any.whl.
File metadata
- Download URL: scampiman-0.1.2-py3-none-any.whl
- Upload date:
- Size: 17.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ff797936541ba7955a44d1fcc75e70f69b63f1f0386596892be5d48e70c95fa
|
|
| MD5 |
9d40f6e9287db25c79a4a5288d9ab724
|
|
| BLAKE2b-256 |
0d64595c492bdb1cf3693eb012c367edf5afc4cf1e69ac9bbbf86e4c1f92b8da
|