Skip to main content

A multi-sample variant calling pipeline

Project description

https://badge.fury.io/py/sequana-variant-calling.svg JOSS (journal of open source software) DOI https://github.com/sequana/variant_calling/actions/workflows/main.yml/badge.svg Python 3.11 | 3.12

This is the variant_calling pipeline from the Sequana projet

Overview:

Variant calling from FASTQ files

Input:

FASTQ files from Illumina Sequencing instrument

Output:

VCF and HTML files

Status:

production

Citation:

Cokelaer et al, (2017), ‘Sequana’: a Set of Snakemake NGS pipelines, Journal of Open Source Software, 2(16), 352, JOSS DOI https://doi:10.21105/joss.00352

Installation

You can install sequana_variant_calling pipeline using:

pip install sequana_variant_calling --upgrade

I would recommend to setup a sequana_variant_calling conda environment executing:

conda env create -f environment.yml

where the environment.yml can be found in the https://github.com/sequana/variant_calling repository.

Later, you can activate the environment as follows:

conda activate sequana_variant_calling

Note, however, that the recommended method is to use singularity/apptainer as explained here below.

Usage

sequana_variant_calling --input-directory DATAPATH --reference-file measles.fa

This creates a directory variant_calling. You just need to move into the directory and execute the script:

cd variant_calling
sh variant_calling.sh

This launch a snakemake pipeline. If you are familiar with snakemake, you can retrieve the pipeline itself and its configuration files and then execute the pipeline yourself with specific parameters:

snakemake -s variant_calling.rules -c config.yaml --cores 4 --stats stats.txt

you can also edit the profile file in .sequana/profile/config.ya,l

Or use sequanix interface.

Usage with singularity::

With singularity, initiate the working directory as follows:

sequana_variant_calling --use-singularity --singularity-prefix ~/.sequana/apptainers

Images are downloaded in a global direcory (here .sequana/apptainers) so that you can reuse them later.

and then as before:

cd variant_calling
sh variant_calling.sh

if you decide to use snakemake manually, do not forget to add singularity options:

snakemake -s variant_calling.rules -c config.yaml --cores 4 --stats stats.txt --use-singularity --singularity-prefix ~/.sequana/apptainers --singularity-args "-B /home:/home"

Requirements

If you rely on singularity/apptainer, no extra dependencies are required (expect python and https://damona.readthedocs.io). If you cannot use apptainer, you will need to install some software:

  • bwa

  • freebayes

  • picard (picard-tools)

  • sambamba

  • minimap2

  • samtools

  • snpEff you will need 5.0 or 5.1d (note the d); 5.1 does not work.

https://raw.githubusercontent.com/sequana/sequana_variant_calling/main/sequana_pipelines/variant_calling/dag.png

Details

Snakemake variant calling pipeline is based on tutorial written by Erik Garrison. Input reads (paired or single) are mapped using bwa and sorted with sambamba-sort. PCR duplicates are marked with sambamba-markdup. Freebayes is used to detect SNPs and short INDELs. The INDEL realignment and base quality recalibration are not necessary with Freebayes. For more information, please refer to a post by Brad Chapman on minimal BAM preprocessing methods.

The pipeline provides an analysis of the mapping coverage using sequana coverage. It detects and characterises automatically low and high genome coverage regions.

Detected variants are annotated with SnpEff if a GenBank file is provided. The pipeline does the database building automatically. Although most of the species should be handled automatically, some special cases such as particular codon table will required edition of the snpeff configuration file.

Finally, joint calling is also available and can be switch on if desired.

Tutorial

Let us download an ecoli reference genome and the data set used to create the assembly. All tools used here below can be installed with damona (or your favorite environment manager):

pip install damona
damona create TEST
damona activate TEST
damona install pigz
damona install sratoolkit # for fasterq-dump
damona install datasets

Then, download the data:

fasterq-dump SRR13921546
pigz SRR*fastq

and the reference genome with its annnotation:

datasets download genome accession GCF_000005845.2 --include gff3,rna,cds,protein,genome,seq-report,gbff
unzip ncbi_dataset.zip
ln -s ncbi_dataset/data/GCF_000005845.2/GCF_000005845.2_ASM584v2_genomic.fna ecoli.fa
ln -s ncbi_dataset/data/GCF_000005845.2/genomic.gff ecoli.gff

Initiate the pipeline:

sequana_variant_calling --input-directory . --reference-file ecoli.fa --aligner-choice bwa_split \
    --do-coverage --annotation-file ecoli.gff  \
    --use-apptainer --apptainer-prefix ~/.sequana/apptainers \
    --input-readtag "_[12]."

Explication:

  • we use apptainer/singularity

  • we use the reference genome ecoli.fa (–reference-file) and its annotation for SNPeff (–annotation-file)

  • we use the sequana_coverage tool (True by default) to get coverage plots.

  • we use –input-directory to indicatre where to find the input files

  • This data set is paired. In NGS, it is common to have _R1_ and _R2_ tags to differentiate the 2 files. Here the tags are _1 and _2. In sequana we define the a wildcard for the read tag. So here we tell the software that thex expected tags follow this pattern: “_[12].” and everything is then automatic.

Then follow the instructions (prepare and execute the pipeline).

You should end up with a summary.hml report.

You can browse the different samples (only one in this example) and get a table with variant calls:

https://raw.githubusercontent.com/sequana/variant_calling/refs/heads/main/doc/table.png

If you set the coverage one, (not recommended for eukaryotes), you should see this kind of plots:

https://raw.githubusercontent.com/sequana/variant_calling/refs/heads/main/doc/coverage.png

Changelog

Version

Description

1.5.0

  • use new sequana-wrappers shells

1.4.0

  • handles long reads data. Use sequana html_report to create the VCF html reports instead of wrapper. More dynamic. Updated some containers, in particular for sequana_coverage.

  • Fixed regression in bwa mapping

  • Fixed ordering of contigs on genomecov that was not sorted in the same way as samtools in some cases.

  • use new sequana-wrappers.shells (instead of wrappers)

1.3.0

  • Updated version to use latest damona containers and latest sequana version 0.19.1. added plot in HTML report with distribution of variants. added tutorial. added bwa_split and freebaye split to process ultra deep sequencing.

1.2.0

  • -Xmx8g option previously added is not robust. Does not work with snpEff 5.1 for instance.

  • add minimap aligner

  • add –nanopore and –pacbio to automatically set minimap2 as the aligner and the minimap options (map-pb or map-ont)

  • add minimap2 container.

  • add missing resources in snpeff section

1.1.2

  • add -Xmx8g option in snpeff rule at the build stage.

  • add resources (8G) in the snpeff rule at run stage

  • fix missing output_directory in sequana_coverage rule

  • fix joint calling (regression) input function and inputs

1.1.1

  • Fix regression in coverage rule

1.1.0

  • add specific apptainer for freebayes (v1.2.0)

  • Update API to use click

1.0.2

  • Fixed failure in multiqc if coverage and snpeff are off

1.0.1

  • automatically fill the bwa index algorithm and fix bwa_index rule to use the options in the config file (not the harcoded one)

1.0.0

  • use last warppers and graphviz apptainer

0.12.0

  • set all apptainers containers and add vcf to bcf conversions

  • Update rule sambamba to use latest wrappers

0.11.0

  • Add singularity containers

0.10.0

  • fully integrated sequana wrappers and simplification of HTML reports

0.9.10

  • Uses new sequana_pipetools and wrappers

0.9.5

  • fix typo in the onsuccess and update sequana requirements to use most up-to-date snakemake rules

0.9.4

  • fix typo related to the reference-file option new name not changed everyhere in the pipeline.

0.9.3

  • use new framework (faster –help, –from-project option)

  • rename –reference into –reference-file and –annotation to –annotation-file

  • add custom summary page

  • add multiqc config file

0.9.2

  • snpeff output files are renamed sample.snpeff (instead of samplesnpeff)

  • add multiqc to show sequana_coverage and snpeff summary sections

  • cleanup onsuccess section

  • more options sanity checks and options (e.g.,

  • genbank_file renamed into annotation_file in the config

  • use –legacy in freebayes options

  • fix coverage section to use new sequana api

  • add the -do-coverage, –do-joint-calling options as well as –circular and –frebayes–ploidy

0.9.1

  • Fix input-readtag, which was not populated

0.9.0

First release

Contribute & Code of Conduct

To contribute to this project, please take a look at the Contributing Guidelines first. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sequana_variant_calling-1.5.0.tar.gz (122.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sequana_variant_calling-1.5.0-py3-none-any.whl (119.9 kB view details)

Uploaded Python 3

File details

Details for the file sequana_variant_calling-1.5.0.tar.gz.

File metadata

  • Download URL: sequana_variant_calling-1.5.0.tar.gz
  • Upload date:
  • Size: 122.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sequana_variant_calling-1.5.0.tar.gz
Algorithm Hash digest
SHA256 ee9294882ef0a0070ad7fc648b05b56999df5cdbd23a45bbde5af8b4f0bf5081
MD5 65494df62fbf3a89b02d8c4a91ac1663
BLAKE2b-256 dac45f0c2d627a93e2edf2257facce36a022a07b78a8e26f98030b95b5fa1c07

See more details on using hashes here.

File details

Details for the file sequana_variant_calling-1.5.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sequana_variant_calling-1.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 145b275b82aae31f2b8aead83aad4ba7832365dae5e976fbba08ccd9b5bf39b5
MD5 c1e1fbe0e97c2843d034d02a3907b4a5
BLAKE2b-256 c7e4ef540a5040d0aa2f39602dbce61724972915557dd3c7d3f2c2e4241ea431

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page