A pipeline to construct reference free core-genome or SNP phylogenetic trees for examining prokaryote relatedness in outbreaks.
Project description
Dryad is a pipeline to construct reference free core-genome or SNP phylogenetic trees for examining prokaryote relatedness in outbreaks. Dryad accomplishes this using NextFlow allowing the pipeline to be run in numerous environments using docker or singularity either locally or in an HPC or cloud environment. Dryad will perform both a reference free core-genome analysis based off of the approach outlined by Oakeson et. al and/or a SNP analysis using the CFSAN-SNP pipeline.
Table of Contents:
Installation
Usage
Workflow outline
Core-genome
SNP
Quality assessment
Genome cluster report
Output
Dependencies
Installing Dryad
Dryad uses a combination of nextflow and containers to function and is dependent on either Docker or Singularity.
Installing dryad can be done with pip using pip install dryad
and updated using pip install -U dryad
. If you are running Dryad from the git repository, a python dependency needs to be installed via pip using pip install -r requirements.txt
.
Using the pipeline
The pipeline is designed to start from raw Illumina short reads. All reads must be in the same directory. Then start the pipeline using dryad
and follow the options for selecting and running the appropriate pipeline.
usage: dryad [-h] [--output <output_path>] [--core-genome] [--snp] [-r <path>]
[-ar] [--sep sep_chars] [--profile {docker,singularity}]
[--config CONFIG] [--get_config] [--resume] [--report]
[reads_path]
A comprehensive tree building program.
positional arguments:
reads_path path to the directory of raw reads in the fastq format
optional arguments:
-h, --help show this help message and exit
--output <output_path>, -o <output_path>
path to ouput directory, default "dryad_results"
--core-genome, -cg construct a core-genome tree
--snp, -s construct a SNP tree, requires a reference sequence in
fasta format (-r)
-r <path> reference sequence for SNP pipeline
-ar detect AR mechanisms
--sep sep_chars dryad identifies sample names from the name of the
read file by splitting the name on the specified
separating characters, default "_"
--profile {docker,singularity}
specify nextflow profile, dryad will try to use docker
first, then singularity
--config CONFIG, -c CONFIG
Nextflow custom configureation
--get_config get a Nextflow configuration template for dryad
--resume resume a previous run
--report <path> RMarkdown file for report.
Both pipelines begin with a quality trimming step to trim the reads of low quality bases at the end of the read using Trimmomatic v0.39, the removal of PhiX contamination using BBtools v38.76, and the assessment of read quality using FastQC v0.11.8. After processing, the reads are used by each pipeline as needed.
Note: Both pipelines can be run automatically in succession using the -cg and -s parameters simultaneously.
Additional workflow parameters
In order to tweak the versions of software used or specific workflow parameters. You can obtain the configuration file using --get_config
. Then use the custom configuration with the --profile
flag when running dryad.
Workflow outline
Core Genome phylogenetic tree construction
The core genome pipeline takes the trimmed and cleaned reads and infers a phylogenetic tree that can be used for inferring outbreak relatedness. This pipeline is based loosely off of the pipeline described here by Oakeson et. al.
Species and MLST type are predicted from the assemblies generated during the core genome pipeline, and assembly quality is evaluated.
Additionally, the core genome pipeline can be run with -ar
to predict antibiotic resistance genes.
The core genome pipeline uses the following applications and pipelines:
Shovill v1.0.4 Shovill is a pipeline centered around SPAdes but alters some of the steps to get similar results in less time.
Prokka v1.14.5 Prokka is a whole genome annotation tool that is used to annotate the coding regions of the assembly.
Roary v3.12.0 Roary takes the annotated genomes and constructs a core gene alignment.
IQ-Tree v1.6.7 IQ-Tree uses the core gene alignment and creates a maximum likelihood phylogenetic tree bootstraped 1000 times.
Mash v2.1 Mash performs fast genome and metagenome distance estimation using MinHash.
MLST v2.17.6 MLST scans contig files against PubMLST typing schemes.
QUAST v5.0.2 QUAST evaluates genome assemblies.
AMRFinderPlus v3.1.1 AMRFinderPlus identifies acquired antimicrobial resistance genes.
SNP phylogenetic tree construction
The SNP pipeline takes the trimmed and cleaned reads and infers a phylogenetic tree that can be used for inferring outbreak relatedness. The pipeline requires the path to the raw reads (mentioned above) and a reference genome in fasta file format.
The SNP pipeline uses the following applications and pipelines:
IQ-Tree v1.6.7 IQ-Tree uses an alignment of the SNP sites to create a maximum likelihood phylogenetic tree bootstrapped 1000 times.
Quality Assessment
The results of quality checks from each pipeline are summarized using MultiQC v1.8
Genome cluster report
Dryad can generate an easily attributable analysis report. This uses RMarkdown and the results from the SNP and core genome pipelines to generate the genome cluster report. This option can be run using --report
. The plotting defaults of the RMarkdown file (/report/report.Rmd) can be modified as necessary and rebuilt using dryad_report
.
Output files
dryad_results
├── logs
│ ├── cleanedreads
│ ├── dryad_execution_report.html
│ ├── dryad_trace.txt
│ ├── fastqc
│ ├── quast
│ └── work
└── results
├── amrfinder
├── annotated
├── ar_predictions_binary.tsv
├── ar_predictions.tsv
├── assembled
├── cluster_report.pdf
├── core_gene_alignment.aln
├── core_genome_statistics.txt
├── core_genome.tree
├── mash
├── mlst.tsv
├── multiqc_data
├── multiqc_report.html
├── report_template.Rmd
├── snp_distance_matrix.tsv
├── snpma.fasta
└── snp.tree
ar_predictions_binary.tsv - Presence/absence matrix of antibiotic resistance genes.
ar_predictions.tsv - Antibiotic tesistance genes detected.
cluster_report.pdf - Genome cluster report.
core_gene_alignment.aln - Alignment of the core set of genes.
core_gene_statistics.txt - Information about the number of core genes.
core_genome_tree.tree - The core-genome phylogenetic tree created by the core-genome pipeline.
mash/{sample}.mash.txt - Species prediction for each sample.
mlst.tsv - MLST scheme predictions.
multiqc_report.html - QC report.
report_template.Rmd - R Markdown template for generating the cluster_report.pdf
snp_distance_matrix.tsv - The SNP distances generated by the SNP pipeline.
snp.tree - The SNP tree created by the SNP pipeline.
snpma.fasta - The SNP alignment.
Authors
Kelsey Florek, WSLH Bioinformatics Scientist
Abigail Shockey, WSLH Bioinformatics Fellow
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.