Skip to main content

A pipeline to construct reference free core-genome or SNP phylogenetic trees for examining prokaryote relatedness in outbreaks.

Project description

Dryad

Latest Release
Build Status

Dryad is a pipeline to construct reference free core-genome or SNP phylogenetic trees for examining prokaryote relatedness in outbreaks. Dryad accomplishes this using NextFlow allowing the pipeline to be run in numerous environments using docker or singularity either locally or in an HPC or cloud environment. Dryad will perform both a reference free core-genome analysis based off of the approach outlined by Oakeson et. al and/or a SNP analysis using the CFSAN-SNP pipeline.

Table of Contents:

Installation
Usage
Workflow outline
Core-genome
SNP
Quality assessment
Genome cluster report
Output
Dependencies

Installing Dryad

Dryad uses a combination of nextflow and containers to function and is dependent on either Docker or Singularity.

Installing dryad can be done with pip using pip install dryad. If you are running Dryad from the git repository, a python dependency needs to be installed via pip using pip install -r requirements.txt.

Using the pipeline

The pipeline is designed to start from raw Illumina short reads. All reads must be in the same directory. Then start the pipeline using dryad and follow the options for selecting and running the appropriate pipeline.

usage: dryad [-h] [--output <output_path>] [--core-genome] [--snp] [-r <path>]
             [-ar] [--sep sep_chars] [--profile {docker,singularity}]
             [--config CONFIG] [--get_config] [--resume] [--report]
             [reads_path]

A comprehensive tree building program.

positional arguments:
  reads_path            path to the directory of raw reads in the fastq format

optional arguments:
  -h, --help            show this help message and exit
  --output <output_path>, -o <output_path>
                        path to ouput directory, default "dryad_results"
  --core-genome, -cg    construct a core-genome tree
  --snp, -s             construct a SNP tree, requires a reference sequence in
                        fasta format (-r)
  -r <path>             reference sequence for SNP pipeline
  -ar                   detect AR mechanisms
  --sep sep_chars       dryad identifies sample names from the name of the
                        read file by splitting the name on the specified
                        separating characters, default "_"
  --profile {docker,singularity}
                        specify nextflow profile, dryad will try to use docker
                        first, then singularity
  --config CONFIG, -c CONFIG
                        Nextflow custom configureation
  --get_config          get a Nextflow configuration template for dryad
  --resume              resume a previous run
  --report <path>       RMarkdown file for report.

Both pipelines begin with a quality trimming step to trim the reads of low quality bases at the end of the read using Trimmomatic v0.39, the removal of PhiX contamination using BBtools v38.76, and the assessment of read quality using FastQC v0.11.8. After processing, the reads are used by each pipeline as needed.
Note: Both pipelines can be run automatically in succession using the -cg and -s parameters simultaneously.

Additional workflow parameters

In order to tweak the versions of software used or specific workflow parameters. You can obtain the configuration file using --get_config. Then use the custom configuration with the --profile flag when running dryad.

Workflow outline

Workflow

Core Genome phylogenetic tree construction

The core genome pipeline takes the trimmed and cleaned reads and infers a phylogenetic tree that can be used for inferring outbreak relatedness. This pipeline is based loosely off of the pipeline described here by Oakeson et. al.

Species and MLST type are predicted from the assemblies generated during the core genome pipeline, and assembly quality is evaluated.

Additionally, the core genome pipeline can be run with -ar to predict antibiotic resistance genes.

The core genome pipeline uses the following applications and pipelines:

Shovill v1.0.4 Shovill is a pipeline centered around SPAdes but alters some of the steps to get similar results in less time.

Prokka v1.14.5 Prokka is a whole genome annotation tool that is used to annotate the coding regions of the assembly.

Roary v3.12.0 Roary takes the annotated genomes and constructs a core gene alignment.

IQ-Tree v1.6.7 IQ-Tree uses the core gene alignment and creates a maximum likelihood phylogenetic tree bootstraped 1000 times.

Mash v2.1 Mash performs fast genome and metagenome distance estimation using MinHash.

MLST v2.17.6 MLST scans contig files against PubMLST typing schemes.

QUAST v5.0.2 QUAST evaluates genome assemblies.

AMRFinderPlus v3.1.1 AMRFinderPlus identifies acquired antimicrobial resistance genes.

SNP phylogenetic tree construction

The SNP pipeline takes the trimmed and cleaned reads and infers a phylogenetic tree that can be used for inferring outbreak relatedness. The pipeline requires the path to the raw reads (mentioned above) and a reference genome in fasta file format.

The SNP pipeline uses the following applications and pipelines:

CFSAN SNP Pipeline v2.0.2

IQ-Tree v1.6.7 IQ-Tree uses an alignment of the SNP sites to create a maximum likelihood phylogenetic tree bootstrapped 1000 times.

Quality Assessment

The results of quality checks from each pipeline are summarized using MultiQC v1.8

Genome cluster report

Dryad can generate an easily attributable analysis report. This uses RMarkdown and the results from the SNP and core genome pipelines to generate the genome cluster report. This option can be run using --report. The plotting defaults of the RMarkdown file (/report/report.Rmd) can be modified as necessary and rebuilt using dryad_report.

Output files

dryad_results
├── logs
│   ├── cleanedreads
│   ├── dryad_execution_report.html
│   ├── dryad_trace.txt
│   ├── fastqc
│   ├── quast
│   └── work
└── results
    ├── amrfinder
    ├── annotated
    ├── ar_predictions_binary.tsv
    ├── ar_predictions.tsv
    ├── assembled
    ├── cluster_report.pdf
    ├── core_gene_alignment.aln
    ├── core_genome_statistics.txt
    ├── core_genome.tree
    ├── mash
    ├── mlst.tsv
    ├── multiqc_data
    ├── multiqc_report.html
    ├── report_template.Rmd
    ├── snp_distance_matrix.tsv
    ├── snpma.fasta
    └── snp.tree

ar_predictions_binary.tsv - Presence/absence matrix of antibiotic resistance genes.
ar_predictions.tsv - Antibiotic tesistance genes detected.
cluster_report.pdf - Genome cluster report.
core_gene_alignment.aln - Alignment of the core set of genes.
core_gene_statistics.txt - Information about the number of core genes.
core_genome_tree.tree - The core-genome phylogenetic tree created by the core-genome pipeline.
mash/{sample}.mash.txt - Species prediction for each sample.
mlst.tsv - MLST scheme predictions.
multiqc_report.html - QC report.
report_template.Rmd - R Markdown template for generating the cluster_report.pdf
snp_distance_matrix.tsv - The SNP distances generated by the SNP pipeline.
snp.tree - The SNP tree created by the SNP pipeline.
snpma.fasta - The SNP alignment.

Authors

Kelsey Florek, WSLH Bioinformatics Scientist
Abigail Shockey, WSLH Bioinformatics Fellow

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dryad-2.0.1.tar.gz (47.1 kB view hashes)

Uploaded Source

Built Distribution

dryad-2.0.1-py3-none-any.whl (64.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page