Assembling pure culture phages from both Illumina and Nanopore sequencing technology

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Environment
- Console
- MacOS X
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3.10
- Python :: 3.11
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

GitHub language count GitHub last commit (branch)

Sphae

Phage toolkit to detect phage candidates for phage therapy

Overview

This snakemake workflow was built using Snaketool [https://doi.org/10.1371/journal.pcbi.1010705], to assemble and annotate phage sequences. Currently, this tool is being developed for phage genomes. The steps include,

Quality control that removes adaptor sequences, low-quality reads and host contamination (optional).
Assembly
Contig quality checks; read coverage, viral or not, completeness, and assembly graph components.
Phage genome annotation'
Annotation of the phage genome

A complete list of programs used for each step is mentioned in the sphae.CITATION file.

Install

Pre-requisites

gcc
conda
libgl1-mesa-dev (ubuntu- for Bandage)
libxcb-xinerama0 (ubuntu- for Bandage)

Install

Setting up a new conda environment

conda create -n sphae python=3.11
conda activate sphae
conda install -n base -c conda-forge mamba #if you don't already have mamba installed

Steps for installing sphae workflow

#clone sphae repository
git clone https://github.com/linsalrob/sphae.git

#move to sphae folder
cd sphae

#install sphae
pip install -e .

#confirm the workflow is installed by running the below command 
sphae --help

Installing databases

Run command,

sphae install

Install the databases to a directory, sphae/workflow/databases

This workflow requires the

Pfam35.0 database to run viral_verify for contig classification.
CheckV database to test for phage completeness
Pharokka databases
Phynteny models

This step takes approximately 1hr 30min to install and requires 9G of storage

Running the workflow

The command sphae run will run QC, assembly and annotation

Commands to run

Only one command needs to be submitted to run all the above steps: QC, assembly and assembly stats

#For illumina reads, place the reads both forward and reverse reads to one directory
sphae run --input tests/data/illumina-subset --output example -k 

#For nanopore reads, place the reads, one file per sample in a directory
sphae run --input tests/data/nanopore-subset --sequencing longread --output example -k

#To run either of the commands on the cluster, add --profile slurm to the command. For instance here is the command for longreads/nanopore reads 
#Before running this below command, make sure have slurm config files setup, here is a tutorial, https://fame.flinders.edu.au/blog/2021/08/02/snakemake-profiles-updated 
sphae run --input tests/data/nanopore-subset --preprocess longread --output example --profile slurm -k

Output

Output is saved to example/RESULTS directory. In this directory, there will be four files

Genome annotations in GenBank format (Phynteny output)
Genome in fasta format (either the reoriented to terminase output from Pharokka, or assembled viral contigs)
Circular visualization in png format (Pharokka output)
Genome summary file

Genome summary file includes the following information to help,

Sample name
Length of the genome
Coding density
If the assembled contig is circular or not (From the assembly graph)
Completeness (calculated from CheckV)
Contamination (calculated from CheckV)
Taxonomy accession ID (Pharokka output, searches the genome against INPHARED database using mash)
Taxa mash includes the number of matching hashes of the assembled genome to the accession ID/Taxa name. Higher the matching hash- more likely the genome is related to the taxa predicted
Gene searches:
- Whether integrase is found (search for integrase gene in annotations)
- Whether anti-microbial genes were found (Pharokka search against AMR database)
- Whether any virulence factors were found (Pharokka search against virulence gene database)
- Whether any CRISPR spacers were found (Pharokka search against MinCED database)

Issues and Questions

This is still a work in progress, so if you come across any issues or errors, report them under Issues.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Environment
- Console
- MacOS X
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3.10
- Python :: 3.11
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

This version

1.31

Nov 30, 2023

1.3.3

May 3, 2024

1.3

Nov 29, 2023

1.2

Nov 28, 2023

1.1

Nov 21, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sphae-1.31.tar.gz (55.0 kB view hashes)

Uploaded Nov 30, 2023 Source

Hashes for sphae-1.31.tar.gz

Hashes for sphae-1.31.tar.gz
Algorithm	Hash digest
SHA256	`89be2ea3ecba88fd8133ff2e397f992e0858c35c072f9ff53493289d5b0c81f9`
MD5	`c79e1b03c1f6780826cdf819714357ba`
BLAKE2b-256	`caddf0d196e089c83062bf0a32f068a52555a263a275489aa2079492b831071e`