Predict bacterial or phage sequence
Project description
Seeker is a python library for discriminating between bacterial and phage genomes. Seeker is based on an LSTM deep-learning models and does not rely on a reference genome, genomic alignment or any direct genome comparison.
Overview
This file describes a python package that implements Seeker, an alignment-free discrimination between Bacterial vs. phages DNA sequences, based on a deep learning framework [1]. This package can call classifiers that were trained with (a) either Python Keras LSTM with embedding layer, or (b) Matlab trained LSTM with a sequence imput layer, which was converted to a Keras model.
If you have any trouble installing or using Seeker, please let us know by opening an issue on GitHub or emailing us (either ayal.gussow@gmail.com or noamaus@gmail.com).
Note: Seeker relies on tensorflow, which is not yet supported in python 3.8. Therefore, to use Seeker you need to use Python 3.6 or 3.7. Creating different Python environments is easy using conda (https://docs.conda.io/en/latest/).
Citation
[1]Noam Auslander*, Ayal B. Gussow1*#, Sean Benler, Yuri I. Wolf, Eugene V. Koonin# Seeker: Alignment-free identification of bacteriophage genomes by deep learning (*) These authors contributed equally, (#) Corresponding authors
Installation
Seeker requires python3, and has been tested with python3.6 and python3.7. Seeker can be installed using pip. From a terminal, run:
pip install seeker
This will install Seeker and all of its dependencies.
Installation using Conda
Conda provides an easy method to install Seeker. First, install conda or miniconda (https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html).
Then run the following commands to install seeker:
conda create --name seeker python=3.7 pip
conda activate seeker
pip install seeker
Note: If you rely on conda, any time you want to use Seeker's libraries or commands you have to first run:
conda activate seeker
Usage
The Seeker library consists of binaries that can be run from the command line and a python library that can be incorporated into Python scripts.
Binaries
Seeker includes a binary that predicts whether an entire sequence is bacterial or phage.
To predict whether sequences are bacterial or phage, run the following from the terminal:
predict-metagenome input.fa
This will output a prediction for each sequence in input.fa
along with Seeker's score. Scores are between 0 and 1.
Higher scores correspond to phage predictions while lower scores correspond to bacterial predictions. Sequences with
scores above 0.5 are predicted phages, while sequences with scores below 0.5
are predicted bacteria.
Python Library
The primary class in the Python library is SeekerFasta. SeekerFasta can load a Fasta file and score its entries using Seeker. SeekerFasta has the following parameters:
- path_or_str. Either a path to a Fasta or a Fasta string.
- LSTM_type. Which LSTM implementation to use. Options are "python", "matlab", "prophage" (not recommended). Default is Matlab.
- seeker_model. If you've already loaded a model into a SeekerModel object and prefer to use that model, you can provide it as a parameter here. Default is None, in which case the model will be loaded from file.
- load_seqs. Whether to preload all Fasta sequences to memory. Default is True.
- is_fasta_str. Set to True if you provided a Fasta string instead of a path to a Fasta file. Default is False.
Once a Fasta is loaded, there are several functions that can be used to generate Seeker predictions. For example, to predict whether the entries of a Fasta are phage or bacteria:
from seeker import SeekerFasta
seeker_fasta = SeekerFasta("input.fa")
predictions = seeker_fasta.phage_or_bacteria() # This returns a list of phage/bacteria predictions for the Fasta
print("\n".join(predictions)) # print predictions
# To filter the Fasta file for predicted phage sequences, the following will
# create a new fasta and save it to "seeker_phage_contigs.fa" with all sequences with
# a Seeker score of 0.5 and above (threshold can be adjusted per user goals)
seeker_fasta.meta2fasta(out_fasta_path="seeker_phage_contigs.fa", threshold=0.5)
Alternatively, to predict prophages:
seeker_fasta = SeekerFasta("input.fa", LSTM_type="prophage")
seeker_fasta.save2bed("output.bed") # Save prophage coordinates to BED file
seeker_fasta.save2fasta("output.fa") # Save prophage sequences to Fasta file
NOTE: Seeker was not trained to predict prophages. The prophage model is the output of the first training step, that has been described in [1]. This model has not been tested thoroughly for prophage prediction, and its performance is affected by the prophage prediction parameters which depend on the organism and the user's goals. Due to this, the use of this model for prophage detection is not recommended, unless it is done as an initial filtering step.
LSTM Models
The LSTM models can be found in the models
directory.
- model.h5. Metagenome LSTM model, trained in python using Keras.
- MatModel0.h5. Metagenome LSTM model, trained in matlab.
- MatModePRO.h5. Prophage LSTM model, trained in matlab.
Datasets
Training, validation and test datasets are available from: ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/Seeker/
Contact
If you run into any issues or have any questions, feel free to open an issue on Github or email us at either ayal.gussow@gmail.com or noamaus@gmail.com.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.