Prophage finder using multiple metrics
Project description
What is PhiSpy?
PhiSpy identifies prophages in Bacterial (and probably Archaeal) genomes. Given an annotated genome it will use several approaches to identify the most likely prophage regions.
Initial versions of PhiSpy were written by
Sajia Akhter (sajia@stanford.edu) Edwards Bioinformatics Lab
Improvements, bug fixes, and other changes were made by
Katelyn McNair Edwards Bioinformatics Lab and Przemyslaw Decewicz DEMB at the University of Warsaw
Installation
Conda
The easiest way to install for all users is to use bioconda
.
conda install -c bioconda phispy
PIP
python-pip
requires a C++ compiler and the Python header files. You should be able to install it like this:
sudo apt install -y build-essential python3-dev python3-pip
python3 -m pip install --user PhiSpy
This will install PhiSpy.py
in ~/.local/bin
which should be in your $PATH
but might not be (see this detailed discussion). See the tips and tricks below for a solution to this.
Advanced Users
For advanced users, you can clone the git repository and use that (though pip
is the recommended install method).
git clone https://github.com/linsalrob/PhiSpy.git
cd PhiSpy
python3 setup.py install --user --record installed_files.txt
Note that we recommend using --record to save a list of all the files that were installed by PhiSpy
. If you ever want to uninstall it, or to remove everything to reinstall e.g. from pip
, you can simply use the contents of that file:
cat installed_files.txt | xargs rm -f
If you have root and you want to install globally, you can change the setup command.
git clone https://github.com/linsalrob/PhiSpy.git
cd PhiSpy
python3 setup.py install
For ease of use, you may wish to add the location of PhiSpy.py to your $PATH.
Software Requirements
PhiSpy requires following programs to be installed in the system. Most of these are likely already on your system or will be installed using the mechanisms above.
Python
- version 3.4 or laterBiopython
- version 1.58 or latergcc
- GNU project C and C++ compiler - version 4.4.1 or later- The
Python.h
header file. This is included inpython3-dev
that is available on most systems.
Testing PhiSpy.py
Download the Streptococcus pyogenes M1 genome
curl -Lo Streptococcus_pyogenes_M1_GAS.gb https://bit.ly/37qFArb
PhiSpy.py -o Streptococcus.phages Streptococcus_pyogenes_M1_GAS.gb
or to run it with the Streptococcus
training set:
PhiSpy.py -o Streptococcus.phages -t data/trainSet_160490.61.txt Streptococcus_pyogenes_M1_GAS.gb
This uses the GenBank
format file for Streptococcus pyogenes M1 GAS that we provide in the tests/ directory, and we use the training set for S. pyogenes M1 GAS that we have pre-calculated. This quickly identifies the four prophages in this genome, runs the repeat finder on all of them, and outputs the answers.
You will find the output files from this query in output_directory
.
Running PhiSpy.py
The simplest command is:
PhiSpy.py genbank_file -o output_directory
where:
genbank file
: The input DNA sequence file in GenBank format.output directory
: The output directory is the directory where the final output file will be created.
If you have new genome, we recommend annotating it using the RAST server or PROKKA. RAST has a server that allows you to upload and download the genome (and can handle lots of genomes), while PROKKA is stand-alone software.
gzip support
PhiSpy.py
natively supports both reading and writing files in gzip
format. If you provide a gzipped
input file, we will write a gzipped
output file.
HMM Searches
When also considering the signal from HMM profile search:
PhiSpy.py genbank_file -o output_directory --phmms hmm_db --threads 4 --color
where:
hmm_db
: reference HMM profiles database to search with genome-encoded proteins (at the moment)
Training sets were searched with pVOG database HMM profiles: AllvogHMMprofiles.tar.gz. To use it:
wget http://dmk-brain.ecn.uiowa.edu/pVOGs/downloads/All/AllvogHMMprofiles.tar.gz
tar -zxvf AllvogHMMprofiles.tar.gz
cat AllvogHMMprofiles/* > pVOGs.hmm
Then use pVOGs.hmm
as hmm_db
.
Since extra step before the regular processing of PhiSpy is performed, input genbank file
is updated and saved in output_directory
.
When --color
flag is used, additional qualifier /color
will be added in the updated GenBank file so that the user could easily distinguished proteins with hits to hmm_db
while viewing the file in Artemis
When running PhiSpy again on the same input data and with --phmms
option you can skip the search step by --skip_search
flag.
Help
For the help menu use the -h
option:
python PhiSpy.py -h
Output Files
PhiSpy
has the option of creating multiple output files with the prophage data:
- prophage_coordinates.tsv (code: 1)
This is the coordinates of each prophage identified in the genome, and their att sites (if found) in tab separated text format.
The columns of the file are:
-
- Prophage number
-
- The contig upon which the prophage resides
-
- The start location of the prophage
-
- The stop location of the prophage If we can detect the att sites, the additional columns are:
-
- start of attL;
-
- end of attL;
-
- start of attR;
-
- end of attR;
-
- sequence of attL;
-
- sequence of attR;
-
- The explanation of why this att site was chosen for this prophage.
- GenBank format output (code: 2)
We provide a duplicate GenBank record that is the same as the input record, but we have inserted the prophage information, including att sites into the record.
If the original GenBank file was provided in gzip
format this file will also be created in gzip format.
- prophage and bacterial sequences (code: 4)
PhiSpy
can automatically separate the DNA sequences into prophage and bacterial components. If this output is chosen, we generate two fasta
files. The first contains the entire genome, but the prophage regions have been masked with N
s. We explicitly chose this format for a
few reasons: (i) it is trivial to convert this format into separate contigs without the Ns but it is more complex to go from separate contigs
back to a single joined contig; (ii) when read mapping against the genome, understanding that reads map either side of a prophage maybe
important; (iii) when looking at insertion points this allows you to visualize the where the prophage was lying.
- prophage_information.tsv (code: 8)
This is a tab separated file, and is the key file to assess prophages in genomes (see assessing predictions, below). The file contains all the genes of the genome, one per line. The tenth colum represents the status of a gene. If this column is 0 then we consider this a bacterial gene. If it is non-zero it is probably a phage gene, and the higher the score the more likely we believe it is a phage gene. This is the raw data that we use to identify the prophages in your genome.
This file has 16 columns:
-
- The id of each gene;
-
- function: function of the gene (or
product
from a GenBank file);
- function: function of the gene (or
-
- contig;
-
- start: start location of the gene;
-
- stop: end location of the gene;
-
- position: a sequential number of the gene (starting at 1);
-
- rank: rank of each gene provided by random forest;
-
- my_status: status of each gene based on random forest;
-
- pp: classification of each gene based on their function;
-
- Final_status: the status of each gene. For prophages, this column has the number of the prophage as listed in prophage.tbl above; If the column contains a 0 we believe that it is a bacterial gene. Otherwise we believe that it is possibly a phage gene.
If we can detect the att sites, the additional columns are:
-
- start of attL;
-
- end of attL;
-
- start of attR;
-
- end of attR;
-
- sequence of attL;
-
- sequence of attR;
- prophage.tsv (code: 16)
This is a simpler version of the prophage_coordinates.tsv file that only has prophage number, contig, start, and stop.
- GFF3 format (code: 32)
This is the prophage information suitable for insertion into a GFF3. This is a legacy file format, however, since GFF3 is no longer widely supported, this only has the prophage coordinates. Please post an issue on GitHub if more complete GFF3 files are required.
- prophage.tbl (code: 64)
This file has two columns separated by tabs [prophage_number, location]. This is a also a legacy file that is not generated by default. The prophage number is a sequential number of the prophage (starting at 1), and the location is in the format: contig_start_stop that encompasses the prophage.
Choosing which output files are created.
We have provided the option (--output_choice
) to choose which output files are created. Each file above has a code associated with it,
and to include that file add up the codes:
Code | File |
---|---|
1 | prophage_coordinates.tsv |
2 | GenBank format output |
4 | prophage and bacterial sequences |
8 | prophage_information.tsv |
16 | prophage.tsv |
32 | GFF3 format |
64 | prophage.tbl |
So for example, if you want to get GenBank format output
(2) and prophage_information.tsv
(8), then
enter an --output_choice
of 10.
The default is 3: you will get both the prophage_coordinates.tsv
and GenBank format output
files.
If you want all files output, use --output-choice 127
.
Example Data
- Streptococcus pyogenes M1 GAS which has a single genome contig. The genome contains four prophages.
To analyze this data, you can use:
PhiSpy.py -o output_directory -t data/trainSet_160490.61.txt tests/Streptococcus_pyogenes_M1_GAS.gb.gz
And you should get a prophage table that has this information (for example, take a look at output_directory/prophage.tbl
).
Prophage number | Contig | Start | Stop |
---|---|---|---|
pp_1 | NC_002737 | 529631 | 569288 |
pp_2 | NC_002737 | 778642 | 820599 |
pp_3 | NC_002737 | 1192630 | 1222549 |
pp_4 | NC_002737 | 1775862 | 1782822 |
Assessing predictions
As with any software, it is critical that you assess the output from phispy
to see if it actually makes sense! We start be ensuring we have the prophage_information.tsv
file output (this is not output by default, and requires adding 8 to the --output-choice
flag).
That is a tab-separated text file that you can import into Microsoft Excel, LibreOffice Calc, Google Sheets, or your favorite spreadsheet viewing program.
There are a few columns that you should pay attention to:
- position (the 6th column) is the position of the gene in the genome. If you sort by this column you will always return the genome to the original order.
- Final status (the 10th column) is whether this region is predicted to be a prophage or not. The number is the prophage number. If the entry is 0 it is not a prophage.
- pp and my status (the 8th and 9th columns) are interim indicators about whether this gene is potentially part of a phage.
We recommend:
- Freeze the first row of the spreadsheet so you can see the column headers
- Sort the spreadsheet by the my status column and color any row red where the value in this column is greater than 0
- Sort the spreadsheet by the final status column and color those rows identified as a prophage green.
- Sort the spreadsheet by the position column.
Now all the prophages are colored green, while all the potential prophage genes that are not included as part of a prophage are colored red. You can easily review those non-prophage regions and determine whether you think they should be included in prophages. Note that in most cases you can adjust the phispy
parameters to include regions you think are prophages.
Note: Ensure that while you are reviewing the results, you pay particular attention to the contig column. In partial genomes, contig breaks are very often located in prophages. This is usual because prophages often contain sequences that are repeated around the genome. We have an open issue open issue to try and resolve this in a meaningful way.
Tips, Tricks, and Errors
If you are feeling lazy, you actually only need to use sudo apt install -y python3-pip; python3 -m pip install phispy
since python3-pip requires build-essential
and python3-dev
!
If you try PhiSpy.py -v
and get an error like this:
$ PhiSpy.py -v
-bash: PhiSpy.py: command not found
Then you can either use the full path:
~/.local/bin/PhiSpy.py -v
or add that location to your $PATH
:
echo "export PATH=\$HOME/.local/bin:\$PATH" >> ~/.bashrc
source ~/.bashrc
PhiSpy.py -v
Making your own training sets
If within reference datasets, close relatives to bacteria of your interest are missing, you can make your own training sets by providing at least a single genome in which you indicate prophage proteins. This is done by adding a new qualifier GenBank annotation for each CDS feature within a prophage region: /is_phage="1"
. This allows PhiSpy to distinguish the signal from bacterial/phage regions and make a training set to use afterwards during classification with random forest algorithm.
To make a training set out of your files use the PhiSpy.py
option -m
:
PhiSpy.py -o output_directory -k kmer_size -t kmers_type -g groups_file --retrain --phmms hmm_db --color --threads 4 genome.gb.gz
where:
output_directory
: a directory where are temporary and final training sets will be written.kmer_size
: is the size of kmers that will be produces. By default it's 12. If changed, remember to also change that parameter while running PhiSpy with produced training sets.kmers_type
: type of generated kmers. By default 'all' means generating kmers by 1 nt. If changed, remember to also change that parameter while running PhiSpy with produced training sets.groups_file
: a file mapping GenBank file names with extension and the name of group they will make; each file can be assigned to more than one group - take a look at how the reference data grouping file was constructed:tests/groups.txt
. Beside the flags that allow training with phmm signal, there's also a--retrain
flag. When used, it overwrites all the training sets in theoutput_directory
that will be produced while training. That includes:phage_kmers_all_wohost.txt
,trainSets_genericAll.txt
andtrainingGenome_list.txt
. The same will happen whentrainingGenome_list.txt
is missing inoutput_directory
.genome.gb.gz
is a gzip compressed GenBank file that has the/is_phage="1"
flags set. If--retrain
is not set, the script extends thetrainingGenome_list.txt
, adds new files tooutput_directory
(overwrites the ones with the same group name) and updatesphage_kmers_all_wohost.txt
.
Preparing GenBank files
- it is recommended to mark prophage proteins even from prophage remnants/disrupted regions composed of a few proteins with
/is_phage="1"
to minimize the loss of good signal, kmers in particular, - don't use too many genomes (e.g. a 100) as you may end up with a small set of phage-specific kmers,
- try to pick several genomes with different prophages to increase the diversity.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.