A Random Forest classifier to predict bacteriophage lifestyle
Project description
BACPHLIP - a bacteriophage lifestyle prediction tool
Adam J. Hockenberry and Claus O. Wilke
Reference:
Pre-print available at: https://www.biorxiv.org/content/10.1101/2020.05.13.094805v1
Overview and important caveats
The BACPHLIP software is designed to test whether a given phage genome (.fasta
formatted) is likely to be either temperate (lysogenic) or virulent (lytic). The software makes this determination by searching for a particular set of what are hypothesized to be "temperate-specific" protein domains. BACPHLIP has several assumptions that users should be aware of:
-
the user input is a phage genome (nucleotide) sequence. BACPHLIP does not perform any checks on whether the input nucleotide sequence is a phage so users are cautioned to verify this fact themselves prior to running BACPHLIP. Random stretches of DNA will be called virulent phages (assuming that no relevant domains are found within the random sequence) not because there are any indications of the sequence being a virulent phage, but rather because no data overturns the starting assumption (that you provided the program with a phage, see point 3 below). Similarly strange results will occur if you provide BACPHLIP with whole bacterial chromosomes, these will likely be called temperate phages simply because it's likely that several of the relevant "temperate domains" will be found somewhere within the chromosome.
-
the phage genome is complete. We stress that the absence of evidence is not the evidence of absence. If the genome is not complete, we simply do not have enough information to make a determination as to whether lysogeny-associated protein domains occur and incomplete/partially-assembled genomes should therefore not be used as input.
-
the default/starting assumption is that any given input file is a virulent (lytic) phage. Depending on the number and identity of various lysogeny-associated proteins that are found, this default assumption may be updated by the random forest classifier to indicate that the sequence is in fact temperate. However, if no protein domains of interest are found at all, the result will be to call the phage virulent.
-
users should read through all documentation prior to use, as well as the (2 page) accompanying manuscript. We have taken care to enumerate the use cases and limitations of this software package. For instance, BACPHLIP was trained on a dataset consisting almost entirely of phages from the order Caudovirales, most of which infect hosts in the orders Actinobacteria, Gammaproteobacteria, and Bacilli. We urge caution when using the software on species outside of these orders, but note that this fact may change as we update and expand training set data in future releases.
Installation
You can install BACPHLIP with pip:
pip install bacphlip
Alternatively, users can clone/download the latest github repository, navigate to the directory where BACPHLIP was downloaded and run:
pip install .
BACPHLIP has several required dependencies outside of the standard library: biopython, pandas, joblib, and scikit-learn.
Additionally, users are required to install the HMMER3 software suite (in addition to the installation routes listed on the HMMER3 website we note that this tool can also be installed via conda). By default, BACPHLIP assumes that HMMER3 is installed in the system path, but local paths may be provided as run-time flags (see below).
Examples
The most straightforwad usage of BACPHLIP is as a command line tool. The required input is a genome (nucleotide) fasta
file containing one record. Assuming that /valid/path/to/a/genome.fasta
exists, you can call BACPHLIP with the command:
bacphlip -i /valid/path/to/a/genome.fasta
This command should create 4 seperate files in the path of the target genome.fasta
with genome.fasta.bacphlip
containing the final model predictions (tab-separated format) in terms of probability of the input phage being either "Virulent" or "Temperate" (the other files append .6frame
, .hmmsearch
, and .hmmsearch.tsv
to the genome file). Attempting to run this command a second time, assuming the first worked, should create an error since the output files already exist. This behavior can be altered with a flag to force overwrite the files:
bacphlip -i /valid/path/to/a/genome.fasta -f
A path to a local HMMER3 install (specifically, the hmmsearch
tool) can be specified in the command line:
bacphlip -i /valid/path/to/a/genome.fasta --local_hmmsearch /valid/path/to/hmmsearch
Users wishing to run BACPHLIP on multiple phages in batch are encouraged to use the --multi_fasta
run-time flag. In this case, the input genome (nucleotide) fasta file should contain multiple sequence records (one per complete genome) with unique id's (as parsed by biopython
). BACPHLIP will create a directory named after the input file, and intermediate files associated with each sequence record will be named from the record id and written to this directory. Finally, the final output file will contain a single table with predictions for each genome. Assuming that multigenome.fasta
exists:
bacphlip -i /valid/path/to/a/multigenome.fasta --multi_fasta
Additionally, BACPHLIP can also be accessed and used as a python library. From a python interpreter simply type:
import bacphlip
bacphlip.run_pipeline('/valid/path/to/a/genome.fasta')
A batch of input files can be run as a loop using this library functionality (which will output and save separate prediction [.bacphlip
, a simple tab-separated format] files for each input:
import bacphlip
import glob
for infile_loc in glob.glob('/valid/path/to/a/set/of/files/*.fasta'):
bacphlip.run_pipeline(infile_loc)
or if multiple genomes are included in the same .fasta
file, numerous genomes can be analyzed by using the run_pipeline_multi
function which will create a single .bacphlip
containing rows for each input:
import bacphlip
multi_fasta_file = '/valid/path/to/multi.fasta'
bacphlip.run_pipeline_multi(infile_loc)
Finally, using BACPHLIP as a library makes individual functions available to the user in order to run and possibly troubleshoot single steps. I.e.:
import bacphlip
bacphlip.six_frame_translate( ... )
bacphlip.hmmsearch_py( ... )
bacphlip.process_hmmsearch( ... )
bacphlip.predict_lifestyle( ... )
Each function has a relevant set of arguments that should be clear from the docs. It is our hope that running BACPHLIP in this manner will give more flexibility with regard to file names and may prove useful to some users.
Next steps
We have several planned next steps, including:
- adding a tutorial for library usage as a jupyter notebook in a forthcoming
examples
folder. - adding the ability to run the pipeline in a "quiet" mode
- (insert your suggestion here)
Misc
The software is provided to you under the MIT license (see file LICENSE.txt
).
The most up-to-date version of this software is available at
https://github.com/adamhockenberry/bacphlip.
The development of BACPHLIP
is provided in a separate repository for transparency. See bacphlip-model-dev.
Contributing
Pull requests addressing errors or adding new functionalities are welcome on GitHub. However, to be accepted, contributions must pass the pytest
unit tests.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file bacphlip-0.9.6.tar.gz
.
File metadata
- Download URL: bacphlip-0.9.6.tar.gz
- Upload date:
- Size: 10.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce9c75630972b69158769e808ccc96821035147d7b3a415a2beda8e893870b38 |
|
MD5 | 7bf7d8a002df1bb120e5a2de704cfb23 |
|
BLAKE2b-256 | 41d52d585cb463747155c8513d4fd6831391dcf063e0f3122026de65ca98d46e |