Skip to main content

Assort phage protein sequences into phamilies using MMseqs2

Project description

phamerate

The phamerate package facilitates pham assembly using MMseqs2. Default parameters have been carefully tuned for rapid, accurate exploration of the bacteriophage protein sequence space.

Installation - Conda

The easiest way to install the phamerate package and its dependencies is through the Anaconda/Miniconda package manager:

conda create -n phamerate python=3.9 phamerate -c bioconda -c conda-forge mmseqs2=13.45111 clustalo -y

Installation - Manual

If you don't have some flavor of conda available (and don't want to install it...) you may follow the instructions here to manually install mmseqs. An optional dependency, clustalo can be manually installed following the instructions here. Most modern operating systems also ship with Python3, the programming language used to develop this package, and required to run it. However, if your system does not have Python 3.6 or higher, you will need to obtain it here.

Once all that is done, you can obtain the phamerate package from PyPI using pip:

pip3 install phamerate

Basic Usage

With all dependencies met, you can run phamerate by invoking it with the -h option (to print the help menu):

phamerate -h

Which should print something like:

usage: phamerate [-h] [--cluster-mode] [--sensitivity] [--identity] [--coverage] [--evalue] [--no-hmm] [--hmm-identity] [--hmm-coverage] [--hmm-evalue] [-c] [-v] [-o] [-t] [-a] infile [infile ...]

Assort phage protein sequences into phamilies using MMseqs2.

positional arguments:
  infile             path to input file(s) in FASTA format

optional arguments:
  -h, --help         show this help message and exit
  -c , --cpus        number of threads to use [default: 8]
  -v, --verbose      print progress messages
  -o , --outdir      path to directory where output files should go [default: /Users/your_username]
  -t , --tmpdir      path where temporary file I/O should occur [default: /tmp/phamerate]
  -a, --align-phams  use Clustal Omega to align phams (this could take awhile...)

mmseqs arguments:
  --cluster-mode     clustering algorithm [default: 0]
  --sensitivity      sensitivity: 1.0 favors speed, 7.5 favors sensitivity [default: 4.0]
  --identity         percent identity for sequence-sequence clustering [default: 0.3]
  --coverage         percent coverage for sequence-sequence clustering [default: 0.85]
  --evalue           E-value threshold for sequence-sequence clustering [default: 0.001]
  --no-hmm           skip HMM clustering
  --hmm-identity     percent identity for consensus-HMM clustering [default: 0.25]
  --hmm-coverage     percent coverage for consensus-HMM clustering [default: 0.5]
  --hmm-evalue       E-value threshold for consensus-HMM clustering [default: 0.001]

Steinegger M. and Soeding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 2017. doi: 10.1038/nbt.3988

The only required argument is the path to a single multiple-FASTA file, for example:

phamerate my_genes.faa

This will perform pham assembly, and create a directory phamily_fastas containing a multiple-FASTA file for each gene phamily in the input gene set.

An alternate output path can be specified with the -o argument:

phamerate my_genes.faa -o ~/Desktop/phamerate_results

This will do the same as before, except phamily_fastas will be found in ~/Desktop/phamerate_results rather than the directory the program was invoked from.

If your dataset consists in multiple FASTA files (e.g. one file per genome), you can specify multiple input files by simply putting them one after the next:

phamerate genome1.faa genome2.faa genome3.faa ... genomeN.faa -o ~/Desktop/phamerate_results

or if all your genomes are in the same directory:

phamerate /path/to/genome/fastas/*.faa -o ~/Desktop/phamerate_results

If you want to have an MSA for each pham, the phamerate program can accomplish this using clustalo - simply use the -a argument:

phamerate my_genes.faa -o ~/Desktop/phamerate_results -a -v

The -v argument will make the program print progress messages to the console as it runs:

Found 378159 translations in 1 file(s)...
Creating MMseqs2 database...
Performing sequence-sequence clustering...
Parsing first iteration phams...
Building HMMs from pre-phams...
Extracting consensus sequences from HMMs...
Performing consensus-HMM clustering...
Parsing second iteration phams...
Found 22897 phamilies in dataset...
Aligning phams with Clustal Omega...
[############                                     ] 25%

This may be especially helpful on large datasets, as the progressbar updates to show what fraction of alignments have been computed. This should give you a sense of whether you have time to make a cup of coffee while it finishes...

Advanced Usage

For folks with large bacterial pan-genomes to analyze, you may find that the BLAST-based method used by Roary is too slow for your needs. In this case, phamerate may be able to help, by raising the --identity threshold to 0.9 (same as the 90% identity threshold used by Roary) and supplying the --no-hmm argument, as you won't be searching for remote homologs:

phamerate my_genes.faa -o ~/Desktop/phamerate_results -a -v --identity 0.9 --no-hmm

Which will print:

Found 378159 translations in 1 file(s)...
Creating MMseqs2 database...
Performing sequence-sequence clustering...
Parsing first iteration phams...
Building HMMs from pre-phams...
Extracting consensus sequences from HMMs...
Performing consensus-HMM clustering...
Parsing second iteration phams...
Found 22897 phamilies in dataset...
Aligning phams with Clustal Omega...
[############                                     ] 25%

Fair warning: at present, phamerate does NOT make any effort to split paralogs out of gene phamilies.

Future Releases

We would like to do the following in future releases:

  • remove paralogs from phamilies [enhancement]
  • export gene_presence_absence.csv [enhancement]
  • export summary file with numbers of core/soft-core/shell/cloud genes [enhancement]
  • export figure showing marginal pan-genome (each new genome adds...) [enhancement]
  • create tree(s) based on well-conserved genes [enhancement]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phamerate-0.1.1.tar.gz (29.1 kB view details)

Uploaded Source

Built Distribution

phamerate-0.1.1-py3-none-any.whl (29.5 kB view details)

Uploaded Python 3

File details

Details for the file phamerate-0.1.1.tar.gz.

File metadata

  • Download URL: phamerate-0.1.1.tar.gz
  • Upload date:
  • Size: 29.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.9

File hashes

Hashes for phamerate-0.1.1.tar.gz
Algorithm Hash digest
SHA256 13a6a8a69633778e9962ed9ee6959bedbbe878a7b97436446b3d05efece3f0b2
MD5 ae29c43b499f95d63b0aa1b0faef915e
BLAKE2b-256 1bbaa08086a19fe2d58413563cfefed851d4fff8f844c83ed4670bf0686a7343

See more details on using hashes here.

File details

Details for the file phamerate-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: phamerate-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 29.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.9

File hashes

Hashes for phamerate-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6ab95796cb04025d0d7be50a3455e59516f0d620ca1f7208c5c7f9357895afdb
MD5 84f768acca0f8653c8853658300abff9
BLAKE2b-256 e351089b1a8a91892289bade914a9685a751067d8a92ded2df3199c5ab6f58af

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page