Skip to main content

tree-based orthology inference

Project description

PhyloPyPruner

PhyloPyPruner is a Python package for tree-based orthology inference that is used to refine the output of a graph-based approach (e.g., OrthoMCL, OrthoFinder or HaMStR) by removing sequences related through gene duplication. In addition to filters and algorithms seen in pre-existing tools such as PhyloTreePruner, UPhO, Agalma and Phylogenomic Dataset Reconstruction, this package provide new methods for differentiating contamination-like sequences from paralogs.

By providing earlier tree-based approaches as a single executable, PhyloPyPruner also has a unique combination of features such as:

  • allowing polytomies in input trees for all paralogy pruning algorithms
  • collapsing weakly supported nodes into polytomies in combination with all paralogy pruning algorithms
  • rooting trees after monophyly masking in combination with the 'largest subtree (LS)' paralogy pruning algorithm

PhyloPyPruner is currently under active development and I would appreciate it if you try this software on your own data and leave feedback.

See the Wiki for more details.

Features

  • Remove short sequences
  • Remove sequences with a long branch length relative to others
  • Collapse weakly supported nodes into polytomies
  • Prune paralogs using one out of five methods
  • Mask monophylies by keepipng the longest sequence or the sequence with the shortest pairwise distance
  • Root the tree using midpoint or outgroup rooting
  • Calculate and visualize paralogy frequency (PF), the number of paralogs for an OTU divided by the number of alignments that said OTU is present in
  • Exclude OTUs with a high PF relative to all OTUs
  • Remove OTUs with an average pairwise distance that is high relative to all OTUs
  • Ignore OTUs one-by-one during tree-based orthology inference and identify OTUs whose exclusion improves metrics of supermatrix quality such as the number of output alignments or missing data

Installation

This software runs under both Python 3 and 2.7. There are no external dependencies, but the plotting library Matplotlib can be installed for generating paralog frequency plots.

You can install PhyloPyPruner using pip.

pip install --user phylopypruner

Usage

To get a list of options, run the software without any arguments or use the -h or --help flag. PhyloPyPruner requires either a corresponding multiple sequence alignment (MSA) in FASTA format and a Newick tree or, the path to a directory containing multiple trees and alignments.

Example 1. Providing a single corresponding tree and alignment. In this case monophyletic masking will be performed by choosing the sequence with the shorter pairwise distance to its sister group and paralogy pruning will be done using the largest subtree (LS) algorithm.

python -m phylopypruner --msa <filename>.fas --tree <filename>.tre

Example 2. Run PhyloPyPruner for every MSA and tree pair within the directory in <path>. Don't include orthologs with fewer than 10 OTUs, remove sequence shorter than 100 positions, collapse nodes with a support value lower than 80% into polytomies, remove branches that are 5 times longer than the standard deviation of all branch lengths and remove OTUs with a paralogy frequency that is larger than 5 times the standard deviation of the paralogy frequency for all OTUs.

python -m phylopypruner --dir <path> --min-taxa 10 --min-len 100 --min-support
80 --trim-lb 5 --trim-freq-paralogs 5

Example 3. Run PhyloPyPruner for every MSA and tree pair within the directory in <path>. Mask monophylies by choosing the longest sequence, prune paralogs using the maximum inclusion (MI) algorithm, remove OTUs with sequences with an average pairwise distance that is 10 times larger than the standard deviation of the average pairwise distance of the sequences for all OTUs, generate statistics for the removal of OTUs using taxon jackknifing and root at the outgroups in <names of outgroups>.

python -m phylopypruner --dir <path> --mask longest --prune MI --trim-divergent
10 --jackknife --outgroup <names of outgroups>

Note: Taxon jackknifing multiplies the execution time by the amount of OTUs available within each input alignment.

FASTA descriptions and Newick names must match and has to be in one of the following formats: OTU|ID or OTU@ID, where OTU is the operational taxonomical unit (usually the species) and ID is a unique annotation or sequence identifier. For example: >Meiomenia_swedmarki|Contig00001_Hsp90. Sequence descriptions and tree names are not allowed to deviate from each other. Sequence data needs to be valid IUPAC nucleotide or amino acid sequences.

Output Example
Figure 1. Example of what the printed output looks like after running PhyloPyPruner with the --trim-freq-paralogs flag.

Output files

The following files are generated after running this program.

<output directory>/
├── <timestamp>_ppp_summary.csv
├── <timestamp>_ppp_ortho_stats.csv
├── <timestamp>_ppp_run.log
├── <timestamp>_ppp_paralog_freq.csv
├── <timestamp>_ppp_paralog_freq.png*
└── <timestamp>_orthologs/
│   ├── 1_pruned.fas
│   ├── 2_pruned.fas
│   ├── 3_pruned.fas
│   └── 4_pruned.fas
...

If <output directory> has not been specified by the --output flag, then output files will be stored within the same directory as the input alignment file(s). See the Output files section within the Wiki for a more detailed explanation of each individual output file.

* – only produced if Matplotlib is installed

Paralogy Frequency Plot
Figure 2. Example of the paralogy frequency (PF) plot.

© Kocot Lab 2018

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phylopypruner-0.2.2.tar.gz (30.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phylopypruner-0.2.2-py3-none-any.whl (49.4 kB view details)

Uploaded Python 3

File details

Details for the file phylopypruner-0.2.2.tar.gz.

File metadata

  • Download URL: phylopypruner-0.2.2.tar.gz
  • Upload date:
  • Size: 30.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.5

File hashes

Hashes for phylopypruner-0.2.2.tar.gz
Algorithm Hash digest
SHA256 b0435a9caa9f8222916e84178620b2c087a51ac085fe4c63a8166479ab22c303
MD5 8277325b0a8276acc6a02f15732e0c99
BLAKE2b-256 c6807c141d8d6be8b7a61e78ed961eb4e0e77df39231fddb3c19618b28ebf30d

See more details on using hashes here.

File details

Details for the file phylopypruner-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: phylopypruner-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 49.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.5

File hashes

Hashes for phylopypruner-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ed7d41befa6919aa9b5d8272070eb7cb6affedb074c6fdc19fdb5318fd198fb6
MD5 8b3cd63e8f3d4b0b39d8b1f900727ec7
BLAKE2b-256 0b5905d7c4072e68282ea134a1dc5553abda46e8b2a488d03beabf5e8e669169

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page