tree-based orthology inference
PhyloPyPruner is a Python package for tree-based orthology inference that is used to refine the output of a graph-based approach (e.g., OrthoMCL, OrthoFinder or HaMStR) by removing sequences related via gene duplication. In addition to filters and algorithms seen in pre-existing tools such as PhyloTreePruner, UPhO, Agalma and Phylogenomic Dataset Reconstruction, this package provides new methods for differentiating contamination-like sequences from paralogs.
PhyloPyPruner is currently under active development and I would appreciate it if you try this software on your own data and leave feedback.
See the Wiki for more details.
- Remove short sequences
- Remove sequences with a long branch length relative to others
- Collapse weakly supported nodes into polytomies
- Tree-based orthology inference:
- Mask monophylies by keeping the longest sequence or the sequence with the shortest pairwise distance
- Root the tree using midpoint or outgroup rooting
- Prune paralogs using one out of five methods
- Paralogy frequency (PF): Calculate the number of paralogs for an OTU divided
by the number of alignments that said OTU is present in. Visualizing PF
can help identify 'problem taxa' with contamination (or partial genome
duplications, etc). PhyloPyPruner can be configured to automatically
remove OTUs with high PF and critical OTUs can be 'protected' from
removal using the
- Trim divergent: Identify and exclude OTUs on an per-alignment basis, where the ratio between the maximum pairwise distance within the OTU and the average pairwise distances with the other sequences exceeds a user-defined divergence threshold.
- Taxon jackknifing: Exclude OTUs, one-by-one, during orthology inference. This enables the user to identify OTUs whose exclusion improves metrics of supermatrix quality such as number of alignments retained or percent missing data.
- Paralogy frequency (PF): Calculate the number of paralogs for an OTU divided by the number of alignments that said OTU is present in. Visualizing PF can help identify 'problem taxa' with contamination (or partial genome duplications, etc). PhyloPyPruner can be configured to automatically remove OTUs with high PF and critical OTUs can be 'protected' from removal using the
By providing earlier tree-based approaches as a single executable, PhyloPyPruner also has a unique combination of features such as:
- Allowing polytomies in input trees for all paralogy pruning algorithms
- Collapsing weakly supported nodes into polytomies in combination with all paralogy pruning algorithms
- Rooting trees after monophyly masking in combination with the 'largest subtree (LS)' paralogy pruning algorithm
This software runs under both Python 3 and 2.7. There are no external dependencies, but the plotting library Matplotlib can be installed for generating paralog frequency plots.
You can install PhyloPyPruner using pip.
pip install --user phylopypruner
To get a list of options, run the software without any arguments or with the
-h option. PhyloPyPruner requires either a corresponding multiple sequence
alignment (MSA), in FASTA format, and a Newick tree or, the path to a directory
containing multiple trees and alignments, as an input.
Example 1. Providing a single corresponding tree and alignment. In this case monophyletic masking will be performed by choosing the sequence with the shorter pairwise distance to its sister group and paralogy pruning will be done using the largest subtree (LS) algorithm.
python -m phylopypruner --msa <filename>.fas --tree <filename>.tre
Example 2. Run PhyloPyPruner for every MSA and tree pair within the
<path>. Don't include orthologs with fewer than 10 OTUs, remove
sequence shorter than 100 positions, collapse nodes with a support value lower
than 80% into polytomies, remove branches that are 5 times longer than the
standard deviation of all branch lengths and remove OTUs with a paralogy
frequency that is larger than 5 times the standard deviation of the paralogy
frequency for all OTUs.
python -m phylopypruner --dir <path> --min-taxa 10 --min-len 100 --min-support 80 --trim-lb 5 --trim-freq-paralogs 5
Example 3. Run PhyloPyPruner for every MSA and tree pair within the
<path>. Mask monophylies by choosing the longest sequence, prune
paralogs using the maximum inclusion (MI) algorithm, remove OTUs with sequences
with an average pairwise distance that is 10 times larger than the standard
deviation of the average pairwise distance of the sequences for all OTUs,
generate statistics for the removal of OTUs using taxon jackknifing and root at
the outgroups in
<names of outgroups>.
python -m phylopypruner --dir <path> --mask longest --prune MI --trim-divergent 10 --jackknife --outgroup <names of outgroups>
Note: Taxon jackknifing multiplies the execution time by the amount of OTUs available within each input alignment.
FASTA descriptions and Newick names must match and has to be in one of the
OTU is the operational
taxonomical unit (usually the species) and
ID is a unique annotation or
sequence identifier. For example:
Sequence descriptions and tree names are not allowed to deviate from each
other. Sequence data needs to be valid IUPAC nucleotide or amino acid
Figure 1. Example of what the printed output looks like after running PhyloPyPruner with the
The following files are generated after running this program.
<output directory>/ └── phylopypruner_output/ ├── supermatrix_stats.csv ├── input_alignments_stats.csv ├── output_alignments_stats.csv ├── otu_stats.csv ├── phylopypruner.log ├── paralogy_freq_plot.png* └── output_alignments/ ├── 1_pruned.fas ├── 2_pruned.fas ├── 3_pruned.fas ...
<output directory> has not been specified by the
--output flag, then
output files will be stored within the same directory as the input alignment
file(s). See the Output files
the Wiki for a more
of each individual output file.
* – only produced if Matplotlib is installed
Figure 2. Example of the paralogy frequency (PF) plot.
© Kocot Lab 2018
Release history Release notifications
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size & hash SHA256 hash help||File type||Python version||Upload date|
|phylopypruner-0.4.4-py3-none-any.whl (52.4 kB) Copy SHA256 hash SHA256||Wheel||py3|
|phylopypruner-0.4.4.tar.gz (35.8 kB) Copy SHA256 hash SHA256||Source||None|