tree-based orthology inference
Project description
PhyloPyPruner
PhyloPyPruner is a tree-based orthology inference program for refining orthology inference made by a graph-based approach. In addition to implementing previously published paralogy pruning algorithms seen in PhyloTreePruner, UPhO, Agalma and Phylogenomic Dataset Reconstruction, this software also provides methods for identifying and getting rid of operational taxonomical units (OTUs) that display contamination-like issues.
PhyloPyPruner is currently under active development and I would appreciate it if you try this software on your own data and leave feedback.
See the Wiki for more details.
Features
- Remove short sequences
- Remove sequences with a long branch length relative to others
- Collapse weakly supported nodes into polytomies
- Prune paralogs using one out of five methods
- Mask monophylies by keepipng the longest sequence or the sequence with the shortest pairwise distance
- Root the tree using midpoint or outgroup rooting
- Calculate and visualize paralogy frequency (PF), the number of paralogs for an OTU divided by the number of alignments that said OTU is present in
- Exclude OTUs with a high PF relative to all OTUs
- Remove OTUs with an average pairwise distance that is high relative to all OTUs
- Ignore OTUs one-by-one during tree-based orthology inference and identify OTUs whose exclusion improves metrics of supermatrix quality such as the number of output alignments or missing data
Installation
This software runs under both Python 3 and 2.7. There are no external dependencies, but the plotting library Matplotlib can be installed for generating paralog frequency plots.
You can install PhyloPyPruner using pip.
pip install --user phylopypruner
Usage
To get a list of options, run the software without any arguments or use the
-h
or --help
flag. PhyloPyPruner requires either a corresponding multiple
sequence alignment (MSA) in FASTA format and a Newick tree or, the path to a
directory containing multiple trees and alignments.
Example 1. Providing a single corresponding tree and alignment. In this case monophyletic masking will be performed by choosing the sequence with the shorter pairwise distance to its sister group and paralogy pruning will be done using the largest subtree (LS) algorithm.
python -m phylopypruner --msa <filename>.fas --tree <filename>.tre
Example 2. Run PhyloPyPruner for every MSA and tree pair within the
directory in <path>
. Don't include orthologs with fewer than 10 OTUs, remove
sequence shorter than 100 positions, collapse nodes with a support value lower
than 80% into polytomies, remove branches that are 5 times longer than the
standard deviation of all branch lengths and remove OTUs with a paralogy
frequency that is larger than 5 times the standard deviation of the paralogy
frequency for all OTUs.
python -m phylopypruner --dir <path> --min-taxa 10 --min-len 100 --min-support
80 --trim-lb 5 --trim-freq-paralogs 5
Example 3. Run PhyloPyPruner for every MSA and tree pair within the
directory in <path>
. Mask monophylies by choosing the longest sequence, prune
paralogs using the maximum inclusion (MI) algorithm, remove OTUs with sequences
with an average pairwise distance that is 10 times larger than the standard
deviation of the average pairwise distance of the sequences for all OTUs and
generate statistics for the removal of OTUs using taxon jackknifing.
python -m phylopypruner --dir <path> --mask longest --prune MI --trim-divergent
10 --jackknife
Note: Taxon jackknifing multiplies the execution time by the amount of OTUs available within each input alignment.
FASTA descriptions and Newick names must match and has to be in one of the
following formats: OTU|ID
or OTU@ID
, where OTU
is the operational
taxonomical unit (usually the species) and ID
is a unique annotation or
sequence identifier. For example: >Meiomenia_swedmarki|Contig00001_Hsp90
.
Sequence descriptions and tree names are not allowed to deviate from each
other. Sequence data needs to be valid IUPAC nucleotide or amino acid
sequences.
Output files
The following files are generated after running this program.
<output directory>/
├── <timestamp>_ppp_summary.csv
├── <timestamp>_ppp_ortho_stats.csv
├── <timestamp>_ppp_run.log
├── <timestamp>_ppp_paralog_freq.csv
├── <timestamp>_ppp_paralog_freq.png*
└── <timestamp>_orthologs/
│ ├── 1_pruned.fas
│ ├── 2_pruned.fas
│ ├── 3_pruned.fas
│ └── 4_pruned.fas
...
If <output directory>
has not been specified by the --output
flag, then
output files will be stored within the same directory as the input alignment
file(s). See the Output files
section within
the Wiki for a more
detailed
explanation
of each individual output file.
* – only produced if Matplotlib is installed
© Kocot Lab 2018
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file phylopypruner-0.2.0.tar.gz
.
File metadata
- Download URL: phylopypruner-0.2.0.tar.gz
- Upload date:
- Size: 29.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5a9cef6bd1ed4bd231d16a9b140aafb64f96e9f654255c41c2ece7f45f22d35f |
|
MD5 | 93c1a5b9f8cde0a2f176707b101b2458 |
|
BLAKE2b-256 | ee92c0758da2a72e5112c03a334a6a6e0e951710668a080f37869ddbd5f392df |
File details
Details for the file phylopypruner-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: phylopypruner-0.2.0-py3-none-any.whl
- Upload date:
- Size: 49.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4781203aa8f6f35582ad57eeb9c39c34b3d5254407d859dd6ecddb8d727072b8 |
|
MD5 | 2d62f0ebb78be1ff9167daed97e7c014 |
|
BLAKE2b-256 | a0a2af5223d236f73c4918729050dd5d8bd76c5bb36c4f5d06d9173208445673 |