Skip to main content

tree-based orthology inference

Project description

PhyloPyPruner

PhyloPyPruner is a tree-based orthology inference program for refining orthology inference made by a graph-based approach. In addition to implementing previously published paralogy pruning algorithms seen in PhyloTreePruner, UPhO, Agalma and Phylogenomic Dataset Reconstruction, this software also provides methods for identifying and getting rid of operational taxonomical units (OTUs) that display contamination-like issues.

PhyloPyPruner is currently under active development and I would appreciate it if you try this software on your own data and leave feedback.

See the Wiki for more details.

Features

  • Remove short sequences
  • Remove relatively long branches
  • Collapse weakly supported nodes into polytomies
  • Prune paralogs using one out of five methods
  • Measure paralogy frequency
  • Remove OTUs with relatively high paralogy frequency
  • Mask monophylies by keepipng the longest sequence or the sequence with the shortest pairwise distance
  • Exclude individual OTUs entirely
  • Root trees using outgroup or midpoint rooting
  • Get rid of OTUs with sequences that display relatively high pairwise distance
  • Measure the impact of individual OTUs using taxon jackknifing

Installation

This software runs under both Python 3 and 2.7. There are no external dependencies, but the plotting library Matplotlib can be installed for generating paralog frequency plots.

You can install PhyloPyPruner using pip.

pip install --user phylopypruner

Usage

To get a list of options, run the software without any arguments or use the -h or --help flag. PhyloPyPruner requires either a corresponding multiple sequence alignment (MSA) in FASTA format and a Newick tree or, the path to a directory containing multiple trees and alignments.

Example 1. Providing a single corresponding tree and alignment. In this case monophyletic masking will be performed by choosing the sequence with the shorter pairwise distance to its sister group and paralogy pruning will be done using the largest subtree (LS) algorithm.

python -m phylopypruner --msa <filename>.fas --tree <filename>.tre

Example 2. Run PhyloPyPruner for every MSA and tree pair within the directory in <path>. Don't include orthologs with fewer than 10 OTUs, remove sequence shorter than 100 positions, collapse nodes with a support value lower than 80% into polytomies, remove branches that are 5 times longer than the standard deviation of all branch lengths and remove OTUs with a paralogy frequency that is larger than 5 times the standard deviation of the paralogy frequency for all OTUs.

python -m phylopypruner --dir <path> --min-taxa 10 --min-len 100 --min-support
80 --trim-lb 5 --trim-freq-paralogs 5

Example 3. Run PhyloPyPruner for every MSA and tree pair within the directory in <path>. Mask monophylies by choosing the longest sequence, prune paralogs using the maximum inclusion (MI) algorithm, remove OTUs with sequences with an average pairwise distance that is 10 times larger than the standard deviation of the average pairwise distance of the sequences for all OTUs and generate statistics for the removal of OTUs using taxon jackknifing.

python -m phylopypruner --dir <path> --mask longest --prune MI --trim-divergent
10 --jackknife

Note: Taxon jackknifing multiplies the execution time by the amount of OTUs available within each input alignment.

FASTA descriptions and Newick names must match and has to be in one of the following formats: OTU|ID or OTU@ID, where OTU is the operational taxonomical unit (usually the species) and ID is a unique annotation or sequence identifier. For example: >Meiomenia_swedmarki|Contig00001_Hsp90. Sequence descriptions and tree names are not allowed to deviate from each other. Sequence data needs to be valid IUPAC nucleotide or amino acid sequences.

Output files

The following files are generated after running this program.

<output directory>/
├── <timestamp>_ppp_summary.csv
├── <timestamp>_ppp_ortho_stats.csv
├── <timestamp>_ppp_run.log
├── <timestamp>_ppp_paralog_freq.csv
├── <timestamp>_ppp_paralog_freq.png*
└── <timestamp>_orthologs/
│   ├── 1_pruned.fas
│   ├── 2_pruned.fas
│   ├── 3_pruned.fas
│   └── 4_pruned.fas
...

If <output directory> has not been specified by the --output flag, then output files will be stored within the same directory as the input alignment file(s). See the Output files section within the Wiki for a more detailed explanation of each individual output file.

* – only produced if Matplotlib is installed

© Kocot Lab 2018

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phylopypruner-0.1.8.tar.gz (29.4 kB view details)

Uploaded Source

Built Distribution

phylopypruner-0.1.8-py3-none-any.whl (48.6 kB view details)

Uploaded Python 3

File details

Details for the file phylopypruner-0.1.8.tar.gz.

File metadata

  • Download URL: phylopypruner-0.1.8.tar.gz
  • Upload date:
  • Size: 29.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.5

File hashes

Hashes for phylopypruner-0.1.8.tar.gz
Algorithm Hash digest
SHA256 c37df9260d2ccc4f28df5e051bb506e57ea542cc8f0af40ba803cfe9ef13f872
MD5 b1ff8d3d9763b4f18072250001563520
BLAKE2b-256 77ed359bfac8e7aeae773aa0c2e532190ced1ef2d4db50c5008ae89ebb589bb4

See more details on using hashes here.

File details

Details for the file phylopypruner-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: phylopypruner-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 48.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.5

File hashes

Hashes for phylopypruner-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 46bc3b45a7b781c737bec56cd0adec61a8a300035fd0cfe0497f9485a26442f9
MD5 acb7c57712af79265c34ae46fac4c631
BLAKE2b-256 5f003033a2a5ac71844636dda796457d81ca738c1c9e4f9432f2e8437b1567c3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page