Skip to main content

Distance-based phylogeny inference using a randomised divide-and-conquer method

Project description

PyPI version

dnctree: Randomized divide and conquer algorithm for phylogenetic trees

This is a distance-based method, inspired by Neighbor-Joining, for inferring phylogenies. Its main feature is that it scales to very large input datasets by avoiding estimating a pairwise distance matrix.

The input is a multiple sequence alignment and the output is a tree in standard Newick format. Option --json-output wraps the Newick-formatted tree in a JSON file, with some additional data describing the computation.

The implementation is (currently) 100% pure Python, but you can handle very large datasets in reasonable time anyways. See the preprint for some examples!

Algorithms

There are currently two algorithm versions implemented in dnctree.

  • The default algorithm, 'core tree', seems to be as accurate as Neighbor-Joining in our experiments so far, but scales much better than Neighbor-Joining. We have submitted a paper on this algorithm.
  • The 'simple' algorithms (see option --simple) is much faster, but has much worse accuracy than the core tree algorithm. The simple algorithm is described in a biorXiv preprint.

Both algorithms use Divide-and-Conquer. Problem instances with fewer sequences than what is given by the "base-case size", 100 by default, are handled by Neighbor-Joining and larger instances are partitioned and handled recursively.

Input formats

The formats Fasta, Phylip, Clustal, Nexus, and Stockholm (Pfam) are currently accepted input formats. This is determined by what the BioPython package accepts.

Example usage

dnctree testdata/s83_L500.phylip
dnctree -f phylip testdata/s83_L500.phylip   # Making it very clear input is a Phylip file
dnctree --simple  testdata/s83_L500.phylip   # Using the faster "simple" algorithm
dnctree --base-case-size 10 testdata/s83_L500.phylip  # Divide and conquer on larger inputs

Examples with output:

$ dnctree testdata/s83_L500.phylip
((((((L26,L27),(L24,L25)),((L28,L29),(L30,L31))),(((L16,L17),(L18,L19)),((L22,L23),(L20,L21)))),((((L8,L9),(L10,L11)),((L12,L13),(L14,L15))),(((L2,L3),(L0,L1)),((L6,L7),(L4,L5))))),(((((L86,L87),(L84,L85)),((L80,L81),(L82,L83))),(((L94,L95),(L92,L93)),((L90,L91),(L88,L89)))),((((L74,L75),(L72,L73)),((L78,L79),(L76,L77))),(((L64,L65),(L66,L67)),((L68,L69),(L70,L71))))),(((((L44,L45),(L46,L47)),((L42,L43),(L40,L41))),(((L34,L35),(L32,L33)),((L38,L39),(L36,L37)))),((((L50,L51),(L48,L49)),((L54,L55),(L52,L53))),(((L58,L59),(L56,L57)),((L62,L63),(L60,L61))))));

$ dnctree --json-output --base-case-size 10 testdata/s83_L500.phylip
{
    "version": "dnctree 1.0",
    "tree": "((((((L45,L44),(L47,L46)),((L43,L42),(L40,L41))),((((L54,L55),(L52,L53)),((L50,L51),(L48,L49))),(((L58,L59),(L57,L56)),((L63,L62),(L61,L60))))),(((L35,L34),(L33,L32)),((L39,L38),(L36,L37)))),(((((L91,L90),(L89,L88)),((L94,L95),(L92,L93))),(((L81,L80),(L82,L83)),((L87,L86),(L84,L85)))),((((L78,L79),(L77,L76)),(((L69,L68),(L70,L71)),((L65,L64),(L66,L67)))),((L74,L75),(L72,L73)))),(((((L7,L6),(L4,L5)),((L3,L2),(L0,L1))),((((L21,L20),(L23,L22)),((L16,L17),(L19,L18))),(((L30,L31),(L29,L28)),((L26,L27),(L24,L25))))),(((L9,L8),(L10,L11)),((L12,L13),(L14,L15)))));",
    "infile": "testdata/s83_L500.phylip",
    "aligned": true,
    "base-case-size": 10,
    "distances-computed": 1711,
    "fraction-computed-distances": 0.375,
    "n-taxa": 96,
    "comment": "Computed 1711 distances for 96 taxa. A full distance matrix would contain 4560 pairs. Savings: 62.5 %",
    "model-name": "WAG",
    "description": "AA alignment",
    "msa-width": 500,
    "computing-time": 0.907991542
}

Credits

  • Amy Lee Jalsenius developed and implemented the "core tree" algorithm which is now the default.
  • Mazen Mardini added PaHMM code (see https://github.com/marbogusz/paHMM-Tree), enabling experiments.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dnctree-1.0.1.tar.gz (39.1 kB view details)

Uploaded Source

Built Distribution

dnctree-1.0.1-py3-none-any.whl (39.5 kB view details)

Uploaded Python 3

File details

Details for the file dnctree-1.0.1.tar.gz.

File metadata

  • Download URL: dnctree-1.0.1.tar.gz
  • Upload date:
  • Size: 39.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for dnctree-1.0.1.tar.gz
Algorithm Hash digest
SHA256 f9d8c46ce4f74502b0f3f86ca74e6768f8751b9069b639e97865ab47bca4ab3f
MD5 2051ecf86f920d34a3fdaff5261173a3
BLAKE2b-256 179914508099979c7c3b544a6894dad417db33b40f23a0f51c8276975a0e0b71

See more details on using hashes here.

File details

Details for the file dnctree-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: dnctree-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 39.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for dnctree-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 430123440d35483b3237d78f9915591f8d666bd7015dc94d896fdc0ed1c4d528
MD5 6a86c823d35093237e3051539b167d97
BLAKE2b-256 f56f0ae1389ea714037d60ba70b767bb208e18eaa5ca908f597bc7bfd24a24d8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page