Simulate random phylogenetic trees
Project description
Ngesh, simulation of random phylogenetic trees with characters
ngesh
is a Python library for simulating random phylogenetic trees and
related data (characters, states, branch length, etc.).
It is intended for benchmarking phylogenetic methods, especially in
historical linguistics, and for providing dummy trees
for their development and debugging. The generation of
random phylogenetic trees also goes by the name of "simulation methods
for phylogenetic trees" or just "phylogenetic tree simulation".
Please remember that ngesh
is a work-in-progress and a library intended to
be a simple drop-in for cases where random trees are needed; for complex
methods, see the alternatives listed below or consult the
bibliographic references.
In detail, with ngesh
:
- trees can be generated according to user-specified birth and death ratios (and the death ratio can be set to zero, resulting in a birth-only tree)
- speciation events default to two descendants, but the number of descendants can be randomly drawn from a user-defined Poisson process (allowing to model hard politomies)
- trees will have random topologies and, if necessary, random branch-lengths
- trees can be limited in terms of number of extant leaves, evolution time (as related to the birth and death parameters), or both
- non-extant leaves can be pruned from birth-death trees
- character evolution can be simulated in relation to branch lengths, with user-specified ratios for mutation and horizontal gene transfer, with different rates of change for each character
- trees can be generated from user-provided seeds, so that the random generation can be maintained across executions (and, in most cases, the execution should be reproducible also on different machines and different vestions of Python)
- nodes can optionally receive unique labels, either sequential ones (like "L01", "L02", and "L03"), random human-readable names (like "Sume", "Fekobir", and "Tukok"), or random biological names approximating the binomial nomenclature standard (like "Sburas wioris", "Zurbata ceglaces", and "Spellis spusso")
- trees can be returned as ETE tree objects or exported in a variety of formats, such as Newick trees, ASCII representation, tabular textual listings, etc.
Changelog
Version 0.3:
- General improvements to code quality
- Full reproducibility from seeds for the pseudo-random generators, allowing string, ints, and floats
- Changes for further integration with
abzu
andalteruphono
for simulating linguistic data
How does ngesh work?
For each tree, an event_rate
is computed from the sum of the birth
and
death
rates. At each iteration, which takes place after an
random expovariant time from the event_rate
, one of the extant nodes is
selected for an "event": either a birth or a death from the
proportion of each rate. All other extant leaves have their distances
updated with the event time.
The random labels follow the expected methods for random text generation from a set of patterns, taking care to generate names as universally readable (if not pronounceable) as possible.
missing on character generation
Installation
In any standard Python environment, ngesh
can be installed with:
pip install ngesh
The pip
installation will also fetch the dependencies ete3
and
numpy
, if necessary.
How to use
You can test your installation from the command line with the ngesh
command, which
will return a different random small birth-death tree in Newick format each time it
is called:
$ ngesh
(Saorus getes:1.31562,((Voces earas:1.07567,(Dallao spettus:0.703609,Sburas wioris:0.703609)1:0.372063)1:0.464667,(Zurbaza ceglaces:0.527431,(Amduo vizoris:0.345862,Uras wiurus:0.345862)1:0.18551)1:1.00897)1:2.1707);
$ ngesh
((Ollio zavis:0.698453,(Spectuo sicui:0.596731,((Ronis mivulis:0.0431014,Vaporus conomattas:0.0431014)1:0.413634,Rizarus urrus:0.456735)1:0.139996)1:0.101722)1:3.17827,(Deses mepus:2.22061,(Ovegpuves wiumoras:1.88469,(Easas ecdebus:0.201891,Muggas lupas:0.201891)1:1.6828)1:0.335918)1:1.65611);
The same command line tool can refer to values provided in a textual
configuration file. Here, we generate the Nexus data for a
reproducible Yule tree (note the 12345
seed)
with a birth ratio of 0.75, at least 8 leaves with "human"
labels,
and 10 presence/absence characters:
$ cat my_tree.conf
[Config]
labels=human
birth=0.666
death=0.0
output=nexus
min_leaves=8
num_chars=10
$ ngesh -c mine.conf --seed 12345
#NEXUS
begin data;
dimensions ntax=15 nchar=25;
format datatype=standard missing=? gap=-;
matrix
Foro 1011000101010010100100100
Meno 1100100101010010010010001
Vuea 1100110001010100001001001
Vegevo 1100100101010010010010100
Bufuri 1100110001010100001001001
Novake 1100110001010100001001001
Fonulip 1100110001010100001001001
Omih 1101001001010011000100100
Onegro 1101000011010010001100100
Rolsoa 1100100100110010010100100
Wigu 1101001001010010001100100
Teozu 1101001001010011000100010
Kabu 1100100101001010010100100
Timebbed 1100100101010010010010100
Okuna 1100110001010100001001001
;
end;
Parameters set in a configuration file can be overridden at the command line. The ASCII representation of the topology of the same tree can be obtained with:
$ ngesh -c mine.conf --seed 12345 -o ascii
/-Omih
|
/-| /-Onegro
| | /-|
| \-| \-Wigu
| |
| \-Teozu
|
/-| /-Kabu
| | |
| | | /-Novake
| | | |
| | | /-| /-Okuna
| | | | | /-|
| \-| | \-| \-Fonulip
| | /-| |
| | | | \-Bufuri
| | | |
--| | /-| \-Vuea
| | | |
| | | | /-Meno
| \-| \-|
| | \-Rolsoa
| |
| | /-Vegevo
| \-|
| \-Timebbed
|
\-Foro
The package is, however, designed to be used as a library. If you have PyQt5 installed (which is not listed as a dependency), the following code will pop up the ETE Tree Viewer on a random tree:
python3 -c "import ngesh ; ngesh.display_random_tree()"
{width=100%}
The main functions for generation are gen_tree()
, which returns a random
tree topology, and add_characters()
, which simulates character evolution
in a provided tree. As they are separate tasks, it is possible to just
generate a random tree or to simulate character evolution in an user
provided tree.
The full documentation is available in the functions docstring (which
can be visualized with print(ngesh.gen_tree.__doc__)
and
print(ngesh.add_characters.__doc__)
) or
directly in the source code.
The code snipped below shows a basic tree generation, character evolution,
and output flow; the parameters for generation are the same listed in
the docstrings and in the following below.
$ ipython3
In [1]: import ngesh
In [2]: tree = ngesh.gen_tree(1.0, 0.5, max_time=3.0, labels="human")
In [3]: print(tree)
/-Butobfa
/-|
| | /-Defomze
| \-|
| \-Gegme
--|
| /-Bo
| /-|
| | \-Peoni
\-|
| /-Riuzo
\-|
\-Hoale
In [4]: tree = ngesh.add_characters(tree, 10, 3.0, 1.0)
In [5]: print(ngesh.tree2nexus(tree))
#NEXUS
begin data;
dimensions ntax=7 nchar=15;
format datatype=standard missing=? gap=-;
matrix
Hoale 100111101101110
Butobfa 101011101110101
Defomze 101011110110101
Riuzo 100111101101110
Peoni 110011101110110
Bo 110011101110110
Gegme 101011101110101
;
end;
Integrating with other software
Integration is easy due to the various export functions. For example, it is possible to generate random trees with characters for which we know all details on evolution and parameters, and generate Nexus files that can be fed to phylogenetic software such as MrBayes or BEAST2 to either check how they perform or how good is our generation in terms of real data.
Let's simulate phylogenetic data for an analysis using BEAST2 through
BEASTling. We start with
a birth-death tree (lambda=0.9, mu=0.3), with at least 15 leaves, and 100
characters whose evolution is modelled with the default parameters
and a string seed "jena"
for reproducibility; the tree data is exported
in "wordlist"
format:
$ cat examples/example_ngesh.conf
[Config]
labels=human
birth=0.9
death=0.3
output=nexus
min_leaves=15
num_chars=100
$ ngesh -c examples/example.conf --seed jena > examples/example.csv
$ head examples/example.csv
Language_ID,Feature_ID,Value
Dotare,feature_0,0
Dotare,feature_1,1
Dotare,feature_2,2
Dotare,feature_3,3
Dotare,feature_4,4
Dotare,feature_5,5
Dotare,feature_6,6
Dotare,feature_7,7
Dotare,feature_8,8
We can now use a minimal BEASTling configuration and generate an XML input for BEAST2. Let's assume we want to test how well our pipeline performs when assuming a Yule tree when the data actually includes extinct taxa. The results here presented are not expected to perfect, as we will use a short chain length to make it faster and a model which is different from the assumptions used for generation (besides the fact of the default parameters for horizontal gene transfer being a bit too aggressive).
$ cat examples/example_beastling.conf
[admin]
basename=example
[MCMC]
chainlength=500000
[model example]
model=covarion
data=example.csv
$ beastling example_beastling.conf
$ beast example.xml
We can proceed normally here: use BEAST2's treeannotator
(or similar
software) to generate a summary tree,
which we store in examples/summary.nex
,
and plot the results with figtree
(or, again, similar software).
Let's plot our summary tree and compare the results with the actual topology (which we can regenerate with the earlier seed).
{width=100%}
$ ngesh -c examples/example_ngesh.conf --seed jena --output newick > examples/example.nw
{width=100%}
The results are not excellent given the limits we set for quick demonstration, but it still capture major information and subgroupings (as clearer by the radial layout below) -- manual data exploration show that at least some of the errors, including the group in the first split, are due to horizontal gene transfer. For an analysis of the inference performance we would need to improve the parameters above and repeat the analysis on a range of random trees, including studying the log of character changes (including borrowings) involved in this particular random tree.
{width=100%}
TODO: Compare trees (Robinson-Foulds symmetric difference?)
The files used and generated in this example
can be found in the /examples
directory.
Parameters for tree generation
The parameters for tree generation, as also given by ngesh -h
, are:
birth
: The tree birth rate (l)death
: The tree death rate (mu)max_time
: The stopping criterion for maximum evolution timemin_leaves
: The stopping criterion for minimum number of leaveslabels
: The model for textual generation of random labels (None
,"enum"
for a simple enumeration,"human"
for randomly generated names, and"bio"
for randomly generated specie names)num_chars
: The number of characters to be simulatedk_mut
: The character mutation gammak
parameterth_mut
: The character mutation gammath
parameterk_hgt
: The character HGT gammak
parameterth_hgt
: The character HGT gammath
parametere
: The character general mutatione
parameter
What does "ngesh" mean?
Technically it is just an unique name, but it was derived from one of the Sumerian words for "tree", ĝeš, albeit with an uncommon transcription. The name comes from the library once being a module of a larger system for simulating language evolution and benchmarking related tools, called Enki after the Sumerian god of (among many other things) language and mischief.
The intended pronounciation, as in the most accepted reconstructions, is /ŋeʃ/. But don't strees over it: it is just a unique name.
Alternatives
There are many tools for simulating phylogenetic processes in order to obtain
random phylogenetic trees. The most complete is probably the R package
TreeSim
by Tanja Stadler, which includes many flexible tree simulation functions. In
R, one can also use the rtree()
function from package ape
and the
birthdeath.tree()
one from package geiger
, as well as manually randomizing taxon
placement in cladograms.
In Python, some code similar to ngesh
and which served as initial inspiration
is provided by Marc-Rolland Noutahi on the blog post
How to simulate a phylogenetic tree ? (part 1).
A number of on-line tools are also available at the time of writing:
- T-Rex (Tree and reticulogram REConstruction at the Université du Québec à Montréal (UQAM)
- Anvi'o Server can be used on-line as a wrapper to T-Rex above
- phyloT, which by randomly sampling taxonomic names, identifiers or protein accessions can be used for the same purpose
TODO
-
Shorter-term
- Write better documentation of function parameters
- Add all still unavailable parameters to the command line tool (e.g., setting hard politomies)
- Automatically generate developer documentation (possibly with Sphinx)
- Allow generation of unlabelled trees from the command-line (a text generation model is currently mandatory)
- Look for default parameters that are more closely related to linguistic trees (or, at least, to the Indo-European one)
- Check if all outputs are complete (e.g., characters are currently missing in the Newick format)
- Add a command-line option (or a new tool) that allows to write the output to one file and the reference tree to a second one (possibly with the log of character evolution)
- Check for alternatives for the exponential correction of a character resistance to mutation (e.g. Zipf law), including separating mutation and borrowing rates
- Allow to exclude non extant taxa from horizontal gene transfer events
- Add stopping criterion on the global number of nodes (in complement to the number of extant nodes, currently implemented), either absolute or as a range
-
Longer-term
- Simulation of data problems (incomplete sampling, errors in sequencing/cognate judgment, etc.)
- Variable birth/death ratios
- Rewrite the random text generation functions, possibly as actual Python generators
- Consider replacing or complementing
expovariate()
in birth/death events with actual random Poisson sampling, allowing additional models - Build a simple website
- Implement parallel character evolution as controlled by a parameter
- Rewrite functions with too many arguments to accept dictionaries of parameters
- Implement more models for random character generation, especially those frm genetics (first candidate, a General Time Reversable model with a proportion of invariable sites and a gamma-shaped distribution of rates across sites)
- Simulate a "donation power" for taxa, making borrowing events globally more likely from a given donor (analogous to cultural influence in linguistics)
- Allow to guarantee that borrowing events will always result in altered states (it is currently possible that an event will borrow an equal state for a given character, especially considering that we favor borrowing from closer taxa)
- Implement character simulation for other datatypes, particularly from genetics (currently only standard binary presence/absence)
Gallery
{width=100%} {width=100%} {width=100%}
How to cite
If you use ngesh
, please cite it as:
Tresoldi, Tiago (2019). Ngesh, a tool for simulating random phylogenetic trees. Version 0.3. Jena. Available at: https://github.com/tresoldi/ngesh
In BibTeX:
@misc{Tresoldi2019ngesh,
author = {Tresoldi, Tiago},
title = {Ngesh, a tool for simulating random phylogenetic trees. Version 0.3},
howpublished = {\url{https://github.com/tresoldi/ngesh}},
address = {Jena},
year = {2019},
doi = {10.5281/zenodo.2619311},
}
References
-
Bailey, N. T. J. (1964). The elements of stochastic processes with applications to the natural sciences. John Wiley & Sons.
-
Foote, M., J. P. Hunter, C. M. Janis, and J. J. Sepkoski Jr. (1999). Evolutionary and preservational constraints on origins of biologic groups: Divergence times of eutherian mammals. Science 283:1310–1314.
-
Harmon, Luke J (2019). Phylogenetic Comparative Methods -- learning from trees. Available at: https://lukejharmon.github.io/pcm/chapter10_birthdeath/. Access date: 2019-03-31.
-
Noutahi, Marc-Rolland (2017). How to simulate a phylogenetic tree? (part 1). Available at: https://mrnoutahi.com/2017/12/05/How-to-simulate-a-tree/. Access date: 2019-03-31
-
Stadler, Tanja (2011). Simulating Trees with a Fixed Number of Extant Species. Systematic Biology 60.5:676-684. DOI: https://doi.org/10.1093/sysbio/syr029
Author
Tiago Tresoldi (tresoldi@shh.mpg.de)
The author was supported during development by the ERC Grant #715618 for the project CALC (Computer-Assisted Language Comparison: Reconciling Computational and Classical Approaches in Historical Linguistics), led by Johann-Mattis List.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.