Skip to main content

REvolutionH-tl: Reconstruction of Evolutionary Histories tool

Project description

REvolutionH-tl logo.

Bioinformatics tool for the reconstruction of evolutionary histories. Input: fasta files or sequence alignment hits, Output: orthology. event-labeled gene trees, and reconciliations.

Bioinformatics & complex networks lab

pipeline

Install

pip install revolutionhtl

Requirements

Python >=3.7

If you want to run sequence alignments using revolutionhtl, then install Diamond:

wget http://github.com/bbuchfink/diamond/releases/download/v2.1.8/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz

Usage

For an example with data click here.

python -m revolutionhtl <arguments>

Below are described the steps of the program, arguments needed to run REvolutionH-tl, and description of output files.

Steps

  1. Orthogroup & best hit selection. Input: alignment hits (generate this using revolutionhtl.diamond) .
  2. Orthology and gene tree reconstruction. Input: best hits (generate this at step 1).
  3. Species tree reconstruction. Input: gene trees (generate this at step 2).
  4. Tree reconciliation. Input: gene and species trees (generate this at steps 2 and 3).

Arguments

Input data (Click to expand) - -h --help
show this help message and exit

-steps [integers]
List of steps to run (default: 1 2 3 4).

-alignment_h --alignment_hits [string]
Directory containing alignment hits, the input of step 1. (default: ./).

-best_h --best_hits [string]
.tsv file containing best hits, the input of step 2. (default: use output of step 1).

-T --gene_trees [string]
.tsv file containing gene trees, the input of steps 3 and 4. (default: use output of step 2).

-S --species_tree [string]
.nhx file containing a species tree, an input of step 4. (default: use output of step 3).

File names (Click to expand) -o --output_prefix [string]
Prefix used for output files (default "tl_project").

-og --orthogroup_column [string]
Column in -best_h -T, and output files specifying orthogroups (default: OG).

-Nm --N_max [integer]
Indicates the maximum number of genes in a orthogroup, bigger orthogroups are splitted. If 0, no orthogroup is splitted. (default= 2000).

-k --k_size_partition [integer]
Integer indicatng how many best hit graphs will be processed in bunch:: first graphs with
Algorithm parameters (Click to expand) -bh_heuristic --besthit_heuristic [string]
Indicates how to normalize bit-score in step 1 (default: normal). Normal: no normalization, prt: use proteinortho auxiliary files, smallest: use length of the smallest sequence, target: use target sequence, query: use query sequence, directed: x->y hit, bidirectional: use x->y and y->x hits.
Options: normal, prt, smallest_bidirectional, smallest_directed, query_directed, target_directed, alignment_directed, query_bidirectional, target_bidirectional, alignment_bidirectional

-f --f_value [float]
Real number between 0 and 1, a parameter of step 1. Defines the adaptative threshhold as: f\*max_bit_score (default: 0.95).

-bmg_h --bmg_heuristic [string]
Comunity detection method, an heuristic of step 2. (default: Louvain).
Options: Mincut, BPMF, Karger, Greedy, Gradient_Walk, Louvain, Louvain_Obj

-bmgh_nb --bmgh_no_binary [bool]
Flag, specifies if force binary tree in step 2. (no flag: force binary, flag: do not force binary).

-stree_h --species_tree_heuristic [string]
Comunity detection method, an heuristic of step 3. (default: louvain_weight).
Options: naive, louvain, mincut, louvain_weight

-streeh_repeats --stree_heuristic_repeats [integer]
integer, specifies how many times run the heuristic of step 3. (default: 3)

-streeh_b --streeh_binary [bool]
Flag, specifies if force binary tree in step 3. (no flag: do not force binary, flag: force binary).

-streeh_ndb --streeh_no_doble_build [bool]
Flag, specifies if run build algorithm twice to obtain less resolved tree in step 3. (no flag: double build, flag: single build).

Output files

Step 1 | Best hit & orthogroup selection

Best hits

For every gene in the analysis, step 1 aims to recover all the most similar genes based on sequence similarity. This procedure uses the normalized bit-score of the alignment hits and an adaptive threshold to create a subselection of alignment hits. We call best hits to this subselection. See section 3.1.2 of reference [1] for a deeper explanation of this analysis.

A best hit $(x\rightarrow y)$ is a directed relationship from one gene $x$ to gene $y$, where the former is called query and the latter is called target.

Best hits are output as a table in a file with extension .best_hits.tsv (Click here for an example file). Each row of this table describes a best hit throughout 10 columns: "OG" indicates the ID of the orthogroup containing the best hit, the six columns columns containing "Query_" and "Target_" contain information about query and target genes. Finally, alignment statistics are showed at the columns "Alignment_length", "Bit_score", and "Evalue".

Orthogroups

An orthogroup is a collection of homologous genes, which means that they appear as leafs of the same gene tree. We say that two genes are in the same orthogroup if we can construct a path of best hits connecting them.

Orthogroups are output as a table in a file with extension .orthogroups.tsv. Here, each row indicates one orthogroup, the first column "OG" indictes to ID, the second and third columns "n_genes", and "n_sepecies" contain the number of genes in the orthogroup and the number of species represented by those genes. Then, there is one column per each species of the analysis, those columns contain the genes of the orthogroup that are present in those species.

Step 2 | Gene tree reconstruction & orthology

Gene trees

Gene trees are output in the file with extension .gene_trees.tsv, for each row of this file there is an orthogroup ID, and a gene tree in NHXX format, which is a generalized newick (Click here for a description and examples of newick format).

The inner nodes of the gene tree are labeled as speciation events with the letter "S", or as duplication events with the letter "D".

The leafs of the gene tree represent the genes of the orthogroup.

The gene trees output here are the least resolved trees that can be reconstructed from the best hits.If you want fully resolved trees, look at the output of step 4.

Orthology

Orthology is saved in the file with extension .orthologs.tsv. Each row identifies a single orthology relationship between genes in the columns "a" and "b". Additionally, the column "OG" specifies the orthogroup containing the genes.

Two genes are orthologous if they diverged at an speciation event.

Best matches

For each gene in the gene trees we return the best matches, i.e. the most evolutionarily related genes in other species. Best matches are defined with respect to a gene tree. See section 3.1.1 [1] for a deeper explanation.

A best match (x-->y) is a directed relationship from gene "x" to gene "y". The main difference with "best hits" on the previous step is that a best match is consistent with a gene tree, while a best hit is based on the bit-score of sequence alignments.

Step 3 | Species tree reconstruction

We reconstruct an species tree as a consensus of all the gene trees obtained in the step 2, the procedure is detaled at section 3.1.4 of [1]. This species tree is saved in NHXX format at the file with extension .species_tree.tsv . NHXX format is a generalized newick (Click here for a description and examples of newick format).

Step 4 | Tree reconciliation

The reconciliation shows how genes evolved across species and time. This can be represented as a map of the nodes in the gene tree to nodes and edges of the species tree. The figure below shows a reconciliation. On the left side the nodes of a gene tree are are mapped to the species tree using arrows. On the right side the reconciliation is shown explicitly by drawing the gene tree inside of the species tree. Red circles correspond to speciation events, while blue squares correspond to gene duplication. Note that duplication nodes in the gene tree are mapped to edges of the species tree, on the other hand, speciation nodes of the gene tree are mapped to nodes of the species tree.

Reconciliation

To represent a reconciliation map REvolutionH-tl uses a comma-separated list, where each element takes the form x:y, where xis a node ID in the gene tree, and yis a node ID in the species tree. In the example of the figure above we have the reconciliation map 0:0,1:1,2:1. As in the example of the figure, the reconcilied gene tree can include extra speciation events and gene loss. In section 3.1.5 of [1] is detailed the procedure to compute this map.

REvolutionH-tl performs a reconciliation of the reconstructed gene and species trees. To this end, we add a node ID to all the nodes of the gene tree and the species tree using the attribute "node_id". The species tree with nodes ID is written to the file with extension .labeled_species_tree.nhxx. On the other hand, the reconciliations are saved in the file with extension .reconciliation.tsv, this file contains the following columns:

  • OG: Specifies the orthogroup ID
  • tree: reconcilied gene tree in NHXX format.
  • reconciliation_map: a comma-separated list of pair of nodes.
  • flipped_nodes: nodes of the gene tree flipped from speciation to duplication. The number of flipped nodes is a metric for discordance between the non-reconcilied gene tree and the species tree.

References

[1] Ramirez-Rafael J. A. (2023). REvolutionH-tl: a tool for the fast reconstruction of evolutionary histories using graph theory [Master dissertation, Cinvestav Irapuato]. Avaliable at https://drive.google.com/file/d/1NckRmpvxeOdoJG3ugbZSKsHyYEua4eYG/view?usp=sharing.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

revolutionhtl-1.0.14.tar.gz (40.9 kB view hashes)

Uploaded Source

Built Distribution

revolutionhtl-1.0.14-py3-none-any.whl (45.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page