A phylogenetic and geographic analysis tool
Project description
aPhyloGeo
🌳 Multi-platform application for analyze phylogenetic trees with climatic parameters
Table of Contents
📝 About the project
aPhyloGeo
is a bioinformatics pipeline dedicated to the analysis of phylogeography. aPhyloGeo
is an open-source multi-platform application designed by the team of Professor Nadia Tahiri (University of Sherbrooke, Quebec, Canada). It is implemented in Python. This tool can be used to obtain trees from climatic data of the regions where the samples have been collected. Those climatic trees are then used for topological and evolutionary comparison against phylogenetic trees from multiple sequence alignments (MSAs) using the Least Square (LS) metric. MSAs that yield trees with a significant LS
value are then optionally saved in folders with their respective tree. The output.csv
file contains the information of all the significant MSAs information (see Workflow Section for more details).
In the context of performing multiple sequence alignments, two distinct methodologies present themselves. The initial approach involves the utilization of the pairwise2 algorithm, whereas the subsequent alternative entails the application of the pyMUSCLE5 algorithm.
💡 If you are using our algorithm in your research, please cite our recent paper: Koshkarov, A., Li, W., Luu, M. L., & Tahiri, N. (2022). Phylogeography: Analysis of genetic and climatic data of SARS-CoV-2. Proceeding in SciPy 2022, Auxtin, TX, USA
Workflow
Figure 1: The workflow of the algorithm. The operations within this workflow include several blocks. The blocks are highlighted with three different colors.
- The first block (the light blue color) is responsible for creating the trees based on the climate data - performs the function of input parameter validation (see YAML file).
- The second block (the light green color) is responsible for creating the trees based on the genetic data - performs the function of input parameter validation (see YAML file).
- The third block (the light pink color) allows the comparison between the phylogenetic trees (i.e., with genetic data) and the climatic trees - denoted phylogeography step using Least Square distance (see Equation below).
$$LS(T_1, T_2) = \sum_{i=1}^{n-1} \sum_{j=i}^{n} \lvert \delta(i,j) - \xi(i,j) \rvert$$
where $T_1$ is the phylogenetic tree 1, $T_2$ is the phylogenetic tree 2, $i$ and $j$ are two species, $\delta(i,j)$ is the distance between specie $i$ and specie $j$ in $T_1$, $\xi(i,j)$ is the distance between specie $i$ and specie $j$ in $T_2$, and $n$ is the total number of species.
This is the most important block and the basis of this study, through the results of which the user receives the output data with the necessary calculations.
Moreover, our approach is optimal since it is elastic and adapts to any computer by using parallelism and available GPUs/CPUs according to the resource usage per unit of computation (i.e., to realize the processing of a single genetic window - see the workflow below). Multiprocessing: Allows multiple windows to be analyzed simultaneously (recommended for large datasets)
In this work, we applied software packages of the following versions: Biopython version 1.79 (BSD 3-Clause License), Bio version 1.5.2 (New BSD License), and numpy version 1.21.6 (BSD 3-Clause License).
⚒️ Installation
Linux UNIX, Mac OS & Windows versions
aPhyloGeo
is available as a Python script.
Prerequisites
This package use Poetry
dependency management and packaging tool for Python. Poetry installation guide can be found here: Poetry Install
⚠️ For windows installation it's recommended to launch powershell in Administrator mode.
Once Poetry is installed, you can clone the repository and install the package using the following commands:
poetry install
Usage
Poetry will handle the virtual environment automatically. if you want to use the virtual environment manually, you can use the following command:
poetry shell
⚠️ Assuming Python 3.8 or higher is installed on the machine, these scripts should run well with the libraries installed.
You can also launch the package using the make
command from your terminal when you are in the root
. This command will use the Makefile
to run the script. If you use the command make clean
, it will erase the output.csv
file previously created with the first command.
Here is a gif of the example above:
🚀 Settings
The aPhyloGeo
software can be encapsulated in other applications and applied to other data by providing a YAML file. This file will include a set of parameters for easy handling.
- Bootstrap threshold: Number of replicates threshold to be generated for each sub-MSA (each position of the sliding window)
- Window length: Size of the sliding window
- Step: Sliding window advancement step
- Distance choice: Least Square (LS) distance (version 1.0) will be extended to Robinson-Foulds (RF) metric
- Least Square distance threshold: LS distance threshold at which the results are most significant
- Alignment method: algorithm selection for sequence alignment ('1' for pairwise2 and '2' for pymuscle)
📁 Example
Description
We selected only 5 of 38 lineages with regional characteristics for further study (see Koshkarov et al., 2022). Based on location information, complete nucleotide sequencing data for these 5 lineages was collected from the NCBI Virus website. In the case of the availability of multiple sequencing results for the same lineage in the same country, we selected the sequence whose collection date was closest to the earliest date presented. If there are several sequencing results for the same country on the same date, the sequence with the least number of ambiguous characters (N per nucleotide) is selected.
Although the selection of samples was based on the phylogenetic cluster of lineage and transmission, most of the sites involved represent different meteorological conditions. As shown in Figure 2, the 5 samples involved temperatures ranging from -4 C to 32.6 C, with an average temperature of 15.3 C. The Specific humidity ranged from 2.9 g/kg to 19.2 g/kg with an average of 8.3 g/kg. The variability of Wind speed and All sky surface shortwave downward irradiance was relatively small across samples compared to other parameters. The Wind speed ranged from 0.7 m/s to 9.3 m/s with an average of 4.0 m/s, and All sky surface shortwave downward irradiance ranged from 0.8 kW-hr/m2/day to 8.6 kW-hr/m2/day with an average of 4.5 kW-hr/m2/day. In contrast to the other parameters, 75% of the cities involved receive less than 2.2 mm of precipitation per day, and only 5 cities have more than 5 mm of precipitation per day. The minimum precipitation is 0 mm/day, the maximum precipitation is 12 mm/day, and the average value is 2.1 mm/day.
Input
The algorithm takes two files as input with the following definitions:
- 🧬 Genetic file with fasta extension. The first file or set of files will contain the genetic sequence information of the species sets selected for the study. The name of the file must allow to know the name of the gene. It is therefore strongly recommended to follow the following nomenclature gene_name.fasta.
- ⛅ Climatic file with csv extension. The second file will contain the habitat information for the species sets selected for the study. Each row will represent the species identifier and each column will represent a climate condition.
Output
The algorithm will return a csv file that contains information from all relevant MSAs (see Workflow Section for more details). The sliding windows of interest are those with interesting bootstrap support (i.e., indicating the robustness of the tree) and high similarity to the climate condition in question (i.e., based on the LS
value). They will indicate, among other things, the name of the gene, the position of the beginning and end of the sliding window, the average bootstrap value, the LS value and finally the climatic condition for which this genetic zone would explain the adaptation of the species in a given environment.
✔️ References
1️⃣ Calculation of distance between phylogenetic tree: Least Square metric
- Cavalli-Sforza, L. L., & Edwards, A. W. (1967). Phylogenetic analysis. Models and estimation procedures. American journal of human genetics, 19(3 Pt 1), 233.
- Felsenstein, J. (1997). An alternating least squares approach to inferring phylogenies from pairwise distances. Systematic biology, 46(1), 101-111.
- Makarenkov, V., & Lapointe, F. J. (2004). A weighted least-squares approach for inferring phylogenies from incomplete distance matrices. Bioinformatics, 20(13), 2113-2121.
2️⃣ Calculation of distance between phylogenetic tree: Robinson-Foulds metric
3️⃣ Dataset full description: Analysis of genetic and climatic data of SARS-CoV-2
- Koshkarov, A., Li, W., Luu, M. L., & Tahiri, N. (2022). Phylogeography: Analysis of genetic and climatic data of SARS-CoV-2.
- Li, W., & Tahiri, N. (2023). aPhyloGeo-Covid: A Web Interface for Reproducible Phylogeographic Analysis of SARS-CoV-2 Variation using Neo4j and Snakemake.
📧 Contact
Please email us at: Nadia.Tahiri@USherbrooke.ca for any questions or feedback.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file aphylogeo-0.2.0.tar.gz
.
File metadata
- Download URL: aphylogeo-0.2.0.tar.gz
- Upload date:
- Size: 21.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.11.5 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d123cda45259f0567d9722ebc44476a1e8e75f246519e71a590ea062def41474 |
|
MD5 | ee889a037dbee0dc303cf90a9ad3ef4f |
|
BLAKE2b-256 | 98d17ad5cbd85418e6052e98c6e9cead74e13ad08529fc97a3305d91e545fad1 |
File details
Details for the file aphylogeo-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: aphylogeo-0.2.0-py3-none-any.whl
- Upload date:
- Size: 23.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.11.5 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 778dc18e6ade17a5b9f03d094e25e90d4b2f427a426f15b7bd45953201450dff |
|
MD5 | 4e29f3b54755fd42abf532cdea17a1ea |
|
BLAKE2b-256 | 9a5d25ec0bf44454e4f2c15fc74b85d30203537ef292c203ec07b14d0acaff63 |