Ancestral sequence reconstruction using a tree structured Ornstein Uhlenbeck variational autoencoder
Project description
DRAUPNIR: "Beta library version for performing ASR using a tree-structured Variational Autoencoder"
Extra requirements for tree inference:
#These are not necessary if you have your own tree file or for using the default datasets
IQ-Tree: http://www.iqtree.org/doc/Quickstart
conda install -c bioconda iqtree
RapidNJ: https://birc.au.dk/software/rapidnj
conda config --add channels bioconda
conda install rapidnj
Extra requirements for fast patristic matrix construction
#Recommended if you have more than 200 sequences. The patristic matrix is constructed only once
Install R (R version 4.1.2 (2021-11-01) -- "Bird Hippie" )
sudo apt update & sudo apt upgrade
sudo apt -y install r-base
together with ape 5.5 and TreeDist 2.3 libraries
install.packages(c("ape","TreeDist"))
Draupnir pip install
pip install draupnir
Example
See Draupnir_example.py
Which guide to use?
By experience, use delta_map, the marginal results (Test folder) are the most stable. It is recommended to run the model both with the variational and the delta_map guides and compare outputs using the mutual information. If necessary run the variational guide longer than the delta_map, since it has more parameters to infere and takes longer.
What folders and files are being produced?
How long should I run my model?
- Before training:
- It is recommended to train for at least 10000 epochs in datasets <800 leaves. See article for inspiration, the runtimes where extended to achieve maximum benchmarking accuracy, but it should not be necessary.
- While it is training:
- Check for the Percent_ID.png plot, if the training accuracy has peaked to almost 100%, run for at least ~1000 epochs more to guarantee full learning
- Check for stabilization of the error loss: ELBO_error.png
- Check for stabilization of the entropy: Entropy_convergence.png
- After training:
-
Observe the latent space:
- t_SNE, UMAP and PCA plots: Is it organized by clades? Although, not every data set will present tight clustering of the tree clades though but there should be some organization
- Distances_GP_VAE_z_vs_branch_lengths_Pairwise_distance_INTERNAL_and_LEAVES plot: Is there a positive correlation? If there is not a good correlation but the train percent identity is high, it will still be a valid run
-
Observe the sampled training (leaves) sequences and test (internal) sequences: Navigate to the Train_argmax and Test_argmax folders and look for the .fasta files
-
Calculate mutual information:
- First: Run Draupnir with the delta_map guide (produces MAP & Marginal results) version and variational guide version, or just the Variational
- Second: Use the draupnir.calculate_mutual_information() with the paths to the folders with the trained runs.
-
Datasets
#They contain some additional datasets other than the ones in the paper
#They are recommended to use with the pipeline, look into datasets.py for more details
dict_urls = {
"aminopeptidase":"https://drive.google.com/drive/folders/1fLsOJbD1hczX15NW0clCgL6Yf4mnx_yl?usp=sharing",
"benchmark_randall_original_naming":"https://drive.google.com/drive/folders/1oE5-22lqcobZMIguatOU_Ki3N2Fl9b4e?usp=sharing",
"Coral_all":"https://drive.google.com/drive/folders/1IbfiM2ww5PDcDSpTjrWklRnugP8RdUTu?usp=sharing",
"Coral_Faviina":"https://drive.google.com/drive/folders/1Ehn5xNNYHRu1iaf7vS66sbAESB-dPJRx?usp=sharing",
"PDB_files_Draupnir_PF00018_116":"https://drive.google.com/drive/folders/1YJDS_oHHq-5qh2qszwk-CucaYWa9YDOD?usp=sharing",
"PDB_files_Draupnir_PF00400_185": "https://drive.google.com/drive/folders/1LTOt-dhksW1ZsBjb2uzi2NB_333hLeu2?usp=sharing",
"PF00096":"https://drive.google.com/drive/folders/103itCfxiH8jIjKYY9Cvy7pRGyDl9cnej?usp=sharing",
"PF00400":"https://drive.google.com/drive/folders/1Ql10yTItcdX93Xpz3Oh-sl9Md6pyJSZ3?usp=sharing",
"SH3_pf00018_larger_than_30aa":"https://drive.google.com/drive/folders/1Mww3uvF_WonpMXhESBl9Jjes6vAKPj5f?usp=sharing",
"simulations_blactamase_1":"https://drive.google.com/drive/folders/1ecHyqnimdnsbeoIh54g2Wi6NdGE8tjP4?usp=sharing",
"simulations_calcitonin_1":"https://drive.google.com/drive/folders/1jJ5RCfLnJyAq0ApGIPrXROErcJK3COvK?usp=sharing",
"simulations_insulin_2":"https://drive.google.com/drive/folders/1xB03AF_DYv0EBTwzUD3pj03zBcQDDC67?usp=sharing",
"simulations_PIGBOS_1":"https://drive.google.com/drive/folders/1KTzfINBVo0MqztlHaiJFoNDt5gGsc0dK?usp=sharing",
"simulations_sirtuins_1":"https://drive.google.com/drive/folders/1llT_HvcuJQps0e0RhlfsI1OLq251_s5S?usp=sharing",
"simulations_src_sh3_1":"https://drive.google.com/drive/folders/1tZOn7PrCjprPYmyjqREbW9PFTsPb29YZ?usp=sharing",
"simulations_src_sh3_2":"https://drive.google.com/drive/folders/1ji4wyUU4aZQTaha-Uha1GBaYruVJWgdh?usp=sharing",
"simulations_src_sh3_3":"https://drive.google.com/drive/folders/13xLOqW2ldRNm8OeU-bnp9DPEqU1d31Wy?usp=sharing"
}
#Datasets in the paper
Dataset | Number of leaves | Alignment lenght | Name |
---|---|---|---|
Randall's Coral fluorescent proteins (CFP) | 19 | 225 | benchmark_randall_original_naming |
Coral fluorescent proteins (CFP) Faviina subclade | 35 | 361 | Coral_Faviina |
Coral fluorescent proteins (CFP) subclade | 71 | 272 | Coral_all |
:-------------------------------------------------: | :----------------: | :----------------: | ----------------------------------- |
Simulation $\beta$-Lactamase | 32 | 314 | simulations_blactamase_1 |
Simulation Calcitonin | 50 | 71 | simulations_calcitonin_1 |
Simulation SRC-Kinase SH3 domain | 100 | 63 | simulations_src_sh3_1 |
Simulation Sirtuin | 150 | 477 | simulations_sirtuins_1 |
Simulation SRC-kinase SH3 domain | 200 | 128 | simulations_src_sh3_3 |
Simulation PIGBOS | 300 | 77 | simulations_PIGBOS_1 |
Simulation Insulin | 400 | 558 | simulations_insulin_2 |
Simulation SRC-kinase SH3 domain | 800 | 99 | simulations_src_sh3_2 |
:-------------------------------------------------: | :----------------: | :----------------: | ----------------------------------- |
PF00400 | 125 | 138 | PF00400 |
What do the folders mean?
- If you selected delta_map guide:
- Train_Plots: Contains information related to the inference of the train sequences (the leaves). They are samples obtained by using the MAP estimates of the logits.
- Train_argmax_Plots: Single sequence per leaf obtained by the using the most likely amino acids indicated by the logits ("argmax the logits")
- Test_Plots: Samples for the test sequences (ancestors). In this case they contain the sequences sampled using the marginal probability approach (equation 5 in the paper)
- Test_argmax_Plots: Contains the most voted sequence from the samples in Test_Plots.
- Test2_Plots: Samples for the test sequences (ancestors). In this case they contain the sequences sampled using the MAP estimated of the logits.
- Test2_argmax_Plots: Samples for the test sequences (ancestors). In this case they contain the most likely amino acids indicated by the logits ("argmax the logits") (equation 4 in the paper)
- If you selected variational guide:
- Train_Plots: Contains information related to the inference of the train sequences (the leaves). They are samples obtained by using the MAP estimates of the logits.
- Train_argmax_Plots: Single sequence per leaf obtained by the using the most likely amino acids indicated by the logits ("argmax the logits")
- Test_Plots: Samples for the test sequences (ancestors). In this case they contain the sequences sampled using the full variational probability approach (equation 6 in the paper)
- Test_argmax_Plots: Contains the most voted sequence from the samples in Test_Plots.
- Test2_Plots == Test_Plots
- Test2_argmax_Plots == Test_argmax_Plots
If this library is useful for your research please cite:
@inproceedings{moreta2021ancestral,
title={Ancestral protein sequence reconstruction using a tree-structured Ornstein-Uhlenbeck variational autoencoder},
author={Moreta, Lys Sanz and R{\o}nning, Ola and Al-Sibahi, Ahmad Salim and Hein, Jotun and Theobald, Douglas and Hamelryck, Thomas},
booktitle={International Conference on Learning Representations},
year={2021}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file draupnir-0.0.25.tar.gz
.
File metadata
- Download URL: draupnir-0.0.25.tar.gz
- Upload date:
- Size: 3.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.7.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 62dee0fe07d5b1f49e29cc668c0ab1f7a906ea02f4c7a57fd48ad1cf56d47248 |
|
MD5 | c1930b3d4b5ab620299dabd520ff9e9d |
|
BLAKE2b-256 | f628763d5ecc1b88b863c02c83997980c9566c2d0408adf4338b53284ad461cd |
File details
Details for the file draupnir-0.0.25-py3-none-any.whl
.
File metadata
- Download URL: draupnir-0.0.25-py3-none-any.whl
- Upload date:
- Size: 4.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.7.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | beb269be729781a07357b296ab18f5b7c818c451b7c28b84c1a8c0cc5a084ca4 |
|
MD5 | 2c66b21a99496a49ce7ff7fc6a2fe099 |
|
BLAKE2b-256 | 033268f4f40b44e4eb4fdb4537fa202b1e6eb149ba3f3e9ffe07456e5b42391a |