Skip to main content

Ancestral sequence reconstruction using a tree structured Ornstein Uhlenbeck variational autoencoder

Project description

DRAUPNIR: "Beta library version for performing ASR using a tree-structured Variational Autoencoder"

Extra requirements for tree inference:

#These are NOT necessary if you have your own tree file or for using the default datasets

IQ-Tree: http://www.iqtree.org/doc/Quickstart

conda install -c bioconda iqtree

RapidNJ: https://birc.au.dk/software/rapidnj

conda config --add channels bioconda
conda install rapidnj

Extra requirements for fast patristic matrix construction

#Recommended if you have more than 200 sequences. The patristic matrix is constructed only once

Install R (R version 4.1.2 (2021-11-01) -- "Bird Hippie" )

sudo apt update & sudo apt upgrade
sudo apt -y install r-base

together with ape 5.5 and TreeDist 2.3 libraries

install.packages(c("ape","TreeDist"))

Draupnir python environment .yaml

https://github.com/LysSanzMoreta/DRAUPNIR_ASR/tree/main/draupnir/env/Python3_9

Draupnir pip install

pip install draupnir

Example

See Draupnir_example.py

Which guide to use?

By experience, use delta_map, since the marginal results (Test folder) are the most stable. It is recommended to run the model both with the variational and the delta_map guides and compare outputs using the mutual information. If necessary, run the variational guide longer than the delta_map, since it has more parameters to optimize and takes longer.

How long should I run my model?

  1. Before training:
    • It is recommended to train for at least 10000 epochs in datasets <800 leaves. See article for inspiration, the runtimes where extended to achieve maximum benchmarking accuracy, but it should not be necessary.
  2. While it is training:
    • Check for the Percent_ID.png plot, if the training accuracy has peaked to almost 100%, run for at least ~1000 epochs more to guarantee full learning
    • Check for stabilization of the error loss: ELBO_error.png
    • Check for stabilization of the entropy: Entropy_convergence.png
  3. After training:
    • Observe the latent space:

      1. t_SNE, UMAP and PCA plots: Is it organized by clades? Although, not every data set will present tight clustering of the tree clades though but there should be some organization

      Latent space

      1. Distances_GP_VAE_z_vs_branch_lengths_Pairwise_distance_INTERNAL_and_LEAVES plot: Is there a positive correlation? If there is not a good correlation but the train percent identity is high, it will still be a valid run
    • Observe the sampled training (leaves) sequences and test (internal) sequences: Navigate to the Train_argmax and Test_argmax folders and look for the .fasta files

    • Calculate mutual information:

      • First: Run Draupnir with the MAP & Marginal version and Variational version, or just the Variational
      • Second: Use the draupnir.calculate_mutual_information() with the paths to the folders with the trained runs.

      alt text

Datasets #They are recommended to use with the pipeline, look into datasets.py for more details

dict_urls = {
        "aminopeptidase":"https://drive.google.com/drive/folders/1fLsOJbD1hczX15NW0clCgL6Yf4mnx_yl?usp=sharing",
        "benchmark_randall_original_naming":"https://drive.google.com/drive/folders/1oE5-22lqcobZMIguatOU_Ki3N2Fl9b4e?usp=sharing",
        "Coral_all":"https://drive.google.com/drive/folders/1IbfiM2ww5PDcDSpTjrWklRnugP8RdUTu?usp=sharing",
        "Coral_Faviina":"https://drive.google.com/drive/folders/1Ehn5xNNYHRu1iaf7vS66sbAESB-dPJRx?usp=sharing",
        "PDB_files_Draupnir_PF00018_116":"https://drive.google.com/drive/folders/1YJDS_oHHq-5qh2qszwk-CucaYWa9YDOD?usp=sharing",
        "PDB_files_Draupnir_PF00400_185": "https://drive.google.com/drive/folders/1LTOt-dhksW1ZsBjb2uzi2NB_333hLeu2?usp=sharing",
        "PF00096":"https://drive.google.com/drive/folders/103itCfxiH8jIjKYY9Cvy7pRGyDl9cnej?usp=sharing",
        "PF00400":"https://drive.google.com/drive/folders/1Ql10yTItcdX93Xpz3Oh-sl9Md6pyJSZ3?usp=sharing",
        "SH3_pf00018_larger_than_30aa":"https://drive.google.com/drive/folders/1Mww3uvF_WonpMXhESBl9Jjes6vAKPj5f?usp=sharing",
        "simulations_blactamase_1":"https://drive.google.com/drive/folders/1ecHyqnimdnsbeoIh54g2Wi6NdGE8tjP4?usp=sharing",
        "simulations_calcitonin_1":"https://drive.google.com/drive/folders/1jJ5RCfLnJyAq0ApGIPrXROErcJK3COvK?usp=sharing",
        "simulations_insulin_2":"https://drive.google.com/drive/folders/1xB03AF_DYv0EBTwzUD3pj03zBcQDDC67?usp=sharing",
        "simulations_PIGBOS_1":"https://drive.google.com/drive/folders/1KTzfINBVo0MqztlHaiJFoNDt5gGsc0dK?usp=sharing",
        "simulations_sirtuins_1":"https://drive.google.com/drive/folders/1llT_HvcuJQps0e0RhlfsI1OLq251_s5S?usp=sharing",
        "simulations_src_sh3_1":"https://drive.google.com/drive/folders/1tZOn7PrCjprPYmyjqREbW9PFTsPb29YZ?usp=sharing",
        "simulations_src_sh3_2":"https://drive.google.com/drive/folders/1ji4wyUU4aZQTaha-Uha1GBaYruVJWgdh?usp=sharing",
        "simulations_src_sh3_3":"https://drive.google.com/drive/folders/13xLOqW2ldRNm8OeU-bnp9DPEqU1d31Wy?usp=sharing"

    }
Dataset Number of leaves Alignment lenght Name
Randall's Coral fluorescent proteins (CFP) 19 225 benchmark_randall_original_naming
Coral fluorescent proteins (CFP) Faviina subclade 35 361 Coral_Faviina
Coral fluorescent proteins (CFP) subclade 71 272 Coral_all
Simulation $\beta$-Lactamase 32 314 simulations_blactamase_1
Simulation Calcitonin 50 71 simulations_calcitonin_1
Simulation SRC-Kinase SH3 domain 100 63 simulations_src_sh3_1
Simulation Sirtuin 150 477 simulations_sirtuins_1
Simulation SRC-kinase SH3 domain 200 128 simulations_src_sh3_3
Simulation PIGBOS 300 77 simulations_PIGBOS_1
Simulation Insulin 400 558 simulations_insulin_2
Simulation SRC-kinase SH3 domain 800 99 simulations_src_sh3_2

What do the results folders mean?

  1. If you selected delta_map guide:
    1. Train_Plots: Contains information related to the inference of the train sequences (the leaves). They are samples obtained by using the marginal probability approach (equation 5 in the paper).
    2. Train_argmax_Plots: Single sequence per leaf obtained by the using the most likely amino acids indicated by the marginal logits ("argmax the logits")
    3. Test_Plots: Samples for the test sequences (ancestors). In this case they contain the sequences sampled using the marginal probability approach (equation 5 in the paper)
    4. Test_argmax_Plots: Contains the most voted sequence from the samples in Test_Plots.
    5. Test2_Plots: Samples for the test sequences (ancestors). In this case they contain the sequences sampled using the MAP estimates of the logits.
    6. Test2_argmax_Plots: Samples for the test sequences (ancestors). In this case they contain the most likely amino acids indicated by the MAP logits ("argmax the logits") (equation 4 in the paper)
  2. If you selected variational guide:
    1. Train_Plots: Contains information related to the inference of the train sequences (the leaves). They are samples obtained from sampling from the variational posterior (equation 6 in the paper).
    2. Train_argmax_Plots: Single sequence per leaf obtained by the using the most likely amino acids indicated by the logits ("argmax the logits")
    3. Test_Plots: Samples for the test sequences (ancestors). In this case they contain the sequences sampled using the full variational probability approach (equation 6 in the paper)
    4. Test_argmax_Plots: Contains the most voted sequence from the samples in Test_Plots.
    5. Test2_Plots == Test_Plots
    6. Test2_argmax_Plots == Test_argmax_Plots

Where are my ancestral sequences?

  • In each of the folders there should be a fasta file _sampled_nodes_seq.fasta

  • Each of the sequences in the file should be identified as //_sample_

    -Node-name-input-tree: Original name of the node in the given input tree

    -Tree-level-order: Position of the node in tree-level order in the tree

    Node_A1//1.0_sample_0

If this library is useful for your research please cite:

@inproceedings{moreta2021ancestral,
  title={Ancestral protein sequence reconstruction using a tree-structured Ornstein-Uhlenbeck variational autoencoder},
  author={Moreta, Lys Sanz and R{\o}nning, Ola and Al-Sibahi, Ahmad Salim and Hein, Jotun and Theobald, Douglas and Hamelryck, Thomas},
  booktitle={International Conference on Learning Representations},
  year={2021}
}

Do not hesitate to give input on how to improve the documentation of this library

**Leave like and subscribe ... wait that was somewhere else ... well, a star will do it ;) **

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

draupnir-0.0.31.tar.gz (3.6 MB view details)

Uploaded Source

Built Distribution

draupnir-0.0.31-py3-none-any.whl (4.0 MB view details)

Uploaded Python 3

File details

Details for the file draupnir-0.0.31.tar.gz.

File metadata

  • Download URL: draupnir-0.0.31.tar.gz
  • Upload date:
  • Size: 3.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for draupnir-0.0.31.tar.gz
Algorithm Hash digest
SHA256 6e4aa0511ec9b012eb772f1b10230820ea410ac4820187b927c372d521350dad
MD5 82cf34be593db982b1635a24e7434bdd
BLAKE2b-256 d9c5ec3237deead32969dde5d2374dbad61a3a07afdc9ae0c58664d9b96a024f

See more details on using hashes here.

File details

Details for the file draupnir-0.0.31-py3-none-any.whl.

File metadata

  • Download URL: draupnir-0.0.31-py3-none-any.whl
  • Upload date:
  • Size: 4.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for draupnir-0.0.31-py3-none-any.whl
Algorithm Hash digest
SHA256 66072fbe317c694af622c6fe091593d55f783d85e9673c5b645cd8d0ba59b32e
MD5 8a58f47460414f137c968fd0572f3dff
BLAKE2b-256 1a9ba272ed44187186697371c268e57473c4cfdec2474305038939e40444c03b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page