Skip to main content

Deep generalizable prediction of RNA secondary structure via base pair motif energy.

Project description

Deep generalizable prediction of RNA secondary structure via base pair motif energy

Heqin Zhu · Fenghe Tang · Quan Quan · Ke Chen · Peng Xiong* · S. Kevin Zhou*

Submitted

bioRxiv | PDF | GitHub | PyPI

Introduction

overview Deep learning methods have demonstrated great performance for RNA secondary structure prediction. However, generalizability is a common unsolved issue on unseen out-of-distribution RNA families, which hinders further improvement of the accuracy and robustness of deep learning methods. Here we construct a base pair motif library that enumerates the complete space of locally adjacent three-neighbor base pair and records the thermodynamic energy of corresponding base pair motifs through de novo modeling of tertiary structures, and we further develop a deep learning approach for RNA secondary structure prediction, named BPfold, which learns relationship between RNA sequence and the energy map of base pair motif. Experiments on sequence-wise and family-wise datasets have demonstrated the great superiority of BPfold compared to other state-of-the-art approaches in accuracy and generalizability. We hope this work contributes to integrating physical priors and deep learning methods for the further discovery of RNA structures and functionalities.

Installation

Requirements

  • python3.8+
  • anaconda

Instructions

  1. Clone this repo
git clone git@github.com:heqin-zhu/BPfold.git
cd BPfold
  1. Create and activate BPfold environment.
conda env create -f BPfold_environment.yaml
conda activate BPfold
  1. Install BPfold
pip3 install BPfold --index-url https://pypi.org/simple
  1. Download model_predict.tar.gz in releases and decompress it.
wget https://github.com/heqin-zhu/BPfold/releases/latest/download/model_predict.tar.gz
tar -xzf model_predict.tar.gz
  1. Optional: Download datasets BPfold_data.tar.gz in releases and decompress them.
wget https://github.com/heqin-zhu/BPfold/releases/latest/download/BPfold_data.tar.gz
tar -xzf BPfold_data.tar.gz 

Usage

BPfold motif library

The base pair motif library is publicly available in releases, which contains the motif:energy pairs. The motif is represented as sequence_pairIdx_pairIdx-chainBreak where pairIdx is 0-indexed, and the energy is a reference score of statistical and physical thermodynamic energy. For instance, CAAAAUG_0_6-3 -49.7835 represents motif CAAAAUG has a known pair C-G whose indexes are 0 and 6, with chainBreak lying at position 3.

[!NOTE] The base pair motif library can be used as thermodynamic priors in other models.

BPfold Prediction

Use BPfold to predict RNA secondary structures. Args:

  • --checkpoint_dir: required, specify checkpoint dir path.
  • --seq: specify one or more input RNA sequences.
  • --input: specify input file of RNA seqs in format of .fasta(multiple seqs are supported), .bpseq, .ct, or .dbn.
  • --output: output dir (will be created automatically), default BPfold_results.
  • --out_type: out format of RNA secondary structures, can be .csv, .bpseq, .ct, or .dbn, default .csv Here are some examples:
BPfold --checkpoint_dir PATH_TO_CHECKPOINT_DIR --seq GGUAAAACAGCCUGU AGUAGGAUGUAUAUG --output BPfold_results
BPfold --checkpoint_dir PATH_TO_CHECKPOINT_DIR --input examples/examples.fasta --out_type csv # (multiple sequences are supported)
BPfold --checkpoint_dir PATH_TO_CHECKPOINT_DIR --input examples/URS0000D6831E_12908_1-117.bpseq
Example of BPfold prediction

Here are the outputs after running BPfold --checkpoint_dir model_predict --input examples/examples.fasta --out_type bpseq:

>> Welcome to use "BPfold" for predicting RNA secondary structure!
Loading model_predict/BPfold_1-6.pth
Loading model_predict/BPfold_2-6.pth
Loading model_predict/BPfold_3-6.pth
Loading model_predict/BPfold_4-6.pth
Loading model_predict/BPfold_5-6.pth
Loading model_predict/BPfold_6-6.pth
[      1] saved in "BPfold_results/5s_Shigella-flexneri-3.bpseq", CI=0.973
CUGGCGGCAGUUGCGCGGUGGUCCCACCUGACCCCAUGCCGAACUCAGAAGUGAAACGCCGUAGCGCCGAUGGUAGUGUGGGGUCUCCCCAUGCGAGAGUAGGGAACUGCCAG
(((((((.....((((((((.....((((((.............))))..))....)))))).)).((.((....((((((((...))))))))....)).))...)))))))
[      2] saved in "BPfold_results/URS0000D6831E_12908_1-117.bpseq", CI=0.915
UUAUCUCAUCAUGAGCGGUUUCUCUCACAAACCCGCCAACCGAGCCUAAAAGCCACGGUGGUCAGUUCCGCUAAAAGGAAUGAUGUGCCUUUUAUUAGGAAAAAGUGGAACCGCCUG
......((((((....(((((.......)))).)(((.((((.((......))..))))))).................))))))..(((......)))..................
Confidence indexes are saved in "BPfold_results_confidence_TIMESTR.yaml"
Program Finished!

For more help information, please run command BPfold -h to see.

Reproduction

For reproduction of all the quantitative results, we provide the predicted secondary structures and model parameters of BPfold in experiments. You can directly downalod the predicted secondary structures by BPfold or use BPfold v0.2.4 with trained parameters to predict these secondary structures, and then evaluate the predicted results.

Directly download

wget https://github.com/heqin-zhu/BPfold/releases/latest/download/BPfold_test_results.tar.gz
tar -xzf BPfold_test_results.tar.gz

Use BPfold

  1. Download BPfold_reproduce.tar.gz in releases.
wget https://github.com/heqin-zhu/BPfold/releases/latest/download/model_reproduce.tar.gz
tar -xzf model_reproduce.tar.gz
  1. Use BPfold v0.2.4 (pip install BPfold==0.2.4) to predict test sequences.

Evaluate

BPfold_eval --gt_dir BPfold_data --pred_dir BPfold_test_results

After running above commands for evaluation, you will see the following outputs:

Outputs of evaluating BPfold
Time used: 29s
[Summary] eval_BPfold_test_results.yaml
 Pred/Total num: [('PDB_test', 116, 116), ('Rfam12.3-14.10', 10791, 10791), ('archiveII', 3966, 3966), ('bpRNA', 1305, 1305), ('bpRNAnew', 5401, 5401)]
-------------------------len>600-------------------------
dataset         & num   & INF   & F1    & P     & R    \\
Rfam12.3-14.10  & 64    & 0.395 & 0.387 & 0.471 & 0.333\\
archiveII       & 55    & 0.352 & 0.311 & 0.580 & 0.242\\
------------------------len<=600-------------------------
dataset         & num   & INF   & F1    & P     & R    \\
PDB_test        & 116   & 0.817 & 0.814 & 0.840 & 0.801\\
Rfam12.3-14.10  & 10727 & 0.696 & 0.690 & 0.662 & 0.743\\
archiveII       & 3911  & 0.829 & 0.827 & 0.821 & 0.843\\
bpRNA           & 1305  & 0.670 & 0.658 & 0.599 & 0.770\\
bpRNAnew        & 5401  & 0.655 & 0.647 & 0.604 & 0.723\\
---------------------------all---------------------------
dataset         & num   & INF   & F1    & P     & R    \\
PDB_test        & 116   & 0.817 & 0.814 & 0.840 & 0.801\\
Rfam12.3-14.10  & 10791 & 0.694 & 0.689 & 0.660 & 0.741\\
archiveII       & 3966  & 0.823 & 0.820 & 0.818 & 0.834\\
bpRNA           & 1305  & 0.670 & 0.658 & 0.599 & 0.770\\
bpRNAnew        & 5401  & 0.655 & 0.647 & 0.604 & 0.723\\

Acknowledgement

We appreciate the following open source projects:

LICENSE

MIT LICENSE

Citation

If you use our code, please kindly consider to cite our paper:

@article {Zhu2024.10.22.619430,
    author = {Zhu, Heqin and Tang, Fenghe and Quan, Quan and Chen, Ke and Xiong, Peng and Zhou, S. Kevin},
    title = {Deep generalizable prediction of RNA secondary structure via base pair motif energy},
    elocation-id = {2024.10.22.619430},
    year = {2024},
    doi = {10.1101/2024.10.22.619430},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2024/10/25/2024.10.22.619430},
    eprint = {https://www.biorxiv.org/content/early/2024/10/25/2024.10.22.619430.full.pdf},
    journal = {bioRxiv}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bpfold-0.2.6.tar.gz (432.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bpfold-0.2.6-py3-none-any.whl (441.4 kB view details)

Uploaded Python 3

File details

Details for the file bpfold-0.2.6.tar.gz.

File metadata

  • Download URL: bpfold-0.2.6.tar.gz
  • Upload date:
  • Size: 432.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for bpfold-0.2.6.tar.gz
Algorithm Hash digest
SHA256 b9e9a99035135b9e43e6a0668d9a064fee24179506a4d7f38ee987fdad3f33c9
MD5 e58df65e413c36b3f9f5a0eea2b80bc0
BLAKE2b-256 e73186e2aaa1c75620ef50947d6b9fd9f6f2d224b0efa80d46a44dcfa2c96a54

See more details on using hashes here.

Provenance

The following attestation bundles were made for bpfold-0.2.6.tar.gz:

Publisher: publish.yml on heqin-zhu/BPfold

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bpfold-0.2.6-py3-none-any.whl.

File metadata

  • Download URL: bpfold-0.2.6-py3-none-any.whl
  • Upload date:
  • Size: 441.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for bpfold-0.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 42993a3aaa6ba5131b75d37d26e26150025d714381f011787c3fc4489c10742f
MD5 228d5f6d55b4d9a551315e809c406907
BLAKE2b-256 9be74723b66b6a40210e4d524d760763aa34cb717006c8471c8bb973b64746d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for bpfold-0.2.6-py3-none-any.whl:

Publisher: publish.yml on heqin-zhu/BPfold

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page