A tool for RNA secondary structure prediction.
Project description
Deep generalizable prediction of RNA secondary structure via base pair motif energy
Heqin Zhu · Fenghe Tang · Quan Quan · Ke Chen · Peng Xiong* · S. Kevin Zhou*
Submitted
Introduction
Deep learning methods have demonstrated great performance for RNA secondary structure prediction. However, generalizability is a common unsolved issue on unseen out-of-distribution RNA families, which hinders further improvement of the accuracy and robustness of deep learning methods. Here we construct a base pair motif library that enumerates the complete space of locally adjacent three-neighbor base pair and records the thermodynamic energy of corresponding base pair motifs through de novo modeling of tertiary structures, and we further develop a deep learning approach for RNA secondary structure prediction, named BPfold, which learns relationship between RNA sequence and the energy map of base pair motif. Experiments on sequence-wise and family-wise datasets have demonstrated the great superiority of BPfold compared to other state-of-the-art approaches in accuracy and generalizability. We hope this work contributes to integrating physical priors and deep learning methods for the further discovery of RNA structures and functionalities.
Installation
Requirements
- Linux system
- python3.6+
- anaconda
Instructions
- Clone this repo.
git clone git@github.com:heqin-zhu/BPfold.git
cd BPfold
- Create and activate BPfold environment.
conda env create -f BPfold_environment.yaml
conda activate BPfold
- Download model_predict.tar.gz in releases and decompress it.
wget https://github.com/heqin-zhu/BPfold/releases/download/v0.1/model_predict.tar.gz
tar -xzf model_predict.tar.gz -C src/BPfold/paras
- Download datasets BPfold_data.tar.gz in releases and decompress them.
wget https://github.com/heqin-zhu/BPfold/releases/download/v0.1/BPfold_data.tar.gz
tar -xzf BPfold_data.tar.gz
Usage
BPfold motif library
The base pair motif library is publicly available in releases, which contains the motif:energy pairs. The motif is represented as sequence_pairIdx_pairIdx-chainBreak where pairIdx is 0-indexed, and the energy is a reference score of statistical and physical thermodynamic energy.
For instance, CAAAAUG_0_6-3 -49.7835 represents motif CAAAAUG has a known pair C-G whose indexes are 0 and 6, with chainBreak lying at position 3.
[!NOTE] The base pair motif library can be used as thermodynamic priors in other models.
BPfold Prediction
Use BPfold to predict RNA secondary structures. The following are some examples. The out_type can be csv, bpseq, ct or dbn', which is defaultly set as csv`.
python3 -m src.BPfold.predict --checkpoint_dir PATH_TO_CHECKPOINT --seq GGUAAAACAGCCUGU AGUAGGAUGUAUAUG --output BPfold_results
python3 -m src.BPfold.predict --checkpoint_dir PATH_TO_CHECKPOINT --input examples/examples.fasta # (multiple sequences are supported)
python3 -m src.BPfold.predict --checkpoint_dir PATH_TO_CHECKPOINT --input examples/URS0000D6831E_12908_1-117.bpseq # .bpseq, .ct, .dbn
Example of BPfold prediction
Here are the outputs after running BPfold --input examples/examples.fasta --out_type bpseq:
>> Welcome to use "BPfold" for predicting RNA secondary structure!
Loading paras/model_predict/BPfold_1-6.pth
Loading paras/model_predict/BPfold_2-6.pth
Loading paras/model_predict/BPfold_3-6.pth
Loading paras/model_predict/BPfold_4-6.pth
Loading paras/model_predict/BPfold_5-6.pth
Loading paras/model_predict/BPfold_6-6.pth
[ 1] saved in "BPfold_results/SS/5s_Shigella-flexneri-3.bpseq", CI=0.980
CUGGCGGCAGUUGCGCGGUGGUCCCACCUGACCCCAUGCCGAACUCAGAAGUGAAACGCCGUAGCGCCGAUGGUAGUGUGGGGUCUCCCCAUGCGAGAGUAGGGAACUGCCAG
(((((((.....((((((((.....((((((.............))))..))....)))))).)).((.((....((((((((...))))))))....)).))...)))))))
[ 2] saved in "BPfold_results/SS/URS0000D6831E_12908_1-117.bpseq", CI=0.931
UUAUCUCAUCAUGAGCGGUUUCUCUCACAAACCCGCCAACCGAGCCUAAAAGCCACGGUGGUCAGUUCCGCUAAAAGGAAUGAUGUGCCUUUUAUUAGGAAAAAGUGGAACCGCCUG
......((((((..(.(((((.......))))))(((.((((.((......))..))))))).................))))))..(((......)))..................
Finished!
For more help information, please run command BPfold -h to see.
Reproduction
For reproduction of all the quantitative results, we provide the predicted secondary structures and model parameters of BPfold in experiments. You can directly downalod the predicted secondary structures by BPfold or use BPfold with trained parameters to predict these secondary structures, and then evaluate the predicted results.
Directly download
wget https://github.com/heqin-zhu/BPfold/releases/download/v0.1/BPfold_test_results.tar.gz
tar -xzf BPfold_test_results.tar.gz
Use BPfold
- Download BPfold_reproduce.tar.gz in releases.
wget https://github.com/heqin-zhu/BPfold/releases/download/v0.1/model_reproduce.tar.gz
tar -xzf model_reproduce.tar.gz -C src/BPfold/paras
- Use BPfold to predict test sequences.
Evaluate
python3 -m src.BPfold.evaluate --data_dir BPfold_data --pred_dir BPfold_test_results
After running above commands for evaluation, you will see the following outputs:
Outputs of evaluating BPfold
Time used: 29s
[Summary] eval_BPfold_test_results.yaml
Pred/Total num: [('PDB_test', 116, 116), ('Rfam12.3-14.10', 10791, 10791), ('archiveII', 3966, 3966), ('bpRNA', 1305, 1305), ('bpRNAnew', 5401, 5401)]
-------------------------len>600-------------------------
dataset & num & INF & F1 & P & R \\
Rfam12.3-14.10 & 64 & 0.395 & 0.387 & 0.471 & 0.333\\
archiveII & 55 & 0.352 & 0.311 & 0.580 & 0.242\\
------------------------len<=600-------------------------
dataset & num & INF & F1 & P & R \\
PDB_test & 116 & 0.817 & 0.814 & 0.840 & 0.801\\
Rfam12.3-14.10 & 10727 & 0.696 & 0.690 & 0.662 & 0.743\\
archiveII & 3911 & 0.829 & 0.827 & 0.821 & 0.843\\
bpRNA & 1305 & 0.670 & 0.658 & 0.599 & 0.770\\
bpRNAnew & 5401 & 0.655 & 0.647 & 0.604 & 0.723\\
---------------------------all---------------------------
dataset & num & INF & F1 & P & R \\
PDB_test & 116 & 0.817 & 0.814 & 0.840 & 0.801\\
Rfam12.3-14.10 & 10791 & 0.694 & 0.689 & 0.660 & 0.741\\
archiveII & 3966 & 0.823 & 0.820 & 0.818 & 0.834\\
bpRNA & 1305 & 0.670 & 0.658 & 0.599 & 0.770\\
bpRNAnew & 5401 & 0.655 & 0.647 & 0.604 & 0.723\\
Acknowledgement
We appreciate the following open source projects:
LICENSE
Citation
If you use our code, please kindly consider to cite our paper:
@article {Zhu2024.10.22.619430,
author = {Zhu, Heqin and Tang, Fenghe and Quan, Quan and Chen, Ke and Xiong, Peng and Zhou, S. Kevin},
title = {Deep generalizable prediction of RNA secondary structure via base pair motif energy},
elocation-id = {2024.10.22.619430},
year = {2024},
doi = {10.1101/2024.10.22.619430},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/10/25/2024.10.22.619430},
eprint = {https://www.biorxiv.org/content/early/2024/10/25/2024.10.22.619430.full.pdf},
journal = {bioRxiv}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bpfold-0.0.1.tar.gz.
File metadata
- Download URL: bpfold-0.0.1.tar.gz
- Upload date:
- Size: 432.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d356b57519ee28b751031ec9f5de9e3ee88b27e5721b65027c28df6b17dfe409
|
|
| MD5 |
b9c1e1abe0f19684f00db5563e60fe02
|
|
| BLAKE2b-256 |
56aa5fa9916dfeaf962796cd19000a91860e3e051f03806e36f18523130570e2
|
Provenance
The following attestation bundles were made for bpfold-0.0.1.tar.gz:
Publisher:
publish.yml on heqin-zhu/BPfold
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bpfold-0.0.1.tar.gz -
Subject digest:
d356b57519ee28b751031ec9f5de9e3ee88b27e5721b65027c28df6b17dfe409 - Sigstore transparency entry: 193302766
- Sigstore integration time:
-
Permalink:
heqin-zhu/BPfold@5da890893ec44992fb1b0d5d80ee0625d81396e2 -
Branch / Tag:
refs/tags/v0.2 - Owner: https://github.com/heqin-zhu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5da890893ec44992fb1b0d5d80ee0625d81396e2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file bpfold-0.0.1-py3-none-any.whl.
File metadata
- Download URL: bpfold-0.0.1-py3-none-any.whl
- Upload date:
- Size: 441.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d11eb502fa99d463d1a1a9e221e5a30b3424d23a95ba10fddd2725a3c3964bc
|
|
| MD5 |
8eb9004ceb520ff6a50c1afae95e9993
|
|
| BLAKE2b-256 |
00fe044317ce83fe04f7966a820cf8f9d47e396428f1c14aecc783794e125a4b
|
Provenance
The following attestation bundles were made for bpfold-0.0.1-py3-none-any.whl:
Publisher:
publish.yml on heqin-zhu/BPfold
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bpfold-0.0.1-py3-none-any.whl -
Subject digest:
4d11eb502fa99d463d1a1a9e221e5a30b3424d23a95ba10fddd2725a3c3964bc - Sigstore transparency entry: 193302770
- Sigstore integration time:
-
Permalink:
heqin-zhu/BPfold@5da890893ec44992fb1b0d5d80ee0625d81396e2 -
Branch / Tag:
refs/tags/v0.2 - Owner: https://github.com/heqin-zhu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5da890893ec44992fb1b0d5d80ee0625d81396e2 -
Trigger Event:
release
-
Statement type: