Deep generalizable prediction of RNA secondary structure via base pair motif energy.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

heqinzhu

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Deep generalizable prediction of RNA secondary structure via base pair motif energy

Heqin Zhu · Fenghe Tang · Quan Quan · Ke Chen · Peng Xiong* · S. Kevin Zhou*

Paper | PDF | poster | GitHub | PyPI

Introduction
Installation
Usage
Reproduction
Acknowledgement
LICENSE
Citation

Introduction

Deep learning methods have demonstrated great performance for RNA secondary structure prediction. However, generalizability is a common unsolved issue on unseen out-of-distribution RNA families, which hinders further improvement of the accuracy and robustness of deep learning methods. Here we construct a base pair motif library that enumerates the complete space of locally adjacent three-neighbor base pair and records the thermodynamic energy of corresponding base pair motifs through de novo modeling of tertiary structures, and we further develop a deep learning approach for RNA secondary structure prediction, named BPfold, which learns relationship between RNA sequence and the energy map of base pair motif. Experiments on sequence-wise and family-wise datasets have demonstrated the great superiority of BPfold compared to other state-of-the-art approaches in accuracy and generalizability. We hope this work contributes to integrating physical priors and deep learning methods for the further discovery of RNA structures and functionalities.

Installation

Requirements

python3.8+
anaconda

Use base pair motif library

pip3 install BPfold

Predict RNA secondary structure

Clone this repo

git clone git@github.com:heqin-zhu/BPfold.git
cd BPfold

Create and activate BPfold environment.

conda env create -f BPfold_environment.yaml
conda activate BPfold

Download model_predict.tar.gz in releases and decompress it.

wget https://github.com/heqin-zhu/BPfold/releases/latest/download/model_predict.tar.gz
tar -xzf model_predict.tar.gz

Optional (for training and evaluation): Download datasets BPfold_data.tar.gz in releases and decompress them.

wget https://github.com/heqin-zhu/BPfold/releases/latest/download/BPfold_data.tar.gz
tar -xzf BPfold_data.tar.gz

Usage

Base pair motif library

The base pair motif library is publicly available in releases, which contains the motif:energy pairs. The motif is represented as sequence_pairIdx_pairIdx-chainBreak where pairIdx is 0-indexed, and the energy is a reference score of statistical and physical thermodynamic energy. For instance, CAAAAUG_0_6-3 -49.7835 represents motif CAAAAUG has a known pair C-G whose indexes are 0 and 6, with chainBreak lying at position 3.

[!NOTE] The base pair motif library can be used as thermodynamic priors in other models.

For an input RNA sequence seq, the base pair motif energy matrix mat can be directly obatined as follows:

from BPfold.util.base_pair_motif import BPM_energy

BPM = BPM_energy()

seq = 'AUGCGUAGTa'
# default, recommended, normed to [-1, 1], BPfold used, shape 2xLxL
mat = BPM.get_energy(seq)

# origin energy, value may be -50.3, 49.7, ..., shape 1xLxL
mat2 = BPM.get_energy(seq, normalize_energy=False, dispart_outer_inner=False)

BPfold for secondary structure prediction

Run command line

Args:

--checkpoint_dir: required, specify checkpoint dir path.
--seq: specify one or more input RNA sequences.
--input: specify input file of RNA seqs in format of .fasta(multiple seqs are supported), .bpseq, .ct, or .dbn.
--output: output dir (will be created automatically), default BPfold_results.
--out_type: out format of RNA secondary structures, can be .csv, .bpseq, .ct, or .dbn, default .csv Here are some examples:

BPfold --checkpoint_dir PATH_TO_CHECKPOINT_DIR --seq GGUAAAACAGCCUGU AGUAGGAUGUAUAUG --output BPfold_results
BPfold --checkpoint_dir PATH_TO_CHECKPOINT_DIR --input examples/examples.fasta --out_type csv # (multiple sequences are supported)
BPfold --checkpoint_dir PATH_TO_CHECKPOINT_DIR --input examples/URS0000D6831E_12908_1-117.bpseq

Example of BPfold prediction

Here are the outputs after running BPfold --checkpoint_dir model_predict --input examples/examples.fasta --out_type bpseq:

>> Welcome to use "BPfold" for predicting RNA secondary structure!
Loading model_predict/BPfold_1-6.pth
Loading model_predict/BPfold_2-6.pth
Loading model_predict/BPfold_3-6.pth
Loading model_predict/BPfold_4-6.pth
Loading model_predict/BPfold_5-6.pth
Loading model_predict/BPfold_6-6.pth
[      1] saved in "BPfold_results/1M5L.bpseq", CI=0.913
GCGCAGGACUCGGCUUCUUCGGAAGGGACGAGGGGCGC
((((....((((.(((((..)))))...))))..))))
............(..............).......... NC
((((....((((((((((..)))))..)))))..)))) MIX
[      2] saved in "BPfold_results/URS0000D6831E_12908_1-117.bpseq", CI=0.892
UUAUCUCAUCAUGAGCGGUUUCUCUCACAAACCCGCCAACCGAGCCUAAAAGCCACGGUGGUCAGUUCCGCUAAAAGGAAUGAUGUGCCUUUUAUUAGGAAAAAGUGGAACCGCCUG
......((((((.....((((.......))))..(((.((((.((......))..))))))).................))))))..(((......)))..................
..................................................................................................................... NC
......((((((.....((((.......))))..(((.((((.((......))..))))))).................))))))..(((......))).................. MIX
Confidence indexes are saved in "BPfold_results_confidence_20250915_03h19m33s.yaml"
Program Finished!

[!NOTE] Results (dbn, connects, bpseq...) with no tag are predicted canonical pairs, tagged with _nc are predicted non-canonical pairs, and tagged with _mix are mixed canonical and non-canonical pairs (i.e., all base pairs). If you want to ignore non-canonical pairs, pass argument --ignore_nc to BPfold.

Run command BPfold -h for more help information.

Import python code

Specify arguments:

checkpiont_dir
at least one of input_seqs (list of seqs) and input_path (fasta_path)

from BPfold.predict import BPfold_predict
from BPfold.util.RNA_kit import connects2dbn


## arguments
checkpoint_dir = '' # to be specified
input_seqs = ['GCGCAGGACUCGGCUUCUUCGGAAGGGACGAGGGGCGC', 'AUGUAUGUCCUGUCGUA']
input_path = 'examples/examples.fasta'

## init model
BPfold_predictor = BPfold_predict(checkpoint_dir)

## BPfold predict  # specify at least one of input_seqs and input_path
pred_results = BPfold_predictor.predict(input_seqs=input_seqs, input_path=input_path, ignore_nc=False)

for dic in pred_results:
    print(f'>{dic["seq_name"]}')
    print(dic["seq"])
    print(connects2dbn(dic["connects"]), f'CI={dic["CI"]:.3f}')

Results of BPfold prediction

Loading /public2/home/heqinzhu/gitrepo/RNA/SS_pred/BPfold/src/BPfold/paras/model_predict/BPfold_1-6.pth
Loading /public2/home/heqinzhu/gitrepo/RNA/SS_pred/BPfold/src/BPfold/paras/model_predict/BPfold_2-6.pth
Loading /public2/home/heqinzhu/gitrepo/RNA/SS_pred/BPfold/src/BPfold/paras/model_predict/BPfold_3-6.pth
Loading /public2/home/heqinzhu/gitrepo/RNA/SS_pred/BPfold/src/BPfold/paras/model_predict/BPfold_4-6.pth
Loading /public2/home/heqinzhu/gitrepo/RNA/SS_pred/BPfold/src/BPfold/paras/model_predict/BPfold_5-6.pth
Loading /public2/home/heqinzhu/gitrepo/RNA/SS_pred/BPfold/src/BPfold/paras/model_predict/BPfold_6-6.pth
>seq_20250929_14h23m28s_1
GCGCAGGACUCGGCUUCUUCGGAAGGGACGAGGGGCGC
((((....((((.(((((..)))))...))))..)))) CI=0.913
>seq_20250929_14h23m28s_2
AUGUAUGUCCUGUCGUA
.....((......)).. CI=0.807
>1M5L
GCGCAGGACUCGGCUUCUUCGGAAGGGACGAGGGGCGC
((((....((((.(((((..)))))...))))..)))) CI=0.913
>URS0000D6831E_12908_1-117
UUAUCUCAUCAUGAGCGGUUUCUCUCACAAACCCGCCAACCGAGCCUAAAAGCCACGGUGGUCAGUUCCGCUAAAAGGAAUGAUGUGCCUUUUAUUAGGAAAAAGUGGAACCGCCUG
......((((((.....((((.......))))..(((.((((.((......))..))))))).................))))))..(((......))).................. CI=0.892

Evaluation

Specify pred_dir and gt_dir. In each directory, there are secondary structures in format of bpseq, ct, or dbn.

BPfold_eval --pred_dir BPfold_results  --gt_dir PATH_TO_NATIVE_STRUCTURES

Reproduction

For reproduction of all the quantitative results, we provide the predicted secondary structures and model parameters of BPfold in experiments. You can directly downalod the predicted secondary structures by BPfold or use BPfold v0.2.0 with trained parameters to predict these secondary structures, and then evaluate the predicted results.

Directly download

wget https://github.com/heqin-zhu/BPfold/releases/download/v0.2/BPfold_test_results.tar.gz
tar -xzf BPfold_test_results.tar.gz

Use BPfold

Download the checkpoints of BPfold: BPfold_reproduce.tar.gz.

wget https://github.com/heqin-zhu/BPfold/releases/download/v0.2/model_reproduce.tar.gz
tar -xzf model_reproduce.tar.gz

Install BPfold version 0.2.4.

pip install BPfold==0.2.4

Use BPfold to predict RNA sequences in test datasets.

Evaluate

BPfold_eval --gt_dir BPfold_data --pred_dir BPfold_test_results

After running above commands for evaluation, you will see the following outputs:

Outputs of evaluating BPfold

Time used: 29s
[Summary] eval_BPfold_test_results.yaml
 Pred/Total num: [('PDB_test', 116, 116), ('Rfam12.3-14.10', 10791, 10791), ('archiveII', 3966, 3966), ('bpRNA', 1305, 1305), ('bpRNAnew', 5401, 5401)]
-------------------------len>600-------------------------
dataset         & num   & INF   & F1    & P     & R    \\
Rfam12.3-14.10  & 64    & 0.395 & 0.387 & 0.471 & 0.333\\
archiveII       & 55    & 0.352 & 0.311 & 0.580 & 0.242\\
------------------------len<=600-------------------------
dataset         & num   & INF   & F1    & P     & R    \\
PDB_test        & 116   & 0.817 & 0.814 & 0.840 & 0.801\\
Rfam12.3-14.10  & 10727 & 0.696 & 0.690 & 0.662 & 0.743\\
archiveII       & 3911  & 0.829 & 0.827 & 0.821 & 0.843\\
bpRNA           & 1305  & 0.670 & 0.658 & 0.599 & 0.770\\
bpRNAnew        & 5401  & 0.655 & 0.647 & 0.604 & 0.723\\
---------------------------all---------------------------
dataset         & num   & INF   & F1    & P     & R    \\
PDB_test        & 116   & 0.817 & 0.814 & 0.840 & 0.801\\
Rfam12.3-14.10  & 10791 & 0.694 & 0.689 & 0.660 & 0.741\\
archiveII       & 3966  & 0.823 & 0.820 & 0.818 & 0.834\\
bpRNA           & 1305  & 0.670 & 0.658 & 0.599 & 0.770\\
bpRNAnew        & 5401  & 0.655 & 0.647 & 0.604 & 0.723\\

Acknowledgement

We appreciate the following open source projects:

LICENSE

MIT LICENSE

Citation

If you find our work helpful, please cite our paper:

@article{BPfold,
    title = {Deep generalizable prediction of {RNA} secondary structure via base pair motif energy},
    author = {Zhu, Heqin and Tang, Fenghe and Quan, Quan and Chen, Ke and Xiong, Peng and Zhou, S. Kevin},
    volume = {16},
    issn = {2041-1723},
    url = {https://doi.org/10.1038/s41467-025-60048-1},
    doi = {10.1038/s41467-025-60048-1},
    number = {1},
    journal = {Nature Communications},
    month = jul,
    year = {2025},
    pages = {5856},
}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

heqinzhu

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.3.1

Jan 8, 2026

This version

0.3.0

Sep 29, 2025

0.2.9

Sep 29, 2025

0.2.8

Sep 14, 2025

0.2.7

Jul 12, 2025

0.2.6

May 24, 2025

0.2.5

Apr 30, 2025

0.2.4

Apr 7, 2025

0.2.3

Apr 7, 2025

0.0.1

Apr 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bpfold-0.3.0.tar.gz (435.1 kB view details)

Uploaded Sep 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bpfold-0.3.0-py3-none-any.whl (443.1 kB view details)

Uploaded Sep 29, 2025 Python 3

File details

Details for the file bpfold-0.3.0.tar.gz.

File metadata

Download URL: bpfold-0.3.0.tar.gz
Upload date: Sep 29, 2025
Size: 435.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bpfold-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`f1bc38bf3787c959d2dc335526f3f61b147b619de37908f44bc7ff1422d3d401`
MD5	`5d3477f4b19aa97a5284b667f91d5a2a`
BLAKE2b-256	`7250b0af58cf4fab8371dcb62ccdad76790217496537c84ce1aa2a4073f8228d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bpfold-0.3.0.tar.gz:

Publisher: publish.yml on heqin-zhu/BPfold

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bpfold-0.3.0.tar.gz
- Subject digest: f1bc38bf3787c959d2dc335526f3f61b147b619de37908f44bc7ff1422d3d401
- Sigstore transparency entry: 567721438
- Sigstore integration time: Sep 29, 2025
Source repository:
- Permalink: heqin-zhu/BPfold@d5c1caf97a65437df35d077e4ef26300921aec99
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/heqin-zhu
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d5c1caf97a65437df35d077e4ef26300921aec99
- Trigger Event: push

File details

Details for the file bpfold-0.3.0-py3-none-any.whl.

File metadata

Download URL: bpfold-0.3.0-py3-none-any.whl
Upload date: Sep 29, 2025
Size: 443.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bpfold-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1c49b271604130e317a45ce5d052e55eca2226d44462087de5e326c42df8cadb`
MD5	`1e2564fa61d97bb8d8da736954a373a3`
BLAKE2b-256	`8161203341a7bfec2090a3578c2a6ea36b9a2b1c640762a2b5043faf32e7efc5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bpfold-0.3.0-py3-none-any.whl:

Publisher: publish.yml on heqin-zhu/BPfold

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bpfold-0.3.0-py3-none-any.whl
- Subject digest: 1c49b271604130e317a45ce5d052e55eca2226d44462087de5e326c42df8cadb
- Sigstore transparency entry: 567721444
- Sigstore integration time: Sep 29, 2025
Source repository:
- Permalink: heqin-zhu/BPfold@d5c1caf97a65437df35d077e4ef26300921aec99
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/heqin-zhu
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d5c1caf97a65437df35d077e4ef26300921aec99
- Trigger Event: push

BPfold 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Deep generalizable prediction of RNA secondary structure via base pair motif energy

Introduction

Installation

Requirements

Use base pair motif library

Predict RNA secondary structure

Usage

Base pair motif library

BPfold for secondary structure prediction

Run command line

Import python code

Evaluation

Reproduction

Acknowledgement

LICENSE

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance