Structure-guided RNA foundation model.
Project description
A fully open structure-guided RNA foundation model for robust structural and functional inference
Heqin Zhu
·
Ruifeng Li
·
Feng Zhang
·
Fenghe Tang
Tong Ye
·
Xin Li
·
Yunjie Gu
·
Peng Xiong*
·
S. Kevin Zhou*
Submitted
Abstract
RNA language models have achieved strong performance across diverse downstream tasks by leveraging large-scale sequence data. However, RNA function is fundamentally shaped by its hierarchical structure, making the integration of structural information into pretraining essential. Existing methods often depend on noisy structural annotations or introduce task-specific biases, limiting model generalizability. Here, we introduce structRFM, a structure-guided RNA foundation model that is pretrained by implicitly incorporating large-scale base pairing interactions and sequence data via a dynamic masking ratio to balance nucleotide-level and structure-level masking. structRFM learns joint knowledge of sequential and structural data, producing versatile representations-including classification-level, sequence-level, and pairwise matrix features-that support broad downstream adaptations. structRFM ranks top models in zero-shot homology classification across fifteen biological language models, and sets new benchmarks for secondary structure prediction, achieving F1 scores of 0.873 on ArchiveII and 0.641 on bpRNA-TS0 dataset. structRFM further enables robust and reliable tertiary structure prediction, with consistent improvements in both 3D accuracy and extracted 2D structures. In functional tasks such as internal ribosome entry site identification, structRFM achieves a 49% performance gain. These results demonstrate the effectiveness of structure-guided pretraining and highlight a promising direction for developing multi-modal RNA language models in computational biology.
Installation
Requirements
- python3.8+
- anaconda
Instructions
- Clone this repo.
git clone git@github.com:heqin-zhu/structRFM.git
cd structRFM
- Create and activate conda environment.
conda env create -f environment.yaml
conda activate structRFM
- Install structRFM.
pip3 install structRFM
- Download and decompress pretrained structRFM (305 M).
wget https://github.com/heqin-zhu/structRFM/releases/latest/download/structRFM_checkpoint.tar.gz
tar -xzf model_predict.tar.gz
- Set environment varible
structRFM_checkpoint.
export structRFM_checkpoint=PATH_TO_CHECKPOINT # modify ~/.bashrc for permanent setting
Pretraining
Download sequence-structure dataset
The pretrianing sequence-structure dataset is constructed using RNAcentral and BPfold. We filter sequences with a length limited to 512, resulting about 21 millions sequence-structure paired data. It can be downloaded at Zenodo (4.5 GB).
Run pretraining
Modify variables USER_DIR, PROGRAM_DIR, DATA_DIR, and OUT_DIR in run.sh, then run:
bash ./run.sh --print --batch_size 128 --epoch 100 --lr 0.0001 --tag mlm --mlm_structure
Extract RNA sequence features
demo.py
import os
from structRFM.infer import structRFM_infer
from_pretrained = os.getenv('structRFM_checkpoint')
model = structRFM_infer(from_pretrained=from_pretrained, max_length=514)
seq = 'AGUACGUAGUA'
output_attentions = True
print('seq len:', len(seq))
# (1+L+1)x 768, [CLS] seq [SEP]
features, attentions = model.extract_feature(seq, return_all=True, output_attentions=output_attentions)
# feat tuple: layer=12, tuple[i]: batch x L x hidden_dim(=768)
last_feat = features[-1]
# classification feature, 1x768
cls_feat = last_feat[0,:] # 1x768
# sequence feature, Lx768
feat1d = last_feat[1:-1, :] # Lx768
# matrix_feature, LxL
feat2d = feat1d @ feat1d.transpose(-1,-2) # LxL
print('classification feature:', cls_feat.shape)
print('sequence feature:', feat1d.shape)
print('matrix feature:', feat2d.shape)
# atten tuple: layer=12, tuple[i]: batch x head(=12) x L x L
# remove special tokens
attentions = tuple([atten[:, :, 1:-1, 1:-1] for atten in attentions])
print('attentions', len(attentions), attentions[0].shape)
Downstream Tasks
Download all data (3.7 GB) and checkpoints (2.2 GB) from Zenodo, and then place them into corresponding folder of each task.
- Zero-shot inference
- Structure prediction
- Function prediction
Acknowledgement
We appreciate the following open-source projects for their valuable contributions:
LICENSE
Citation
If you find our work helpful, please cite our paper:
@article {structRFM,
author = {Zhu, Heqin and Li, Ruifeng and Zhang, Feng and Tang, Fenghe and Ye, Tong and Li, Xin and Gu, Yujie and Xiong, Peng and Zhou, S Kevin},
title = {A fully-open structure-guided RNA foundation model for robust structural and functional inference},
elocation-id = {2025.08.06.668731},
year = {2025},
doi = {10.1101/2025.08.06.668731},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/08/07/2025.08.06.668731},
journal = {bioRxiv}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file structrfm-0.0.4.tar.gz.
File metadata
- Download URL: structrfm-0.0.4.tar.gz
- Upload date:
- Size: 26.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b1897f23e8a669f83af9b4f08427bd40f7efc2a1cb5f80761bfa0aac354a6fe
|
|
| MD5 |
80d603bf08c7b43b5bbd44b86b2bc3e2
|
|
| BLAKE2b-256 |
74af57018b25e3f9512ad3c7b45bb40698d4260ddb5df9a2584d624bb8da159d
|
Provenance
The following attestation bundles were made for structrfm-0.0.4.tar.gz:
Publisher:
publish.yml on heqin-zhu/structRFM
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
structrfm-0.0.4.tar.gz -
Subject digest:
7b1897f23e8a669f83af9b4f08427bd40f7efc2a1cb5f80761bfa0aac354a6fe - Sigstore transparency entry: 366684561
- Sigstore integration time:
-
Permalink:
heqin-zhu/structRFM@6414d62ca33e5b4ade2c8290877fd081a1e708b5 -
Branch / Tag:
refs/tags/v0.0.4 - Owner: https://github.com/heqin-zhu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6414d62ca33e5b4ade2c8290877fd081a1e708b5 -
Trigger Event:
push
-
Statement type:
File details
Details for the file structrfm-0.0.4-py3-none-any.whl.
File metadata
- Download URL: structrfm-0.0.4-py3-none-any.whl
- Upload date:
- Size: 25.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39f3e3d1a50a76290b576c40c8ba16061f13ce3ce91e959f229e8d0fa6c4823a
|
|
| MD5 |
046808956489e035badcb47a39b572c1
|
|
| BLAKE2b-256 |
dd57c9505adc403745ebbac022de1be0b93bb1a7c1f5905aefd0a40a8be8d51f
|
Provenance
The following attestation bundles were made for structrfm-0.0.4-py3-none-any.whl:
Publisher:
publish.yml on heqin-zhu/structRFM
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
structrfm-0.0.4-py3-none-any.whl -
Subject digest:
39f3e3d1a50a76290b576c40c8ba16061f13ce3ce91e959f229e8d0fa6c4823a - Sigstore transparency entry: 366684593
- Sigstore integration time:
-
Permalink:
heqin-zhu/structRFM@6414d62ca33e5b4ade2c8290877fd081a1e708b5 -
Branch / Tag:
refs/tags/v0.0.4 - Owner: https://github.com/heqin-zhu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6414d62ca33e5b4ade2c8290877fd081a1e708b5 -
Trigger Event:
push
-
Statement type: