Skip to main content

Structure-guided RNA foundation model.

Project description

A fully open structure-guided RNA foundation model for robust structural and functional inference

Heqin Zhu · Ruifeng Li · Feng Zhang · Fenghe Tang
Tong Ye · Xin Li · Yunjie Gu · Peng Xiong* · S. Kevin Zhou*

Submitted

bioRxiv | PDF | GitHub | PyPI

Overview

Abstract

RNA language models have achieved strong performance across diverse downstream tasks by leveraging large-scale sequence data. However, RNA function is fundamentally shaped by its hierarchical structure, making the integration of structural information into pretraining essential. Existing methods often depend on noisy structural annotations or introduce task-specific biases, limiting model generalizability. Here, we introduce structRFM, a structure-guided RNA foundation model that is pretrained by implicitly incorporating large-scale base pairing interactions and sequence data via a dynamic masking ratio to balance nucleotide-level and structure-level masking. structRFM learns joint knowledge of sequential and structural data, producing versatile representations-including classification-level, sequence-level, and pairwise matrix features-that support broad downstream adaptations. structRFM ranks top models in zero-shot homology classification across fifteen biological language models, and sets new benchmarks for secondary structure prediction, achieving F1 scores of 0.873 on ArchiveII and 0.641 on bpRNA-TS0 dataset. structRFM further enables robust and reliable tertiary structure prediction, with consistent improvements in both 3D accuracy and extracted 2D structures. In functional tasks such as internal ribosome entry site identification, structRFM achieves a 49% performance gain. These results demonstrate the effectiveness of structure-guided pretraining and highlight a promising direction for developing multi-modal RNA language models in computational biology.

Key Achievements

  • Zero-shot homology classification: Top-ranked among 15 biological language models.
  • Secondary structure prediction: Sets new state-of-the-art performances.
  • Tertiary structure prediction: Derived method Zfold improves RNA Puzzles accuracy by 19% over AlphaFold3.
  • Functional inference: Boosts F1 score by 49% on IRES identification.

Installation

Requirements

  • python3.8+
  • anaconda

Instructions

  1. Clone this repo.
git clone https://github.com/heqin-zhu/structRFM.git
cd structRFM
  1. Create and activate conda environment.
conda env create -f structRFM_environment.yaml
conda activate structRFM
  1. Download and decompress pretrained structRFM (305 MB).
wget https://github.com/heqin-zhu/structRFM/releases/latest/download/structRFM_checkpoint.tar.gz
tar -xzf structRFM_checkpoint.tar.gz
  1. Set environment varible structRFM_checkpoint.
export structRFM_checkpoint=PATH_TO_CHECKPOINT # modify ~/.bashrc for permanent setting

Usage

Extract RNA sequence features

import os

from structRFM.infer import structRFM_infer

from_pretrained = os.getenv('structRFM_checkpoint')
model_paras = dict(max_length=514, dim=768, layer=12, num_attention_heads=12)
model = structRFM_infer(from_pretrained=from_pretrained, **model_paras)

seq = 'AGUACGUAGUA'

print('seq len:', len(seq))
feat_dic = model.extract_feature(seq)
for k, v in feat_dic.items():
    print(k, v.shape)

'''
seq len: 11
cls_feat torch.Size([768])
seq_feat torch.Size([11, 768])
mat_feat torch.Size([11, 11])
'''

Build structRFM for finetuning

import os

from structRFM.model import get_structRFM
from structRFM.data import preprocess_and_load_dataset, get_mlm_tokenizer

from_pretrained = os.getenv('structRFM_checkpoint')

tokenizer = get_mlm_tokenizer(max_length=514)
model = get_structRFM(dim=768, layer=12, num_attention_heads=12, from_pretrained=from_pretrained, pretrained_length=None, max_length=514, tokenizer=tokenizer)

Pretraining

Download sequence-structure dataset

The pretrianing sequence-structure dataset is constructed using RNAcentral and BPfold. We filter sequences with a length limited to 512, resulting about 21 millions sequence-structure paired data. It can be downloaded at Zenodo (4.5 GB).

Run pretraining

  • Modify variables USER_DIR and PROGRAM_DIR in scripts/run.sh,
  • Specify DATA_PATH and run_name in the following command,

Then run:

bash scripts/run.sh --batch_size 96 --epoch 100 --lr 0.0001 --tag mlm --mlm_structure --max_length 514 --model_scale base --data_path DATA_PATH --run_name structRFM_512

For more information, run python3 main.py -h.

Downstream Tasks

Download all data (3.7 GB) and checkpoints (2.2 GB) from Zenodo, and then place them into corresponding folder of each task.

Acknowledgement

We appreciate the following open-source projects for their valuable contributions:

LICENSE

MIT LICENSE

Citation

If you find our work helpful, please cite our paper:

@article {structRFM,
    author = {Zhu, Heqin and Li, Ruifeng and Zhang, Feng and Tang, Fenghe and Ye, Tong and Li, Xin and Gu, Yujie and Xiong, Peng and Zhou, S Kevin},
    title = {A fully-open structure-guided RNA foundation model for robust structural and functional inference},
    elocation-id = {2025.08.06.668731},
    year = {2025},
    doi = {10.1101/2025.08.06.668731},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2025/08/07/2025.08.06.668731},
    journal = {bioRxiv}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

structrfm-0.0.7.tar.gz (24.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

structrfm-0.0.7-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file structrfm-0.0.7.tar.gz.

File metadata

  • Download URL: structrfm-0.0.7.tar.gz
  • Upload date:
  • Size: 24.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for structrfm-0.0.7.tar.gz
Algorithm Hash digest
SHA256 f501b8bc20f9469a974fa254567a671c3e9ef932df19b3af71ba42cdf8e98899
MD5 63b14f7ff6d6a8a7aea774158a02650e
BLAKE2b-256 268e865f1cbd3f14f870e256e10acf1fb82a20be3f76b155a72c256ada97ec24

See more details on using hashes here.

Provenance

The following attestation bundles were made for structrfm-0.0.7.tar.gz:

Publisher: publish.yml on heqin-zhu/structRFM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file structrfm-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: structrfm-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 24.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for structrfm-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 eb7e27a5dd9eb665711b0d963d9f9fd2b0b46cc644ca940af8b11d3278d8225e
MD5 82dc82cbd0a9c0c5cd7d0e8c04087cfc
BLAKE2b-256 e2177abc26bddb9b169c733b9187bc520e7f9790411b130f6d67f388d02dbdaf

See more details on using hashes here.

Provenance

The following attestation bundles were made for structrfm-0.0.7-py3-none-any.whl:

Publisher: publish.yml on heqin-zhu/structRFM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page