Skip to main content

Structure-guided RNA foundation model.

Project description

structRFM: Structure-guided RNA Foundation Model

License: MIT Python 3.8+ PyTorch

bioRxiv | PDF | GitHub | PyPI

Overview

structRFM is a fully open-source structure-guided RNA foundation model that integrates sequence and structural knowledge through innovative pre-training strategies. By leveraging 21 million sequence-structure pairs and a novel Structure-guided Masked Language Modeling (SgMLM) approach, structRFM achieves state-of-the-art performance across a broad spectrum of RNA structural and functional inference tasks, setting new benchmarks for reliability and generalizability.

Figure: Overview of architecture and downstream applications

Key Features

  • Structure-Guided Pre-Training: SgMLM strategy dynamically balances sequence-level and structure-level masking, capturing base-pair interactions without task-specific biases.
  • Multi-Source Structure Ensemble: MUSES (Multi-source ensemble of secondary structures) integrates thermodynamics-based, probability-based, and deep learning-based predictors to mitigate annotation biases.
  • Versatile Feature Output: Generates classification-level, sequence-level, and pairwise matrix features to support sequence-wise, nucleotide-wise, and structure-wise tasks.
  • State-of-the-Art Performance: Archieves state-of-the-art performances on zero-shot, secondary structure prediction, tertiary structure prediction, function prediction tasks.
  • Zero-Shot Capability: Ranks top 4 in zero-shot homology classification across Rfam and ArchiveII datasets, with strong secondary structure prediction without labeled data.
  • Long RNA Handling: Overlapping sliding window strategy enables high-accuracy classification of long non-coding RNAs (lncRNAs) up to 3,000 nt.
  • Fully Open Resources: 21M sequence-structure dataset, pre-trained models, and fine-tuned checkpoints are publicly available for the research community.

Quick Start

Pre-trained Model

AutoModel and AutoTokenizer

Requirements: pip install transformers

import os

from transformers import AutoModel, AutoTokenizer

model_path = 'heqin-zhu/structRFM'
# model_path = os.getenv('structRFM_checkpoint')

model = AutoModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# single sequence
seq = 'GUCCCAACUCUUGCGGGGAGGGAU'
inputs = tokenizer(seq, return_tensors="pt")
outputs = model(**inputs)
print('>>> single seq, length:', len(seq))
for k, v in outputs.items():
    print(k, v.shape)
print(outputs.last_hidden_state.shape)

# batch mode
seqs = ["GUCCCAA", 'AGUGUUG', 'AUGUAGUTCUN']
inputs = tokenizer(
             seqs,
             add_special_tokens=True,
             max_length=512,
             padding='max_length',
             truncation=True,
             return_tensors='pt'
        )
outputs = model(**inputs) # note that the output sequential features are padded to max-length
print('>>> batch seqs, batch:', len(seqs))
for k, v in outputs.items():
    print(k, v.shape)

'''
>>> single seq, length: 24
last_hidden_state torch.Size([1, 24, 768])
pooler_output torch.Size([1, 768])
torch.Size([1, 24, 768])
>>> batch seqs, batch: 3
last_hidden_state torch.Size([3, 512, 768])
pooler_output torch.Size([3, 768])
'''

Preparation-1

  1. Install packages
pip install transformers structRFM BPfold
  1. Download and decompress pretrained structRFM (~300 MB).
wget https://github.com/heqin-zhu/structRFM/releases/latest/download/structRFM_checkpoint.tar.gz
tar -xzf structRFM_checkpoint.tar.gz
  1. Set environment varible structRFM_checkpoint.
export structRFM_checkpoint=PATH_TO_CHECKPOINT # modify ~/.bashrc for permanent setting

Wrapped features

Requirements: refer to Preparation-1

Use structRFM_infer to extract different features.

import os

from structRFM.infer import structRFM_infer

from_pretrained = os.getenv('structRFM_checkpoint')
model_paras = dict(max_length=514, dim=768, layer=12, num_attention_heads=12)
model = structRFM_infer(from_pretrained=from_pretrained, **model_paras)

seq = 'AGUACGUAGUA'

print('seq len:', len(seq))
feat_dic = model.extract_feature(seq)
for k, v in feat_dic.items():
    print(k, v.shape)

'''
seq len: 11
cls_feat torch.Size([768])
seq_feat torch.Size([11, 768])
mat_feat torch.Size([11, 11])
'''

Building Model and Tokenizer

Requirements: refer to Preparation-1

import os

from structRFM.model import get_structRFM
from structRFM.data import preprocess_and_load_dataset, get_mlm_tokenizer

from_pretrained = os.getenv('structRFM_checkpoint') # None

tokenizer = get_mlm_tokenizer(max_length=514)
model = get_structRFM(dim=768, layer=12, num_attention_heads=12, from_pretrained=from_pretrained, pretrained_length=None, max_length=514, tokenizer=tokenizer)

Pre-training and Fine-tuning

Download sequence-structure dataset

The pretrianing sequence-structure dataset is constructed using RNAcentral and BPfold. We filter sequences with a length limited to 512, resulting about 21 millions sequence-structure paired data. It can be downloaded at Zenodo (4.5 GB).

Or use huggingface to load datasets (under construction):

# pip install datasets
from datasets import load_dataset
dataset = load_dataset("heqin-zhu/structRFM-dataset")

Preparation-2

Prepare structRFM environment

  1. Clone GitHub repo.
git clone https://github.com/heqin-zhu/structRFM.git
cd structRFM
  1. Create and activate conda environment.
conda env create -f structRFM_environment.yaml
conda activate structRFM

Run Pre-training

  • Modify variables USER_DIR and PROGRAM_DIR in scripts/run.sh,
  • Specify DATA_PATH and run_name in the following command,

Then run:

bash scripts/run.sh --batch_size 96 --epoch 100 --lr 0.0001 --tag mlm --mlm_structure --max_length 514 --model_scale base --data_path DATA_PATH --run_name structRFM_512

For more information, run python3 main.py -h.

Run Fine-tuning

Requirements: refer to Preparation-2

Download all data (3.7 GB) and task-specific checkpoints from Zenodo, and then place them into corresponding folder of each task.

structRFM Inference

Requirements: refer to Preparation-2

structRFM for RNA secondary structure prediction

Download one fine-tuned structRFM in releases, as the CHECKPOINT_PATH:

# Fine-tuned on bpRNA1m
wget https://github.com/heqin-zhu/structRFM/releases/latest/download/structRFM_SSP_bpRNA1m.pt

# Fine-tuned on RNAStrAlign
wget https://github.com/heqin-zhu/structRFM/releases/latest/download/structRFM_SSP_RNAStrAlign.pt

# Fine-tuned on All datasets (TODO)

Specify FASTA_PATH(multi seq enabled), CHECKPOINT_PATH, and Run the following command

python3 scripts/structRFM_SSP.py --gpu 0 --output_format bpseq --checkpoint_path CHECKPOINT_PATH --input_fasta FASTA_PATH --output_dir structRFM_SSP_results

[!NOTE] --output_format: out format of RNA secondary structures, can be .csv, .bpseq, .ct, or .dbn, default .csv

Acknowledgement

We appreciate the following open-source projects for their valuable contributions:

LICENSE

MIT LICENSE

Citation

If you find our work helpful, please cite our paper:

@article {structRFM,
    author = {Zhu, Heqin and Li, Ruifeng and Zhang, Feng and Tang, Fenghe and Ye, Tong and Li, Xin and Gu, Yujie and Xiong, Peng and Zhou, S Kevin},
    title = {A fully-open structure-guided RNA foundation model for robust structural and functional inference},
    elocation-id = {2025.08.06.668731},
    year = {2025},
    doi = {10.1101/2025.08.06.668731},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2025/08/07/2025.08.06.668731},
    journal = {bioRxiv}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

structrfm-0.0.9.tar.gz (26.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

structrfm-0.0.9-py3-none-any.whl (25.3 kB view details)

Uploaded Python 3

File details

Details for the file structrfm-0.0.9.tar.gz.

File metadata

  • Download URL: structrfm-0.0.9.tar.gz
  • Upload date:
  • Size: 26.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for structrfm-0.0.9.tar.gz
Algorithm Hash digest
SHA256 c41cbcb7d4f937b983e77f54350a41baa8a427c756eb980dfa12051f96851d41
MD5 64297508d33cc21801e7e29bfe9bee13
BLAKE2b-256 cb99a04609acd5c203967192fe08973c65767d86143be8feb83a68674968610b

See more details on using hashes here.

Provenance

The following attestation bundles were made for structrfm-0.0.9.tar.gz:

Publisher: publish.yml on heqin-zhu/structRFM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file structrfm-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: structrfm-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 25.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for structrfm-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 8364c5c28acab9ba1866acbba64484b90dab58adea6f059ee8a8e20badaac049
MD5 61e5e4fe47b81c40779eafe451a373e0
BLAKE2b-256 d473b7c4b02910b623a5b940690ebca75ee83802e7112c568f285c96111e02e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for structrfm-0.0.9-py3-none-any.whl:

Publisher: publish.yml on heqin-zhu/structRFM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page