Structure-guided RNA foundation model.
Project description
Overview
structRFM is a fully open-source structure-guided RNA foundation model that integrates sequence and structural knowledge through innovative pre-training strategies. By leveraging 21 million sequence-structure pairs and a novel Structure-guided Masked Language Modeling (SgMLM) approach, structRFM achieves state-of-the-art performance across a broad spectrum of RNA structural and functional inference tasks, setting new benchmarks for reliability and generalizability.
Key Features
- Structure-Guided Pre-Training: SgMLM strategy dynamically balances sequence-level and structure-level masking, capturing base-pair interactions without task-specific biases.
- Multi-Source Structure Ensemble: MUSES (Multi-source ensemble of secondary structures) integrates thermodynamics-based, probability-based, and deep learning-based predictors to mitigate annotation biases.
- Versatile Feature Output: Generates classification-level, sequence-level, and pairwise matrix features to support sequence-wise, nucleotide-wise, and structure-wise tasks.
- State-of-the-Art Performance: Archieves state-of-the-art performances on zero-shot, secondary structure prediction, tertiary structure prediction, function prediction tasks.
- Zero-Shot Capability: Ranks top 4 in zero-shot homology classification across Rfam and ArchiveII datasets, with strong secondary structure prediction without labeled data.
- Long RNA Handling: Overlapping sliding window strategy enables high-accuracy classification of long non-coding RNAs (lncRNAs) up to 3,000 nt.
- Fully Open Resources: 21M sequence-structure dataset, pre-trained models, and fine-tuned checkpoints are publicly available for the research community.
Quick Start
Pre-trained Model
AutoModel and AutoTokenizer
Requirements: pip install transformers
import os
from transformers import AutoModel, AutoTokenizer
model_path = 'heqin-zhu/structRFM'
# model_path = os.getenv('structRFM_checkpoint')
model = AutoModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# single sequence
seq = 'GUCCCAACUCUUGCGGGGAGGGAU'
inputs = tokenizer(seq, return_tensors="pt")
outputs = model(**inputs)
print('>>> single seq, length:', len(seq))
for k, v in outputs.items():
print(k, v.shape)
print(outputs.last_hidden_state.shape)
# batch mode
seqs = ["GUCCCAA", 'AGUGUUG', 'AUGUAGUTCUN']
inputs = tokenizer(
seqs,
add_special_tokens=True,
max_length=512,
padding='max_length',
truncation=True,
return_tensors='pt'
)
outputs = model(**inputs) # note that the output sequential features are padded to max-length
print('>>> batch seqs, batch:', len(seqs))
for k, v in outputs.items():
print(k, v.shape)
'''
>>> single seq, length: 24
last_hidden_state torch.Size([1, 24, 768])
pooler_output torch.Size([1, 768])
torch.Size([1, 24, 768])
>>> batch seqs, batch: 3
last_hidden_state torch.Size([3, 512, 768])
pooler_output torch.Size([3, 768])
'''
Preparation-1
- Install packages
pip install transformers structRFM BPfold
- Download and decompress pretrained structRFM (~300 MB).
wget https://github.com/heqin-zhu/structRFM/releases/latest/download/structRFM_checkpoint.tar.gz
tar -xzf structRFM_checkpoint.tar.gz
- Set environment varible
structRFM_checkpoint.
export structRFM_checkpoint=PATH_TO_CHECKPOINT # modify ~/.bashrc for permanent setting
Wrapped features
Requirements: refer to Preparation-1
Use structRFM_infer to extract different features.
import os
from structRFM.infer import structRFM_infer
from_pretrained = os.getenv('structRFM_checkpoint')
model_paras = dict(max_length=514, dim=768, layer=12, num_attention_heads=12)
model = structRFM_infer(from_pretrained=from_pretrained, **model_paras)
seq = 'AGUACGUAGUA'
print('seq len:', len(seq))
feat_dic = model.extract_feature(seq)
for k, v in feat_dic.items():
print(k, v.shape)
'''
seq len: 11
cls_feat torch.Size([768])
seq_feat torch.Size([11, 768])
mat_feat torch.Size([11, 11])
'''
Building Model and Tokenizer
Requirements: refer to Preparation-1
import os
from structRFM.model import get_structRFM
from structRFM.data import preprocess_and_load_dataset, get_mlm_tokenizer
from_pretrained = os.getenv('structRFM_checkpoint') # None
tokenizer = get_mlm_tokenizer(max_length=514)
model = get_structRFM(dim=768, layer=12, num_attention_heads=12, from_pretrained=from_pretrained, pretrained_length=None, max_length=514, tokenizer=tokenizer)
Pre-training and Fine-tuning
Download sequence-structure dataset
The pretrianing sequence-structure dataset is constructed using RNAcentral and BPfold. We filter sequences with a length limited to 512, resulting about 21 millions sequence-structure paired data. It can be downloaded at Zenodo (4.5 GB).
Or use huggingface to load datasets (under construction):
# pip install datasets
from datasets import load_dataset
dataset = load_dataset("heqin-zhu/structRFM-dataset")
Preparation-2
Prepare structRFM environment
- Clone GitHub repo.
git clone https://github.com/heqin-zhu/structRFM.git
cd structRFM
- Create and activate conda environment.
conda env create -f structRFM_environment.yaml
conda activate structRFM
Run Pre-training
- Modify variables
USER_DIRandPROGRAM_DIRinscripts/run.sh, - Specify
DATA_PATHandrun_namein the following command,
Then run:
bash scripts/run.sh --batch_size 96 --epoch 100 --lr 0.0001 --tag mlm --mlm_structure --max_length 514 --model_scale base --data_path DATA_PATH --run_name structRFM_512
For more information, run python3 main.py -h.
Run Fine-tuning
Requirements: refer to Preparation-2
Download all data (3.7 GB) and task-specific checkpoints from Zenodo, and then place them into corresponding folder of each task.
- Zero-shot inference
- Structure prediction
- Function prediction
structRFM Inference
Requirements: refer to Preparation-2
structRFM for RNA secondary structure prediction
Download one fine-tuned structRFM in releases, as the CHECKPOINT_PATH:
# Fine-tuned on bpRNA1m
wget https://github.com/heqin-zhu/structRFM/releases/latest/download/structRFM_SSP_bpRNA1m.pt
# Fine-tuned on RNAStrAlign
wget https://github.com/heqin-zhu/structRFM/releases/latest/download/structRFM_SSP_RNAStrAlign.pt
# Fine-tuned on All datasets (TODO)
Specify FASTA_PATH(multi seq enabled), CHECKPOINT_PATH, and Run the following command
python3 scripts/structRFM_SSP.py --gpu 0 --output_format bpseq --checkpoint_path CHECKPOINT_PATH --input_fasta FASTA_PATH --output_dir structRFM_SSP_results
[!NOTE]
--output_format: out format of RNA secondary structures, can be.csv,.bpseq,.ct, or.dbn, default.csv
Acknowledgement
We appreciate the following open-source projects for their valuable contributions:
LICENSE
Citation
If you find our work helpful, please cite our paper:
@article {structRFM,
author = {Zhu, Heqin and Li, Ruifeng and Zhang, Feng and Tang, Fenghe and Ye, Tong and Li, Xin and Gu, Yujie and Xiong, Peng and Zhou, S Kevin},
title = {A fully-open structure-guided RNA foundation model for robust structural and functional inference},
elocation-id = {2025.08.06.668731},
year = {2025},
doi = {10.1101/2025.08.06.668731},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/08/07/2025.08.06.668731},
journal = {bioRxiv}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file structrfm-0.0.9.tar.gz.
File metadata
- Download URL: structrfm-0.0.9.tar.gz
- Upload date:
- Size: 26.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c41cbcb7d4f937b983e77f54350a41baa8a427c756eb980dfa12051f96851d41
|
|
| MD5 |
64297508d33cc21801e7e29bfe9bee13
|
|
| BLAKE2b-256 |
cb99a04609acd5c203967192fe08973c65767d86143be8feb83a68674968610b
|
Provenance
The following attestation bundles were made for structrfm-0.0.9.tar.gz:
Publisher:
publish.yml on heqin-zhu/structRFM
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
structrfm-0.0.9.tar.gz -
Subject digest:
c41cbcb7d4f937b983e77f54350a41baa8a427c756eb980dfa12051f96851d41 - Sigstore transparency entry: 852782382
- Sigstore integration time:
-
Permalink:
heqin-zhu/structRFM@09b7b541887260bc9c0aebf30aaa5af4a65cebe0 -
Branch / Tag:
refs/tags/v0.0.9 - Owner: https://github.com/heqin-zhu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@09b7b541887260bc9c0aebf30aaa5af4a65cebe0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file structrfm-0.0.9-py3-none-any.whl.
File metadata
- Download URL: structrfm-0.0.9-py3-none-any.whl
- Upload date:
- Size: 25.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8364c5c28acab9ba1866acbba64484b90dab58adea6f059ee8a8e20badaac049
|
|
| MD5 |
61e5e4fe47b81c40779eafe451a373e0
|
|
| BLAKE2b-256 |
d473b7c4b02910b623a5b940690ebca75ee83802e7112c568f285c96111e02e9
|
Provenance
The following attestation bundles were made for structrfm-0.0.9-py3-none-any.whl:
Publisher:
publish.yml on heqin-zhu/structRFM
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
structrfm-0.0.9-py3-none-any.whl -
Subject digest:
8364c5c28acab9ba1866acbba64484b90dab58adea6f059ee8a8e20badaac049 - Sigstore transparency entry: 852782408
- Sigstore integration time:
-
Permalink:
heqin-zhu/structRFM@09b7b541887260bc9c0aebf30aaa5af4a65cebe0 -
Branch / Tag:
refs/tags/v0.0.9 - Owner: https://github.com/heqin-zhu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@09b7b541887260bc9c0aebf30aaa5af4a65cebe0 -
Trigger Event:
push
-
Statement type: