Skip to main content

RNA Foundation Model (rna-fm): Pretrained language models for RNAs. From CUHK AIH Lab.

Project description

RNA-FM

Update March 2024: CDS-FM, a foundation model pre-trained on coding sequences (CDS) in mRNA is now released! The model can take into CDSs and represent them with contextual embeddings, benefiting mRNA and protein related tasks.

This repository contains codes and pre-trained models for RNA foundation model (RNA-FM). RNA-FM outperforms all tested single-sequence RNA language models across a variety of structure prediction tasks as well as several function-related tasks. You can find more details about RNA-FM in our paper, "Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions" (Chen et al., 2022).

Overview

Citation
@article{chen2022interpretable,
  title={Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions},
  author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and King, Irwin and others},
  journal={arXiv preprint arXiv:2204.00300},
  year={2022}
}
Table of contents

Create Environment with Conda

First, download the repository and create the environment.

git clone https://github.com/ml4bio/RNA-FM.git
cd ./RNA-FM
conda env create -f environment.yml

Then, activate the "RNA-FM" environment and enter into the workspace.

conda activate RNA-FM
cd ./redevelop

Access pre-trained models.

Download pre-trained models from this gdrive link and place the pth files into the pretrained folder.

Apply RNA-FM with Existing Scripts.

1. Embedding Extraction.

python launch/predict.py --config="pretrained/extract_embedding.yml" \
--data_path="./data/examples/example.fasta" --save_dir="./resuts" \
--save_frequency 1 --save_embeddings

RNA-FM embeddings with shape of (L,640) will be saved in the $save_dir/representations.

As For CDS-FM, you can call it with an extra argument, MODEL.BACKBONE_NAME:

python launch/predict.py --config="pretrained/extract_embedding.yml" \
--data_path="./data/examples/example.fasta" --save_dir="./resuts" \
--save_frequency 1 --save_embeddings --save_embeddings_format raw MODEL.BACKBONE_NAME cds-fm

2. Downstream Prediction - RNA secondary structure.

python launch/predict.py --config="pretrained/ss_prediction.yml" \
--data_path="./data/examples/example.fasta" --save_dir="./resuts" \
--save_frequency 1

The predicted probability maps will be saved in form of .npy files, and the post-processed binary predictions will be saved in form of .ct files. You can find them in the $save_dir/r-ss.

3. Online Version - RNA-FM server.

If you have any trouble with the deployment of the local version of RNA-FM, you can access its online version from this link, RNA-FM server. You can easily submit jobs on the server and download results from it afterwards, without setting up environment and occupying any computational resources.

Quick Start for Further Development.

PyTorch is the prerequisite package which you must have installed to use this repository. You can install rna-fm in your own environment with the following pip command if you just want to use the pre-trained language model. you can either install rna-fm from PIPY:

pip install rna-fm

or install rna-fm from github:

cd ./RNA-FM
pip install .

After installation, you can load the RNA-FM and extract its embeddings with the following code:

import torch
import fm

# Load RNA-FM model
model, alphabet = fm.pretrained.rna_fm_t12()
batch_converter = alphabet.get_batch_converter()
model.eval()  # disables dropout for deterministic results

# Prepare data
data = [
    ("RNA1", "GGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCACCCCU"),
    ("RNA2", "GGGUGUCGCUCAGUUGGUAGAGUGCUUGCCUGGCAUGCAAGAAACCUUGGUUCAAUCCCCAGCACUGCA"),
    ("RNA3", "CGAUUCNCGUUCCC--CCGCCUCCA"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

# Extract embeddings (on CPU)
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[12])
token_embeddings = results["representations"][12]

More tutorials can be found from https://ml4bio.github.io/RNA-FM/. The related notebooks are stored in the tutorials folder.

As for CDS-FM, the above code needs a slight revision. To be noted, the length of input RNA sequences should be the multiple of 3 to ensure the sequence can be tokenized into a series of codons (3-mer).

import torch
import fm

# Load CDS-FM model
model, alphabet = fm.pretrained.cds_fm_t12()
batch_converter = alphabet.get_batch_converter()
model.eval()  # disables dropout for deterministic results

# Prepare data
data = [
    ("CDS1", "AUGGGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCACCCCUA"),
    ("CDS2", "AUGGGGUGUCGCUCAGUUGGUAGAGUGCUUGCCUGGCAUGCAAGAAACCUUGGUUCAAUCCCCAGCACUGCA"),
    ("CDS3", "AUGCGAUUCNCGUUCCC--CCGCCUCC"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

# Extract embeddings (on CPU)
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[12])
token_embeddings = results["representations"][12]

Related RNA Language Models (BERT-style)

Shorthand Code Subject Layers Embed Dim Max Length Input Token Dataset Description Year Publisher
RNA-FM Yes ncRNA 12 640 1024 Seq base RNAcentral 19 (23 million samples) The first RNA language model for general purpose 2022.04 arxiv/bioRxiv
RNABERT Yes ncRNA 6 120 440 Seq base RNAcentral (762370) & Rfam 14.3 dataset (trained with partial MSA) Specialized in structural alignment and clustering 2022.02 NAR Genomics and Bioinformatics
UNI-RNA No RNA 24 1280 $\infty$ Seq base RNAcentral & nt & GWH (1 billion) A general model with larger scale and datasets than RNA-FM 2023.07 bioRxiv
RNA-MSM Yes ncRNA 12 768 1024 MSA base 4069 RNA families from Rfam 14.7 A model utilize evolutionary information from MSA directly 2023.03 NAR
SpliceBERT Yes pre-mRNA 6 1024 512 Seq base 2 million precursor messenger RNA (pre-mRNA) Specialized in RNA splicing of pre-mRNA 2023.05 bioRxiv
CodonBERT No mRNA CDS 12 768 512*2 Seq codon (3mer) 10 million mRNAs from NCBI Only focus on CDS of mRNA without UTRs 2023.09 bioRxiv
UTR-LM Yes mRNA 5'UTR 6 128 $\infty$ Seq base 700K 5'UTRs from Ensembl & eGFP & mCherry & Cao Used for 5'UTR and mRNA expression related tasks 2023.10 bioRxiv
3UTRBERT Yes mRNA 3'UTR 12 768 512 Seq k-mer 20,362 3'UTRs Used for 3'UTR mediated gene regulation tasks 2023.09 bioRxiv
BigRNA No DNA - - - Seq - thousands of genome-matched datasets tissue-specific RNA expression, splicing, microRNA sites, and RNA binding protein 2023.09 bioRxiv

Citations

If you find the models useful in your research, we ask that you cite the relevant paper:

For RNA-FM:

@article{chen2022interpretable,
  title={Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions},
  author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and King, Irwin and others},
  journal={arXiv preprint arXiv:2204.00300},
  year={2022}
}

The model of this code builds on the esm sequence modeling framework. And we use fairseq sequence modeling framework to train our RNA language modeling. We very appreciate these two excellent works!

License

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rna-fm-0.2.0.tar.gz (36.8 kB view details)

Uploaded Source

Built Distribution

rna_fm-0.2.0-py3-none-any.whl (44.4 kB view details)

Uploaded Python 3

File details

Details for the file rna-fm-0.2.0.tar.gz.

File metadata

  • Download URL: rna-fm-0.2.0.tar.gz
  • Upload date:
  • Size: 36.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.8

File hashes

Hashes for rna-fm-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4d02779132f8bd910833f3e20c3d08d441da715fa432ca0dc714b582a81aaa03
MD5 f37449b78dbeb9ec7950b92456483306
BLAKE2b-256 402ce15eadcc21c8cbcf324fb44cddca0f2014a6d1e0b566e2c3f61bee10ad1d

See more details on using hashes here.

File details

Details for the file rna_fm-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: rna_fm-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 44.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.8

File hashes

Hashes for rna_fm-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 575d14f30274c927c9e5c33f613f5687a6071d463c230d69350f85f03d23966c
MD5 1d44856704ceccf99690b3d30098f996
BLAKE2b-256 94dad6f9eab6c1211aff033ca5879752219f6dcf83b68c36eee8bb5e04f0c827

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page