Skip to main content

Benchmarking suite for mRNA property prediction.

Reason this release was yanked:

Broken.

Project description

mRNABench

This repository contains a workflow to benchmark the embedding quality of genomic foundation models on (m)RNA specific tasks. The mRNABench contains a catalogue of datasets and training split logic which can be used to evaluate the embedding quality of several catalogued models.

Jump to: Model Catalog Dataset Catalog

Setup

Several configurations of the mRNABench are available.

Datasets Only

If you are interested in the benchmark datasets only, you can run:

pip install mrna-bench

Full Version

The inference-capable version of mRNABench that can generate embeddings using Orthrus, DNA-BERT2, NucleotideTransformer, RNA-FM, and HyenaDNA can be installed as shown below. Note that this requires PyTorch version 2.2.2 with CUDA 12.1 and Triton uninstalled (due to a DNA-BERT2 issue).

conda create --name mrna_bench python=3.10
conda activate mrna_bench

pip install torch==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install mrna-bench[base_models] 
pip uninstall triton

Inference with other models will require the installation of the model's dependencies first, which are usually listed on the model's GitHub page (see below).

Post-install

After installation, please run the following in Python to set where data associated with the benchmarks will be stored.

import mrna_bench as mb

path_to_dir_to_store_data = "DESIRED_PATH"
mb.update_data_path(path_to_dir_to_store_data)

Usage

Datasets can be retrieved using:

import mrna_bench as mb

dataset = mb.load_dataset("go-mf")
data_df = dataset.data_df

The mRNABench can also be used to test out common genomic foundation models:

import torch

import mrna_bench as mb
from mrna_bench.embedder import DatasetEmbedder
from mrna_bench.linear_probe import LinearProbe

device = torch.device("cuda")

dataset = mb.load_dataset("go-mf")
model = mb.load_model("Orthrus", "orthrus-large-6-track", device)

embedder = DatasetEmbedder(model, dataset)
embeddings = embedder.embed_dataset()
embeddings = embeddings.detach().cpu().numpy()

prober = LinearProbe(
    dataset=dataset,
    embeddings=embeddings,
    task="multilabel",
    target_col="target",
    split_type="homology"
)

metrics = prober.run_linear_probe()
print(metrics)

Also see the scripts/ folder for example scripts that uses slurm to embed dataset chunks in parallel for reduce runtime, as well as an example of multi-seed linear probing.

Model Catalog

The current models catalogued are:

Model Name Model Versions Description Citation Supported
by
base_models
Orthrus orthrus_large_6_track
orthrus_base_4_track
Mamba-based RNA FM pre-trained using contrastive learning. code paper
AIDO.RNA aido_rna_650m
aido_rna_650m_cds
aido_rna_1b600m
aido_rna_1b600m_cds
Encoder Transformer-based RNA FM pre-trained using MLM on 42M ncRNA sequences. Version that is domain adapted to CDS is available. paper
RNA-FM rna-fm
mrna-fm
Transformer-based RNA FM pre-trained using MLM on 23M ncRNA sequences. mRNA-FM trained on CDS using codon tokenizer. github
DNABERT2 dnabert2 Transformer-based DNA FM pre-trained using MLM on multispecies genomic dataset. Uses BPE and other modern architectural improvements for efficiency. github
NucleotideTransformer 2.5b-multi-species
2.5b-1000g
500m-human-ref
500m-1000g
v2-50m-multi-species
v2-100m-multi-species
v2-250m-multi-species
v2-500m-multi-species
Transformer-based DNA FM pre-trained using MLM on a variety of possible datasets at various model sizes. Sequence is tokenized using 6-mers. github
HyenaDNA hyenadna-large-1m-seqlen-hf
hyenadna-medium-450k-seqlen-hf
hyenadna-medium-160k-seqlen-hf
hyenadna-small-32k-seqlen-hf
hyenadna-tiny-16k-seqlen-d128-hf
Hyena-based DNA FM pre-trained using NTP on the human reference genome. Available at various model sizes and pretraining sequence contexts. github

Adding a new model

All models should inherit from the template EmbeddingModel. Each model file should lazily load dependencies within its __init__ methods so each model can be used individually without install all other models. Models must implement get_model_short_name(model_version) which fetches the internal name for the model. This must be unique for every model version and must not contain underscores. Models should implement either embed_sequence or embed_sequence_sixtrack (see code for method signature). New models should be added to MODEL_CATALOG.

Dataset Catalog

The current datasets catalogued are:

Dataset Name Catalogue Identifier Description Tasks Citation
GO Molecular Function go-mf Classification of the molecular function of a transcript's product as defined by the GO Resource. multilabel website
Mean Ribosome Load (Sugimoto) mrl‑sugimoto Mean Ribosome Load per transcript isoform as measured in Sugimoto et al. 2022. regression paper
RNA Half-life (Human) rnahl‑human RNA half-life of human transcripts collected by Agarwal et al. 2022. regression paper
RNA Half-life (Mouse) rnahl‑mouse RNA half-life of mouse transcripts collected by Agarwal et al. 2022. regression paper
Protein Subcellular Localization prot‑loc Subcellular localization of transcript protein product defined in Protein Atlas. multilabel website
Protein Coding Gene Essentiality pcg‑ess Essentiality of PCGs as measured by CRISPR knockdown. Log-fold expression and binary essentiality available on several cell lines. regression classification paper

Adding a new dataset

New datasets should inherit from BenchmarkDataset. Dataset names cannot contain underscores. Each new dataset should download raw data and process it into a dataframe by overriding process_raw_data. This dataframe should store transcript as rows, using string encoding in the sequence column. If homology splitting is required, a column gene containing gene names is required. Six track embedding also requires columns cds and splice. The target column can have any name, as it is specified at time of probing. New datasets should be added to DATASET_CATALOG.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mrna_bench-1.0.0.tar.gz (27.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mrna_bench-1.0.0-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file mrna_bench-1.0.0.tar.gz.

File metadata

  • Download URL: mrna_bench-1.0.0.tar.gz
  • Upload date:
  • Size: 27.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for mrna_bench-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e21c52d403cf9daf814bf2811309f81171ac81a00f1625f5297a569c8e81b822
MD5 5e876adc71a203a959b2551c0dfe949e
BLAKE2b-256 48f563d74b38f011a02e5edf6b00e40fa4c1253eb4eee700d4776ef86dfaecd6

See more details on using hashes here.

File details

Details for the file mrna_bench-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: mrna_bench-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for mrna_bench-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 69c4ead994c8a97608b45be99426e855e97051eb8205171ca9b270af41ced264
MD5 f1a91daf8a7ee22f93ce6e31dba25c0a
BLAKE2b-256 16cc06caf0e60584e29b2f06b23d783a2c4b4f89f1aa13b1bd80a85836a63ce5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page