Skip to main content

Genome modeling across all domains of life

Project description

Evo 2: Genome modeling and design across all domains of life

Evo 2

Evo 2 is a state of the art DNA language model for long context modeling and design. Evo 2 models DNA sequences at single-nucleotide resolution at up to 1 million base pair context length using the StripedHyena 2 architecture. Evo 2 was pretrained using Savanna. Evo 2 was trained autoregressively on OpenGenome2, a dataset containing 8.8 trillion tokens from all domains of life.

We describe Evo 2 in the preprint: "Genome modeling and design across all domains of life with Evo 2".

[!NOTE]

  • New Evo 2 20B checkpoint: 40B-level performance with double the speed, read more here.
  • Light install for 7B models: installation option compatible with more hardware, see Installation.

Contents

Setup

This repo is for running Evo 2 locally for inference or generation, using our Vortex inference code. For training and finetuning, see the section here. You can run Evo 2 without any installation using the Nvidia Hosted API. You can also self-host an instance using Nvidia NIM. See the Nvidia NIM section for more information.

Requirements

Evo 2 is built on the Vortex inference repo, see the Vortex github for more details and Docker option.

System requirements

  • [OS] Linux (official) or WSL2 (limited support)
  • [Software]
    • CUDA: 12.1+ with compatible NVIDIA drivers
    • cuDNN: 9.3+
    • Compiler: GCC 9+ or Clang 10+ with C++17 support
    • Python 3.11 or 3.12
  • Recommended Torch 2.6.x or 2.7.x

FP8 and Transformer Engine requirements

The 40B, 20B, and 1B models require FP8 via Transformer Engine for numerical accuracy and a Nvidia Hopper GPU. The 7B models can run in bfloat16 without Transformer Engine on any supported GPU.

Model FP8 (Transformer Engine) Required
evo2_7b / evo2_7b_262k / evo2_7b_base No
evo2_20b Yes
evo2_40b / evo2_40b_base Yes
evo2_1b_base Yes

Always validate model outputs after configuration changes or on different hardware by using the tests.

Installation

Full install

Install Transformer Engine and Flash Attention first, then install Evo 2. We recommend using conda to install Transformer Engine:

conda install -c nvidia cuda-nvcc cuda-cudart-dev
conda install -c conda-forge transformer-engine-torch=2.3.0
pip install flash-attn==2.8.0.post2 --no-build-isolation
pip install evo2

Light install (7B models only, no Transformer Engine)

Evo 2 7B models can run without Transformer Engine or FP8-capable hardware. If you run into issues installing Flash Attention, see the Flash Attention GitHub for system requirements and troubleshooting.

# A compatible PyTorch must be installed before flash attention, for example: pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu128
pip install flash-attn==2.8.0.post2 --no-build-isolation
pip install evo2

From source

git clone https://github.com/arcinstitute/evo2
cd evo2
pip install -e .

Verify installation

python -m evo2.test.test_evo2_generation --model_name evo2_7b  # or evo2_1b_base, evo2_20b, evo2_40b

Docker

Evo 2 can be run using Docker (shown below), Singularity, or Apptainer.

docker build -t evo2 .
docker run -it --rm --gpus '"device=0"' -v ./huggingface:/root/.cache/huggingface evo2 bash

Note: The volume mount (-v) preserves downloaded models between container runs and specifies where they are saved.

Once inside the container:

python -m evo2.test.test_evo2_generation --model_name evo2_7b

Usage

Checkpoints

We provide the following model checkpoints, hosted on HuggingFace:

Checkpoint Name Description
evo2_40b 40B parameter model with 1M context
evo2_20b 20B parameter model with 1M context
evo2_7b 7B parameter model with 1M context
evo2_40b_base 40B parameter model with 8K context
evo2_7b_base 7B parameter model with 8K context
evo2_1b_base Smaller 1B parameter model with 8K context
evo2_7b_262k 7B parameter model with 262K context
evo2_7b_microviridae 7B parameter base model fine-tuned on Microviridae genomes

Note: The 40B model requires multiple H100 GPUs. Vortex automatically handles device placement, splitting the model across available CUDA devices.

Forward

Evo 2 can be used to score the likelihoods across a DNA sequence.

import torch
from evo2 import Evo2

evo2_model = Evo2('evo2_7b')

sequence = 'ACGT'
input_ids = torch.tensor(
    evo2_model.tokenizer.tokenize(sequence),
    dtype=torch.int,
).unsqueeze(0).to('cuda:0')

outputs, _ = evo2_model(input_ids)
logits = outputs[0]

print('Logits: ', logits)
print('Shape (batch, length, vocab): ', logits.shape)

Embeddings

Evo 2 embeddings can be saved for use downstream. We find that intermediate embeddings work better than final embeddings, see our paper for details.

import torch
from evo2 import Evo2

evo2_model = Evo2('evo2_7b')

sequence = 'ACGT'
input_ids = torch.tensor(
    evo2_model.tokenizer.tokenize(sequence),
    dtype=torch.int,
).unsqueeze(0).to('cuda:0')

layer_name = 'blocks.28.mlp.l3'

outputs, embeddings = evo2_model(input_ids, return_embeddings=True, layer_names=[layer_name])

print('Embeddings shape: ', embeddings[layer_name].shape)

Generation

Evo 2 can generate DNA sequences based on prompts.

from evo2 import Evo2

evo2_model = Evo2('evo2_7b')

output = evo2_model.generate(prompt_seqs=["ACGT"], n_tokens=400, temperature=1.0, top_k=4)

print(output.sequences[0])

Notebooks

We provide example notebooks.

The BRCA1 scoring notebook shows zero-shot BRCA1 variant effect prediction. This example includes a walkthrough of:

  • Performing zero-shot BRCA1 variant effect predictions using Evo 2
  • Reference vs alternative allele normalization

The generation notebook shows DNA sequence completion with Evo 2. This example shows:

  • DNA prompt based generation and 'DNA autocompletion'
  • How to get and prompt using phylogenetic species tags for generation

The exon classifier notebook demonstrates exon classification using Evo 2 embeddings. This example shows:

  • Running the Evo 2 based exon classifier
  • Performance metrics and visualization

The sparse autoencoder (SAE) notebook explores interpretable features learned by Evo 2. This example includes:

  • Running and visualizing Evo 2 SAE features
  • Demonstrating SAE features on a part of the E. coli genome

Nvidia NIM

Evo 2 is available on Nvidia NIM and hosted API.

The quickstart guides users through running Evo 2 on the NVIDIA NIM using a python or shell client after starting NIM. An example python client script is shown below. This is the same way you would interact with the Nvidia hosted API.

#!/usr/bin/env python3
import requests
import os
import json
from pathlib import Path

key = os.getenv("NVCF_RUN_KEY") or input("Paste the Run Key: ")

r = requests.post(
    url=os.getenv("URL", "https://health.api.nvidia.com/v1/biology/arc/evo2-40b/generate"),
    headers={"Authorization": f"Bearer {key}"},
    json={
        "sequence": "ACTGACTGACTGACTG",
        "num_tokens": 8,
        "top_k": 1,
        "enable_sampled_probs": True,
    },
)

if "application/json" in r.headers.get("Content-Type", ""):
    print(r, "Saving to output.json:\n", r.text[:200], "...")
    Path("output.json").write_text(r.text)
elif "application/zip" in r.headers.get("Content-Type", ""):
    print(r, "Saving large response to data.zip")
    Path("data.zip").write_bytes(r.content)
else:
    print(r, r.headers, r.content)

Very long sequences

You can use Savanna or Nvidia BioNemo for embedding long sequences. Vortex can currently compute over very long sequences via teacher prompting, however please note that forward pass on long sequences may currently be slow.

Dataset

The OpenGenome2 dataset used for pretraining Evo2 is available on HuggingFace . Data is available either as raw fastas or as JSONL files which include preprocessing and data augmentation.

Training and Finetuning

Evo 2 was trained using Savanna, an open source framework for training alternative architectures.

To train or finetune Evo 2, you can use Savanna or Nvidia BioNemo which provides a Evo 2 finetuning tutorial here.

Citation

If you find these models useful for your research, please cite the relevant papers

@article {Brixi2025.02.18.638918,
	author = {Brixi, Garyk and Durrant, Matthew G and Ku, Jerome and Poli, Michael and Brockman, Greg and Chang, Daniel and Gonzalez, Gabriel A and King, Samuel H and Li, David B and Merchant, Aditi T and Naghipourfar, Mohsen and Nguyen, Eric and Ricci-Tam, Chiara and Romero, David W and Sun, Gwanggyu and Taghibakshi, Ali and Vorontsov, Anton and Yang, Brandon and Deng, Myra and Gorton, Liv and Nguyen, Nam and Wang, Nicholas K and Adams, Etowah and Baccus, Stephen A and Dillmann, Steven and Ermon, Stefano and Guo, Daniel and Ilango, Rajesh and Janik, Ken and Lu, Amy X and Mehta, Reshma and Mofrad, Mohammad R.K. and Ng, Madelena Y and Pannu, Jaspreet and Re, Christopher and Schmok, Jonathan C and St. John, John and Sullivan, Jeremy and Zhu, Kevin and Zynda, Greg and Balsam, Daniel and Collison, Patrick and Costa, Anthony B. and Hernandez-Boussard, Tina and Ho, Eric and Liu, Ming-Yu and McGrath, Tom and Powell, Kimberly and Burke, Dave P. and Goodarzi, Hani and Hsu, Patrick D and Hie, Brian},
	title = {Genome modeling and design across all domains of life with Evo 2},
	elocation-id = {2025.02.18.638918},
	year = {2025},
	doi = {10.1101/2025.02.18.638918},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/02/21/2025.02.18.638918},
	eprint = {https://www.biorxiv.org/content/early/2025/02/21/2025.02.18.638918.full.pdf},
	journal = {bioRxiv}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evo2-0.5.3.tar.gz (45.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evo2-0.5.3-py3-none-any.whl (44.8 kB view details)

Uploaded Python 3

File details

Details for the file evo2-0.5.3.tar.gz.

File metadata

  • Download URL: evo2-0.5.3.tar.gz
  • Upload date:
  • Size: 45.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for evo2-0.5.3.tar.gz
Algorithm Hash digest
SHA256 9a8d139d62092b4ca575fe768d4a297505a442053d53483471274fb2593bbe65
MD5 dcbe96d7623315dde782d3ca6c38f1da
BLAKE2b-256 88d8af983ab2530a5ca783ccf6d80202267fb464b14dc338ffa15c76e745fa5b

See more details on using hashes here.

File details

Details for the file evo2-0.5.3-py3-none-any.whl.

File metadata

  • Download URL: evo2-0.5.3-py3-none-any.whl
  • Upload date:
  • Size: 44.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for evo2-0.5.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a6379e4c82589e92c340c268aa05574e5e45d46c6d713457aa5b466fcd489276
MD5 49f02f661d6a53c85b2f82bc8df98ceb
BLAKE2b-256 2ff09aee16cfca3714201f2b03f93456e675343666ef8e84564eeb1038c43c2f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page