Skip to main content

DNA foundation modeling from molecular to genome scale.

Project description

Evo: DNA foundation modeling from molecular to genome scale

Evo

Evo is a biological foundation model capable of long-context modeling and design. Evo uses the StripedHyena architecture to enable modeling of sequences at a single-nucleotide, byte-level resolution with near-linear scaling of compute and memory relative to context length. Evo has 7 billion parameters and is trained on OpenGenome, a prokaryotic whole-genome dataset containing ~300 billion tokens.

We describe Evo in the paper “Sequence modeling and design from molecular to genome scale with Evo”.

We provide the following model checkpoints:

Checkpoint Name Description
evo-1-8k-base A model pretrained with 8,192 context. We use this model as the base model for molecular-scale finetuning tasks.
evo-1-131k-base A model pretrained with 131,072 context using evo-1-8k-base as the base model. We use this model to reason about and generate sequences at the genome scale.
evo-1-8k-crispr A model finetuned using evo-1-8k-base as the base model to generate CRISPR-Cas systems.
evo-1-8k-transposon A model finetuned using evo-1-8k-base as the base model to generate IS200/IS605 transposons.

News

We identified and fixed an issue related to a wrong permutation of some projections, which affects generation quality. To use the new model revision with HuggingFace, please load as follows:

config = AutoConfig.from_pretrained(model_name, trust_remote_code=True, revision="1.1_fix")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    config=config,
    trust_remote_code=True,
    revision="1.1_fix"
)

Contents

Setup

Requirements

Evo is based on StripedHyena.

Evo uses FlashAttention-2, which may not work on all GPU architectures. Please consult the FlashAttention GitHub repository for the current list of supported GPUs.

Make sure to install the correct PyTorch version on your system.

Installation

You can install Evo using pip

pip install evo-model

or directly from the GitHub source

git clone https://github.com/evo-design/evo.git
cd evo/
pip install .

We recommend that you install the PyTorch library first, before installing all other dependencies (due to dependency issues of the flash-attn library; see, e.g., this issue).

One of our example scripts, demonstrating how to go from generating sequences with Evo to folding proteins (scripts/generation_to_folding.py), further requires the installation of prodigal. We have created an environment.yml file for this:

conda env create -f environment.yml
conda activate evo-design

Usage

Below is an example of how to download Evo and use it locally through the Python API.

from evo import Evo
import torch

device = 'cuda:0'

evo_model = Evo('evo-1-131k-base')
model, tokenizer = evo_model.model, evo_model.tokenizer
model.to(device)
model.eval()

sequence = 'ACGT'
input_ids = torch.tensor(
    tokenizer.tokenize(sequence),
    dtype=torch.int,
).to(device).unsqueeze(0)

with torch.no_grad():
    logits, _ = model(input_ids) # (batch, length, vocab)

print('Logits: ', logits)
print('Shape (batch, length, vocab): ', logits.shape)

An example of batched inference can be found in scripts/example_inference.py.

We provide an example script for how to prompt the model and sample a set of sequences given the prompt.

python -m scripts.generate \
    --model-name 'evo-1-131k-base' \
    --prompt ACGT \
    --n-samples 10 \
    --n-tokens 100 \
    --temperature 1. \
    --top-k 4 \
    --device cuda:0

We also provide an example script for using the model to score the log-likelihoods of a set of sequences.

python -m scripts.score \
    --input-fasta examples/example_seqs.fasta \
    --output-tsv scores.tsv \
    --model-name 'evo-1-131k-base' \
    --device cuda:0

HuggingFace

Evo is integrated with HuggingFace.

from transformers import AutoConfig, AutoModelForCausalLM

model_name = 'togethercomputer/evo-1-8k-base'

model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True, revision="1.1_fix")
model_config.use_cache = True

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    config=model_config,
    trust_remote_code=True,
    revision="1.1_fix"
)

Together API

Evo is available through Together AI with a web UI, where you can generate DNA sequences with a chat-like interface.

For more detailed or batch workflows, you can call the Together API with a simple example below.

import openai
import os

# Fill in your API information here.
client = openai.OpenAI(
  api_key=TOGETHER_API_KEY,
  base_url='https://api.together.xyz',
)

chat_completion = client.chat.completions.create(
  messages=[
    {
      "role": "system",
      "content": ""
    },
    {
      "role": "user",
      "content": "ACGT", # Prompt the model with a sequence.
    }
  ],
  model="togethercomputer/evo-1-131k-base",
  max_tokens=128, # Sample some number of new tokens.
  logprobs=True
)
print(
    chat_completion.choices[0].logprobs.token_logprobs,
    chat_completion.choices[0].message.content
)

Dataset

The OpenGenome dataset for pretraining Evo is available at Hugging Face datasets.

Citation

Please cite the following publication when referencing Evo.

@article{nguyen2024sequence,
   author = {Eric Nguyen and Michael Poli and Matthew G. Durrant and Brian Kang and Dhruva Katrekar and David B. Li and Liam J. Bartie and Armin W. Thomas and Samuel H. King and Garyk Brixi and Jeremy Sullivan and Madelena Y. Ng and Ashley Lewis and Aaron Lou and Stefano Ermon and Stephen A. Baccus and Tina Hernandez-Boussard and Christopher Ré and Patrick D. Hsu and Brian L. Hie },
   title = {Sequence modeling and design from molecular to genome scale with Evo},
   journal = {Science},
   volume = {386},
   number = {6723},
   pages = {eado9336},
   year = {2024},
   doi = {10.1126/science.ado9336},
   URL = {https://www.science.org/doi/abs/10.1126/science.ado9336},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evo_model-0.2.1.tar.gz (19.7 kB view details)

Uploaded Source

Built Distribution

evo_model-0.2.1-py3-none-any.whl (20.7 kB view details)

Uploaded Python 3

File details

Details for the file evo_model-0.2.1.tar.gz.

File metadata

  • Download URL: evo_model-0.2.1.tar.gz
  • Upload date:
  • Size: 19.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for evo_model-0.2.1.tar.gz
Algorithm Hash digest
SHA256 95a31444a808163312b1e7b524a4614210096b92286626a02ccfdb8057c6ad86
MD5 26989391e8da7c0d7d0d726330e9d8cc
BLAKE2b-256 c68e45e98fe7515187c26d65a8e1a8678fcb7f86503279d48da35c5daa35bc24

See more details on using hashes here.

File details

Details for the file evo_model-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: evo_model-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 20.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for evo_model-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c2bd5687ac1b981347ed30eb59b2bbbffafd4d7216f3230fcfae3b06800ec735
MD5 55788a01758e667fe5edcc4822b8c294
BLAKE2b-256 34b6fb37e18e54e8b91c8585863be160fb0c5c17b2718cb2c6176e2237ca772e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page