Python package for generation of protein sequences and evolutionary alignments via discrete diffusion models
Project description
Dayhoff
Dayhoff is an Atlas of both protein sequence data and generative language models — a centralized resource that brings together 3.34 billion protein sequences across 1.7 billion clusters of metagenomic and natural protein sequences (GigaRef), 46 million structure-based synthetic sequences (BackboneRef), and 16 million multiple sequence alignments (OpenProteinSet). These models can natively predict zero-shot mutation effects on fitness, scaffold structural motifs by conditioning on evolutionary or structural context, and perform guided generation of novel proteins within specified families. Learning from metagenomic and structure-based synthetic data from the Dayhoff Atlas increased the cellular expression rates of generated proteins, highlighting the real-world value of expanding the scale, diversity, and novelty of protein sequence data.
The Dayhoff architecture is a hybrid of state-space Mamba layers and Transformer self-attention, interleaved with Mixture-of-Experts modules to maximize capacity while preserving efficiency. It natively handles long contexts, allowing both single sequences and unrolled MSAs to be modeled. Trained with an autoregressive objective in both N→C and C→N directions, Dayhoff supports order-agnostic infilling and scales to billions of parameters.
If you use the code, data, models, or results. please cite our preprint.
Table of Contents
- Dayhoff
- Usage
- Installation
- Data and Model availability
- Unconditional generation
- Homolog-conditioned generation
- Analysis
- Out-of-scope use cases
- Responsible AI
- Contributing
- Trademarks
Usage
The simplest way to use these models and datasets is via the HuggingFace interface. Alternately, you can install this package or use our Docker. Either way, you will need PyTorch, mamba=ssm, causal-conv1d, and flash-attn.
Prerequisites
Requirements:
- PyTorch: 2.7.1
- CUDA 12.8 and above
We recommend using uv and creating a clean environment.
uv venv dayhoff
source dayhoff/bin/activate
In that new environment, install PyTorch 2.7.1.
uv pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
Now, we need to install mamba-ssm, flash-attn, causal-conv1d, and their prerequisites.
uv pip install wheel packaging
uv pip install --no-build-isolation flash-attn causal-conv1d mamba-ssm
To import from HuggingFace, you will need to install these versions:
uv pip install datasets==3.2.0 #for HF datasets
uv pip install transformers==4.51.0
uv pip install huggingface_hub~=0.34.4
Now, you can simply import the models or datasets into your code.
from transformers import SuppressTokensLogitsProcessor
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
from datasets import load_dataset
model = AutoModelForCausalLM.from_pretrained('microsoft/Dayhoff-3b-GR-HM-c')
tokenizer = AutoTokenizer.from_pretrained('microsoft/Dayhoff-3b-GR-HM-c',
trust_remote_code=True)
gigaref_clustered_train = load_dataset("microsoft/DayhoffDataset",
name="gigaref_no_singletons",
split="train")
Installation
Now, we can either install from pypi:
uv pip install dayhoff
Or, to be able to run the example scripts, clone the repo and install.
git clone https://github.com/microsoft.com/dayhoff.git
uv pip install -e .
Docker
For a fully functional containerized environment without needing to install dependencies manually, you can use the provided Docker image instead:
docker pull samirchar/dayhoff:latest
docker run -it samirchar/dayhoff:latest
Data and model availability
All Dayhoff models are available on AzureAIFoundry
Additionally, all Dayhoff models are also hosted on Hugging Face 🤗. All datasets used in the paper, with the exception of OpenProteinSet are available on Hugging Face in three formats: FASTA, Arrow, and JSONL.
GigaRef, BackboneRef, and DayhoffRef are available under CC BY License
Datasets
Training datasets
The Dayhoff models were trained on the Dayhoff Atlas, with varying data mixes which include:
UniRef50 (UR50) - dataset from UniProt, clustered at 50% sequence identity, contains only cluster representatives.
- Splits: train (25 GB), test (26 MB), valid (26 MB)
UniRef90 (UR90) - dataset from UniProt, clustered at 90% sequence identity, contains cluster representatives and members.
- Splits: train (83 GB), test (90 MB), valid (87 MB)
GigaRef (GR)– 3.43B protein sequences across 1.7B clusters of metagenomic and natural protein sequences. There are two subsets of gigaref:
- GigaRef-clusters (GR) - Only includes cluster representatives and members, no singletons
- Splits: train (433 GB), test (22 MB)
- GigaRef-singletons (GR-s) - Only includes singletons
- Splits: train (282 GB)
BackboneRef (BR) – 46M structure-derived synthetic sequences from c.a. 240,000 de novo backbones, with three subsets containing 10M sequences each:
- BackboneRef unfiltered (BRu) – 10M sequences randomly sampled from all 46M designs.
- Splits: train (3 GB)
- BackboneRef quality (BRq) – 10M sequences sampled from 127,633 backbones whose average self-consistency RMSD ≤ 2 Å.
- Splits: train(3 GB)
- BackboneRef novelty (BRn) – 10M sequences from 138,044 backbones with a max TM-score < 0.5 to any natural structure.
- Splits: train (3GB)
OpenProteinSet (HM) – 16 million precomputed MSAs from 16M sequences in UniClust30 and 140,000 PDB chains.
DayhoffRef
Given the potential for generative models to expand the space of proteins and their functions, we used the Dayhoff models to generate DayhoffRef, a PLM-generated database of synthetic protein sequences
DayhoffRef: dataset of 16 million synthetic protein sequences generated by the Dayhoff models: Dayhoff-3b-UR90, Dayhoff-3b-GR-HM, Dayhoff-3b-GR-HM-c, and Dayhoff-170m-UR50-BRn.
- Splits: train (5 GB)
Loading datasets in HuggingFace
Below are some examples on how to load the datasets using load_dataset in HuggingFace:
gigaref_clustered_train = load_dataset("microsoft/DayhoffDataset",
name="gigaref_no_singletons",
split="train")
uniref50_train = load_dataset("microsoft/DayhoffDataset",
name="uniref50",
split = "train")
backboneref_novelty = load_dataset("microsoft/DayhoffDataset",
name="backboneref",
split = "BBR_n")
dayhoffref = load_dataset("microsoft/DayhoffDataset",
name="dayhoffref",
split = "train")
For the largest datasets, consider using streaming=True.
Models
Weights are available for the following models, as described in the paper
170M parameter models
- Dayhoff-170m-UR50: A 170M parameter model trained on UniRef50 cluster representatives
- Dayhoff-170m-UR90: A 170M parameter model trained on UniRef90 members sampled by UniRef50 cluster
- Dayhoff-170m-GR : A 170M parameter model trained on members sampled from GigaRef clusters
- Dayhoff-170m-BRu: A 170M parameter model trained on UniRef50 cluster representatives and samples from unfiltered BackboneRef
- Dayhoff-170m-BRq: A 170M parameter model trained on UniRef50 cluster representatives and samples from quality-filtered BackboneRef
- Dayhoff-170m-BRn: A 170M parameter model trained on UniRef50 cluster representatives and samples from novelty-filtered BackboneRef
3B parameter models
- Dayhoff-3b-UR90: A 3B parameter model trained on UniRef90 members sampled by UniRef50 cluster
- Dayhoff-3b-GR-HM: A 3B parameter model trained on members sampled from GigaRef clusters and homologs from OpenProteinSet
- Dayhoff-3b-GR-HM-c: A 3B parameter model trained on members sampled from GigaRef clusters and homologs from OpenProteinSet and subsequently cooled using UniRef90 members sampled by UniRef50 cluster and homologs from OpenProteinSet.
Unconditional generation
For most cases, use examples/generate.py to generate new protein sequences. Below is a sample command to generate 10 sequences with at most 100 residues and to place them in a fasta file in the directory generations/
python examples/generate.py generations/ --model-name Dayhoff-170m-UR50-BBR-n --max-length 100 --n-generations 10 --temp 1.0 --min-p 0.0 --random-seed 1 --gpu 0
Homolog-conditioned generation
[examples/generate.py] includes an option to pass a fasta file, in which case it performs sequence generation conditioned on the sequences in the fasta file. The order of the conditioning sequences will be randomly shuffled for each generation.
python examples/generate.py generations/ --fasta-file example.fasta --model-name Dayhoff-3b-GR-HM-c --max-length 128 --n-generations 10 --temp 1.0 --min-p 0.0 --random-seed 1 --gpu 0
Zero-shot fitness scoring
[examples/score.py] will compute backwards and forward average log likelihoods for every sequence in a fasta file.
python examples/score.py example.fasta output_dir/ --model-name Dayhoff-3b-GR-HM-c --gpu 0
Analysis scripts
The following scipts were used to conduct analyses described in the paper.
Generation:
Dataset analysis:
- clusters.py
- gigaref.py
- gigaref_clusters.py
- gigaref_singles.py
- gigaref_to_jsonl.py
- create_fasta_sample.py
- extract_test_fastas.py
- plot_metrics.py
- sample-clustered-splits.py
- sample_uniref.py
Perplexity:
Sequence fidelity (via folding and inverse folding):
Distributional embedding analysis (via FPD and PNMMD):
Pfam annotation:
DayhoffRef compilation:
ProteinGym evals:
Scaffolding (Details in README.md in scaffolding/):
Evolution guided generation:
Cas9 evals:
Out-of-Scope Use Cases
This model should not be used to generate anything that is not a protein sequence or a set of homologuous protein sequences. It is not meant for natural language or other biological sequences, such as DNA sequences.
Responsible AI Considerations
The intended use of this model is to generate high-quality, realistic, protein sequences or sets of homologous protein sequences. Generations can be designed from scratch or conditioned on partial sequences in both N→C and C→N directions.
Risks and limitations: Not all sequences are guaranteed to be realistic. It remains difficult to generate high-quality sequences with no sequence homology to any natural sequence.
The code and datasets released in this repository are provided for research and development use only. They are not intended for use in clinical decision-making or for any other clinical use, and the performance of these models for clinical use has not been established. You bear sole responsibility for any use of these models, data and software, including incorporation into any product intended for clinical use.
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dayhoff-0.1.1.tar.gz.
File metadata
- Download URL: dayhoff-0.1.1.tar.gz
- Upload date:
- Size: 31.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39a07365bee115ad9c3fd43c6ed0ed51e769acb3e41f18d1c5395e00ecbc7df0
|
|
| MD5 |
f5414f9d836bea513537194a147a72fb
|
|
| BLAKE2b-256 |
336858ee2bd71b1a24755ee3ff45172079a3de0c72c2d319bf29ca809f7f5fac
|
File details
Details for the file dayhoff-0.1.1-py3-none-any.whl.
File metadata
- Download URL: dayhoff-0.1.1-py3-none-any.whl
- Upload date:
- Size: 27.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1a2d51aa381019ef98b34f117511773a92bd1cc0f64826657ec678caf6c745c
|
|
| MD5 |
b0fe39e1ada27720dd58d86585dea679
|
|
| BLAKE2b-256 |
cd7a89017733a67ac21590d28742f31a6c1c6e5c50e8cb174d32e211afb8d545
|