Evolutionary Scale Modeling (esm): Pretrained language models for proteins. From Facebook AI Research.

Project description

Evolutionary Scale Modeling

This repository contains code and pre-trained weights for Transformer protein language models from Facebook AI Research, including our state-of-the-art ESM-1b protein language model. The models are described in detail in our paper, "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences" (Rives et al., 2019), which first proposed protein language modeling with Transformers.

Citation

@article{rives2019biological,
  author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob},
  title={Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences},
  year={2019},
  doi={10.1101/622803},
  url={https://www.biorxiv.org/content/10.1101/622803v4},
  journal={bioRxiv}
}

Table of contents

Comparison to related works
Usage
Benchmarks
- Comparison on several tasks
Available Models and Datasets
- Pre-trained Models
- ESM Structural Split Dataset
Citations
License

What's New

Dec 2020: Self-Attention Contacts for all pre-trained models (see Rao et al. 2020)
Dec 2020: Added new pre-trained model ESM-1b (see Rives et al. 2019 Appendix B)
Dec 2020: ESM Structural Split Dataset (see Rives et al. 2019 Appendix A.10)

Comparison to related works

Model	Pre-training	Params	SSP	Contact
UniRep	UR50*	18M	58.4	21.9
SeqVec	UR50*	93M	62.1	29.0
TAPE	PFAM*	38M	58.0	23.2
ProtBert-BFD	BFD*	420M	70.0	50.3
LSTM biLM (S)	UR50/S	28M	60.4	24.1
LSTM biLM (L)	UR50/S	113M	62.4	27.8
Transformer-6	UR50/S	43M	62.0	30.2
Transformer-12	UR50/S	85M	65.4	37.7
Transformer-34	UR100	670M	64.3	32.7
Transformer-34	UR50/S	670M	69.2	50.2
ESM-1b	UR50/S	650M	71.6	56.9

Comparison to related protein language models. (SSP) Secondary structure Q8 accuracy on CB513. (Contact) Top-L long range contact precision on RaptorX test set.

* Pre-training datasets from related works have differences from ours.

Usage

Quick Start

As a prerequisite, you must have PyTorch 1.5 or later installed to use this repository.

You can either work in the root of this repository, or use this one-liner for installation:

$ pip install git+https://github.com/facebookresearch/esm.git

We also support PyTorch Hub, which removes the need to clone and/or install this repository yourself:

import torch
model, alphabet = torch.hub.load("facebookresearch/esm", "esm1b_t33_650M_UR50S")

Then, you can load and use a pretrained model as follows:

import torch
import esm

# Load ESM-1b model
model, alphabet = esm.pretrained.esm1b_t33_650M_UR50S()
batch_converter = alphabet.get_batch_converter()

# Prepare data (first 2 sequences from ESMStructuralSplitDataset superfamily / 4)
data = [
    ("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"),
    ("protein2", "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

# Extract per-residue representations (on CPU)
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[33], return_contacts=True)
token_representations = results["representations"][33]

# Generate per-sequence representations via averaging
# NOTE: token 0 is always a beginning-of-sequence token, so the first residue is token 1.
sequence_representations = []
for i, (_, seq) in enumerate(data):
    sequence_representations.append(token_representations[i, 1 : len(seq) + 1].mean(0))

# Look at the unsupervised self-attention map contact predictions
import matplotlib.pyplot as plt
for (_, seq), attention_contacts in zip(data, results["contacts"]):
    plt.matshow(attention_contacts[: len(seq), : len(seq)])
    plt.title(seq)
    plt.show()

Compute embeddings in bulk from FASTA

We provide a script that efficiently extracts embeddings in bulk from a FASTA file. A cuda device is optional and will be auto-detected. The following command extracts the final-layer embedding for a FASTA file from the ESM-1b model:

$ python extract.py esm1b_t33_650M_UR50S examples/some_proteins.fasta my_reprs/ \
    --repr_layers 0 32 33 --include mean per_tok

Directory my_reprs/ now contains one .pt file per FASTA sequence; use torch.load() to load them. extract.py has flags that determine what's included in the .pt file:

--repr-layers (default: final only) selects which layers to include embeddings from.
--include specifies what embeddings to save. You can use the following:
- per_tok includes the full sequence, with an embedding per amino acid (seq_len x hidden_dim).
- mean includes the embeddings averaged over the full sequence, per layer.
- bos includes the embeddings from the beginning-of-sequence token. (NOTE: Don't use with the pre-trained models - we trained without bos-token supervision)

Notebooks

Variant prediction - using the embeddings

To help you get started with using the embeddings, this jupyter notebook tutorial shows how to train a variant predictor using embeddings from ESM-1. You can adopt a similar protocol to train a model for any downstream task, even with limited data. First you can obtain the embeddings for examples/P62593.fasta either by downloading the precomputed embeddings as instructed in the notebook or by running the following:

# Obtain the embeddings
$ python extract.py esm1_t34_670M_UR50S examples/P62593.fasta examples/P62593_reprs/ \
    --repr_layers 34 --include mean

Then, follow the remaining instructions in the tutorial. You can also run the tutorial in a colab notebook.

ESMStructuralSplitDataset and self-attention contact prediction

And this jupyter notebook tutorial shows how to load and index the ESMStructuralSplitDataset, and computes the self-attention map contact predictions as described in our paper "Transformer protein language models are unsupervised structure learners".

Available Models and Datasets

Pre-trained Models

Shorthand	Full Name	#layers	#params	Dataset	Embedding Dim	Model URL
ESM-1b	esm1b_t33_650M_UR50S	33	650M	UR50/S	1280	https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt
ESM1-main	esm1_t34_670M_UR50S	34	670M	UR50/S	1280	https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50S.pt
	esm1_t34_670M_UR50D	34	670M	UR50/D	1280	https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50D.pt
	esm1_t34_670M_UR100	34	670M	UR100	1280	https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR100.pt
	esm1_t12_85M_UR50S	12	85M	UR50/S	768	https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t12_85M_UR50S.pt
	esm1_t6_43M_UR50S	6	43M	UR50/S	768	https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t6_43M_UR50S.pt

ESM Structural Split Dataset

This is a five-fold cross validation dataset of protein domain structures that can be used to measure generalization of representations across different levels of structural dissimilarity. The dataset implements structural holdouts at the family, superfamily, and fold level. The SCOPe database is used to classify domains. Independently for each level of structural hold-out, the domains are split into 5 equal sets, i.e. five sets of folds, superfamilies, or families. This ensures that for each of the five partitions, structures having the same classification do not appear in both the train and test sets. For a given classification level each structure appears in a test set once, so that in the cross validation experiment each of the structures will be evaluated exactly once.

The dataset provides 3d coordinates, distance maps, and secondary structure labels. For further details on the construction of the dataset see Rives et al. 2019 Appendix A.10.

This jupyter notebook tutorial shows how to load and index the ESMStructuralSplitDataset.

ESMStructuralSplitDataset, upon initializing, will download splits and pkl. We also provide msas for each of the domains. The data can be directly downloaded below.

Name	Description	URL
splits	train/valid splits	https://dl.fbaipublicfiles.com/fair-esm/structural-data/splits.tar.gz
pkl	pkl objects containing sequence, SSP labels, distance map, and 3d coordinates	https://dl.fbaipublicfiles.com/fair-esm/structural-data/pkl.tar.gz
msas	a3m files containing MSA for each domain	https://dl.fbaipublicfiles.com/fair-esm/structural-data/msas.tar.gz

Citations

If you find the models useful in your research, we ask that you cite the following paper:

@article{rives2019biological,
  author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob},
  title={Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences},
  year={2019},
  doi={10.1101/622803},
  url={https://www.biorxiv.org/content/10.1101/622803v4},
  journal={bioRxiv}
}

For the self-attention contact prediction, see the following paper (biorxiv preprint):

@article{rao2020transformer,
  author = {Rao, Roshan M and Meier, Joshua and Sercu, Tom and Ovchinnikov, Sergey and Rives, Alexander},
  title={Transformer protein language models are unsupervised structure learners},
  year={2020},
  doi={10.1101/2020.12.15.422761},
  url={https://www.biorxiv.org/content/10.1101/2020.12.15.422761v1},
  journal={bioRxiv}
}

Much of this code builds on the fairseq sequence modeling framework. We use fairseq internally for our protein language modeling research. We highly recommend trying it out if you'd like to pre-train protein language models from scratch.

License

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

Project details

Release history Release notifications | RSS feed

2.0.0

Nov 1, 2022

1.0.3

Oct 18, 2022

1.0.2

Aug 23, 2022

1.0.0

Oct 18, 2022

0.5.0

Aug 12, 2022

0.4.2

Apr 7, 2022

0.4.0

Jul 12, 2021

0.3.1

Mar 26, 2021

0.3.0

Mar 26, 2021

This version

0.2.0

Mar 26, 2021

0.1.1

Mar 26, 2021

0.1.0

Mar 26, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fair_esm-0.2.0-py3-none-any.whl (30.0 kB view details)

Uploaded Mar 26, 2021 Python 3

File details

Details for the file fair_esm-0.2.0-py3-none-any.whl.

File metadata

Download URL: fair_esm-0.2.0-py3-none-any.whl
Upload date: Mar 26, 2021
Size: 30.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.8.5

File hashes

Hashes for fair_esm-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ef9d6a1dbc5f72c35bbef915d55449e8286d7da2f12fdfe0b644372c6a69dc7c`
MD5	`6df623337551d8ab3e0f83d9ff2d0535`
BLAKE2b-256	`f72c3e266873a3381fd3f5335ee619f74ffc371e54e3aa269fe01f6e726bf6fe`

See more details on using hashes here.

fair-esm 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Evolutionary Scale Modeling

Comparison to related works

Usage

Quick Start

Compute embeddings in bulk from FASTA

Notebooks

Variant prediction - using the embeddings

ESMStructuralSplitDataset and self-attention contact prediction

Available Models and Datasets

Pre-trained Models

ESM Structural Split Dataset

Citations

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes