Evolutionary Scale Modeling (esm): Pretrained language models for proteins. From Facebook AI Research.
Project description
====================================================== Evolutionary Scale Modeling (ESM)
Pretrained language models for proteins
This repository contains a PyTorch implementation of and pre-trained weights for the transformer protein language models in
"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences" (Rives et al., 2019)
_
from Facebook AI Research:
.. code-block:: bibtex
@article{rives2019biological,
author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob},
title={Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences},
year={2019},
doi={10.1101/622803},
url={https://www.biorxiv.org/content/10.1101/622803v3},
journal={bioRxiv}
}
.. _"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences" (Rives et al., 2019): https://doi.org/10.1101/622803
Quickstart
As a prerequisite, you must have PyTorch 1.5 or later installed to use this repository. A cuda device is optional and will be auto-detected.
You can either work in the root of this repository, or use this one-liner for installation:
.. code-block:: bash
$ pip install git+https://github.com/facebookresearch/esm.git
Then, you can load and use a pretrained model as follows:
.. code-block:: python
import torch
import esm
# Load 34 layer model
model, alphabet = esm.pretrained.esm1_t34_670M_UR50S()
batch_converter = alphabet.get_batch_converter()
# Prepare data (two protein sequences)
data = [("protein1", "MYLYQKIKN"), ("protein2", "MNAKYD")]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
# Extract per-residue embeddings (on CPU)
with torch.no_grad():
results = model(batch_tokens, repr_layers=[34])
token_embeddings = results["representations"][34]
# Generate per-sequence embeddings via averaging
# NOTE: token 0 is always a beginning-of-sequence token, so the first residue is token 1.
sequence_embeddings = []
for i, (_, seq) in enumerate(data):
sequence_embeddings.append(token_embeddings[i, 1:len(seq) + 1].mean(0))
We also support PyTorch Hub, which removes the need to clone and/or install this repository yourself:
.. code-block:: python
import torch
model, alphabet = torch.hub.load("facebookresearch/esm", "esm1_t34_670M_UR50S")
FASTA embedding extractor
For your convenience, we have provided a script that efficiently extracts embeddings in bulk from a FASTA file:
.. code-block:: bash
# Extract final-layer embedding for a FASTA file from a 34-layer model
$ python extract.py esm1_t34_670M_UR50S examples/some_proteins.fasta my_reprs/ \
--repr_layers 0 32 34 --include mean per_tok
# my_reprs/ now contains one ".pt" file per FASTA sequence; use torch.load() to load them
# extract.py has flags that determine what's included in the ".pt" file:
# --repr-layers (default: final only) selects which layers to include embeddings from.
# --include specifies what embeddings to save. You can use the following:
# * per_tok includes the full sequence, with an embedding per amino acid (seq_len x hidden_dim).
# * mean includes the embeddings averaged over the full sequence, per layer.
# * bos includes the embeddings from the beginning-of-sequence token.
# (NOTE: Don't use with the pre-trained models - we trained without bos-token supervision)
Tutorial
|ImageLink|_
.. |ImageLink| image:: https://colab.research.google.com/assets/colab-badge.svg .. _ImageLink: https://colab.research.google.com/github/facebookresearch/esm/blob/master/examples/variant_prediction.ipynb
To help you get started, we provide a jupyter notebook tutorial
__ demonstrating how to train a variant predictor using embeddings from ESM. You can adopt a similar protocol to train a model for any downstream task, even with limited data.
First you can obtain the embeddings for examples/P62593.fasta
either by downloading the precomputed
__ embeddings
as instructed in the notebook or by running the following:
.. code-block:: bash
# Obtain the embeddings
$ python extract.py esm1_t34_670M_UR50S examples/P62593.fasta examples/P62593_reprs/ \
--repr_layers 34 --include mean
__ examples/variant_prediction.ipynb __ https://dl.fbaipublicfiles.com/fair-esm/examples/P62593_reprs.tar.gz
Then, follow the remaining instructions in the tutorial. You can also run the tutorial in a colab notebook
__.
Available models
The following table lists the pretrained models available for use.
See also Table 1 in the paper
_.
+-----------+---------------------+---------+---------+---------+---------------+----------------+-----------------------------------------------------------------------+ | Shorthand | Full Name | #layers | #params | Dataset | Embedding Dim | Perplexity/ECE | Model URL | +-----------+---------------------+---------+---------+---------+---------------+----------------+-----------------------------------------------------------------------+ | ESM1-main | esm1_t34_670M_UR50S | 34 | 670M | UR50/S | 1280 | 8.54 | https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50S.pt | +-----------+---------------------+---------+---------+---------+---------------+----------------+-----------------------------------------------------------------------+ | | esm1_t34_670M_UR50D | 34 | 670M | UR50/D | 1280 | 8.46 | https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50D.pt | +-----------+---------------------+---------+---------+---------+---------------+----------------+-----------------------------------------------------------------------+ | | esm1_t34_670M_UR100 | 34 | 670M | UR100 | 1280 | 10.32 | https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR100.pt | +-----------+---------------------+---------+---------+---------+---------------+----------------+-----------------------------------------------------------------------+ | | esm1_t12_85M_UR50S | 12 | 85M | UR50/S | 768 | 10.45 | https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t12_85M_UR50S.pt | +-----------+---------------------+---------+---------+---------+---------------+----------------+-----------------------------------------------------------------------+ | | esm1_t6_43M_UR50S | 6 | 43M | UR50/S | 768 | 11.79 | https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t6_43M_UR50S.pt | +-----------+---------------------+---------+---------+---------+---------------+----------------+-----------------------------------------------------------------------+
Comparison to related work
This table compares to related pre-training methods, and corresponds to Table 8 in the paper
_.
The last 3 columns are the major benchmark results:
- RH: Remote Homology at the fold level, using Hit-10 metric on SCOP.
- SSP: Secondary structure Q8 accuracy on CB513.
- Contact: Top-L long range contact precision on RaptorX test set from
Wang et al. (2017)
_.
.. _the paper: https://doi.org/10.1101/622803
+----------------+--------------+--------+------+------+---------+
| Model | Pre-training | Params | RH | SSP | Contact |
+----------------+--------------+--------+------+------+---------+
| UniRep
_ | | 18M | .527 | 58.4 | 21.9 |
+----------------+--------------+--------+------+------+---------+
| SeqVec
_ | | 93M | .545 | 62.1 | 29.0 |
+----------------+--------------+--------+------+------+---------+
| TAPE
_ | | 38M | .581 | 58.0 | 23.2 |
+----------------+--------------+--------+------+------+---------+
| LSTM biLM (S) | UR50/S | 28M | .558 | 60.4 | 24.1 |
+----------------+--------------+--------+------+------+---------+
| LSTM biLM (L) | UR50/S | 113M | .574 | 62.4 | 27.8 |
+----------------+--------------+--------+------+------+---------+
| Transformer-6 | UR50/S | 43M | .653 | 62.0 | 30.2 |
+----------------+--------------+--------+------+------+---------+
| Transformer-12 | UR50/S | 85M | .639 | 65.4 | 37.7 |
+----------------+--------------+--------+------+------+---------+
| Transformer-34 | UR100 | 670M | .599 | 64.3 | 32.7 |
+----------------+--------------+--------+------+------+---------+
| Transformer-34 | UR50/S | 670M | .639 | 69.2 | 50.2 |
+----------------+--------------+--------+------+------+---------+
.. _Wang et al. (2017): https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005324
.. _UniRep: https://www.nature.com/articles/s41592-019-0598-1
.. _SeqVec: https://github.com/rostlab/SeqVec
Performance on TAPE benchmark
We evaluated our best performing model on the TAPE
_ benchmark (Rao, et al. 2019), finding that our neural embeddings perform similarly to or better than alignment-based methods.
.. _TAPE: https://github.com/songlab-cal/tape
+--------------------+------+------+-----------------+--------------+-----------+-------------+ | Model | SS3 | SS8 | Remote homology | Fluorescence | Stability | Contact | +--------------------+------+------+-----------------+--------------+-----------+-------------+ | ESM (best neural) | 0.82 | 0.67 | 0.33 | 0.68 | 0.71 | (0.61)* | +--------------------+------+------+-----------------+--------------+-----------+-------------+ | TAPE (best neural) | 0.75 | 0.59 | 0.26 | 0.68 | 0.73 | 0.4 | +--------------------+------+------+-----------------+--------------+-----------+-------------+ | TAPE (alignment) | 0.8 | 0.63 | 0.09 | N/A | N/A | 0.64 | +--------------------+------+------+-----------------+--------------+-----------+-------------+ * Not comparable: ESM (bests neural) uses a linear projection on the features (the contact head available in the PyTorch version of TAPE), but the results from the TAPE paper use a ResNet head. See the previous table for a rigorous comparison of ESM and TAPE in a fair benchmarking setup.
Reference
If you find the model useful in your research, we ask that you cite the following paper:
.. code-block:: bibtex
@article{rives2019biological,
author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob},
title={Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences},
year={2019},
doi={10.1101/622803},
url={https://www.biorxiv.org/content/10.1101/622803v3},
journal={bioRxiv}
}
Additionally, much of this code hails from the excellent fairseq
_ sequence modeling framework; we have released this standalone model to facilitate more lightweight and flexible usage. We encourage those who wish to pretrain protein language models from scratch to use fairseq.
.. _fairseq: https://github.com/pytorch/fairseq
License
This source code is licensed under the MIT license found in the LICENSE
file
in the root directory of this source tree.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file fair_esm-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: fair_esm-0.1.1-py3-none-any.whl
- Upload date:
- Size: 26.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 72712291124fa1582b8acc0991bf67664d76fd37bb2ebf940b5a602ec851e37d |
|
MD5 | 5ecc005f8bd45fcd42d128c620345bb1 |
|
BLAKE2b-256 | 65ba9be537b8777881103f2ae0cbeeeaa0dc84625cb3ac2b2d9e6cafcf4244aa |