Skip to main content

A pipeline for protein embedding generation and visualization

Project description

Bio Embeddings

Resources to learn about bio_embeddings:

Project aims:

  • Facilitate the use of language model based biological sequence representations for transfer-learning by providing a single, consistent interface and close-to-zero-friction
  • Reproducible workflows
  • Depth of representation (different models from different labs trained on different dataset for different purposes)
  • Extensive examples, handle complexity for users (e.g. CUDA OOM abstraction) and well documented warnings and error messages.

The project includes:

  • General purpose python embedders based on open models trained on biological sequence representations (SeqVec, ProtTrans, UniRep,...)
  • A pipeline which:
    • embeds sequences into matrix-representations (per-amino-acid) or vector-representations (per-sequence) that can be used to train learning models or for analytical purposes
    • projects per-sequence embedidngs into lower dimensional representations using UMAP or t-SNE (for lightwieght data handling and visualizations)
    • visualizes low dimensional sets of per-sequence embeddings onto 2D and 3D interactive plots (with and without annotations)
    • extracts annotations from per-sequence and per-amino-acid embeddings using supervised (when available) and unsupervised approaches (e.g. by network analysis)
  • A webserver that wraps the pipeline into a distributed API for scalable and consistent workfolws

Installation

You can install bio_embeddings via pip or use it via docker.

Pip

Install the pipeline like so:

pip install bio-embeddings[all]

To get the latest features, please install the pipeline like so:

pip install -U "bio-embeddings[all] @ git+https://github.com/sacdallago/bio_embeddings.git"

Docker

We provide a docker image at ghcr.io/bioembeddings/bio_embeddings. Simple usage example:

docker run --rm --gpus all \
    -v "$(pwd)/examples/docker":/mnt \
    -u $(id -u ${USER}):$(id -g ${USER}) \
    ghcr.io/bioembeddings/bio_embeddings:v0.1.6 /mnt/config.yml

See the docker example in the examples folder for instructions. You can also use ghcr.io/bioembeddings/bio_embeddings:latest which is built from the latest commit.

Installation notes:

bio_embeddings was developed for unix machines with GPU capabilities and CUDA installed. If your setup diverges from this, you may encounter some inconsitencies (e.g. speed is significantly affected by the absence of a GPU and CUDA). For Windows users, we strongly recommend the use of Windows Subsystem for Linux.

What model is right for you?

Each models has its strengths and weaknesses (speed, specificity, memory footprint...). There isn't a "one-fits-all" and we encourage you to at least try two different models when attempting a new exploratory project.

The models prottrans_bert_bfd, prottrans_albert_bfd, seqvec and prottrans_xlnet_uniref100 were all trained with the goal of systematic predictions. From this pool, we believe the optimal model to be prottrans_bert_bfd, followed by seqvec, which has been established for longer and uses a different principle (LSTM vs Transformer).

Usage and examples

We highly recommend you to check out the examples folder for pipeline examples, and the notebooks folder for post-processing pipeline runs and general purpose use of the embedders.

After having installed the package, you can:

  1. Use the pipeline like:

    bio_embeddings config.yml
    

    A blueprint of the configuration file, and an example setup can be found in the examples directory of this repository.

  2. Use the general purpose embedder objects via python, e.g.:

    from bio_embeddings.embed import SeqVecEmbedder
    
    embedder = SeqVecEmbedder()
    
    embedding = embedder.embed("SEQVENCE")
    

    More examples can be found in the notebooks folder of this repository.

Cite

While we are working on a proper publication, if you are already using this tool, we would appreciate if you could cite the following poster:

Dallago C, Schütze K, Heinzinger M et al. bio_embeddings: python pipeline for fast visualization of protein features extracted by language models [version 1; not peer reviewed]. F1000Research 2020, 9(ISCB Comm J):876 (poster) (doi: 10.7490/f1000research.1118163.1)

Contributors

  • Christian Dallago (lead)
  • Konstantin Schütze
  • Tobias Olenyi
  • Michael Heinzinger

Development status

Pipeline stages
Web server (unpublished)
  • SeqVec supervised predictions
  • Bert supervised predictions
  • SeqVec unsupervised predictions for GO: CC, BP,..
  • Bert unsupervised predictions for GO: CC, BP,..
  • SeqVec unsupervised predictions for SwissProt (just a link to the 1st-k-nn)
  • Bert unsupervised predictions for SwissProt (just a link to the 1st-k-nn)
General purpose embedders

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bio_embeddings-0.1.6.tar.gz (49.9 kB view details)

Uploaded Source

Built Distribution

bio_embeddings-0.1.6-py3-none-any.whl (73.3 kB view details)

Uploaded Python 3

File details

Details for the file bio_embeddings-0.1.6.tar.gz.

File metadata

  • Download URL: bio_embeddings-0.1.6.tar.gz
  • Upload date:
  • Size: 49.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.8.7 Linux/4.15.0-117-generic

File hashes

Hashes for bio_embeddings-0.1.6.tar.gz
Algorithm Hash digest
SHA256 8ef63864473650c3f0ebae79263cc548516dd9798c3115104c074ef60b8379bc
MD5 14e1c340cb160bfa14148f1356f40d84
BLAKE2b-256 ecf9a20aeae5b28565d713aa4d8ec70d41dc5b9ca029fd1dd9e77ec3c3260310

See more details on using hashes here.

File details

Details for the file bio_embeddings-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: bio_embeddings-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 73.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.8.7 Linux/4.15.0-117-generic

File hashes

Hashes for bio_embeddings-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 6cc8332915c031eff6b9a1b36a4f35a72172d1a354801bf21033b8fe9723640f
MD5 a7b8e0762e356c5e96365fd65114f458
BLAKE2b-256 c5fa4ccf67cdfa1d5fa4b3ee9c71970028c9b8a2f88e97108e1e41b907a0a2d2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page