Language models for Biological Sequence Transformation and Evolutionary Representation.

Project description

LBSTER 🦞

Language models for Biological Sequence Transformation and Evolutionary Representation

lobster is a "batteries included" language model library for proteins and other biological sequences. Led by Nathan Frey, Taylor Joren, Aya Abdlesalam Ismail, Joseph Kleinhenz and Allen Goodman, with many valuable contributions from Contributors across Prescient Design, Genentech.

This repository contains training code and access to pre-trained language models for biological sequence data.

Usage

Table of contents

Why you should use LBSTER
Citations
Install instructions
Models
Notebooks
Training and inference
Contributing

Why you should use LBSTER

LBSTER is built for pre-training models quickly from scratch. It is "batteries included." This is most useful if you need to control the pre-training data mixture and embedding space, or want to experiment with novel pre-training objectives and fine-tuning strategies.
LBSTER is a living, open-source library that will be periodically updated with new code and pre-trained models from the Frey Lab at Prescient Design, Genentech. The Frey Lab works on real therapeutic molecule design problems and LBSTER models and capabilities reflect the demands of real-world drug discovery campaigns.
LBSTER is built with beignet, a standard library for biological research, and integrated with cortex, a modular framework for multitask modeling, guided generation, and multi-modal models.
LBSTER supports concepts; we have a concept-bottleneck protein language model we refer to as CB-LBSTER, which supports 718 concepts.

Citations

If you use the code and/or models, please cite the relevant papers. For the lbster code base cite: Cramming Protein Language Model Training in 24 GPU Hours

@article{Frey2024.05.14.594108,
	author = {Frey, Nathan C. and Joren, Taylor and Ismail, Aya Abdelsalam and Goodman, Allen and Bonneau, Richard and Cho, Kyunghyun and Gligorijevi{\'c}, Vladimir},
	title = {Cramming Protein Language Model Training in 24 GPU Hours},
	elocation-id = {2024.05.14.594108},
	year = {2024},
	doi = {10.1101/2024.05.14.594108},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108},
	eprint = {https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108.full.pdf},
	journal = {bioRxiv}
}

For the cb-lbster code base cite: Concept Bottleneck Language Models for Protein Design

@article{ismail2024conceptbottlenecklanguagemodels,
      title={Concept Bottleneck Language Models For protein design}, 
      author={Aya Abdelsalam Ismail and Tuomas Oikarinen and Amy Wang and Julius Adebayo and Samuel Stanton and Taylor Joren and Joseph Kleinhenz and Allen Goodman and Héctor Corrada Bravo and Kyunghyun Cho and Nathan C. Frey},
      year={2024},
      eprint={2411.06090},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2411.06090}, 
}

Install

clone the repo, cd into it and do mamba env create -f env.yml then from the root of the repo, do

pip install -e .

Main models you should use

Pretrained Models

Masked LMs

Shorthand	#params	Dataset	Description	Model checkpoint
Lobster_24M	24 M	uniref50	24M parameter protein Masked LLM trained on uniref50	lobster_24M
Lobster_150M	150 M	uniref50	150M parameter protein Masked LLM trained on uniref50	lobster_150M

CB LMs

Shorthand	#params	Dataset	Description	Model checkpoint
cb_Lobster_24M	24 M	uniref50+SwissProt	24M parameter a protein concept bottleneck model for proteins with 718 concepts	cb_lobster_24M
cb_Lobster_150M	150 M	uniref50+SwissProt	150M parameter a protein concept bottleneck model for proteins with 718 concepts	cb_lobster_150M
cb_Lobster_650M	650 M	uniref50+SwissProt	650M parameter a protein concept bottleneck model for proteins with 718 concepts	cb_lobster_650M
cb_Lobster_3B	3 B	uniref50+SwissProt	3B parameter a protein concept bottleneck model for proteins with 718 concepts	cb_lobster_3B

Loading a pre-trained model

from lobster.model import LobsterPMLM, LobsterPCLM, LobsterCBMPMLM
masked_language_model = LobsterPMLM("asalam91/lobster_24M")
concept_bottleneck_masked_language_model = LobsterCBMPMLM("asalam91/cb_lobster_24M")
causal_language_model = LobsterPCLM.load_from_checkpoint(<path to ckpt>)

3D, cDNA, and dynamic models use the same classes.

Models

LobsterPMLM: masked language model (BERT-style encoder-only architecture)
LobsterCBMPMLM: concept bottleneck masked language model (BERT-style encoder-only architecture with a concept bottleneck and a linear decoder)
LobsterPCLM: causal language model (Llama-style decoder-only architecture)
LobsterPLMFold: structure prediction language models (pre-trained encoder + structure head)

Notebooks

Representation learning

Check out jupyter notebook tutorial for example on how extract embedding reprsentations from different models.

Concept Interventions

Check out jupyter notebook tutorial for example on to intervene on different concepts for our concept-bottleneck models class.

Training and inference

Embedding

The entrypoint lobster_embed is the main driver for embedding sequences and accepts parameters using Hydra syntax. The available parameters for configuration can be found by running lobster_embed --help or by looking in the src/lobster/hydra_config directory

To embed a fasta file of sequences using a pre-trained model on an interactive GPU node, cd into the root dir of this repo and do

lobster_embed data.path_to_fasta="test_data/query.fasta" checkpoint="path_to_checkpoint.ckpt"

This will generate a dataframe of embeddings and also log them to wandb.

Regression and classification

For robust multitask modeling, we recommend using lobster with cortex. For simple baselines using lobster embeddings, use lobster.model.LinearProbe and lobster.model.LobsterMLP.

Likelihoods

Likelihoods from an autoregressive PrescientCLM or pseudo-log likelihoods ("naturalness") from a PrescientPMLM can be computed for a list of sequences using

model.naturalness(sequences)
model.likelihood(sequences)

Training from scratch

The entrypoint lobster_train is the main driver for training and accepts parameters using Hydra syntax. The available parameters for configuration can be found by running lobster_train --help or by looking in the src/lobster/hydra_config directory

To train an MLM on a fasta file of sequences on an interactive GPU node, cd into the root dir of this repo and do

lobster_train data.path_to_fasta="test_data/query.fasta" logger=csv paths.root_dir="."

Contributing

Contributions are welcome! We ask that all users and contributors remember that the LBSTER team are all full-time drug hunters, and our open-source efforts are a labor of love because we care deeply about open science and scientific progress.

Install dev requirements and pre-commit hooks

python -m pip install -r requirements-dev.in
pre-commit install

Testing

python -m pytest -v --cov-report term-missing --cov=./lobster ./tests

Project details

Release history Release notifications | RSS feed

0.0.16

Sep 10, 2025

0.0.15

Jul 18, 2025

0.0.14

Jul 7, 2025

0.0.13

Jun 23, 2025

0.0.12

Jun 2, 2025

0.0.11

Apr 26, 2025

0.0.10

Apr 21, 2025

0.0.9

Apr 16, 2025

0.0.8

Apr 16, 2025

0.0.7

Apr 16, 2025

0.0.6

Mar 3, 2025

0.0.5

Jan 16, 2025

This version

0.0.4

Jan 16, 2025

0.0.3

Nov 14, 2024

0.0.2.post1.dev3 pre-release

Nov 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lbster-0.0.4.tar.gz (14.0 MB view details)

Uploaded Jan 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lbster-0.0.4-py3-none-any.whl (905.7 kB view details)

Uploaded Jan 16, 2025 Python 3

File details

Details for the file lbster-0.0.4.tar.gz.

File metadata

Download URL: lbster-0.0.4.tar.gz
Upload date: Jan 16, 2025
Size: 14.0 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for lbster-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`f8d74d0080ab34c1f62699e3bcfcca8ba395913699fb40abcbdf2ab6781e0d10`
MD5	`677201d236d8f4257ab2920b9f65844c`
BLAKE2b-256	`88d10117951a70ea6e685d374b529567454946787b6b0307599b07467004d7d6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lbster-0.0.4.tar.gz:

Publisher: publish-pypi.yml on prescient-design/lobster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lbster-0.0.4.tar.gz
- Subject digest: f8d74d0080ab34c1f62699e3bcfcca8ba395913699fb40abcbdf2ab6781e0d10
- Sigstore transparency entry: 162991158
- Sigstore integration time: Jan 16, 2025
Source repository:
- Permalink: prescient-design/lobster@b0e4f23e41aab052e209c0370e899e4f724789c8
- Branch / Tag: refs/tags/v0.0.4
- Owner: https://github.com/prescient-design
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@b0e4f23e41aab052e209c0370e899e4f724789c8
- Trigger Event: release

File details

Details for the file lbster-0.0.4-py3-none-any.whl.

File metadata

Download URL: lbster-0.0.4-py3-none-any.whl
Upload date: Jan 16, 2025
Size: 905.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for lbster-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8eadd4afa5740b6efc225bd858e41d75840e67988361fc4afb4d5c228c0faae8`
MD5	`f5db82a2e03d79590dacb97256657b6b`
BLAKE2b-256	`df3c30bb3cb3e7d2a3f994971c3b3989d5fec2a081866986d7544512b726586f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lbster-0.0.4-py3-none-any.whl:

Publisher: publish-pypi.yml on prescient-design/lobster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lbster-0.0.4-py3-none-any.whl
- Subject digest: 8eadd4afa5740b6efc225bd858e41d75840e67988361fc4afb4d5c228c0faae8
- Sigstore transparency entry: 162991161
- Sigstore integration time: Jan 16, 2025
Source repository:
- Permalink: prescient-design/lobster@b0e4f23e41aab052e209c0370e899e4f724789c8
- Branch / Tag: refs/tags/v0.0.4
- Owner: https://github.com/prescient-design
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@b0e4f23e41aab052e209c0370e899e4f724789c8
- Trigger Event: release

lbster 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

LBSTER 🦞

Usage

Why you should use LBSTER

Citations

Install

Main models you should use

Pretrained Models

Masked LMs

CB LMs

Loading a pre-trained model

Notebooks

Representation learning

Concept Interventions

Training and inference

Embedding

Regression and classification

Likelihoods

Training from scratch

Contributing

Install dev requirements and pre-commit hooks

Testing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance