Skip to main content

AbLang2: An antibody-specific language model focusing on NGL prediction.

Project description


AbLang-2

Addressing the antibody germline bias and its effect on language models for improved antibody design

DOI:10.1101/2022.01.20.477061

Motivation: The versatile binding properties of antibodies have made them an extremely important class of biotherapeutics. However, therapeutic antibody development is a complex, expensive and time-consuming task, with the final antibody needing to not only have strong and specific binding, but also be minimally impacted by any developability issues. The success of transformer-based language models in protein sequence space and the availability of vast amounts of antibody sequences, has led to the development of many antibody-specific language models to help guide antibody discovery and design. Antibody diversity primarily arises from V(D)J recombination, mutations within the CDRs, and/or from a small number of mutations away from the germline outside the CDRs. Consequently, a significant portion of the variable domain of all natural antibody sequences remains germline. This affects the pre-training of antibody-specific language models, where this facet of the sequence data introduces a prevailing bias towards germline residues. This poses a challenge, as mutations away from the germline are often vital for generating specific and potent binding to a target, meaning that language models need be able to suggest key mutations away from germline.

Results: In this study, we explore the implications of the germline bias, examining its impact on both general-protein and antibody-specific language models. We develop and train a series of new antibody-specific language models optimised for predicting non-germline residues. We then compare our final model, AbLang-2, with current models and show how it suggests a diverse set of valid mutations with high cumulative probability. AbLang-2 is trained on both unpaired and paired data, and is freely available (https://github.com/oxpig/AbLang2.git).

Availability and implementation: AbLang2 is a python package available at https://github.com/oxpig/AbLang2.git.


Install AbLang2

AbLang is freely available and can be installed with pip.

    pip install ablang2

or directly from github.

    pip install -U git+https://github.com/oxpig/AbLang2.git

NB: If you want to have your returned output aligned (i.e. use the argument "align=True"), you need to manually install Pandas and a version of ANARCI in the same environment. ANARCI can also be installed using bioconda; however, this version is maintained by a third party.

    conda install -c bioconda anarci

AbLang2 usecases

AbLang2 can be used in different ways and for a variety of usecases. The central building blocks are the tokenizer, AbRep, and AbLang.

  • Tokenizer: Converts sequences and amino acids to tokens, and vice versa
  • AbRep: Generates residue embeddings from tokens
  • AbLang: Generates amino acid likelihoods from tokens
import ablang2

# Download and initialise the model
ablang = ablang2.pretrained(model_to_use='ablang2-paired', random_init=False, ncpu=1, device='cpu')

seq = [
'EVQLLESGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCARDVPGHGAAFMDVWGTGTTVTVSS', # The heavy chain (VH) needs to be the first element
'DIQLTQSPLSLPVTLGQPASISCRSSQSLEASDTNIYLSWFQQRPGQSPRRLIYKISNRDSGVPDRFSGSGSGTHFTLRISRVEADDVAVYYCMQGTHWPPAFGQGTKVDIK' # The light chain (VL) needs to be the second element
]

# Tokenize input sequences
seqs = [f"{seq[0]}|{seq[1]}"] # Input needs to be a list, with | used to separated the VH and VL 
tokenized_seq = ablang.tokenizer(seqs, pad=True, w_extra_tkns=False, device="cpu")
        
# Generate rescodings
with torch.no_grad():
    rescoding = ablang.AbRep(tokenized_seq).last_hidden_states

# Generate logits/likelihoods
with torch.no_grad():
    likelihoods = ablang.AbLang(tokenized_seq)

We have build a wrapper for specific usecases which can be explored via a the following Jupyter notebook.

Citation

@article{Olsen2024,
  title={Addressing the antibody germline bias and its effect on language models for improved antibody design},
  author={Tobias H. Olsen, Iain H. Moal and Charlotte M. Deane},
  journal={bioRxiv},
  doi={https://doi.org/10.1101/2024.02.02.578678},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ablang2-0.1.1.tar.gz (33.8 kB view details)

Uploaded Source

Built Distribution

ablang2-0.1.1-py3-none-any.whl (38.2 kB view details)

Uploaded Python 3

File details

Details for the file ablang2-0.1.1.tar.gz.

File metadata

  • Download URL: ablang2-0.1.1.tar.gz
  • Upload date:
  • Size: 33.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.7

File hashes

Hashes for ablang2-0.1.1.tar.gz
Algorithm Hash digest
SHA256 fcd0cb847b5171e14955967bf191cd214622d31670e1b1ccc0c7a503f1419849
MD5 5bf592c016872a9a7f9706c94250dc27
BLAKE2b-256 636f489098296083f8a06ce9447f54af1a0adbedb1213b652a08875f88912ba3

See more details on using hashes here.

File details

Details for the file ablang2-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: ablang2-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 38.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.7

File hashes

Hashes for ablang2-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b691f64c5a1090e7490d49059abc67ae634c268c45b7b4d5aa813e23a84389c9
MD5 c18ac704a5a057621add82d3ce058371
BLAKE2b-256 fc5cfc95963b488860bda8c476bfcd9b7b067b2c47880693d99421d0ee94bcc9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page