Skip to main content

AbLang2: An antibody-specific language model focusing on NGL prediction.

Project description


AbLang-2

Addressing the antibody germline bias and its effect on language models for improved antibody design

DOI:10.1101/2022.01.20.477061

Motivation: The versatile binding properties of antibodies have made them an extremely important class of biotherapeutics. However, therapeutic antibody development is a complex, expensive and time-consuming task, with the final antibody needing to not only have strong and specific binding, but also be minimally impacted by any developability issues. The success of transformer-based language models in protein sequence space and the availability of vast amounts of antibody sequences, has led to the development of many antibody-specific language models to help guide antibody discovery and design. Antibody diversity primarily arises from V(D)J recombination, mutations within the CDRs, and/or from a small number of mutations away from the germline outside the CDRs. Consequently, a significant portion of the variable domain of all natural antibody sequences remains germline. This affects the pre-training of antibody-specific language models, where this facet of the sequence data introduces a prevailing bias towards germline residues. This poses a challenge, as mutations away from the germline are often vital for generating specific and potent binding to a target, meaning that language models need be able to suggest key mutations away from germline.

Results: In this study, we explore the implications of the germline bias, examining its impact on both general-protein and antibody-specific language models. We develop and train a series of new antibody-specific language models optimised for predicting non-germline residues. We then compare our final model, AbLang-2, with current models and show how it suggests a diverse set of valid mutations with high cumulative probability. AbLang-2 is trained on both unpaired and paired data, and is freely available (https://github.com/oxpig/AbLang2.git).

Availability and implementation: AbLang2 is a python package available at https://github.com/oxpig/AbLang2.git.

TCRLang-Paired: The AbLang2 architecture can be initialised with model weights trained on paired TCR sequences. This model can be used in an identical way to AbLang2 on TCR sequences. The only missing functionality is the lack of the align command. The generation of sequence and residue encodings, as well as masking are all the same. For an example please see the notebook.


Install AbLang2

AbLang is freely available and can be installed with pip.

    pip install ablang2

or directly from github.

    pip install -U git+https://github.com/oxpig/AbLang2.git

NB: If you want to have your returned output aligned (i.e. use the argument "align=True"), you need to manually install Pandas and a version of ANARCI in the same environment. ANARCI can also be installed using bioconda; however, this version is maintained by a third party.

    conda install -c bioconda anarci

AbLang2 usecases

AbLang2 can be used in different ways and for a variety of usecases. The central building blocks are the tokenizer, AbRep, and AbLang.

  • Tokenizer: Converts sequences and amino acids to tokens, and vice versa
  • AbRep: Generates residue embeddings from tokens
  • AbLang: Generates amino acid likelihoods from tokens
import ablang2

# Download and initialise the model
ablang = ablang2.pretrained(model_to_use='ablang2-paired', random_init=False, ncpu=1, device='cpu')

seq = [
'EVQLLESGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCARDVPGHGAAFMDVWGTGTTVTVSS', # The heavy chain (VH) needs to be the first element
'DIQLTQSPLSLPVTLGQPASISCRSSQSLEASDTNIYLSWFQQRPGQSPRRLIYKISNRDSGVPDRFSGSGSGTHFTLRISRVEADDVAVYYCMQGTHWPPAFGQGTKVDIK' # The light chain (VL) needs to be the second element
]

# Tokenize input sequences
seqs = [f"{seq[0]}|{seq[1]}"] # Input needs to be a list, with | used to separated the VH and VL 
tokenized_seq = ablang.tokenizer(seqs, pad=True, w_extra_tkns=False, device="cpu")
        
# Generate rescodings
with torch.no_grad():
    rescoding = ablang.AbRep(tokenized_seq).last_hidden_states

# Generate logits/likelihoods
with torch.no_grad():
    likelihoods = ablang.AbLang(tokenized_seq)

We have build a wrapper for specific usecases which can be explored via a the following Jupyter notebook.

Citation

@article{Olsen2024,
  title={Addressing the antibody germline bias and its effect on language models for improved antibody design},
  author={Tobias H. Olsen, Iain H. Moal and Charlotte M. Deane},
  journal={bioRxiv},
  doi={https://doi.org/10.1101/2024.02.02.578678},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ablang2-0.2.1.tar.gz (33.9 kB view details)

Uploaded Source

Built Distribution

ablang2-0.2.1-py3-none-any.whl (38.3 kB view details)

Uploaded Python 3

File details

Details for the file ablang2-0.2.1.tar.gz.

File metadata

  • Download URL: ablang2-0.2.1.tar.gz
  • Upload date:
  • Size: 33.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for ablang2-0.2.1.tar.gz
Algorithm Hash digest
SHA256 7f01bec5c5a5d5c270fa078cdd923d3a98bdc2b93538b17d46231db2d783676d
MD5 5611431e004d62b12fff5fdae1763e76
BLAKE2b-256 5ba562ff8f776a2732c5fe5ae07cee89c5f3d3248c07814d012f8669d024115f

See more details on using hashes here.

File details

Details for the file ablang2-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: ablang2-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 38.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for ablang2-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 eda9b550bd8d9bdf1caaac821c3951deff5ef7141d4644292c6ba09126bbd834
MD5 572c82b0361bf583102f6070151d8661
BLAKE2b-256 c68979e5964b73c6cf7a9b19bc8b7a9c727dedffd780086da92196cc6464d720

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page