Skip to main content

AbLang: A language model for antibodies.

Project description


AbLang: A language model for antibodies

DOI:10.1101/2022.01.20.477061

Motivation: General protein language models have been shown to summarize the semantics of protein sequences into representations that are useful for state-of-the-art predictive methods. However, for antibody specific problems, such as restoring residues lost due to sequencing errors, a model trained solely on antibodies may be more powerful. Antibodies are one of the few protein types where the volume of sequence data needed for such language models is available, e.g. in the Observed Antibody Space (OAS) database.

Results: Here, we introduce AbLang, a language model trained on the antibody sequences in the OAS database. We demonstrate the power of AbLang by using it to restore missing residues in antibody sequence data, a key issue with B-cell receptor repertoire sequencing, e.g. over 40% of OAS sequences are missing the first 15 amino acids. AbLang restores the missing residues of antibody sequences better than using IMGT germlines or the general protein language model ESM-1b. Further, AbLang does not require knowledge of the germline of the antibody and is seven times faster than ESM-1b.

Availability and implementation: AbLang is a python package available at https://github.com/oxpig/AbLang.


Install AbLang

AbLang is freely available and can be installed with pip.

    pip install ablang

or directly from github.

    pip install -U git+https://github.com/oxpig/AbLang.git

NB: If you use the argument "align=True", you need to manually install a version of ANARCI in the same environment. ANARCI can also be installed using bioconda; however, this version is maintained by a third party.

    conda install -c bioconda anarci

AbLang use cases

A Jupyter notebook showing the different use cases of AbLang and its building blocks can be found here.

Currently, AbLang can be used to generate three different representations/encodings for antibody sequences.

  1. Res-codings: These encodings are 768 values for each residue, useful for residue specific predictions.

  2. Seq-codings: These encodings are 768 values for each sequence, useful for sequence specific predictions. The same length of encodings for each sequence, means these encodings also removes the need to align antibody sequences.

  3. Res-likelihoods: These encodings are the likelihoods of each amino acid at each position in a given antibody sequence, useful for exploring possible mutations. The order of amino acids follows the ablang vocabulary.

These representations can be used for a plethora of antibody design applications. As an example, we have used the res-likelihoods from AbLang to restore missing residues in antibody sequences due either to sequencing errors, such as ambiguous bases, or the limitations of the sequencing techniques used.

Antibody sequence restoration

Restoration of antibody sequences can be done using the "restore" mode as seen below.

import ablang

heavy_ablang = ablang.pretrained("heavy") # Use "light" if you are working with light chains
heavy_ablang.freeze()


seqs = [
    'EV*LVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS',
    '*************PGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNK*YADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTL*****',
]

heavy_ablang(seqs, mode='restore')

The output of the above is seen below.

array(['EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS',
       'QVQLVESGGGVVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS'],
      dtype='<U121')

For restoration of an unknown number of missing residues at the ends of antibody sequences, the "align" parameter can be set to True.

seqs = [
    'EV*LVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS',
    'PGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNK*YADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTL',
]

heavy_ablang(seqs, mode='restore', align=True)

The output of the above is seen below.

array(['EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS',
       'QVQLVESGGGVVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS'],
      dtype='<U121')

Citation

@article{Olsen2022,
  title={AbLang: An antibody language model for completing antibody sequences},
  author={Tobias H. Olsen, Iain H. Moal and Charlotte M. Deane},
  journal={bioRxiv},
  doi={https://doi.org/10.1101/2022.01.20.477061},
  year={2022}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ablang-0.3.1.tar.gz (23.5 kB view details)

Uploaded Source

Built Distribution

ablang-0.3.1-py3-none-any.whl (23.4 kB view details)

Uploaded Python 3

File details

Details for the file ablang-0.3.1.tar.gz.

File metadata

  • Download URL: ablang-0.3.1.tar.gz
  • Upload date:
  • Size: 23.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for ablang-0.3.1.tar.gz
Algorithm Hash digest
SHA256 17a9a13827c14189cb0d86c97ed18af53608b46d78a6d310dae3b5f46eb3c6d1
MD5 c7cc3617c5b13b2e4602db1225e24f49
BLAKE2b-256 a59390be3815d37d11d1733da589d6773e8cce7958018c0df93fa1af20882200

See more details on using hashes here.

File details

Details for the file ablang-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: ablang-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 23.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for ablang-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 12293cc04c2d1b10dac5c1f23f1f4b34164a2be3c22b1ab9c7e0bced6c65e394
MD5 c772079ce0303ada30ec45b0d01ed30b
BLAKE2b-256 69edd156169eb643e53d4a20a1199e2a199106f3fa2e5f230ecb1a5d12c7e4c5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page