Skip to main content

Search OAS for similar sequences

Project description


KA-Search: Rapid and exact sequence identity search of known antibodies

In antibody therapeutic drug discovery, finding antibodies with similar sequence, or complementary determining regions (CDR) as an antibody of interest is a convenient method for understanding ones drug. A simple and often used approach for finding similar sequences is using sequence identity, either for the whole sequence or CDRs. However, with the current and ever-growing size of available antibody sequences, searching against all known antibodies is becoming ever the more difficult. The Observed Antibody Space (OAS) database currently contains 1.7 billion antibody sequences. A tool optimized for this specific task is therefore relevant for speeding up this process and making it feasible.

Here, we introduce Known Antibody Search (KA-Search), a tool for rapid search of similar whole chain, CDRs and CDR3 antibodies in the OAS database. We demonstrate how with a prepared target dataset, KA-Search can be used to find the N most similar antibodies from a dataset of 1.7 billion antibodies within ~22 mins on 6 CPUs. Further, for convenience, a user can create their own database to search against, utilizing the same functionality and speed on in-house data.


Install KA-Search

KA-Search is freely available and can be installed with pip.

    pip install kasearch

or directly from github.

    pip install -U git+

Searching with KA-Search

A Jupyter notebook showcasing KA-Search can be found here.

Align query sequence

Align query sequences using the AlignSequences class.

from kasearch import AlignSequences, SearchDB, ExtractMetadata, PrepareDB

raw_queries = [ 'QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS', 'EVQLQQSGTVLARPGASVKMSCEASGYTFTNYWMHWVKQRPGQGLEWIGAIYPGNSDTSYIQKFKGKAKLTAVTSTTSVYMELSSLTNEDSAVYYCTLYDGYYVFAYWGQGTLVTVSA',
]

query_db = AlignSequences(raw_queries, # Sequences as strings to align.
                          n_jobs=1     # Allocated number for jobs/threads for the search.
                         )
query_db.db.aligned_seqs[0]

The alignment should look like this.

array([[81, 86, 75,  0, 76, 81, 69, 83, 71, 65,  0, 69, 76, 65, 82, 80,
        71, 65, 83, 86, 75, 76, 83, 67, 75, 65, 83, 71, 89, 84, 70,  0,
         0,  0,  0,  0,  0,  0,  0,  0, 84, 78, 89, 87, 77, 81,  0, 87,
        86, 75, 81,  0, 82,  0, 80,  0, 71,  0, 81,  0,  0, 71,  0, 76,
        68,  0, 87, 73, 71, 65, 73, 89, 80, 71,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0, 68, 71, 78, 84, 82, 89,  0,  0, 84,  0,  0,
        72,  0,  0, 75, 70,  0,  0, 75,  0,  0,  0, 71, 75, 65, 84, 76,
        84, 65,  0, 68,  0,  0,  0, 75,  0, 83,  0,  0, 83, 83,  0,  0,
         0,  0, 84,  0, 65, 89, 77, 81, 76, 83, 83, 76, 65, 83,  0, 69,
        68, 83, 71, 86, 89, 89, 67, 65, 82, 71, 69, 71, 78,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0, 89, 65, 87, 70, 65, 89, 87, 71,  0, 81,
        71, 84, 84, 86, 84, 86, 83, 83],
       [69, 86, 81,  0, 76, 81, 81, 83, 71, 84,  0, 86, 76, 65, 82, 80,
        71, 65, 83, 86, 75, 77, 83, 67, 69, 65, 83, 71, 89, 84, 70,  0,
         0,  0,  0,  0,  0,  0,  0,  0, 84, 78, 89, 87, 77, 72,  0, 87,
        86, 75, 81,  0, 82,  0, 80,  0, 71,  0, 81,  0,  0, 71,  0, 76,
        69,  0, 87, 73, 71, 65, 73, 89, 80, 71,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0, 78, 83, 68, 84, 83, 89,  0,  0, 73,  0,  0,
        81,  0,  0, 75, 70,  0,  0, 75,  0,  0,  0, 71, 75, 65, 75, 76,
        84, 65,  0, 86,  0,  0,  0, 84,  0, 83,  0,  0, 84, 84,  0,  0,
         0,  0, 83,  0, 86, 89, 77, 69, 76, 83, 83, 76, 84, 78,  0, 69,
        68, 83, 65, 86, 89, 89, 67, 84, 76, 89, 68, 71, 89,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0, 89, 86, 70, 65, 89, 87, 71,  0, 81,
        71, 84, 76, 86, 84, 86, 83, 65]], dtype=int8)

Initiate target database

Next, you initiate the target database to search against. Default is OAS, however, a description for creating your own database to search against can be found further down.

DB_PATH = "../data/db-example"

oasdb = SearchDB(database_path=DB_PATH,  # DB path. Default will be to download a prepared version of OAS.
                 allowed_chain='Any',    # Search against a specific chain. Default is any chain.
                 allowed_species='Any',  # Search against a specific species. Default is any species.
                 n_jobs=5                # Allocated number for jobs/threads for the search.
                )

Search database with aligned query

The initiated database can now be searched using an aligned query.

oasdb.search(query_db.db.aligned_seqs[0], # Input can only be a single aligned sequence at a time.
             keep_best_n=2,               # You can define how many most similar sequences to return
             reset_best=True              # In cases where you want to search with the same sequence against multiple databases, you might want to not reset_best. (True is default)
            )

search_oas.current_best_identities

Returning the following best identities

array([[0.79510413, 0.78571429, 0.83333333],
       [0.79161837, 0.78571429, 0.78409091]])

Extract meta data from the N best results

The meta data of the N best matches can then be extracted using the N best indexes.

To get the meta data of the N best sequences from a specific region, the correct index needs to be specified.

  • 0: whole sequence
  • 1: CDRs
  • 2: CDR3

e.g. "oasdb.current_best_ids[:,0]" returns meta data for the N best whole sequence identities.

NB: The column "sequence_alignment_aa" holds the antibody sequence.

meta_db = ExtractMetadata()

n_best_sequences = meta_db.get_meta(oasdb.current_best_ids[:,0])
n_best_sequences

Returns

array([[0.79510413, 0.78571429, 0.83333333],
       [0.79161837, 0.78571429, 0.78409091]])

Create custom database

To create your own database you first need to create a csv file in the OAS format. For an example file, look at data/custom-data-example.csv. This file consists of a dictionary containing the metadata in the first line and then rows of the individual sequences afterwards. Only the Species and Chain is strictly needed in the metadata, and only the amino acids sequence of the antibodies is required for each antibody sequence.

1. Format your data into OAS files

import json
import os
import pandas as pd

from kasearch.species_anarci import number
from kasearch.merge_db import merge_files

metadata = {"Species":"Human", "Chain":"Heavy"}
metadata = pd.Series(name=json.dumps(metadata), dtype='object')
seqsdata = pd.DataFrame([["EVQLVESGGGLAKPGGSLRLHCAASGFAFSSYWMNWVRQAPGKRLEWVSAINLGGGLTYYAASVKGRFTISRDNSKNTLSLQMNSLRAEDTAVYYCATDYCSSTYCSPVGDYWGQGVLVTVSS"],
                          ["EVQLVQSGAEVKRPGESLKISCKTSGYSFTSYWISWVRQMPGKGLEWMGAIDPSDSDTRYNPSFQGQVTISADKSISTAYLQWSRLKASDTATYYCAIKKYCTGSGCRRWYFDLWGPGT"]
                         ], columns = ['sequence_alignment_aa'])
                         
save_file = "../data/custom-data-example-2.csv"
metadata.to_csv(save_file, index=False)
seqsdata.to_csv(save_file, index=False, mode='a')

2. Turn your OAS formatted files into a custom database

After creating all the files you want to include in the new database, you can run the following code to create the database.

db_folder = "../data/my_db"
db_files = ['../data/custom-data-example.csv']

newDB = PrepareDB(db_path=db_folder)

db_dict = {}
for num, data_unit_file in enumerate(db_files):

    db_dict[num] = data_unit_file
    
    metadata = json.loads(','.join(pd.read_csv(data_unit_file, nrows=0).columns))
    seqsdata = pd.read_csv(data_unit_file, header=1, usecols=['sequence_alignment_aa']).iloc[:,0].values
    numbered_sequences = [number(sequence, allowed_species=None)[0] for sequence in seqsdata]

    newDB.prepare_database(numbered_sequences, 
                           file_id=num, 
                           chain=metadata['Chain'], 
                           species=metadata['Species'])
    
newDB.save_database()

merge_files(db_folder)

with open(os.path.join(db_folder, "my_db_id_to_study.txt"), "w") as handle: handle.write(str(db_dict))

3. Initiate the search class with your custom database

mydb = SearchDB(database_path=db_folder,      # Path to your database. Default will be to download a prepared version of OAS.
                 allowed_chain='Heavy',       # Search against a specific chain. Default is any chain.
                 allowed_species='Any',       # Search against a specific species. Default is any species.
                 n_jobs=1                     # Allocated number for jobs/threads for the search.
                )

Citation

Work in preparation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kasearch-0.0.2.tar.gz (63.5 kB view hashes)

Uploaded source

Built Distribution

kasearch-0.0.2-py3-none-any.whl (66.3 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page