Fast and performant TCR representation model
Project description
SCEPTR
[!NOTE] The latest version of SCEPTR no longer supports Python versions earlier than 3.9.
Simple Contrastive Embedding of the Primary sequence of T cell Receptors (SCEPTR) is a BERT-like attention model trained on T cell receptor (TCR) data. It maps TCRs to vector representations, which can be used for downstream TCR and TCR repertoire analysis such as TCR clustering or classification.
Installation
From PyPI (Recommended)
Coming soon.
From Source
[!IMPORTANT] To install
sceptr
from source, you must havegit-lfs
installed and set up on your system. This is because you must be able to download the trained model weights directly from the Git LFS servers during your install.
Using pip
From your Python environment, run the following replacing <VERSION_TAG>
with the appropriate version specifier (e.g. v1.0.0-alpha.1
).
The latest release tags can be found by checking the 'releases' section on the github repository page.
pip install git+https://github.com/yutanagano/sceptr.git@<VERSION_TAG>
Manual install
You can also clone the repository, and from within your Python environment, navigate to the project root directory and run:
pip install .
Note that even for manual installation, you still need git-lfs
to properly de-reference the stub files at git-clone
-ing time.
Troubleshooting
A recent security update to git
has resulted in some difficulties cloning repositories that rely on git-lfs
.
This can result in an error message with a message along the lines of:
fatal: active `post-checkout` hook found during `git clone`
If this happens, you can temporarily set the GIT_CLONE_PROTECTION_ACTIVE
environment variable to false
by prepending GIT_CLONE_PROTECTION_ACTIVE=false
before the install command like below:
GIT_CLONE_PROTECTION_ACTIVE=false pip install git+https://github.com/yutanagano/sceptr.git@<VERSION_TAG>
This is a known issue for git
version 2.45.1
and is fixed from version 2.45.2
.
Prescribed data format
[!IMPORTANT] SCEPTR only recognises TCR V/J gene symbols that are IMGT-compliant, and also known to be functional (i.e. known pseudogenes or ORFs are not allowed). For easy standardisation of TCR gene nomenclature in your data, as well as filtering your data for functional V/J genes, check out tidytcells.
SCEPTR expects to receive TCR data in the form of pandas DataFrame
instances.
Therefore, all TCR data should be represented as a DataFrame
with the following structure and data types.
The column order is irrelevant.
Each row should represent one TCR.
Incomplete rows are allowed (e.g. only beta chain data available) as long as the SCEPTR variant that is being used has at least some partial information to go on.
Column name | Column datatype | Column contents |
---|---|---|
TRAV | str |
IMGT symbol for the alpha chain V gene |
CDR3A | str |
Amino acid sequence of the alpha chain CDR3, including the first C and last W/F residues, in all caps |
TRAJ | str |
IMGT symbol for the alpha chain J gene |
TRBV | str |
IMGT symbol for the beta chain V gene |
CDR3B | str |
Amino acid sequence of the beta chain CDR3, including the first C and last W/F residues, in all caps |
TRBJ | str |
IMGT symbol for the beta chain J gene |
Usage
Functional API (sceptr.sceptr
)
The eponymous sceptr
submodule is the easiest way to use SCEPTR.
It loads the default SCEPTR variant (currently ab_sceptr
) and exposes its methods directly as module-level functions.
[!TIP] To use the functional API, import the
sceptr
submodule like so:from sceptr import sceptr
Attempting to access the submodule as an attribute of the top level module
import sceptr sceptr.sceptr.calc_vector_representations() #...do something...
will result in an error.
sceptr.sceptr.calc_vector_representations(instances: DataFrame) -> ndarray
Map a table of TCRs provided as a pandas DataFrame
in the above format to a set of vector representations.
Parameters:
- tcrs (
DataFrame
): DataFrame in the presribed format.
Returns:
A 2D numpy ndarray
object where every row vector corresponds to a row in the original TCR DataFrame
.
The returned array will have shape (N, D) where N is the number of TCRs in the input data and D is the dimensionality of the SCEPTR model.
sceptr.sceptr.calc_cdist_matrix(anchors: DataFrame, comparisons: DataFrame) -> ndarray
Generate a cdist matrix between two collections of TCRs.
Parameters:
- anchor_tcrs (
DataFrame
): DataFrame in the prescribed format, representing TCRs from collection A. - comparison_tcrs (
DataFrame
): DataFrame in the prescribed format, representing TCRs from collection B.
Returns:
A 2D numpy ndarray
representing a cdist matrix between TCRs from collection A and B.
The returned array will have shape (X, Y) where X is the number of TCRs in collection A and Y is the number of TCRs in collection B.
sceptr.sceptr.calc_pdist_vector(instances: DataFrame) -> ndarray
Generate a pdist set of distances between each pair of TCRs in the input data.
Parameters:
- tcrs (
DataFrame
): DataFrame in the prescribed format.
Returns
A 2D numpy ndarray
representing a pdist vector of distances between each pair of TCRs in the input data.
The returned array will have shape (1/2 * N * (N-1),), where N is the number of TCRs in the input data.
Loading specific SCEPTR variants (sceptr.variant
)
Because SCEPTR is still a project in development, there exist multiple variants of the model.
For more curious users, these model variants will be available to load and use through the sceptr.variant
submodule.
The module exposes functions, each named after a particular model variant, which when called, will return a Sceptr
object corresponding to the selected model variant.
This Sceptr
object will then have the methods: calc_pdist_vector
, calc_cdist_matrix
, and calc_vector_representations
available to use, with function signatures exactly as defined above for the functional API in the sceptr.sceptr
submodule.
Paired-chain variants
Name | Description |
---|---|
sceptr.variant.default |
default model used by the functional API |
sceptr.variant.mlm_only |
default model trained without autocontrastive learning |
sceptr.variant.left_aligned |
similar to default model but with learnable token embeddings and a sinusoidal position information embedding method more similar to the original NLP BERT/transformer models |
sceptr.variant.left_aligned_mlm_only |
left-aligned variant trained without autocontrastive learning |
sceptr.variant.cdr3_only |
only uses the CDR3 loops as input |
sceptr.variant.cdr3_only_mlm_only |
only uses CDR3 loops as input, and did not receive autocontrastive learning |
sceptr.variant.large |
larger variant with model dimensionality 128 |
sceptr.variant.small |
smaller variant with model dimensionality 32 |
sceptr.variant.tiny |
smaller variant with model dimensionality 16 |
sceptr.variant.blosum |
variant using BLOSUM62 embeddings instead of one-hot |
sceptr.variant.average_pooling |
variant using the average-pooling method to generate the TCR representation vector |
sceptr.variant.shuffled_data |
variant trained on the Tanno et al. dataset with randomised alpha/beta pairing |
sceptr.variant.synthetic_data |
variant trained using synthetic TCR sequences generated by OLGA |
sceptr.variant.dropout_noise_only |
variant trained without residue/chain dropping during autocontrastive learning |
sceptr.variant.finetuned |
variant fine-tuned using supervised contrastive learning for six pMHCs with peptides GILGFVFTL, NLVPMVATV, SPRWYFYYL, TFEYVSQPFLMDLE, TTDPSFLGRY and YLQPRTFLL (from VDJdb) |
Single-chain variants
Name | Description |
---|---|
sceptr.variant.a_sceptr |
alpha-chain only variant |
sceptr.variant.b_sceptr |
beta-chain only variant |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.