Skip to main content

Python bindings for TCR trie search

Project description

TCRtrie

TCRtrie is a tool for approximate search in TCR repertoires based on a CDR3 index. It can be used both for searching user-provided repertoires and for searching the VDJdb database.

The library supports two search modes:

  • edit-distance-based search with bounded substitutions, insertions, and deletions;
  • matrix-based search where substitutions are scored using an amino acid substitution matrix.

Main objects

VDJdb

VDJdb is a lazy object. It is initialized on first use and is not callable.

Trie

Trie is the low-level class for building an index from your own AIRR-like TSV file or from in-memory sequence arrays.

Examples

Search in VDJdb with edit distance

from tcrtrie import VDJdb

df = VDJdb.search(  # returns pandas.DataFrame
    query="CASSEGTDGYTF",
    maxSubstitution=2,
    maxInsertion=1,
    maxDeletion=1,
    maxEdits=2,
    vGeneFilter="TRBV19*01",
    jGeneFilter="TRBJ1-2*01",
    numThreads=8,
    detailed=True,
)

df

Batch search in VDJdb

from tcrtrie import VDJdb

VDJdb.searchForAll(
    queries=["CASSEGTDGYTF", "CAISTGDSNQPQHF"],
    maxSubstitution=2,
    maxInsertion=1,
    maxDeletion=1,
    maxEdits=2,
    vGeneFilters=["TRBV19*01", "TRBV6-6*01"],
    jGeneFilters=["TRBJ1-2*01", "TRBJ1-5*01"],
    numThreads=8,
    detailed=True,
)

Search in VDJdb with a substitution matrix

from tcrtrie import VDJdb

VDJdb.searchWithMatrix(
    query="CASSEGTDGYTF",
    maxCost=12,
    detailed=True,
)

Build a trie from your own TSV file

from tcrtrie import Trie

trie = Trie("my_repertoire.tsv")

hits = trie.SearchIndices(
    query="CASSEGTDGYTF",
    maxSubstitution=2,
    maxInsertion=1,
    maxDeletion=1,
    maxEdits=2,
)

hits

Load a custom substitution matrix

from tcrtrie import Trie

trie = Trie("my_repertoire.tsv")
trie.LoadSubstitutionMatrix("my_matrix.txt", delimiter="", gapFactor=1.5)

hits = trie.SearchIndicesWithMatrix(
    query="CASSEGTDGYTF",
    maxCost=12,
)

hits

Installation

Requirements

  • Python 3.8+
  • pip
  • C++17-compatible compiler
  • CMake

On Windows, you may need to install Microsoft C++ Build Tools if no suitable compiler is available.

Install from PyPI

pip install tcrtrie

Install directly from GitHub

python -m pip install git+https://github.com/MikePodsytnik/TCRtrie@0.2.0-tcrtriepy

Clone and build locally

git clone --branch TCRtriePy https://github.com/MikePodsytnik/TCRtrie.git
cd TCRtrie
python -m pip install --upgrade pip setuptools wheel scikit-build-core pybind11
python -m pip install .

VDJdb database management

TCRtriePy does not update VDJdb automatically in order to avoid silent changes in scientific results. Database updates are performed explicitly via a command-line tool.

The tcrtrie-vdjdb-update command downloads and installs a selected VDJdb release into the local cache (~/.cache/tcrtrie/vdjdb). The cached version is then used by the VDJdb object in Python.

Install the latest available VDJdb release:

tcrtrie-vdjdb-update

Install a specific VDJdb version (recommended for reproducibility):

tcrtrie-vdjdb-update --tag 2025-12-29

List available VDJdb releases:

tcrtrie-vdjdb-update --list | head -n 10

After updating the database, restart the Python process or Jupyter kernel to ensure the new version is used.

Input data format

Repertoire TSV format

Trie expects an AIRR-like tab-separated file.

Required column:

  • junction_aa

Optional columns:

  • v_call
  • j_call
  • __group_id

Minimal example:

junction_aa	v_call	j_call
CASSEGTDGYTF	TRBV19*01	TRBJ1-2*01
CAISTGDSNQPQHF	TRBV6-6*01	TRBJ1-5*01

Notes:

  • the file must be tab-separated;
  • amino acid sequences are read from junction_aa;
  • Sequences must be uppercase and contain only the 20 standard amino acid symbols (ACDEFGHIKLMNPQRSTVWY).
  • v_call and j_call are optional, but they are required if you want to use V/J filtering;

Substitution matrix format

The substitution matrix must be a square matrix over amino acid symbols.

Requirements:

  • all 20 standard amino acids must be present;
  • the matrix may contain either 20 labels or 21 labels if gap (-) is provided explicitly;
  • row and column labels must match;
  • diagonal values must be strictly greater than every other value in the corresponding row and column.

Whitespace-separated matrices are supported by default. If your matrix uses another separator, pass it via the delimiter argument.

Example:

   A  R  N  D ...
A  4 -1 -2 -2 
R -1  5  0 -2 
N -2  0  6  1 
D -2 -2  1  6 
...

How gap scores are handled

If the matrix already contains -, those values are used directly.

If a gap column is not provided, the gap score for amino acid a is derived as the negative diagonal score, i.e. gap(a) = -score(a, a) * gapFactor. Then the same synthesized value is written to both aa -> - and - -> aa. By default, gapFactor=1.0.

After that, the score matrix is converted into an internal non-negative cost matrix used by matrix-based search.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tcrtrie-0.2.3.tar.gz (30.9 kB view details)

Uploaded Source

File details

Details for the file tcrtrie-0.2.3.tar.gz.

File metadata

  • Download URL: tcrtrie-0.2.3.tar.gz
  • Upload date:
  • Size: 30.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for tcrtrie-0.2.3.tar.gz
Algorithm Hash digest
SHA256 6a472b905190a8e6b7bf7a4e4ae4e7200f0050be200608d29ef70fb01e772f7f
MD5 6c0908b2106b48b51f116cd2de235db3
BLAKE2b-256 c1d2a755e483dd0b6e1ac86094a1da971ce4203881d5ba614ae668d964818311

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page