Python bindings for TCR trie search
Project description
TCRtrie
TCRtrie is a tool for approximate search in TCR repertoires based on a CDR3 index. It can be used both for searching user-provided repertoires and for searching the VDJdb database.
The library supports two search modes:
- edit-distance-based search with bounded substitutions, insertions, and deletions;
- matrix-based search where substitutions are scored using an amino acid substitution matrix.
The core idea is to build a trie index over CDR3 amino acid sequences and use it for fast approximate matching.
Main objects
VDJdb
VDJdb is a lazy object.
It is initialized on first use and is not callable.
Use it like this:
from tcrtrie import VDJdb
VDJdb.search(...)
Do not write:
VDJdb()
Trie
Trie is the low-level class for building an index from your own AIRR-like TSV file
or from in-memory sequence arrays.
Examples
Search in VDJdb with edit distance
from tcrtrie import VDJdb
df = VDJdb.search( # returns pandas.DataFrame
query="CASSEGTDGYTF",
maxSubstitution=2,
maxInsertion=1,
maxDeletion=1,
maxEdits=2,
vGeneFilter="TRBV19*01",
jGeneFilter="TRBJ1-2*01",
numThreads=8,
detailed=True,
)
df
Batch search in VDJdb
from tcrtrie import VDJdb
VDJdb.searchForAll(
queries=["CASSEGTDGYTF", "CAISTGDSNQPQHF"],
maxSubstitution=2,
maxInsertion=1,
maxDeletion=1,
maxEdits=2,
vGeneFilters=["TRBV19*01", "TRBV6-6*01"],
jGeneFilters=["TRBJ1-2*01", "TRBJ1-5*01"],
numThreads=8,
detailed=True,
)
Search in VDJdb with a substitution matrix
from tcrtrie import VDJdb
VDJdb.searchWithMatrix(
query="CASSEGTDGYTF",
maxCost=12,
detailed=True,
)
Build a trie from your own TSV file
from tcrtrie import Trie
trie = Trie("my_repertoire.tsv")
hits = trie.SearchIndices(
query="CASSEGTDGYTF",
maxSubstitution=2,
maxInsertion=1,
maxDeletion=1,
maxEdits=2,
)
hits
Load a custom substitution matrix
from tcrtrie import Trie
trie = Trie("my_repertoire.tsv")
trie.LoadSubstitutionMatrix("my_matrix.txt", delimiter="", gapFactor=1.5)
hits = trie.SearchIndicesWithMatrix(
query="CASSEGTDGYTF",
maxCost=12,
)
hits
Installation
Requirements
- Python 3.8+
pip- C++17-compatible compiler
- CMake
On Windows, you may need to install Microsoft C++ Build Tools if no suitable compiler is available.
Install from PyPI
pip install tcrtrie
Install directly from GitHub
python -m pip install git+https://github.com/MikePodsytnik/TCRtrie@0.2.0-tcrtriepy
Clone and build locally
git clone --branch TCRtriePy https://github.com/MikePodsytnik/TCRtrie.git
cd TCRtrie
python -m pip install --upgrade pip setuptools wheel scikit-build-core pybind11
python -m pip install .
VDJdb database management
TCRtriePy does not update VDJdb automatically in order to avoid silent changes in scientific results. Database updates are performed explicitly via a command-line tool.
The tcrtrie-vdjdb-update command downloads and installs a selected VDJdb release into the local cache
(~/.cache/tcrtrie/vdjdb). The cached version is then used by the VDJdb object in Python.
Install the latest available VDJdb release:
tcrtrie-vdjdb-update
Install a specific VDJdb version (recommended for reproducibility):
tcrtrie-vdjdb-update --tag 2025-12-29
List available VDJdb releases:
tcrtrie-vdjdb-update --list | head -n 10
After updating the database, restart the Python process or Jupyter kernel to ensure the new version is used.
Input data format
Repertoire TSV format
Trie expects an AIRR-like tab-separated file.
Required column:
junction_aa
Optional columns:
v_callj_call__group_id
Minimal example:
junction_aa v_call j_call
CASSEGTDGYTF TRBV19*01 TRBJ1-2*01
CAISTGDSNQPQHF TRBV6-6*01 TRBJ1-5*01
Notes:
- the file must be tab-separated;
- amino acid sequences are read from
junction_aa; - Sequences must be uppercase and contain only the 20 standard amino acid symbols (
ACDEFGHIKLMNPQRSTVWY). v_callandj_callare optional, but they are required if you want to use V/J filtering;
Substitution matrix format
The substitution matrix must be a square matrix over amino acid symbols.
Requirements:
- all 20 standard amino acids must be present;
- the matrix may contain either 20 labels or 21 labels if gap (
-) is provided explicitly; - row and column labels must match;
- diagonal values must be strictly greater than every other value in the corresponding row and column.
Whitespace-separated matrices are supported by default.
If your matrix uses another separator, pass it via the delimiter argument.
Example:
A R N D ...
A 4 -1 -2 -2
R -1 5 0 -2
N -2 0 6 1
D -2 -2 1 6
...
How gap scores are handled
If the matrix already contains -, those values are used directly.
If a gap column is not provided, the gap score for amino acid a is derived as the negative diagonal score, i.e. gap(a) = -score(a, a) * gapFactor.
Then the same synthesized value is written to both aa -> - and - -> aa.
By default, gapFactor=1.0.
After that, the score matrix is converted into an internal non-negative cost matrix used by matrix-based search.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file tcrtrie-0.2.0.tar.gz.
File metadata
- Download URL: tcrtrie-0.2.0.tar.gz
- Upload date:
- Size: 35.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0fedb61ce25d6fb6d11ce72886b27aff3f3a6b99d8803e92ade93c1c61d9a33
|
|
| MD5 |
06b4e7447dc0970e009e75291cc19f41
|
|
| BLAKE2b-256 |
d99dc7afb2463f3e65e896a5949da3194c8b88738eebbe76fbe9a9264f1db767
|