T-cell receptor selection for TCR-T studies from antigen specific culture and scRNA/VDJ sequencing

These details have not been verified by PyPI

Project links

Project description

TCRsift

Select antigen-specific TCRs from single-cell sequencing data.

pip install tcrsift
tcrsift run --sample-sheet samples.yaml -o results/

Architecture
Installation
Quick Start
Sample Sheet Format
Core Pipeline Steps
Supplementary Tools
Workflows
API Reference
Output Files

Architecture

CellRanger VDJ + GEX  -->  tcrsift run  -->  clonotypes.csv
                              |
        load -> phenotype -> clonotype -> filter -> annotate -> assemble

Supplementary: load-sct, annotate-gex, match-til, til-clonotype, til-select, unify

Key Data Structures

Stage	Format	Description
Load → Phenotype	`AnnData`	Per-cell data with expression matrix + VDJ annotations in `.obs`
Clonotype → Assemble	`DataFrame`	Per-clonotype data with aggregated statistics

The transition from AnnData to DataFrame happens at the clonotype aggregation step. After that point, all operations work on clonotype-level DataFrames.

CellRanger Requirements

TCRsift expects standard 10x Genomics CellRanger output directories:

VDJ Directory (from cellranger vdj):

vdj_outs/
├── filtered_contig_annotations.csv   # Required: contig info per cell
├── clonotypes.csv                    # Optional: CellRanger clonotype calls
├── consensus_annotations.csv         # Optional: for sequence assembly
└── filtered_contig.fasta             # Optional: for native leader extraction

Required columns in filtered_contig_annotations.csv:

barcode, chain (TRA/TRB)
cdr3 (amino acid sequence)
v_gene, j_gene, c_gene
umis, reads
productive, full_length

GEX Directory (from cellranger count):

gex_outs/
├── filtered_feature_bc_matrix.h5     # Preferred: HDF5 format
└── filtered_feature_bc_matrix/       # Alternative: MTX directory
    ├── matrix.mtx.gz
    ├── features.tsv.gz
    └── barcodes.tsv.gz

The GEX matrix must contain T cell marker genes (CD3D, CD3E, CD3G, CD4, CD8A, CD8B) for phenotyping. Gene names or ENSEMBL IDs are both supported.

Barcode Matching: CellRanger VDJ and GEX may use different barcode suffixes (e.g., ACGT-1 vs ACGT-2). TCRsift strips suffixes and matches on the core barcode.

Installation

pip install tcrsift

Or install from source:

git clone https://github.com/pirl-unc/tcrsift.git
cd tcrsift
pip install -e .

Optional Dependencies

# Common add-on bundles
pip install "tcrsift[reports,assembly,excel]"

# For PDF report generation
pip install "tcrsift[reports]"
brew install wkhtmltopdf  # macOS

# For constant region sequences from Ensembl
pip install "tcrsift[assembly]"
pyensembl install --release 93 --species human

# For SCT Excel input files
pip install "tcrsift[excel]"

Quick Start

Run the Complete Pipeline

tcrsift run \
    --sample-sheet samples.yaml \
    --output-dir results/ \
    --vdjdb /path/to/vdjdb

This runs: load → phenotype → clonotype → filter → annotate → assemble

With Configuration File

# Generate example config with all defaults
tcrsift generate-config -o my_config.yaml

# Edit config, then run
tcrsift run --config my_config.yaml --sample-sheet samples.yaml -o results/

Python API

import tcrsift

# Load samples from sample sheet
adata = tcrsift.load_samples("samples.yaml")

# Phenotype cells (CD4/CD8 classification)
adata = tcrsift.phenotype_cells(adata)

# Aggregate to clonotypes
clonotypes = tcrsift.aggregate_clonotypes(adata)

# Filter by expansion
filtered = tcrsift.filter_clonotypes(clonotypes, method="threshold", tcell_type="cd8")

# Annotate with VDJdb
annotated = tcrsift.annotate_clonotypes(filtered, vdjdb_path="/path/to/vdjdb")

# Assemble full sequences
assembled = tcrsift.assemble_full_sequences(annotated, include_constant=True)

Sample Sheet Format

TCRsift accepts sample sheets in CSV or YAML format.

YAML Format

samples:
  # Culture sample with peptide stimulation
  - sample: "Patient1_Culture"
    vdj_dir: "/data/patient1/vdj"
    gex_dir: "/data/patient1/gex"
    antigen_type: "short_peptide"
    antigen_name: "CMV pp65 495-503"
    epitope_sequence: "NLVPMVATV"
    mhc_allele: "HLA-A*02:01"
    source: "culture"

  # TIL sample (no antigen info needed)
  - sample: "Patient1_TIL"
    vdj_dir: "/data/patient1_til/vdj"
    source: "til"
    tissue: "tumor"

CSV Format

sample,vdj_dir,gex_dir,antigen_type,source
Patient1_Culture,/data/patient1/vdj,/data/patient1/gex,short_peptide,culture
Patient1_TIL,/data/patient1_til/vdj,,,til

Required Fields

Field	Required	Description
`sample`	Yes	Unique sample identifier
`vdj_dir`	Yes*	Path to CellRanger VDJ output
`gex_dir`	No	Path to CellRanger GEX output
`source`	No	Sample type: `culture`, `til`, `tetramer`, `sct`
`patient_id`	No	Donor / patient identifier — enables per-donor analysis
`enrichment_method`	No	Free-form method label (e.g. `AIMpos`, `tetpos`) — enables per-method outputs
`timepoint`	No	Free-form timepoint label (e.g. `D7`, `D14`) — enables per-timepoint filtering
`apc_type`	No	APC identity (e.g. `mDC`, `B-LCL`) — enables per-APC filtering

*At least one of vdj_dir or gex_dir is required.

Multi-donor / multi-method designs

When patient_id and enrichment_method are populated, several behaviors light up automatically:

Per-(donor, method) aggregations on clonotypes (e.g. n_methods_per_donor, max_methods_per_donor)
Per-(donor, method) ranked CSVs under data/filtered_by_method/
Per-donor method × method overlap matrices + recovery panel (see Filter Clonotypes)
Per-donor FDR scope as the default when donors are unrelated (see --fdr-scope below)

For cohorts where donors share antigen + MHC + experimental cohort and a unified FDR ranking is biologically valid, set the YAML root-level flag:

donors_share_antigen: true
samples:
  - sample: ...
    patient_id: ...
    enrichment_method: ...

This locks --fdr-scope auto resolution to global instead of per-donor.

Antigen Types

Antigen Type	Expected T Cell	Description
`short_peptide`	CD8	8-11aa peptides (direct MHC-I binding)
`long_peptide`	mixed	15-25+aa (requires processing)
`whole_protein`	mixed	Full protein antigens
`tetramer_mhc1`	CD8	MHC-I tetramer selection
`tetramer_mhc2`	CD4	MHC-II tetramer selection
`sct`	CD8	Single-chain trimer (pMHC-I fusion)

Core Pipeline Steps

1. Load Data

Loads CellRanger VDJ and GEX outputs, extracts T cell markers (CD3, CD4, CD8), and combines into a unified AnnData object.

tcrsift load --sample-sheet samples.yaml -o loaded.h5ad

What happens:

Reads filtered_contig_annotations.csv for VDJ data
Reads filtered_feature_bc_matrix.h5 for gene expression
Matches barcodes between VDJ and GEX
Extracts CD3D/E/G, CD4, CD8A/B expression per cell
Pivots VDJ to get one row per cell with TRA/TRB info

2. Phenotype Cells

Classifies each cell as CD4+ or CD8+ based on gene expression ratios.

tcrsift phenotype -i loaded.h5ad -o phenotyped.h5ad --cd4-cd8-ratio 3.0

Classification logic:

Confident CD8+: (CD8A + CD8B + 1) / (CD4 + 1) > ratio (default: 3.0)
Confident CD4+: (CD4 + 1) / (CD8A + CD8B + 1) > ratio
Likely CD8+: CD8 > 0 and CD4 = 0 (any CD8 without CD4)
Likely CD4+: CD4 > 0 and CD8 = 0 (any CD4 without CD8)
Unknown: Similar expression or both near zero

3. Aggregate Clonotypes

Groups cells by CDR3 sequences into clonotypes with aggregated statistics.

tcrsift clonotype -i phenotyped.h5ad -o clonotypes.csv --group-by CDR3ab

Grouping options:

CDR3ab: Match by both alpha and beta CDR3 (strict pairing)
CDR3b_only: Match by beta chain only (allows alpha variation)

Output columns:

CDR3ab: Unique identifier (CDR3_alpha_CDR3_beta)
cell_count: Number of cells with this TCR
frequency: Proportion of total cells
Tcell_type_consensus: Most common phenotype
samples: Which samples contain this clone

4. Filter Clonotypes

Applies tiered filtering to prioritize expanded clones.

tcrsift filter -i clonotypes.csv -o filtered/ --method threshold --tcell-type cd8

Tier thresholds (default):

Tier	Min Cells	Min Frequency	Max Conditions
1	10	1%	2
2	5	0.5%	3
3	3	0.1%	5
4	2	0.05%	10
5	2	0%	unlimited

FDR scope (`--fdr-scope`)

Controls the null distribution the FDR-tier filter computes against:

Value	Behavior	Output
`auto` (default)	Resolves to `per-donor` for multi-donor cohorts unless `donors_share_antigen` is set, else `global`	depends on resolution
`global`	Single FDR ranking pooled across all samples (pre-0.8.2 behavior)	`filtered_tier{1..5}.csv`
`per-donor`	Each donor gets its own FDR null from its own pooled-across-methods samples	`filtered_tier{1..5}_<donor>.csv`
`per-sample`	Each sample gets its own FDR null	`filtered_tier{1..5}_<sample>.csv`

Default flip (0.8.2): Multi-donor cohorts now get per-donor scope by default. Without donors_share_antigen: true on the sheet, pooling unrelated donors into one FDR ranking silently biases the result — a clone abundant in one donor and absent from another is ranked against the wrong distribution. Users running v0.7.x → v0.8.x against multi-donor sheets will see different tier files.

# Multi-donor unrelated → per-donor by default
tcrsift run --sample-sheet sheet.yaml -o results/

# Force the old behavior:
tcrsift run --sample-sheet sheet.yaml -o results/ --fdr-scope global
# or on the sheet:
# donors_share_antigen: true

Method × method overlap + recovery (multi-method designs)

When enrichment_method is populated, tcrsift run emits additional outputs that surface per-(donor, method) performance without users reconstructing it from delimited strings:

data/clone_sample_long.csv                  # long-format (clone, sample) table
data/filtered_by_method/<donor>__<m>.csv    # top-N clones per (donor, method)
data/method_overlap_<donor>.csv             # method × method overlap matrix
plots/method_overlap_<donor>.png            # heatmap
data/method_recovery.csv                    # per-(donor, method) recovery of tier1
plots/method_recovery.png                   # paired-by-donor bar chart

Overlap similarity metric is configurable: --method-overlap-similarity {jaccard,dice,count} (default jaccard). All blocks are no-ops when enrichment_method isn't populated — backwards-compatible with single-method designs.

5. Annotate Clonotypes

Matches against public TCR databases to identify known specificities.

tcrsift annotate -i filtered/tier1.csv -o annotated.csv \
    --vdjdb /path/to/vdjdb \
    --iedb /path/to/iedb

Supported databases:

VDJdb: Curated TCR-epitope pairs
IEDB: Immune Epitope Database
CEDAR: Cancer Epitope Database and Analysis Resource

Viral flagging: Clones matching CMV, EBV, HIV, Influenza, etc. are flagged as is_viral=True for review.

Reference-database cache (`tcrsift data`)

When --vdjdb / --iedb / --cedar are omitted, tcrsift annotate and tcrsift run auto-discover databases from a managed cache directory. Resolution order (first wins):

explicit --cache-dir flag
TCRSIFT_DATA_DIR env var
$XDG_CACHE_HOME/tcrsift
~/.cache/tcrsift

Manage the cache with the tcrsift data subcommand:

tcrsift data list                  # show cached DBs, sizes, paths
tcrsift data download              # auto-fetch supported DBs (VDJdb + IEDB)
tcrsift data download --db iedb    # fetch one DB
tcrsift data download --force      # refresh
tcrsift data clear --db iedb       # remove one DB
tcrsift data clear                 # remove all

VDJdb is fetched via the GitHub releases API for the latest version; IEDB uses the stable receptor-full-v3 download. CEDAR has no automated download — tcrsift data list shows where to drop it manually. Every downloaded DB gets a _meta.json sidecar with source URL, timestamp, size, and sha256.

# Fresh install: cache the DBs once, then forget about paths.
tcrsift data download
tcrsift run --sample-sheet samples.yaml -o results/    # cache picked up automatically

6. Assemble Full Sequences

Builds full-length TCR sequences with leader peptides and constant regions.

tcrsift assemble -i annotated.csv -o full_sequences.csv \
    --alpha-leader CD28 --beta-leader CD8A --include-constant

Sequence structure:

[Leader] + [V(D)J variable region] + [Constant region]

Single-chain construct:
[Beta full] + [T2A linker] + [Alpha full]

Leader options: CD8A, CD28, IgK, TRAC, TRBC, or from_contig (extract native)

Supplementary Tools

These tools handle data outside the standard CellRanger workflow.

Load SCT Data

Loads TCR data from SCT (single-cell TCR) platform Excel files.

tcrsift load-sct -i sct_data.xlsx -o sct_clonotypes.csv --aggregate

When to use: You have SCT platform data (pMHC tetramer with paired TCR sequencing) that wasn't processed through CellRanger.

Quality filters applied:

high_quality: SNR ≥ 2.0, reads ≥ 10 per chain, mutation match
chosen: Stricter criteria (SNR ≥ 3.4, reads ≥ 50, comPACT match)

TIL Matching (Automatic)

When you include TIL samples in your sample sheet with source: til, the run command automatically detects them and adds TIL matching columns to culture clonotypes.

# samples.yaml - TIL samples are auto-detected
samples:
  - sample: "Culture_Pool1"
    vdj_dir: "/data/culture/vdj"
    source: "culture"
  - sample: "Patient1_TIL"
    vdj_dir: "/data/til/vdj"
    source: "til"  # This sample will be used for TIL matching

tcrsift run --sample-sheet samples.yaml -o results/
# TIL matching happens automatically - no extra flags needed!

# You can also explicitly specify TIL samples (overrides auto-detection):
tcrsift run --sample-sheet samples.yaml -o results/ --til-samples Patient1_TIL

Use either --til-samples or repeat --til-sample, not both.

Output columns added:

til_match (bool): Clone found in TIL
til_cell_count: Number of TIL cells with this TCR
til_frequency: Frequency in TIL repertoire

TIL samples are excluded from culture clonotype aggregation and are only used for matching.

Why TIL matching matters: Clones that appear in both antigen-stimulated culture AND tumor tissue provide orthogonal evidence of tumor-reactivity.

Match TIL (Cross-Run)

Use match-til only when TIL data was processed in a separate pipeline run.

tcrsift match-til \
    -i culture_clonotypes.csv \
    --til-h5ad til_processed.h5ad \
    -o matched.csv

# Or provide multiple TIL samples directly (no sample sheet):
tcrsift match-til \
    -i culture_clonotypes.csv \
    -o matched.csv \
    --til-sample T1=csv:/path/to/til_t1.csv \
    --til-sample T2=h5ad:/path/to/til_t2.h5ad \
    --til-sample T3=vdj:/path/to/til_t3_vdj_outs

When to use:

TIL from a different patient or experiment
Retrospective matching against archived TIL data
TIL processed with different parameters

TIL-Only Aggregation

For TIL-only studies (for example, multiple TIL timepoints), aggregate one or more TIL sources directly into clonotype-level counts/frequencies:

tcrsift til-clonotype -o til_clonotypes.csv \
    --til-sample T1=csv:/path/to/til_t1.csv \
    --til-sample T2=h5ad:/path/to/til_t2.h5ad \
    --til-sample T3=vdj:/path/to/til_t3_vdj_outs

This creates a harmonized clonotype table with:

til_cell_count and til_frequency (combined across all TIL samples)
til_cell_count.{sample} and til_frequency.{sample} (per-sample columns)

TIL-Only Clone Prioritization (`til-select`)

For 10x VDJ + GEX tumor timepoints, use til-select to prioritize clones with CD8 bias plus enrichment/immunogenic/cytolytic branch signals.

Input layout per timepoint:

consensus_annotations.<TP>.csv
clonotypes.<TP>.csv
filtered_contig_annotations.<TP>.csv
sample_filtered_feature_bc_matrix.<TP>.h5

Example (compatible with pfo004/full-length-tcrs-in-TILs/data):

tcrsift til-select \
  --data-dir ~/code/pfo-analysis/pfo004/full-length-tcrs-in-TILs/data \
  --vdjdb ~/code/pfo-analysis/pfo004/full-length-tcrs-in-TILs/data/vdjdb.txt \
  --iedb ~/code/pfo-analysis/pfo004/full-length-tcrs-in-TILs/data/iedb_tcr_full_v3.tsv \
  --cedar ~/code/pfo-analysis/pfo004/full-length-tcrs-in-TILs/data/cedar_tcr_full_v3.tsv \
  --rank-by marker_score_z_mean \
  --verbose

Key outputs in figures/:

abTCR_master_table.csv
abTCR_annotated.csv
selection_masks.csv
subset_*.csv
selection_funnel.png
selected_clones_report.pdf
marker_cells_<TP>.csv, marker_clonotype_scores_<TP>.csv

Legacy v2 CSV compatibility:

til-select writes v2-compatible CSV schemas and column ordering by default.
Using the same inputs/options as v2/harmonize_abtcr_timepoints.py, CSV outputs are expected to match exactly.
Figure files (.png, .pdf) are not expected to be byte-identical across runs/environments.

Annotate with Gene Expression (`annotate-gex`)

Adds gene expression data from a 10x HDF5 file to TCR DataFrames.

annotate vs annotate-gex:

Command	Data Source	Purpose
`annotate`	Public databases (VDJdb, IEDB)	Label clonotypes with known epitope specificities
`annotate-gex`	10x HDF5 expression file	Add per-cell gene expression values

When GEX data is available:

Standard pipeline: If gex_dir is in your sample sheet, GEX is loaded automatically at the load step and used for CD4/CD8 phenotyping
VDJ-only workflows: Use annotate-gex to add expression from a separate HDF5 file

When to use annotate-gex:

You loaded VDJ-only data (no gex_dir in sample sheet)
You have a separate 10x HDF5 file with expression data
You want genes beyond the default CD3/CD4/CD8 markers

# Add per-cell expression from HDF5 file
tcrsift annotate-gex \
    -i cells.csv \
    --gex-file filtered_feature_bc_matrix.h5 \
    -o cells_with_gex.csv

# Add GEX and aggregate to clonotype level
tcrsift annotate-gex \
    -i cells.csv \
    --gex-file filtered_feature_bc_matrix.h5 \
    --aggregate \
    --cd4-cd8-counts \
    -o clonotype_gex.csv

# Custom gene list
tcrsift annotate-gex \
    -i cells.csv \
    --gex-file matrix.h5 \
    --genes "GZMA,GZMB,PRF1,IFNG,TNF" \
    -o cytotoxicity_markers.csv

Output columns:

gex.{GENE}: Expression per cell
gex.{GENE}.sum, gex.{GENE}.mean: Aggregated per clonotype (with --aggregate)
gex.n_reads, gex.n_genes, gex.pct_mito: QC metrics per cell
CD4_only.count, CD8_only.count: Cells with exclusive expression (with --cd4-cd8-counts)

Note: For most workflows, use gex_dir in your sample sheet and the standard pipeline will handle GEX automatically during loading.

Python API:

from tcrsift import augment_with_gex, aggregate_gex_by_clonotype, compute_cd4_cd8_counts

cells_df = augment_with_gex(cells_df, "filtered_feature_bc_matrix.h5")
clonotype_gex = aggregate_gex_by_clonotype(cells_df, group_col="CDR3_pair")
cd4_cd8 = compute_cd4_cd8_counts(cells_df, group_col="CDR3_pair")

Unify Multiple Experiments

Merges clonotype data from multiple independent pipeline runs into a unified table.

tcrsift unify \
    -i til_results/clonotypes.csv culture_results/clonotypes.csv sct_clonotypes.csv \
    -o unified.csv

Note: Inputs can be standard TCRsift clonotype outputs (CDR3ab) or SCT-style tables (CDR3_pair); tcrsift unify will normalize identifiers automatically.

When to use: You have results from multiple independent runs and want to compare or combine them.

run vs unify:

Scenario	Use
One patient, culture + TIL in same sample sheet	`run` (TIL auto-detected)
One patient, culture + TIL processed separately	`match-til`
TIL-only, one or more tumor timepoints (10x VDJ+GEX)	`til-select`
Multiple patients or experiments	`unify`
Comparing results across different data sources	`unify`

Output includes:

Prefixed columns from each source (e.g., TIL.cell_count, Culture.cell_count)
Occurrence flags (occurs_in_TIL, occurs_in_Culture)
Combined statistics (combined.total_cells.count)
Phenotype confidence based on combined evidence

Generate Mnemonic Names

Creates pronounceable names from CDR3 sequences for easier reference. Similar sequences produce similar names, making it easy to spot related clonotypes.

tcrsift mnemonic -i clonotypes.csv -o clonotypes_named.csv

Example output:

CDR3_beta	mnemonic_name
CASSLGQAYEQYF	Laigqaye Qoy
CASSLAGAYEQYF	Lagaye Qoy
CASSIRASYEQYF	Irasye Qoy
CASSIRANYEQYF	Iranye Qoy

Common conserved prefixes (CASS, CAV) and suffixes (F) are stripped to focus on the variable region. Inserted vowels use diphthongs (ai, oo, ei) to distinguish from original amino acid vowels (A, E, I, Y).

Workflows

Standard Single-Experiment Analysis

tcrsift run --sample-sheet samples.yaml -o results/ --vdjdb /path/to/vdjdb

Culture + TIL Together

When TIL and culture samples are in the same sample sheet, TIL matching happens automatically:

# samples.yaml
samples:
  - sample: "Culture_Pool1"
    vdj_dir: "/data/culture/vdj"
    gex_dir: "/data/culture/gex"
    source: "culture"
  - sample: "TIL"
    vdj_dir: "/data/til/vdj"
    source: "til"  # Auto-detected for TIL matching

tcrsift run --sample-sheet samples.yaml -o results/
# No --til-samples flag needed - auto-detected from source: til

This automatically matches culture clones against TIL samples and adds til_match, til_cell_count, til_frequency columns.

Multi-Source Unification

When combining data from different sources processed separately:

# Process each source
tcrsift til-clonotype -o til_results/clonotypes.csv --sample-sheet til_samples.yaml
tcrsift run --sample-sheet culture_samples.yaml -o culture_results/
tcrsift load-sct -i sct_data.xlsx -o sct_clonotypes.csv --aggregate

# Unify
tcrsift unify \
    -i til_results/clonotypes.csv culture_results/clonotypes.csv sct_clonotypes.csv \
    -o unified_clonotypes.csv

API Reference

Data Loading

from tcrsift import load_samples, load_cellranger_vdj, load_cellranger_gex

# Load all samples from sample sheet
adata = load_samples("samples.yaml")

# Load individual CellRanger outputs
vdj_df = load_cellranger_vdj("/path/to/vdj", sample_name="S1")
adata = load_cellranger_gex("/path/to/gex", sample_name="S1")

Phenotyping

from tcrsift import phenotype_cells, filter_by_tcell_type, get_phenotype_summary

# Classify cells
adata = phenotype_cells(adata, cd4_cd8_ratio=3.0)

# Filter to CD8+ only
cd8_cells = filter_by_tcell_type(adata, tcell_type="cd8")

# Get summary by sample
summary = get_phenotype_summary(adata)

Clonotyping

from tcrsift import aggregate_clonotypes, get_clonotype_summary

# Aggregate by CDR3 pair
clonotypes = aggregate_clonotypes(adata, group_by="CDR3ab", min_umi=2)

# Get summary
summary = get_clonotype_summary(clonotypes)

Filtering

from tcrsift import filter_clonotypes, split_by_tier

# Filter with default tiers
filtered = filter_clonotypes(clonotypes, method="threshold", tcell_type="cd8")

# Split into separate DataFrames by tier
tier_dfs = split_by_tier(filtered)

Annotation

from tcrsift import annotate_clonotypes, load_vdjdb

# Load database
vdjdb = load_vdjdb("/path/to/vdjdb")

# Annotate clonotypes
annotated = annotate_clonotypes(
    clonotypes,
    vdjdb_path="/path/to/vdjdb",
    match_by="CDR3ab",
    exclude_viral=True,
)

TIL Matching

from tcrsift import match_til, get_til_summary

# Match culture clones against TIL data
matched = match_til(culture_clonotypes, til_adata, match_by="CDR3ab")

# Get recovery statistics
summary = get_til_summary(matched)

SCT Data

from tcrsift import load_sct, aggregate_sct, get_sct_specificities

# Load and filter SCT data
df = load_sct("sct_data.xlsx", min_snr=2.0, min_reads_per_chain=10)
hq = df[df.high_quality]

# Aggregate to clonotypes
clonotypes = aggregate_sct(df)

# Get specificity mapping
specificities = get_sct_specificities(clonotypes)

Multi-Experiment Unification

from tcrsift import merge_experiments, add_phenotype_confidence

# Prepare experiments
experiments = [
    (til_clonotypes, "TIL"),
    (culture_clonotypes, "Culture"),
]

# Merge with occurrence flags and combined stats
unified = merge_experiments(experiments, add_occurrence_flags=True)

# Add phenotype confidence
unified = add_phenotype_confidence(unified, ratio_threshold=10.0)

Sequence Assembly

from tcrsift import assemble_full_sequences, export_fasta

# Assemble with leaders and constant regions
assembled = assemble_full_sequences(
    clonotypes,
    alpha_leader="CD28",
    beta_leader="CD8A",
    include_constant=True,
    linker="T2A",
)

# Export FASTA
export_fasta(assembled, "sequences.fasta", sequence_col="single_chain_aa")

Output Files

clonotypes.csv

Column	Description
`CDR3ab`	Unique identifier (CDR3_alpha_CDR3_beta)
`CDR3_alpha`	Alpha chain CDR3 sequence
`CDR3_beta`	Beta chain CDR3 sequence
`cell_count`	Number of cells
`frequency`	Proportion of total cells
`Tcell_type_consensus`	Consensus T cell type
`tier`	Quality tier (1-5)
`db_match`	Matched in public database
`is_viral`	Viral specificity flag
`n_donors`, `n_methods_per_donor`, `max_methods_per_donor`	Multi-donor / multi-method aggregations (populated when `patient_id` + `enrichment_method` are set)

Multi-donor / multi-method outputs

Emitted by tcrsift run when the corresponding sheet fields are populated:

File	When	Description
`data/clone_sample_long.csv`	always	One row per (clone, sample) pair — easier to pivot than the semicolon-delimited `samples` column
`data/filtered_by_method/<donor>__<method>.csv`	`enrichment_method` set	Top-N clones ranked within each (donor, method); `--per-method-top-n N`
`data/method_overlap_<donor>.csv`	`enrichment_method` set, ≥2 methods/donor	method × method overlap matrix; metric via `--method-overlap-similarity`
`plots/method_overlap_<donor>.png`	as above, with plots enabled	seaborn heatmap
`data/method_recovery.csv`	`enrichment_method` set	Long `[donor, method, recovered, total, fraction]` — "how does each method perform"
`plots/method_recovery.png`	as above, with plots enabled	Paired-by-donor bar chart, methods sorted by cross-donor mean recovery
`data/filtered_tier{N}_<donor>.csv`	`--fdr-scope per-donor` (default for multi-donor)	Tier files split by donor instead of global tiered files

full_sequences.csv

Column	Description
`CDR3ab`	Clonotype identifier
`alpha_full_aa`	Full alpha chain (leader + VDJ + constant)
`beta_full_aa`	Full beta chain
`single_chain_aa`	Beta-2A-Alpha construct
`single_chain_nt`	DNA sequence

Documentation

Full documentation: https://pirl-unc.github.io/tcrsift/

Contributing

See CONTRIBUTING.md for guidelines.

License

Apache License 2.0. See LICENSE for details.

Citation

@software{tcrsift,
  author = {Rubinsteyn, Alex},
  title = {TCRsift: T-cell receptor selection from antigen-specific culture},
  url = {https://github.com/pirl-unc/tcrsift},
  year = {2024}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.15.0

May 15, 2026

0.7.5

May 9, 2026

0.7.4

May 9, 2026

0.7.3

May 9, 2026

0.7.2

May 8, 2026

0.7.1

May 8, 2026

0.7.0

May 8, 2026

0.6.0

May 8, 2026

0.5.1

May 8, 2026

0.5.0

May 8, 2026

0.4.0

May 8, 2026

0.3.2

May 8, 2026

0.3.1

May 8, 2026

0.3.0

May 8, 2026

0.2.9

May 8, 2026

0.2.8

May 7, 2026

0.2.7

May 7, 2026

0.2.6

Apr 6, 2026

0.2.2

Jan 18, 2026

0.2.1

Jan 18, 2026

0.2.0

Jan 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tcrsift-0.15.0.tar.gz (342.9 kB view details)

Uploaded May 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tcrsift-0.15.0-py3-none-any.whl (228.9 kB view details)

Uploaded May 15, 2026 Python 3

File details

Details for the file tcrsift-0.15.0.tar.gz.

File metadata

Download URL: tcrsift-0.15.0.tar.gz
Upload date: May 15, 2026
Size: 342.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for tcrsift-0.15.0.tar.gz
Algorithm	Hash digest
SHA256	`2473334a09bf1983fd137a00a4db52ce63eea3e95472ade00c3bb722c2d5c9fa`
MD5	`7cd2c0a9278857c4f49915f9e074a3ea`
BLAKE2b-256	`5895ec5ef256f038b9e3d49681efb3294a76d0255324c700625df5fcfe5b4303`

See more details on using hashes here.

File details

Details for the file tcrsift-0.15.0-py3-none-any.whl.

File metadata

Download URL: tcrsift-0.15.0-py3-none-any.whl
Upload date: May 15, 2026
Size: 228.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for tcrsift-0.15.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6c37af41e41adf83f4fc88e85de68c6b766063c287a469528ca85184b6799e20`
MD5	`49233b22db69902d71e1e3c1ca69b80d`
BLAKE2b-256	`39ba992d75b1c21fe91496165b741cbb4a535495e212bd1a0bbf59da9194340d`

See more details on using hashes here.

tcrsift 0.15.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TCRsift

Contents

Architecture

Key Data Structures

CellRanger Requirements

Installation

Optional Dependencies

Quick Start

Run the Complete Pipeline

With Configuration File

Python API

Sample Sheet Format

YAML Format

CSV Format

Required Fields

Multi-donor / multi-method designs

Antigen Types

Core Pipeline Steps

1. Load Data

2. Phenotype Cells

3. Aggregate Clonotypes

4. Filter Clonotypes

FDR scope (--fdr-scope)

Method × method overlap + recovery (multi-method designs)

5. Annotate Clonotypes

Reference-database cache (tcrsift data)

6. Assemble Full Sequences

Supplementary Tools

Load SCT Data

TIL Matching (Automatic)

Match TIL (Cross-Run)

TIL-Only Aggregation

TIL-Only Clone Prioritization (til-select)

Annotate with Gene Expression (annotate-gex)

Unify Multiple Experiments

Generate Mnemonic Names

Workflows

Standard Single-Experiment Analysis

Culture + TIL Together

Multi-Source Unification

API Reference

Data Loading

Phenotyping

Clonotyping

Filtering

Annotation

TIL Matching

SCT Data

Multi-Experiment Unification

Sequence Assembly

Output Files

clonotypes.csv

Multi-donor / multi-method outputs

full_sequences.csv

Documentation

Contributing

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

FDR scope (`--fdr-scope`)

Reference-database cache (`tcrsift data`)

TIL-Only Clone Prioritization (`til-select`)

Annotate with Gene Expression (`annotate-gex`)