Predict cancer epitopes from cancer sequence data
Project description
Topiary
Topiary predicts which peptides from protein sequences will be presented by MHC molecules, making them potential T-cell epitopes. It is used in cancer immunotherapy research to find mutant peptides (neoantigens) that the immune system could target.
Core idea: Given protein sequences + HLA alleles + an MHC binding predictor, Topiary scans all possible peptides and returns those predicted to bind MHC, ranked by binding strength.
Topiary can start from several types of input:
- Somatic variants (VCF/MAF) — the original use case: find mutant peptides from cancer sequencing data
- Protein sequences (FASTA/CSV) — scan full-length proteins with a sliding window
- Peptide lists (FASTA/CSV) — score specific peptides directly, no sliding window
- Gene/transcript IDs — pull sequences from Ensembl automatically
- Built-in gene sets — cancer-testis antigens (CTA), tissue-expressed genes
How it works
- Get protein sequences — from variants (via varcode), FASTA/CSV files, or Ensembl lookups
- Generate candidate peptides — sliding window over proteins, or use peptides as-is
- Predict MHC binding — via mhctools (NetMHCpan, MHCflurry, etc.)
- Filter and rank — by binding affinity, percentile rank, presentation score, or custom expressions
- Annotate — with gene/transcript info, mutation positions, RNA expression levels
For variant inputs, Topiary also filters by RNA expression and identifies which predicted epitopes actually overlap the mutation.
Installation
pip install topiary
For Ensembl-based features (variant annotation, gene lookups), download reference data:
# GRCh38 (hg38) — most common
pyensembl install --release 93 --species human
# GRCh37 (hg19) — if your variants use this reference
pyensembl install --release 75 --species human
For cancer-testis antigen and tissue expression features:
pip install pirlygenes
Quick start
Command line
Scan a FASTA file for MHC binders:
topiary \
--fasta proteins.fasta \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01,HLA-B*07:02 \
--ic50-cutoff 500 \
--output-csv results.csv
Score specific peptides (no sliding window):
topiary \
--peptide-csv peptides.csv \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--ic50-cutoff 500 \
--output-csv results.csv
Find neoantigen candidates from somatic variants:
topiary \
--vcf somatic.vcf \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01,HLA-B*07:02 \
--ic50-cutoff 500 \
--percentile-cutoff 2.0 \
--rna-gene-fpkm-tracking-file genes.fpkm_tracking \
--rna-min-gene-expression 4.0 \
--only-novel-epitopes \
--output-csv epitopes.csv
Scan cancer-testis antigens, excluding peptides found in vital organs:
topiary \
--cta \
--exclude-tissues heart_muscle lung liver \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--ic50-cutoff 500 \
--output-csv cta_epitopes.csv
Python API
from topiary import TopiaryPredictor, Affinity, Presentation
from mhctools import NetMHCpan
# Set up predictor with filtering
predictor = TopiaryPredictor(
models=[NetMHCpan],
alleles=["HLA-A*02:01", "HLA-B*07:02"],
filter=(Affinity <= 500) | (Presentation.rank <= 2.0),
rank_by=[Presentation.score, Affinity.score],
)
# Scan protein sequences (sliding window)
df = predictor.predict_from_named_sequences({
"BRAF_V600E": "MAALSGGGGG...LATEKSRWSG",
"TP53_R248W": "MEEPQSDPSV...ALPQHAHAQM",
})
# Score specific peptides (no sliding window)
df = predictor.predict_from_named_peptides({
"peptide_1": "YLQLVFGIEV",
"peptide_2": "LLFNILGGWV",
})
# From somatic variants (requires varcode)
from varcode import load_vcf
variants = load_vcf("somatic.vcf")
df = predictor.predict_from_variants(variants)
Input modes
Sequence and peptide files
| Flag | Format | Behavior |
|---|---|---|
--fasta FILE |
FASTA with full-length proteins | Sliding-window scan |
--peptide-fasta FILE |
FASTA where each entry is one peptide | Scored as-is |
--sequence-csv FILE |
CSV with sequence column (+ optional name) |
Sliding-window scan |
--peptide-csv FILE |
CSV with peptide column (+ optional name) |
Scored as-is |
Gene and transcript lookups
These pull protein sequences from Ensembl automatically:
| Flag | Example |
|---|---|
--gene-names NAME [NAME ...] |
--gene-names BRAF TP53 EGFR |
--gene-ids ID [ID ...] |
--gene-ids ENSG00000157764 |
--transcript-ids ID [ID ...] |
--transcript-ids ENST00000288602 |
--ensembl-proteome |
Scan the entire Ensembl proteome |
--cta |
Cancer-testis antigen genes (requires pirlygenes) |
--ensembl-release N |
Use a specific Ensembl release (default: 93 for human) |
For gene lookups, Topiary uses the longest protein-coding transcript per gene.
Genomic variants
| Flag | Description |
|---|---|
--vcf FILE |
VCF file of somatic variants |
--maf FILE |
TCGA MAF file |
--variant CHR POS REF ALT |
Individual variant (requires --ensembl-version) |
--protein-change GENE CHANGE |
Direct protein change, e.g. --protein-change EGFR T790M |
Multiple input flags can be combined in a single run.
MHC binding prediction
You must specify a predictor and alleles:
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01,HLA-B*07:02
Supported predictors: netmhcpan, netmhc, netmhciipan, netmhccons, mhcflurry, random, and IEDB web API variants (netmhcpan-iedb, netmhccons-iedb, smm-iedb, smm-pmbec-iedb).
Alleles can be specified as a comma-separated list (--mhc-alleles) or one per line in a file (--mhc-alleles-file).
Peptide lengths: --mhc-epitope-lengths 8,9,10,11 (defaults come from the predictor).
Filtering and ranking
Simple cutoffs
--ic50-cutoff 500 # Keep peptides with IC50 <= 500 nM
--percentile-cutoff 2.0 # Keep peptides with percentile rank <= 2.0
--presentation-cutoff 2.0 # Keep peptides with presentation rank <= 2.0
--filter-logic any # "any" (OR, default) or "all" (AND)
Expression-based ranking
--rank-by pMHC_presentation,pMHC_affinity
Sort surviving peptides by presentation score, breaking ties with affinity.
Advanced filter expressions
--ranking "affinity <= 500 | presentation.rank <= 2"
Python API expressions
from topiary import Affinity, Presentation, RankingStrategy
# Combine filters with | (OR) or & (AND)
my_filter = (Affinity <= 500) | (Presentation.rank <= 2.0)
# Composite scoring
my_score = 0.5 * Affinity.score + 0.5 * Presentation.score
predictor = TopiaryPredictor(
models=[NetMHCpan],
alleles=["HLA-A*02:01"],
filter=my_filter,
rank_by=[my_score],
)
Available prediction kinds: Affinity, Presentation, Processing, Stability. Each has .value, .rank, and .score attributes.
Exclusion filtering
For direct sequence/peptide inputs, you can exclude peptides that also appear in reference proteomes — useful for finding tumor-specific or pathogen-specific peptides:
--exclude-ensembl # Exclude peptides in the human Ensembl proteome
--exclude-non-cta # Exclude non-CTA proteins (requires pirlygenes)
--exclude-tissues heart_muscle lung # Exclude genes expressed in these tissues
--exclude-fasta reference.fasta # Exclude peptides in custom reference sequences
--exclude-mode substring # "substring" (default) or "exact"
Region restriction
Limit prediction to specific protein regions (only applies to sequence inputs, not peptides):
--regions spike:319-541 nucleocapsid:0-50
Format: name:start-end (0-based, half-open interval).
RNA expression filtering
For variant-based workflows, filter by gene or transcript expression:
--rna-gene-fpkm-tracking-file genes.fpkm_tracking
--rna-min-gene-expression 4.0
--rna-transcript-fpkm-tracking-file isoforms.fpkm_tracking
--rna-min-transcript-expression 1.5
Also supports StringTie GTF format: --rna-transcript-fpkm-gtf-file.
Output
--output-csv results.csv # CSV output
--output-html results.html # HTML table
--output-csv-sep "\t" # Use tab separator
--subset-output-columns peptide allele affinity # Select columns
--rename-output-column value ic50 # Rename columns
Output columns
All predictions: source_sequence_name, peptide, peptide_offset, peptide_length, allele, kind, score, value, affinity, percentile_rank, prediction_method_name
Variant predictions add: variant, gene, gene_id, transcript_id, transcript_name, effect, effect_type, contains_mutant_residues, mutation_start_in_peptide, mutation_end_in_peptide
Built-in protein sources (Python API)
The topiary.sources module provides functions for loading protein sequences from Ensembl and PirlyGenes:
from topiary.sources import (
ensembl_proteome,
sequences_from_gene_names,
sequences_from_gene_ids,
sequences_from_transcript_ids,
cta_sequences,
non_cta_sequences,
tissue_expressed_sequences,
available_tissues,
)
# All return dict[name -> amino_acid_sequence]
seqs = sequences_from_gene_names(["BRAF", "TP53", "EGFR"])
cta = cta_sequences()
tissues = available_tissues() # list of tissue names
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file topiary-4.3.1.tar.gz.
File metadata
- Download URL: topiary-4.3.1.tar.gz
- Upload date:
- Size: 59.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a7e267c3468d4cc3f629e88bfb43d275854612dd6bbb2ea77f4a08b061179e7
|
|
| MD5 |
7b90d673c2eddbf8432e4ff07eb36d61
|
|
| BLAKE2b-256 |
dcda1df0b540aeb1db5aac978549b92a93ca0a5b6d6d9a63bee47df284d24d98
|
File details
Details for the file topiary-4.3.1-py3-none-any.whl.
File metadata
- Download URL: topiary-4.3.1-py3-none-any.whl
- Upload date:
- Size: 74.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14321473868a89a44c934d2300bb4fb814ca871e5df9e8007d4a4d09d050d73b
|
|
| MD5 |
5e3fa980e6b9cc7a9599ceca81c474e6
|
|
| BLAKE2b-256 |
00b07532d1d0d6f03d2c26e7e02f059ea9c3c2d04e1f464d5fa21a76768e4835
|