Skip to main content

Predict cancer epitopes from cancer sequence data

Project description

Tests Coverage Status PyPI

Topiary

Topiary predicts which peptides from protein sequences will be presented by MHC molecules, making them potential T-cell epitopes. It is used in cancer immunotherapy research to find mutant peptides (neoantigens) that the immune system could target.

Core idea: Given protein sequences + HLA alleles + an MHC binding predictor, Topiary scans all possible peptides and returns those predicted to bind MHC, ranked by binding strength.

Topiary can start from several types of input:

  • Somatic variants (VCF/MAF) — the original use case: find mutant peptides from cancer sequencing data
  • Protein sequences (FASTA/CSV) — scan full-length proteins with a sliding window
  • Peptide lists (FASTA/CSV) — score specific peptides directly, no sliding window
  • Gene/transcript IDs — pull sequences from Ensembl automatically
  • Built-in gene sets — cancer-testis antigens (CTA), tissue-expressed genes

How it works

  1. Get protein sequences — from variants (via varcode), FASTA/CSV files, or Ensembl lookups
  2. Generate candidate peptides — sliding window over proteins, or use peptides as-is
  3. Predict MHC binding — via mhctools (NetMHCpan, MHCflurry, etc.)
  4. Filter and rank — by binding affinity, percentile rank, presentation score, or custom expressions
  5. Annotate — with gene/transcript info, mutation positions, RNA expression levels

For variant inputs, Topiary also filters by RNA expression and identifies which predicted epitopes actually overlap the mutation.

Installation

pip install topiary

For Ensembl-based features (variant annotation, gene lookups), download reference data:

# GRCh38 (hg38) — most common
pyensembl install --release 93 --species human

# GRCh37 (hg19) — if your variants use this reference
pyensembl install --release 75 --species human

For cancer-testis antigen and tissue expression features:

pip install pirlygenes

For tab completion of command-line arguments (bash/zsh/fish):

pip install 'topiary[completion]'
activate-global-python-argcomplete

Quick start

Command line

Scan a FASTA file for MHC binders:

topiary \
  --fasta proteins.fasta \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01,HLA-B*07:02 \
  --ic50-cutoff 500 \
  --output-csv results.csv

Score specific peptides (no sliding window):

topiary \
  --peptide-csv peptides.csv \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --ic50-cutoff 500 \
  --output-csv results.csv

Find neoantigen candidates from somatic variants:

topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01,HLA-B*07:02 \
  --ic50-cutoff 500 \
  --percentile-cutoff 2.0 \
  --rna-gene-fpkm-tracking-file genes.fpkm_tracking \
  --rna-min-gene-expression 4.0 \
  --only-novel-epitopes \
  --output-csv epitopes.csv

Scan cancer-testis antigens, excluding peptides found in vital organs:

topiary \
  --cta \
  --exclude-tissues heart_muscle lung liver \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --ic50-cutoff 500 \
  --output-csv cta_epitopes.csv

Python API

from topiary import TopiaryPredictor, Affinity, Presentation
from mhctools import NetMHCpan

# Set up predictor with filtering
predictor = TopiaryPredictor(
    models=[NetMHCpan],
    alleles=["HLA-A*02:01", "HLA-B*07:02"],
    filter=(Affinity <= 500) | (Presentation.rank <= 2.0),
    rank_by=[Presentation.score, Affinity.score],
)

# Scan protein sequences (sliding window)
df = predictor.predict_from_named_sequences({
    "BRAF_V600E": "MAALSGGGGG...LATEKSRWSG",
    "TP53_R248W": "MEEPQSDPSV...ALPQHAHAQM",
})

# Score specific peptides (no sliding window)
df = predictor.predict_from_named_peptides({
    "peptide_1": "YLQLVFGIEV",
    "peptide_2": "LLFNILGGWV",
})

# From somatic variants (requires varcode)
from varcode import load_vcf
variants = load_vcf("somatic.vcf")
df = predictor.predict_from_variants(variants)

Input modes

Sequence and peptide files

Flag Format Behavior
--fasta FILE FASTA with full-length proteins Sliding-window scan
--peptide-fasta FILE FASTA where each entry is one peptide Scored as-is
--sequence-csv FILE CSV with sequence column (+ optional name) Sliding-window scan
--peptide-csv FILE CSV with peptide column (+ optional name) Scored as-is

Gene and transcript lookups

These pull protein sequences from Ensembl automatically:

Flag Example
--gene-names NAME [NAME ...] --gene-names BRAF TP53 EGFR
--gene-ids ID [ID ...] --gene-ids ENSG00000157764
--transcript-ids ID [ID ...] --transcript-ids ENST00000288602
--ensembl-proteome Scan the entire Ensembl proteome
--cta Cancer-testis antigen genes (requires pirlygenes)
--ensembl-release N Use a specific Ensembl release (default: 93 for human)

For gene lookups, Topiary uses the longest protein-coding transcript per gene.

Genomic variants

Flag Description
--vcf FILE VCF file of somatic variants
--maf FILE TCGA MAF file
--variant CHR POS REF ALT Individual variant (requires --ensembl-version)
--protein-change GENE CHANGE Direct protein change, e.g. --protein-change EGFR T790M

Multiple input flags can be combined in a single run.

MHC binding prediction

You must specify a predictor and alleles:

--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01,HLA-B*07:02

Supported predictors: netmhcpan, netmhc, netmhciipan, netmhccons, mhcflurry, random, and IEDB web API variants (netmhcpan-iedb, netmhccons-iedb, smm-iedb, smm-pmbec-iedb).

Alleles can be specified as a comma-separated list (--mhc-alleles) or one per line in a file (--mhc-alleles-file).

Peptide lengths: --mhc-epitope-lengths 8,9,10,11 (defaults come from the predictor).

Filtering and ranking

Simple cutoffs

--ic50-cutoff 500            # Keep peptides with IC50 <= 500 nM
--percentile-cutoff 2.0      # Keep peptides with percentile rank <= 2.0
--presentation-cutoff 2.0    # Keep peptides with presentation rank <= 2.0
--filter-logic any            # "any" (OR, default) or "all" (AND)

Expression-based ranking

--rank-by pMHC_presentation,pMHC_affinity

Sort surviving peptides by presentation score, breaking ties with affinity.

Advanced filter expressions

--ranking "affinity <= 500 | presentation.rank <= 2"

Python API expressions

from topiary import Affinity, Presentation, RankingStrategy

# Combine filters with | (OR) or & (AND)
my_filter = (Affinity <= 500) | (Presentation.rank <= 2.0)

# Composite scoring
my_score = 0.5 * Affinity.score + 0.5 * Presentation.score

predictor = TopiaryPredictor(
    models=[NetMHCpan],
    alleles=["HLA-A*02:01"],
    filter=my_filter,
    rank_by=[my_score],
)

Available prediction kinds: Affinity, Presentation, Processing, Stability. Each has .value, .rank, and .score attributes.

Exclusion filtering

For direct sequence/peptide inputs, you can exclude peptides that also appear in reference proteomes — useful for finding tumor-specific or pathogen-specific peptides:

--exclude-ensembl                    # Exclude peptides in the human Ensembl proteome
--exclude-non-cta                    # Exclude non-CTA proteins (requires pirlygenes)
--exclude-tissues heart_muscle lung  # Exclude genes expressed in these tissues
--exclude-fasta reference.fasta      # Exclude peptides in custom reference sequences
--exclude-mode substring             # "substring" (default) or "exact"

Region restriction

Limit prediction to specific protein regions (only applies to sequence inputs, not peptides):

--regions spike:319-541 nucleocapsid:0-50

Format: name:start-end (0-based, half-open interval).

RNA expression filtering

For variant-based workflows, filter by gene or transcript expression:

--rna-gene-fpkm-tracking-file genes.fpkm_tracking
--rna-min-gene-expression 4.0
--rna-transcript-fpkm-tracking-file isoforms.fpkm_tracking
--rna-min-transcript-expression 1.5

Also supports StringTie GTF format: --rna-transcript-fpkm-gtf-file.

Output

--output-csv results.csv          # CSV output
--output-html results.html        # HTML table
--output-csv-sep "\t"             # Use tab separator
--subset-output-columns peptide allele affinity  # Select columns
--rename-output-column value ic50               # Rename columns

Output columns

All predictions: source_sequence_name, peptide, peptide_offset, peptide_length, allele, kind, score, value, affinity, percentile_rank, prediction_method_name

Variant predictions add: variant, gene, gene_id, transcript_id, transcript_name, effect, effect_type, contains_mutant_residues, mutation_start_in_peptide, mutation_end_in_peptide

Built-in protein sources (Python API)

The topiary.sources module provides functions for loading protein sequences from Ensembl and PirlyGenes:

from topiary.sources import (
    ensembl_proteome,
    sequences_from_gene_names,
    sequences_from_gene_ids,
    sequences_from_transcript_ids,
    cta_sequences,
    non_cta_sequences,
    tissue_expressed_sequences,
    available_tissues,
)

# All return dict[name -> amino_acid_sequence]
seqs = sequences_from_gene_names(["BRAF", "TP53", "EGFR"])
cta = cta_sequences()
tissues = available_tissues()  # list of tissue names

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topiary-4.4.0.tar.gz (63.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

topiary-4.4.0-py3-none-any.whl (78.1 kB view details)

Uploaded Python 3

File details

Details for the file topiary-4.4.0.tar.gz.

File metadata

  • Download URL: topiary-4.4.0.tar.gz
  • Upload date:
  • Size: 63.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for topiary-4.4.0.tar.gz
Algorithm Hash digest
SHA256 12e2feab70e16e34df18785125a67fdb684c7fca9c578351bbd7c8b9eadebf8a
MD5 34a02ed1afc4819c8085b79e103c2b61
BLAKE2b-256 6dd659b4157e9b3d446648b7625954b484a190ebd9303c8213186ef0f8cf652f

See more details on using hashes here.

File details

Details for the file topiary-4.4.0-py3-none-any.whl.

File metadata

  • Download URL: topiary-4.4.0-py3-none-any.whl
  • Upload date:
  • Size: 78.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for topiary-4.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 beb929c6a27005582cc70f88ec2c11569354ea654732daa78ac6d6d0bbf78a0c
MD5 6cb6bfc0a442af1f1a293f5c3cbd0fe8
BLAKE2b-256 f47957aa4eafe901be83a98397a11e79182e30862c99ca28815a8691ebe64d19

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page