Skip to main content

Predict cancer epitopes from cancer sequence data

Project description

Tests Coverage Status PyPI

Topiary

Predict which peptides from protein sequences will be presented by MHC molecules, making them potential T-cell epitopes. Used in cancer immunotherapy research to find mutant peptides (neoantigens) that the immune system could target.

Installation

pip install topiary

Topiary 4.9.0 and later require mhctools>=3.7.0.

For variant annotation and gene lookups, download Ensembl reference data:

pyensembl install --release 110 --species human

Predicting MHC binding

The simplest use case: score a handful of peptides against one or more HLA alleles.

from topiary import TopiaryPredictor
from mhctools import NetMHCpan

predictor = TopiaryPredictor(
    models=[NetMHCpan],
    alleles=["HLA-A*02:01"],
)

df = predictor.predict_from_named_peptides({
    "peptide_1": "YLQLVFGIEV",
    "peptide_2": "LLFNILGGWV",
    "peptide_3": "GILGFVFTL",
})

The result is a DataFrame with one row per peptide-allele-kind combination. Each row has score (normalized, higher = better), value (raw prediction, e.g. IC50 nM), and percentile_rank (lower = better).

To scan full-length proteins with a sliding window instead:

df = predictor.predict_from_named_sequences({
    "BRAF_V600E": "MAALSGGGGG...LATEKSRWSG",
})

Multiple models can run together. The DataFrame will have rows for each model's prediction kinds:

from mhctools import NetMHCpan, MHCflurry

predictor = TopiaryPredictor(
    models=[NetMHCpan, MHCflurry],
    alleles=["HLA-A*02:01", "HLA-B*07:02"],
)

Filtering and ranking

Topiary has an expression language for filtering and ranking predictions. It works identically as Python objects and as CLI strings.

Filters

Keep only peptides that pass a threshold:

from topiary import TopiaryPredictor, Affinity, Presentation
from mhctools import NetMHCpan

predictor = TopiaryPredictor(
    models=[NetMHCpan],
    alleles=["HLA-A*02:01"],
    filter_by=Affinity <= 500,                       # IC50 nM cutoff
)

# Or with presentation score:
predictor = TopiaryPredictor(
    models=[NetMHCpan],
    alleles=["HLA-A*02:01"],
    filter_by=(Affinity <= 500) | (Presentation.rank <= 2.0),  # keep if either passes
)

Prediction kinds and fields

Accessor Aliases What it measures
Affinity ba, aff, ic50 Binding affinity (IC50 nM)
Presentation el Eluted ligand / presentation score
Stability pMHC complex stability
Processing Antigen processing / cleavage

Each kind has three fields:

  • .value — raw prediction (e.g. IC50 in nM). Default when no field is specified, so Affinity <= 500 means Affinity.value <= 500.
  • .rank — percentile rank (lower = better).
  • .score — normalized score (higher = better).

Ranking with transforms

Sort surviving peptides with sort_by. Transforms normalize heterogeneous scores to a common scale:

from topiary import TopiaryPredictor, Affinity, Presentation, mean
from mhctools import NetMHCpan, MHCflurry

predictor = TopiaryPredictor(
    models=[NetMHCpan, MHCflurry],
    alleles=["HLA-A*02:01"],
    filter_by=Affinity <= 500,
    sort_by=mean(
        Affinity["netmhcpan"].logistic(350, 150),
        Affinity["mhcflurry"].logistic(350, 150),
    ),
)

When multiple models produce the same kind, qualify with brackets: Affinity["netmhcpan"].

Available transforms:

Transform What it does
.descending_cdf(mean, std) Lower input -> higher output (for IC50, rank)
.ascending_cdf(mean, std) Higher input -> higher output (for scores)
.logistic(midpoint, width) Sigmoid normalization
.clip(lo, hi) Clamp to range
.hinge() max(0, x)
.log() / .log2() / .log10() / .log1p() Logarithms
.sqrt() / .exp() Square root, exponential
abs(...) Absolute value

Aggregations: mean(), geomean(), minimum(), maximum(), median()

String expressions

Every filter and ranking expression has a string form for use at the CLI or in configuration:

Python String
Affinity <= 500 affinity <= 500 / ba <= 500
Affinity.rank <= 2 affinity.rank <= 2
Presentation.score >= 0.5 el.score >= 0.5
(Affinity <= 500) | (Presentation.rank <= 2) affinity <= 500 | el.rank <= 2
Affinity["netmhcpan"] <= 500 netmhcpan_ba <= 500
0.5 * Affinity.score + 0.5 * Presentation.score 0.5 * affinity.score + 0.5 * presentation.score
Affinity.logistic(350, 150) affinity.logistic(350, 150)
mean(Affinity.score, Presentation.score) mean(affinity.score, presentation.score)
Column("gene_tpm") >= 5 gene_tpm >= 5
Column("gene_tpm").log() gene_tpm.log()

Column references

Any DataFrame column can be used directly in expressions — expression data, peptide properties, custom annotations. Unknown identifiers are automatically treated as column references:

from topiary import Affinity, Column

# As a filter — bare name or Column() both work
filter_by = Column("gene_tpm") >= 5.0

# As a sorting signal
sort_by = 0.5 * Affinity.logistic(350, 150) - 0.1 * Column("cysteine_count")

# Transforms work on columns too
sort_by = Column("gene_tpm").log1p()

In CLI strings, bare names work directly:

--filter-by "ba <= 500 & gene_tpm >= 5"
--sort-by "affinity.logistic(350, 150) + 0.1 * gene_tpm.log1p()"

The explicit column(name) syntax also works: gene_tpm >= 5.

Misspelled column names get a helpful error at evaluation time: Column 'hydrophobicty' not found. Did you mean: ['hydrophobicity']?

Scope prefixes

The wt. prefix reads predictions for the wildtype peptide at the same position (variant workflows only). Use it for differential binding:

from topiary import Affinity, wt

sort_by = Affinity.score - wt.Affinity.score  # mutant advantage

shuffled. and self. prefixes work the same way for shuffled-decoy and self-proteome contexts.

Peptide-level expressions

Len() reads the peptide length. Count("C") counts amino acid occurrences:

from topiary import Len, Count, wt

sort_by = Count("C") - wt.Count("C")   # gained/lost cysteines vs wildtype

Expression data

Load gene-level, transcript-level, or variant-level quantification to annotate predictions. Expression values become DataFrame columns that you can reference in ranking and filtering.

How it works

Each --*-expression flag takes a spec string: [name:]file[:id_col[:val_col]].

  1. The file is auto-detected (Salmon .sf, Kallisto abundance.tsv, RSEM .genes.results / .isoforms.results, StringTie .gtf, Cufflinks .fpkm_tracking) or treated as generic TSV/CSV.
  2. The loaded columns are joined onto the prediction DataFrame by gene_id, transcript_id, or variant respectively.
  3. Value columns are prefixed with the name (default: gene, transcript, or variant) and lowercased. For example, Salmon's TPM column loaded with --gene-expression becomes gene_tpm.

Python API

import pandas as pd
from topiary import TopiaryPredictor, Affinity, Column
from topiary.rna import load_expression_from_spec
from mhctools import NetMHCpan
from varcode import load_vcf

# Load expression data
gene_name, gene_id_col, gene_df = load_expression_from_spec(
    "salmon_quant.sf", default_name="gene"
)
# gene_df has columns: ["Name", "TPM"]
# gene_name = "gene", gene_id_col = "Name"

expression_data = {
    "gene": [(gene_name, gene_id_col, gene_df)],
    "transcript": [],
    "variant": [],
}

predictor = TopiaryPredictor(
    models=[NetMHCpan],
    alleles=["HLA-A*02:01"],
    filter_by=(Affinity <= 500) & (Column("gene_tpm") >= 5.0),
)

variants = load_vcf("somatic.vcf")
df = predictor.predict_from_variants(
    variants,
    expression_data=expression_data,
)
# df now has a "gene_tpm" column from the Salmon file

Column naming

The name prefix and the original column name are combined as {prefix}_{column}, both lowercased:

Flag File Original column Result column
--gene-expression quant.sf Salmon TPM gene_tpm
--gene-expression quant.sf Salmon NumReads gene_numreads
--transcript-expression abundance.tsv Kallisto tpm transcript_tpm
--variant-expression isovar.tsv generic num_alt_reads variant_num_alt_reads
--gene-expression mygene:quant.sf Salmon TPM mygene_tpm

The join keys are implicit: gene-level data joins on gene_id, transcript-level on transcript_id, variant-level on variant. These columns are present in variant-pipeline output.

Transcript selection

When transcript-level expression data is provided, Topiary uses it to select the highest-expressed transcript per variant (instead of the default priority-based selection). This matches the behavior of the legacy --rna-transcript-fpkm-tracking-file flag.

Auto-detected formats

Extension / pattern Format Default ID column Default value column
.sf Salmon Name TPM
abundance*.tsv (with target_id + tpm header) Kallisto target_id tpm
.genes.results RSEM (gene) gene_id TPM
.isoforms.results RSEM (transcript) transcript_id TPM
.gtf StringTie reference_id TPM (or FPKM)
.fpkm_tracking Cufflinks tracking_id FPKM
anything else generic first column all numeric columns

Override defaults with the full spec: name:file:id_col:val_col.

CLI usage

Score peptides

topiary \
  --peptide-csv peptides.csv \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --output-csv results.csv

The CSV needs a peptide column (and optionally name). Each peptide is scored as-is.

Scan proteins

topiary \
  --fasta proteins.fasta \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01,HLA-B*07:02 \
  --ic50-cutoff 500 \
  --output-csv results.csv

Filter and rank

# Filter: keep if affinity OR presentation passes
topiary \
  --fasta proteins.fasta \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --filter-by "ba <= 500 | el.rank <= 2" \
  --output-csv results.csv

# Rank: composite score from two models
topiary \
  --fasta antigens.fasta \
  --mhc-predictor netmhcpan mhcflurry \
  --mhc-alleles HLA-A*02:01 \
  --filter-by "ba <= 500" \
  --sort-by "mean(affinity['netmhcpan'].logistic(350, 150), affinity['mhcflurry'].logistic(350, 150))" \
  --output-csv results.csv

Neoantigen discovery from somatic variants

topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01,HLA-B*07:02 \
  --filter-by "ba <= 500 | el.rank <= 2" \
  --sort-by "0.6 * affinity.descending_cdf(500, 200) + 0.4 * presentation.score" \
  --only-novel-epitopes \
  --output-csv epitopes.csv

Add expression data

# Gene-level expression from Salmon (auto-detected)
topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --gene-expression salmon_quant.sf \
  --filter-by "ba <= 500 & gene_tpm >= 5" \
  --only-novel-epitopes \
  --output-csv results.csv

# Transcript-level from Kallisto
topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --transcript-expression kallisto/abundance.tsv \
  --filter-by "ba <= 500" \
  --sort-by "affinity.descending_cdf(500, 200) + 0.1 * transcript_tpm.log1p()" \
  --output-csv results.csv

# Variant-level read support from isovar
topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --variant-expression isovar_output.tsv \
  --filter-by "ba <= 500 & variant_num_alt_reads >= 3" \
  --output-csv results.csv

# Combine all three levels
topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --gene-expression salmon_quant.sf \
  --transcript-expression kallisto/abundance.tsv \
  --variant-expression isovar_output.tsv \
  --filter-by "ba <= 500 & gene_tpm >= 5 & variant_num_alt_reads >= 3" \
  --output-csv results.csv

# Override auto-detection with explicit spec
topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --gene-expression mygene:custom_quant.tsv:ensembl_id:tpm_value \
  --output-csv results.csv

Expression flags are only valid with the variant pipeline (--vcf, --maf, --variant). They are rejected with direct sequence inputs (--fasta, --peptide-csv, etc.) because those outputs lack the gene_id / transcript_id / variant columns needed for joining.

Gene and transcript lookups

Pull protein sequences from Ensembl:

topiary --gene-names BRAF TP53 EGFR --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --gene-ids ENSG00000157764 --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --transcript-ids ENST00000288602 --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --ensembl-proteome --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --cta --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01  # cancer-testis antigens

Restrict to protein regions

topiary \
  --fasta proteins.fasta \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --regions spike:319-541 nucleocapsid:0-50 \
  --output-csv results.csv

Format: NAME:START-END (0-indexed, half-open).

Exclusion filtering

Remove peptides found in reference proteomes — for tumor-specific or pathogen-specific peptide selection:

--exclude-ensembl                    # human Ensembl proteome
--exclude-non-cta                    # non-CTA proteins (requires pirlygenes)
--exclude-tissues heart_muscle lung  # genes expressed in these tissues
--exclude-fasta reference.fasta      # custom reference sequences
--exclude-mode substring             # "substring" (default) or "exact"

MHC prediction models

Specify one or more predictors with --mhc-predictor and alleles with --mhc-alleles:

--mhc-predictor netmhcpan mhcflurry --mhc-alleles HLA-A*02:01,HLA-B*07:02

All predictors come from mhctools.

With mhctools 3.7.0+, upstream predictor parsing supports multiple predictors in one CLI invocation, so commands like --mhc-predictor netmhcpan42 bigmhc-el are supported directly. Topiary keeps its higher-level --filter-by / --sort-by DSL on top of that lower-level predictor interface. Topiary's ranking/filtering DSL is also compatible with the simplified mhctools 3.7.0+ kind constants API. NetChop and Pepsickle behavior follows the upstream changes as well: improved NetChop error handling and Pepsickle's epitope-focused model selection.

CLI name Predicts
netmhcpan affinity + presentation (auto-detects installed version)
netmhcpan4 / netmhcpan41 / netmhcpan42 specific NetMHCpan 4.x versions
netmhcpan4-ba / netmhcpan4-el single-mode (binding affinity or eluted ligand)
mhcflurry affinity + presentation + processing
mixmhcpred presentation
netmhciipan / netmhciipan4 / netmhciipan43 MHC class II
bigmhc / bigmhc-el / bigmhc-im presentation / immunogenicity
netmhcstabpan pMHC stability
pepsickle / netchop proteasomal cleavage
netmhcpan-iedb / netmhccons-iedb / smm-iedb IEDB web API (no local install)
random random predictions (for testing)

Peptide lengths: --mhc-epitope-lengths 8,9,10,11 (defaults come from the predictor).

Output

--output-csv results.csv
--output-html results.html
--output-csv-sep "\t"
--subset-output-columns peptide allele affinity
--rename-output-column value ic50

All predictions: source_sequence_name, peptide, peptide_offset, peptide_length, allele, kind, score, value, affinity, percentile_rank, prediction_method_name

Variant predictions add: variant, gene, gene_id, transcript_id, transcript_name, effect, effect_type, contains_mutant_residues, mutation_start_in_peptide, mutation_end_in_peptide

Expression data adds: columns named {prefix}_{column} as described in Column naming, e.g. gene_tpm, transcript_tpm, variant_num_alt_reads.

Peptide properties

Compute amino acid properties and use them in ranking:

from topiary import TopiaryPredictor, Affinity, Column
from topiary.properties import add_peptide_properties

df = predictor.predict_from_named_sequences(seqs)
df = add_peptide_properties(df, groups=["manufacturability"])

# Properties become ranking signals
score = Affinity.logistic(350, 150) - 0.1 * Column("cysteine_count")
# CLI: --sort-by "affinity.logistic(350, 150) - 0.1 * cysteine_count"

Named groups:

  • "core" — charge, hydrophobicity, aromaticity, molecular_weight
  • "manufacturability" — core + cysteine_count, instability_index, max_7mer_hydrophobicity, difficult_nterm/cterm, asp_pro_bonds
  • "immunogenicity" — core + tcr_charge, tcr_aromaticity, tcr_hydrophobicity (TCR-facing positions for MHC-I)

Protein sources (Python API)

from topiary.sources import (
    ensembl_proteome,
    sequences_from_gene_names,
    cta_sequences,
    non_cta_sequences,
    tissue_expressed_sequences,
    available_tissues,
)

# All return dict[name -> amino_acid_sequence]
seqs = sequences_from_gene_names(["BRAF", "TP53", "EGFR"])
cta = cta_sequences()                          # cancer-testis antigens
heart = tissue_expressed_sequences(["heart_muscle"])
print(available_tissues())                      # list tissue names

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topiary-4.9.0.tar.gz (102.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

topiary-4.9.0-py3-none-any.whl (117.0 kB view details)

Uploaded Python 3

File details

Details for the file topiary-4.9.0.tar.gz.

File metadata

  • Download URL: topiary-4.9.0.tar.gz
  • Upload date:
  • Size: 102.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for topiary-4.9.0.tar.gz
Algorithm Hash digest
SHA256 1b116cc69449f768397d2b5e51d30992460c31ee816a704bd4e93a364c036968
MD5 ed206f05c7eb059244b2eaab52cdf940
BLAKE2b-256 e2615738cfb1b80d9d5b4ea4c9bece8ebf9b86271e78e10b46accaa29f7d0eb8

See more details on using hashes here.

File details

Details for the file topiary-4.9.0-py3-none-any.whl.

File metadata

  • Download URL: topiary-4.9.0-py3-none-any.whl
  • Upload date:
  • Size: 117.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for topiary-4.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e1c199fb7acdf1f3d195074d8993f4c2f0381f45fed7b2c14c89e117ff7ef0ec
MD5 01c012ede1730c354052af85c0497dea
BLAKE2b-256 0b809fb9836c535019b504d5b4da9f8761e020d37ef419c8be58516ef60de08a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page