Predict cancer epitopes from cancer sequence data
Project description
Topiary
Predict which peptides from protein sequences will be presented by MHC molecules, making them potential T-cell epitopes. Used in cancer immunotherapy research to find mutant peptides (neoantigens) that the immune system could target.
Installation
pip install topiary
Topiary 4.9.0 and later require mhctools>=3.7.0.
For variant annotation and gene lookups, download Ensembl reference data:
pyensembl install --release 110 --species human
Predicting MHC binding
The simplest use case: score a handful of peptides against one or more HLA alleles.
from topiary import TopiaryPredictor
from mhctools import NetMHCpan
predictor = TopiaryPredictor(
models=[NetMHCpan],
alleles=["HLA-A*02:01"],
)
df = predictor.predict_from_named_peptides({
"peptide_1": "YLQLVFGIEV",
"peptide_2": "LLFNILGGWV",
"peptide_3": "GILGFVFTL",
})
The result is a DataFrame with one row per peptide-allele-kind combination. Each row has score (normalized, higher = better), value (raw prediction, e.g. IC50 nM), and percentile_rank (lower = better).
To scan full-length proteins with a sliding window instead:
df = predictor.predict_from_named_sequences({
"BRAF_V600E": "MAALSGGGGG...LATEKSRWSG",
})
Multiple models can run together. The DataFrame will have rows for each model's prediction kinds:
from mhctools import NetMHCpan, MHCflurry
predictor = TopiaryPredictor(
models=[NetMHCpan, MHCflurry],
alleles=["HLA-A*02:01", "HLA-B*07:02"],
)
Filtering and ranking
Topiary has an expression language for filtering and ranking predictions. It works identically as Python objects and as CLI strings.
Filters
Keep only peptides that pass a threshold:
from topiary import TopiaryPredictor, Affinity, Presentation
from mhctools import NetMHCpan
predictor = TopiaryPredictor(
models=[NetMHCpan],
alleles=["HLA-A*02:01"],
filter_by=Affinity <= 500, # IC50 nM cutoff
)
# Or with presentation score:
predictor = TopiaryPredictor(
models=[NetMHCpan],
alleles=["HLA-A*02:01"],
filter_by=(Affinity <= 500) | (Presentation.rank <= 2.0), # keep if either passes
)
Prediction kinds and fields
| Accessor | Aliases | What it measures |
|---|---|---|
Affinity |
ba, aff, ic50 |
Binding affinity (IC50 nM) |
Presentation |
el |
Eluted ligand / presentation score |
Stability |
pMHC complex stability | |
Processing |
Antigen processing / cleavage |
Each kind has three fields:
.value— raw prediction (e.g. IC50 in nM). Default when no field is specified, soAffinity <= 500meansAffinity.value <= 500..rank— percentile rank (lower = better)..score— normalized score (higher = better).
Ranking with transforms
Sort surviving peptides with sort_by. Transforms normalize heterogeneous scores to a common scale:
from topiary import TopiaryPredictor, Affinity, Presentation, mean
from mhctools import NetMHCpan, MHCflurry
predictor = TopiaryPredictor(
models=[NetMHCpan, MHCflurry],
alleles=["HLA-A*02:01"],
filter_by=Affinity <= 500,
sort_by=mean(
Affinity["netmhcpan"].logistic(350, 150),
Affinity["mhcflurry"].logistic(350, 150),
),
)
When multiple models produce the same kind, qualify with brackets: Affinity["netmhcpan"].
Available transforms:
| Transform | What it does |
|---|---|
.descending_cdf(mean, std) |
Lower input -> higher output (for IC50, rank) |
.ascending_cdf(mean, std) |
Higher input -> higher output (for scores) |
.logistic(midpoint, width) |
Sigmoid normalization |
.clip(lo, hi) |
Clamp to range |
.hinge() |
max(0, x) |
.log() / .log2() / .log10() / .log1p() |
Logarithms |
.sqrt() / .exp() |
Square root, exponential |
abs(...) |
Absolute value |
Aggregations: mean(), geomean(), minimum(), maximum(), median()
String expressions
Every filter and ranking expression has a string form for use at the CLI or in configuration:
| Python | String |
|---|---|
Affinity <= 500 |
affinity <= 500 / ba <= 500 |
Affinity.rank <= 2 |
affinity.rank <= 2 |
Presentation.score >= 0.5 |
el.score >= 0.5 |
(Affinity <= 500) | (Presentation.rank <= 2) |
affinity <= 500 | el.rank <= 2 |
Affinity["netmhcpan"] <= 500 |
netmhcpan_ba <= 500 |
0.5 * Affinity.score + 0.5 * Presentation.score |
0.5 * affinity.score + 0.5 * presentation.score |
Affinity.logistic(350, 150) |
affinity.logistic(350, 150) |
mean(Affinity.score, Presentation.score) |
mean(affinity.score, presentation.score) |
Column("gene_tpm") >= 5 |
gene_tpm >= 5 |
Column("gene_tpm").log() |
gene_tpm.log() |
Column references
Any DataFrame column can be used directly in expressions — expression data, peptide properties, custom annotations. Unknown identifiers are automatically treated as column references:
from topiary import Affinity, Column
# As a filter — bare name or Column() both work
filter_by = Column("gene_tpm") >= 5.0
# As a sorting signal
sort_by = 0.5 * Affinity.logistic(350, 150) - 0.1 * Column("cysteine_count")
# Transforms work on columns too
sort_by = Column("gene_tpm").log1p()
In CLI strings, bare names work directly:
--filter-by "ba <= 500 & gene_tpm >= 5"
--sort-by "affinity.logistic(350, 150) + 0.1 * gene_tpm.log1p()"
The explicit column(name) syntax also works: gene_tpm >= 5.
Misspelled column names get a helpful error at evaluation time: Column 'hydrophobicty' not found. Did you mean: ['hydrophobicity']?
Scope prefixes
The wt. prefix reads predictions for the wildtype peptide at the same position (variant workflows only). Use it for differential binding:
from topiary import Affinity, wt
sort_by = Affinity.score - wt.Affinity.score # mutant advantage
shuffled. and self. prefixes work the same way for shuffled-decoy and self-proteome contexts.
Peptide-level expressions
Len() reads the peptide length. Count("C") counts amino acid occurrences:
from topiary import Len, Count, wt
sort_by = Count("C") - wt.Count("C") # gained/lost cysteines vs wildtype
Expression data
Load gene-level, transcript-level, or variant-level quantification to annotate predictions. Expression values become DataFrame columns that you can reference in ranking and filtering.
How it works
Each --*-expression flag takes a spec string: [name:]file[:id_col[:val_col]].
- The file is auto-detected (Salmon
.sf, Kallistoabundance.tsv, RSEM.genes.results/.isoforms.results, StringTie.gtf, Cufflinks.fpkm_tracking) or treated as generic TSV/CSV. - The loaded columns are joined onto the prediction DataFrame by
gene_id,transcript_id, orvariantrespectively. - Value columns are prefixed with the name (default:
gene,transcript, orvariant) and lowercased. For example, Salmon'sTPMcolumn loaded with--gene-expressionbecomesgene_tpm.
Python API
import pandas as pd
from topiary import TopiaryPredictor, Affinity, Column
from topiary.rna import load_expression_from_spec
from mhctools import NetMHCpan
from varcode import load_vcf
# Load expression data
gene_name, gene_id_col, gene_df = load_expression_from_spec(
"salmon_quant.sf", default_name="gene"
)
# gene_df has columns: ["Name", "TPM"]
# gene_name = "gene", gene_id_col = "Name"
expression_data = {
"gene": [(gene_name, gene_id_col, gene_df)],
"transcript": [],
"variant": [],
}
predictor = TopiaryPredictor(
models=[NetMHCpan],
alleles=["HLA-A*02:01"],
filter_by=(Affinity <= 500) & (Column("gene_tpm") >= 5.0),
)
variants = load_vcf("somatic.vcf")
df = predictor.predict_from_variants(
variants,
expression_data=expression_data,
)
# df now has a "gene_tpm" column from the Salmon file
Column naming
The name prefix and the original column name are combined as {prefix}_{column}, both lowercased:
| Flag | File | Original column | Result column |
|---|---|---|---|
--gene-expression quant.sf |
Salmon | TPM |
gene_tpm |
--gene-expression quant.sf |
Salmon | NumReads |
gene_numreads |
--transcript-expression abundance.tsv |
Kallisto | tpm |
transcript_tpm |
--variant-expression isovar.tsv |
generic | num_alt_reads |
variant_num_alt_reads |
--gene-expression mygene:quant.sf |
Salmon | TPM |
mygene_tpm |
The join keys are implicit: gene-level data joins on gene_id, transcript-level on transcript_id, variant-level on variant. These columns are present in variant-pipeline output.
Transcript selection
When transcript-level expression data is provided, Topiary uses it to select the highest-expressed transcript per variant (instead of the default priority-based selection). This matches the behavior of the legacy --rna-transcript-fpkm-tracking-file flag.
Auto-detected formats
| Extension / pattern | Format | Default ID column | Default value column |
|---|---|---|---|
.sf |
Salmon | Name |
TPM |
abundance*.tsv (with target_id + tpm header) |
Kallisto | target_id |
tpm |
.genes.results |
RSEM (gene) | gene_id |
TPM |
.isoforms.results |
RSEM (transcript) | transcript_id |
TPM |
.gtf |
StringTie | reference_id |
TPM (or FPKM) |
.fpkm_tracking |
Cufflinks | tracking_id |
FPKM |
| anything else | generic | first column | all numeric columns |
Override defaults with the full spec: name:file:id_col:val_col.
CLI usage
Score peptides
topiary \
--peptide-csv peptides.csv \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--output-csv results.csv
The CSV needs a peptide column (and optionally name). Each peptide is scored as-is.
Scan proteins
topiary \
--fasta proteins.fasta \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01,HLA-B*07:02 \
--ic50-cutoff 500 \
--output-csv results.csv
Filter and rank
# Filter: keep if affinity OR presentation passes
topiary \
--fasta proteins.fasta \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--filter-by "ba <= 500 | el.rank <= 2" \
--output-csv results.csv
# Rank: composite score from two models
topiary \
--fasta antigens.fasta \
--mhc-predictor netmhcpan mhcflurry \
--mhc-alleles HLA-A*02:01 \
--filter-by "ba <= 500" \
--sort-by "mean(affinity['netmhcpan'].logistic(350, 150), affinity['mhcflurry'].logistic(350, 150))" \
--output-csv results.csv
Neoantigen discovery from somatic variants
topiary \
--vcf somatic.vcf \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01,HLA-B*07:02 \
--filter-by "ba <= 500 | el.rank <= 2" \
--sort-by "0.6 * affinity.descending_cdf(500, 200) + 0.4 * presentation.score" \
--only-novel-epitopes \
--output-csv epitopes.csv
Add expression data
# Gene-level expression from Salmon (auto-detected)
topiary \
--vcf somatic.vcf \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--gene-expression salmon_quant.sf \
--filter-by "ba <= 500 & gene_tpm >= 5" \
--only-novel-epitopes \
--output-csv results.csv
# Transcript-level from Kallisto
topiary \
--vcf somatic.vcf \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--transcript-expression kallisto/abundance.tsv \
--filter-by "ba <= 500" \
--sort-by "affinity.descending_cdf(500, 200) + 0.1 * transcript_tpm.log1p()" \
--output-csv results.csv
# Variant-level read support from isovar
topiary \
--vcf somatic.vcf \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--variant-expression isovar_output.tsv \
--filter-by "ba <= 500 & variant_num_alt_reads >= 3" \
--output-csv results.csv
# Combine all three levels
topiary \
--vcf somatic.vcf \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--gene-expression salmon_quant.sf \
--transcript-expression kallisto/abundance.tsv \
--variant-expression isovar_output.tsv \
--filter-by "ba <= 500 & gene_tpm >= 5 & variant_num_alt_reads >= 3" \
--output-csv results.csv
# Override auto-detection with explicit spec
topiary \
--vcf somatic.vcf \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--gene-expression mygene:custom_quant.tsv:ensembl_id:tpm_value \
--output-csv results.csv
Expression flags are only valid with the variant pipeline (--vcf, --maf, --variant). They are rejected with direct sequence inputs (--fasta, --peptide-csv, etc.) because those outputs lack the gene_id / transcript_id / variant columns needed for joining.
Gene and transcript lookups
Pull protein sequences from Ensembl:
topiary --gene-names BRAF TP53 EGFR --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --gene-ids ENSG00000157764 --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --transcript-ids ENST00000288602 --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --ensembl-proteome --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --cta --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01 # cancer-testis antigens
Restrict to protein regions
topiary \
--fasta proteins.fasta \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--regions spike:319-541 nucleocapsid:0-50 \
--output-csv results.csv
Format: NAME:START-END (0-indexed, half-open).
Exclusion filtering
Remove peptides found in reference proteomes — for tumor-specific or pathogen-specific peptide selection:
--exclude-ensembl # human Ensembl proteome
--exclude-non-cta # non-CTA proteins (requires pirlygenes)
--exclude-tissues heart_muscle lung # genes expressed in these tissues
--exclude-fasta reference.fasta # custom reference sequences
--exclude-mode substring # "substring" (default) or "exact"
MHC prediction models
Specify one or more predictors with --mhc-predictor and alleles with --mhc-alleles:
--mhc-predictor netmhcpan mhcflurry --mhc-alleles HLA-A*02:01,HLA-B*07:02
All predictors come from mhctools.
With mhctools 3.7.0+, upstream predictor parsing supports multiple
predictors in one CLI invocation, so commands like
--mhc-predictor netmhcpan42 bigmhc-el are supported directly. Topiary keeps
its higher-level --filter-by / --sort-by DSL on top of that lower-level
predictor interface. Topiary's ranking/filtering DSL is also compatible with
the simplified mhctools 3.7.0+ kind constants API. NetChop and Pepsickle
behavior follows the upstream changes as well: improved NetChop error handling
and Pepsickle's epitope-focused model selection.
| CLI name | Predicts |
|---|---|
netmhcpan |
affinity + presentation (auto-detects installed version) |
netmhcpan4 / netmhcpan41 / netmhcpan42 |
specific NetMHCpan 4.x versions |
netmhcpan4-ba / netmhcpan4-el |
single-mode (binding affinity or eluted ligand) |
mhcflurry |
affinity + presentation + processing |
mixmhcpred |
presentation |
netmhciipan / netmhciipan4 / netmhciipan43 |
MHC class II |
bigmhc / bigmhc-el / bigmhc-im |
presentation / immunogenicity |
netmhcstabpan |
pMHC stability |
pepsickle / netchop |
proteasomal cleavage |
netmhcpan-iedb / netmhccons-iedb / smm-iedb |
IEDB web API (no local install) |
random |
random predictions (for testing) |
Peptide lengths: --mhc-epitope-lengths 8,9,10,11 (defaults come from the predictor).
Output
--output-csv results.csv
--output-html results.html
--output-csv-sep "\t"
--subset-output-columns peptide allele affinity
--rename-output-column value ic50
All predictions: source_sequence_name, peptide, peptide_offset, peptide_length, allele, kind, score, value, affinity, percentile_rank, prediction_method_name
Variant predictions add: variant, gene, gene_id, transcript_id, transcript_name, effect, effect_type, contains_mutant_residues, mutation_start_in_peptide, mutation_end_in_peptide
Expression data adds: columns named {prefix}_{column} as described in Column naming, e.g. gene_tpm, transcript_tpm, variant_num_alt_reads.
Peptide properties
Compute amino acid properties and use them in ranking:
from topiary import TopiaryPredictor, Affinity, Column
from topiary.properties import add_peptide_properties
df = predictor.predict_from_named_sequences(seqs)
df = add_peptide_properties(df, groups=["manufacturability"])
# Properties become ranking signals
score = Affinity.logistic(350, 150) - 0.1 * Column("cysteine_count")
# CLI: --sort-by "affinity.logistic(350, 150) - 0.1 * cysteine_count"
Named groups:
"core"— charge, hydrophobicity, aromaticity, molecular_weight"manufacturability"— core + cysteine_count, instability_index, max_7mer_hydrophobicity, difficult_nterm/cterm, asp_pro_bonds"immunogenicity"— core + tcr_charge, tcr_aromaticity, tcr_hydrophobicity (TCR-facing positions for MHC-I)
Protein sources (Python API)
from topiary.sources import (
ensembl_proteome,
sequences_from_gene_names,
cta_sequences,
non_cta_sequences,
tissue_expressed_sequences,
available_tissues,
)
# All return dict[name -> amino_acid_sequence]
seqs = sequences_from_gene_names(["BRAF", "TP53", "EGFR"])
cta = cta_sequences() # cancer-testis antigens
heart = tissue_expressed_sequences(["heart_muscle"])
print(available_tissues()) # list tissue names
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file topiary-4.11.0.tar.gz.
File metadata
- Download URL: topiary-4.11.0.tar.gz
- Upload date:
- Size: 116.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48d0569764f5b14e1f38902ba2476096accc87290fd7dff5fef73f7479e457c1
|
|
| MD5 |
59a35c1fdb359f4a6699117c83a637d2
|
|
| BLAKE2b-256 |
3132c14290a982db897b2de6c25278fbf06395098a22554a90c80e3a53da2081
|
File details
Details for the file topiary-4.11.0-py3-none-any.whl.
File metadata
- Download URL: topiary-4.11.0-py3-none-any.whl
- Upload date:
- Size: 133.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f57a6c0bbd3132a744a7a8f4f061e0d923560ff0573d1d762f4d44d9c26afb06
|
|
| MD5 |
1e32ba8ca3e495a0a1dc9b8b06b0b7c9
|
|
| BLAKE2b-256 |
3c6a0406ba2899d0d7ddb7cf5ad1d87d2b192befde2cb4f2e075f5016257183f
|