Skip to main content

Gene lists for cancer immunotherapy expression analysis

Project description

pirlygenes

Gene lists related to cancer immunotherapy

TCR-T

Clinical trials

Last updated: September 17th, 2024

Sources:

CAR-T

Approved therapies

Last updated: September 17th, 2024

Sources:

Multi-specific antibodies and T-cell engagers

Clinical trials

Last updated: September 11th, 2024

Sources:

Antibody-drug conjugates (ADCs)

Approved

Last updated: September 19th, 2024

Sources:

Clinical trials

Last updated: September 11th, 2024

Sources:

Radioligand therapies (RLTs)

Current target list

Last updated: February 11th, 2026

Sources:

Methodology:

  • pirlygenes/data/radioligand-targets.csv is a curated target-level list (gene targets, Ensembl IDs, and target status buckets) intended to power gene-set visualization while trial-level v1.4.0 curation is in progress.

CLI plotting notes:

  • Treatment plots now include a Radio category label (capitalized consistently with other treatment labels).
  • Use --label-genes to force annotation of genes that should always be text-labeled, for example: --label-genes FAP,CD276.
  • PNG output defaults are larger/higher resolution (--plot-height 12.0, --plot-aspect 1.4, --output-dpi 300), and can be overridden from CLI.

Cancer-testis antigens (CTAs)

Last updated: March 23rd, 2026

Quick start

from pirlygenes.gene_sets_cancer import (
    CTA_gene_names,              # recommended: filtered, reproductive-restricted CTAs
    CTA_gene_ids,                # same, as Ensembl gene IDs
    CTA_unfiltered_gene_names,   # full superset from all source databases
    CTA_unfiltered_gene_ids,     # same, as Ensembl gene IDs
    CTA_evidence,                # full DataFrame with all evidence columns
)

# Default: expressed, reproductive-restricted CTAs (~257 genes)
cta_genes = CTA_gene_names()

# Full unfiltered superset from all sources (~358 genes)
all_ctas = CTA_unfiltered_gene_names()

# Partition ALL protein-coding genes into CTA / never-expressed / non-CTA
from pirlygenes.gene_sets_cancer import CTA_partition_gene_ids
p = CTA_partition_gene_ids()   # p.cta, p.cta_never_expressed, p.non_cta

# Evidence table with per-gene HPA tissue restriction data
df = CTA_evidence()

Pipeline overview

The CTA gene set is built as an unbiased union of genes from multiple CT antigen databases and literature sources, then systematically filtered using Human Protein Atlas tissue expression data.

Step 1: Collect — union of protein-coding CT genes from multiple source databases (358 genes):

Source Genes Reference
CTpedia 167 Almeida et al. 2009, NAR
CTexploreR/CTdata 62 new Loriot et al. 2025, PLOS Genetics
Protein-level CT genes (136 total, 46 overlap) 89 new da Silva et al. 2017, Oncotarget
EWSR1-FLI1 CT gene binding sites 12 Gallegos et al. 2019, Mol Cell Biol
Meiosis, piRNA, spermatogenesis genes 28 Multiple sources (see docs)

Each gene is tracked with a source_databases column indicating which databases include it (CTpedia, CTexploreR_CT, CTexploreR_CTP, daSilva2017, daSilva2017_protein). Only protein-coding genes (Ensembl biotype) are included. Genes with outdated HGNC symbols are renamed to current symbols with old names kept as aliases.

Step 2: Annotate — each gene is scored against Human Protein Atlas v23 tissue expression:

  • RNA: HPA RNA tissue consensus (rna_tissue_consensus.tsv) — normalized transcripts per million (nTPM) across 50 normal tissues
  • Protein: HPA normal tissue IHC (normal_tissue.tsv) — immunohistochemistry detection levels (Not detected / Low / Medium / High) across 63 tissues with antibody reliability scores (Enhanced / Supported / Approved / Uncertain)

Step 3: Filter — protein-coding + tiered thresholds based on protein antibody confidence (278 of 358 pass):

Protein evidence Deflated RNA threshold
Enhanced (orthogonal validation) ≥ 80%
Supported (consistent characterization) ≥ 90%
Approved (basic validation) ≥ 95%
Uncertain or no protein data ≥ 99%

Genes with protein detected in non-reproductive tissues always fail. Thymus is excluded from all restriction checks (AIRE-driven mTEC expression is expected for CTAs).

Gene set counts

Function Description Count
CTA_gene_names() Recommended default. Expressed, reproductive-restricted CTAs ~257
CTA_never_expressed_gene_names() CTAs from databases but no HPA expression (max nTPM < 2, no protein) ~21
CTA_filtered_gene_names() All filter-passing CTAs (= expressed + never_expressed) ~278
CTA_excluded_gene_names() CTAs that fail filter (somatic expression) ~80
CTA_unfiltered_gene_names() Full superset from all source databases 358
CTA_evidence() Full DataFrame with all evidence columns 358 rows
CTA_partition_gene_ids() Partition all protein-coding genes (dataclass with .cta, .cta_never_expressed, .non_cta sets) ~20k
CTA_partition_gene_names() Same, as gene symbols ~20k
CTA_partition_dataframes() Same, as DataFrames with evidence columns ~20k

Evidence columns

Each gene in cancer-testis-antigens.csv carries identity and HPA-derived evidence:

Column Description
Ensembl_Gene_ID Ensembl gene ID (validated against release 112)
source_databases Semicolon-separated list of source databases (CTpedia, CTexploreR_CT, CTexploreR_CTP, daSilva2017)
biotype Ensembl gene biotype (must be protein_coding to pass filter)
Canonical_Transcript_ID Longest protein-coding transcript (Ensembl 112)
protein_reproductive IHC detected only in {testis, ovary, placenta} (excl. thymus), or "no data"
protein_thymus IHC detected in thymus
protein_reliability Best HPA antibody reliability: Enhanced / Supported / Approved / Uncertain / "no data"
protein_strict_expression Semicolon-separated tissues with IHC detection (excl. thymus)
rna_reproductive All tissues with ≥1 nTPM (excl. thymus) are in {testis, ovary, placenta}
rna_thymus Thymus nTPM ≥ 1
rna_reproductive_frac Fraction of total nTPM (excl. thymus) in core reproductive tissues
rna_deflated_reproductive_frac (1 + Σ_repro max(0, nTPM−1)) / (1 + Σ_all max(0, nTPM−1))
rna_deflated_reproductive_and_thymus_frac Same but thymus added to reproductive numerator
rna_80/90/95/99_pct_filter Whether deflated reproductive fraction ≥ threshold
filtered Final inclusion flag (see tiered thresholds above)

For full details on the curation process, evidence columns, and filter logic, see docs/cta-curation.md.

Deflated RNA metric

The deflated metric max(0, nTPM − 1) per tissue suppresses low-level basal transcription noise before computing the reproductive fraction. A +1 pseudocount on numerator and denominator prevents 0/0 for very-low-expression genes. Example: CTCFL/BORIS has raw reproductive fraction 54% (diluted by sub-1 nTPM noise across ~40 tissues) but deflated fraction 100% (only testis has ≥1 nTPM).

Class I MHC antigen presentation

Last updated: July 21st, 2018

Sources:

Interferon-gamma response

Last updated: July 21st, 2018

Sources:

  • Interferon Receptor Signaling Pathways Regulating PD-L1 and PD-L2 Expression
    • "JAK1/JAK2-STAT1/STAT2/STAT3-IRF1 axis primarily regulates PD-L1 expression, with IRF1 binding to its promoter"
    • "PD-L2 responded equally to interferon beta and gamma and is regulated through both IRF1 and STAT3, which bind to the PD-L2 promoter"
    • "the suppressor of cytokine signaling protein family (SOCS; mostly SOCS1 and SOCS3) are involved in negative feedback regulation of cytokines that signal mainly through JAK2 binding, thereby modulating the activity of both STAT1 and STAT3"
  • Mutations Associated with Acquired Resistance to PD-1 Blockade in Melanoma
    • "resistance-associated loss-of-function mutations in the genes encoding interferon-receptor–associated Janus kinase 1 (JAK1) or Janus kinase 2 (JAK2), concurrent with deletion of the wild-type allele"
  • SOCS, inflammation, and cancer
    • "Abnormal expression of SOCS1 and SOCS3 in cancer cells has been reported in human carcinoma associated with dysregulation of signals from cytokine receptors"

Recurrently mutated cancer genes

Last updated: July 21st, 2018

Cancer genes and recurrent mutations extract from Comprehensive Characterization of Cancer Driver Genes and Mutations.

Genes extracted from Table S1 into cancer-driver-genes.csv. Mutations extracted from Table S4 into cancer-driver-variants.csv.

Both datasets were annotated with Ensembl IDs using Ensembl release 92.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pirlygenes-2.7.1.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pirlygenes-2.7.1-py3-none-any.whl (536.8 kB view details)

Uploaded Python 3

File details

Details for the file pirlygenes-2.7.1.tar.gz.

File metadata

  • Download URL: pirlygenes-2.7.1.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for pirlygenes-2.7.1.tar.gz
Algorithm Hash digest
SHA256 62279aae56f126782702e8fa48bf8e60e0cf574caa6778bfaf37c01f2f24a058
MD5 8219f585c60cc9b1f083397838116e34
BLAKE2b-256 0a9f6f1a18016493c71ae63b236e4e5186535b85f15188b969cfded686dec47a

See more details on using hashes here.

File details

Details for the file pirlygenes-2.7.1-py3-none-any.whl.

File metadata

  • Download URL: pirlygenes-2.7.1-py3-none-any.whl
  • Upload date:
  • Size: 536.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for pirlygenes-2.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a40dd7ca8e7cbef15d8c0874ce8e4d4a5d15d558af653e9d05bb8a167d9c95a1
MD5 1ea95b41078de6986671b4fe5e627705
BLAKE2b-256 577cf087e5968735238e4782164c5af185405be0eec060121b61c187f5c55429

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page