Skip to main content

Curated cancer reference data: ontology, TMB, incidence/mortality, and expression

Project description

oncoref

Tests PyPI

Curated cancer reference data — cancer-type ontology, tumor mutational burden (TMB), incidence/mortality, checkpoint-inhibitor (ICI) response, per-cohort RNA-seq expression, and cancer-testis antigens — behind one small Python API, a data fetch/cache CLI, and a set of reference plots.

oncoref is the base layer

oncoref is designed as the base layer of the openvax/PIRL stack — the intended single upstream source of truth for cancer reference data, meant to become a shared dependency of pirlygenes, trufflepig, and tsarina. Adoption is still in progress — most of these don't depend on it yet. Architecturally it stays at the bottom: it depends only on pandas / numpy / pyarrow / PyYAML, it never imports its consumers (data and logic flow only downward), and it owns these definitions rather than mirroring them from elsewhere.

Anything that needs to know about

  • gene expression of cancer samples — per-cohort RNA-seq in a normalized, comparable space: summary stats, tail-weighted percentiles, and medoid/exemplar samples per cancer type/subtype;
  • HPA protein / RNA normal-tissue expression;
  • the definition of cancer-testis antigens — the HPA tissue-restriction call over the candidate list (HPA-only; no MS/peptide layer);
  • the ontology of cancer types — codes, the parent/child hierarchy, subtypes, families, characteristic driver fusions, and the cross-cutting MSI/POLE/HPV groupings; and
  • checkpoint-inhibitor response rates and TMB per cancer type

depends on oncoref — including pirlygenes (gene-set curation/analysis), tsarina (personalized target selection), hitlist (panel selection), trufflepig (sample classification), and anything else downstream.

Everything keys on the cancer-type registry. The small curated tables ship in the wheel; the heavy per-cohort expression bundle downloads on first use from oncoref's own GitHub Release.

Install

pip install oncoref

Python API

The flat oncoref namespace remains available for compatibility and quick interactive use. For new code, prefer the semantic submodules in docs/api.md; they make it clearer whether you are working with the cancer ontology, cohorts, ICI response, CTA coverage, generic antigen-panel coverage, or CTA-specific peptides.

import oncoref as od

od.resolve_cancer_type("prostate")        # -> "PRAD"
od.cancer_type_info("SARC_RMS_ARMS")      # full registry record + burden + tmb
od.cancer_tmb("LUAD_EGFR")                # 6.9  (inherited from LUAD)
od.cancer_burden("pancreas", metric="us_mortality_pct")
od.burden_category("SARC_OS")             # -> "bone_and_joint" (incidence/mortality bucket)
od.cancer_ici_response("SKCM")            # 42  (anti-PD-1 ORR %; fallback aPD-1 → aPD-L1 → combo)
od.cancer_ici_response("SKCM", regimen="PD-1+CTLA-4")   # 57.6  (pin a regimen)

# Cancer-testis antigens (HPA-derived tissue-restriction):
od.cta_gene_names()                       # expressed CTA symbols (MAGEA4, CT83, …)
od.cta_evidence()                         # full HPA restriction table

# Per-cohort expression percentiles (downloads the data bundle on first use):
od.cohort_gene_percentiles("PRAD")        # per-gene p0…p100 vector (within-cohort)
od.within_sample_top_fraction("PRAD")     # per-gene frac of samples top-5% (within-sample)

Domains

  • Cancer ontologyoncoref.cancer_ontology: cancer_type_registry, resolve_cancer_type, cancer_type_info, tree/family/lineage helpers, viral_status, fusion_status.
  • Cohortsoncoref.cohorts: cohort_registry, cohort_aggregates, cohort_source_version, and mixture-cohort helpers.
  • TMBcancer_tmb, cancer_tmb_df (parent-chain inheritance).
  • Incidence / mortalitycancer_burden, burden_category (ACS / GLOBOCAN).
  • Checkpoint responseoncoref.ici_response: regimen-aware ORR anchors, anti-PD-1 shortcuts, endpoint estimates, and pooled response summaries.
  • Expressioncohort_gene_percentiles, within_sample_top_fraction, representative_cohort_samples over the lazy-downloaded per-cohort bundle.
  • Clean TPM / normalizationoncoref.normalization for the 16/9/75 compartment transform and oncoref.gene_families for clean-TPM censored compartment IDs plus the biological housekeeping denominator panel.
  • Cancer-testis antigensoncoref.cta: cta_gene_names/cta_gene_ids, cta_evidence, synthesize_restriction (HPA-only tissue-restriction; MS evidence stays in the target-selection layer).
  • CTA coverage / peptidesoncoref.cta_coverage for patient coverage and oncoref.cta_peptides for CTA-specific 9-mer count maps and load.
  • Generic antigen-panel coverageoncoref.antigen_coverage for explicit non-CTA panels.
  • HPA normal tissuehpa_rna_consensus, hpa_normal_tissue (IHC), hpa_single_cell, and per-gene lookups (gene_tissue_ntpm, gene_protein_tissues, gene_cell_type_ntpm) over HPA v23, fetched on demand (oncoref hpa fetch).
  • Genome referencecanonical_gene_id_and_name, find_gene_id_by_name, find_gene_name_from_ensembl_{gene,transcript}_id, aggregate_gene_expression (pyensembl-backed symbol ↔ Ensembl-ID resolution). pyensembl ships with the package, but resolution needs a downloaded human release once: pyensembl install --release 111 --species homo_sapiens (the accessors return None until then).
  • Plots (pip install oncoref[plots]) — oncoref.plots.apd1_vs_tmb, apd1_orr_bars, incidence_vs_mortality, the CTA/coverage figures, and oncoref.cta_curation_plots.render.

CLI

oncoref cancer-type prostate     # registry info as JSON
oncoref tmb LUAD_EGFR            # 6.9
oncoref ici SKCM                # 42  (--regimen to pin, --all-regimens to compare)
oncoref burden pancreas --metric us_mortality_pct
oncoref cta --count             # number of expressed CTAs
oncoref plot apd1-vs-tmb --out apd1_vs_tmb.png
oncoref plot patient-coverage --gene-set cta --out coverage_out
oncoref plot cta-curation --out cta_curation_out

# expression-bundle cache (per-cohort expression):
oncoref cache fetch             # download the ~340 MB bundle
oncoref cache status            # which bundle paths are cached (no download)
oncoref cache dir               # where the data bundle is cached
oncoref cache prune --yes       # delete stale version caches
oncoref hpa fetch               # download HPA reference data (RNA / IHC / single-cell)
oncoref version

Development

./develop.sh   # editable install with dev extras
./format.sh    # ruff format
./lint.sh      # ruff check + format --check
./test.sh      # lint + pytest with coverage

License

Apache 2.0.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oncoref-1.8.6.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oncoref-1.8.6-py3-none-any.whl (1.6 MB view details)

Uploaded Python 3

File details

Details for the file oncoref-1.8.6.tar.gz.

File metadata

  • Download URL: oncoref-1.8.6.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for oncoref-1.8.6.tar.gz
Algorithm Hash digest
SHA256 8e07b62ee0bff3394f0b8b495cdafff2f73829fac9c60f287a65f7a5b3543040
MD5 422f88c816923e9adb35b125973937f6
BLAKE2b-256 fe290640ff92bcb51841b69a4d04ccf94a5173a18ee10c54786b959026ca4362

See more details on using hashes here.

File details

Details for the file oncoref-1.8.6-py3-none-any.whl.

File metadata

  • Download URL: oncoref-1.8.6-py3-none-any.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for oncoref-1.8.6-py3-none-any.whl
Algorithm Hash digest
SHA256 d0da9eac2fbc616f6a3bbfba7ef0c58bf74188c029cf6768cfc768852776d44c
MD5 a279e271ace29c50580196f6382ebc89
BLAKE2b-256 71923cd4d589831ccb90b12732347718e54d84097a6c780ced1810f52e3b302a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page