Skip to main content

Curated cancer reference data: ontology, TMB, incidence/mortality, and expression

Project description

oncoref

Tests PyPI

Curated cancer reference data — cancer-type ontology, tumor mutational burden (TMB), incidence/mortality, checkpoint-inhibitor (ICI) response, per-cohort RNA-seq expression, HPA normal-tissue expression, and HPA-derived cancer-testis antigen references — behind one small Python API, a data fetch/cache CLI, and a set of reference plots.

Where oncoref fits

The openvax/PIRL tools are split by ownership boundary, not by file format. Newcomers should think of the stack as three layers:

  • oncoref is the empirical base: what is true, measured, or canonically named about cancer types, cohorts, genes, and reference datasets. It owns gene identity and canonicalization, the cancer-type ontology/registry, expression reference data and normalization, epidemiology, TMB, ICI/aPD1 response, and source-anchored CTA facts. If a row has an n, a confidence interval, a source cohort, or a PMID/DOI anchoring a measurement, it is usually an oncoref fact.
  • pirlygenes owns curated gene sets and panels: which genes are useful for a purpose. That includes lineage/family/compartment/supertype panels, discriminators, surfaceome, TME and stem-cell markers, response-signature panels, target-to-therapy registries, and other opinionated selections keyed to oncoref cancer codes and gene IDs. An empty set can be a valid pirlygenes answer.
  • trufflepig owns per-sample interpretation: QC narration, library-prep/source warnings, deconvolution, scoring, and rule tables that fire against one tumor sample.

That division keeps the dependency direction simple: pirlygenes and trufflepig can depend on oncoref, but oncoref never imports its consumers. When a shared fact is wrong, incomplete, or poorly anchored, fix it in oncoref. When a panel or rule is purpose-specific, keep it in the downstream package and key it to oncoref's ontology and gene identifiers.

oncoref is the base layer

oncoref is designed as the base layer of the openvax/PIRL stack — the intended upstream home for shared cancer reference mechanics and data, meant to become a common dependency of pirlygenes, trufflepig, and tsarina. Adoption is staged: downstream packages can delegate parity-clean primitives while keeping their own curated tables, packaged artifacts, and compatibility APIs until those surfaces are ready to move. Architecturally oncoref stays at the bottom: it depends only on pandas / numpy / pyarrow / PyYAML, it never imports its consumers (data and logic flow only downward), and shared definitions should be fixed or exposed here rather than reimplemented separately downstream.

Use oncoref for shared questions about

  • gene expression of cancer samples — per-cohort RNA-seq in a normalized, comparable space: summary stats, tail-weighted percentiles, and medoid/exemplar samples per cancer type/subtype. Downstream packages may still keep packaged expression artifacts and compatibility wrappers while parity checks converge;
  • HPA protein / RNA normal-tissue expression;
  • the HPA-derived cancer-testis antigen call — the HPA tissue-restriction call over the candidate list (HPA-only; no pirlygenes therapy/MS curation layer);
  • the ontology of cancer types — codes, the parent/child hierarchy, subtypes, families, characteristic driver fusions, and the cross-cutting MSI/POLE/HPV groupings; and
  • checkpoint-inhibitor response rates and TMB per cancer type.

Everything keys on the cancer-type registry. The small curated tables ship in the wheel; the heavy per-cohort expression bundle downloads on first use from oncoref's own GitHub Release.

Install

pip install oncoref

Python API

The flat oncoref namespace remains available for compatibility and quick interactive use. For new code, prefer the semantic submodules in docs/api.md; they make it clearer whether you are working with the cancer ontology, cohorts, ICI response, CTA coverage, generic antigen-panel coverage, or CTA-specific peptides.

import oncoref as od

od.resolve_cancer_type("prostate")        # -> "PRAD"
od.cancer_type_info("SARC_RMS_ARMS")      # full registry record + burden + tmb
od.cancer_tmb("LUAD_EGFR")                # 6.9  (inherited from LUAD)
od.cancer_burden("pancreas", metric="us_mortality_pct")
od.burden_category("SARC_OS")             # -> "bone_and_joint" (incidence/mortality bucket)
od.cancer_ici_response("SKCM")            # 42  (anti-PD-1 ORR %; fallback aPD-1 → aPD-L1 → combo)
od.cancer_ici_response("SKCM", regimen="PD-1+CTLA-4")   # 57.6  (pin a regimen)

# Cancer-testis antigens (HPA-derived tissue-restriction):
od.cta_gene_names()                       # expressed CTA symbols (MAGEA4, CT83, …)
od.cta_evidence()                         # full HPA restriction table

# Per-cohort expression percentiles (downloads the data bundle on first use):
od.cohort_gene_percentiles("PRAD")        # per-gene p0…p100 vector (within-cohort)
od.within_sample_top_fraction("PRAD")     # per-gene frac of samples top-5% (within-sample)

Domains

  • Cancer ontologyoncoref.cancer_ontology: cancer_type_registry, resolve_cancer_type, cancer_type_records, cancer_type_codes, cancer_type_path, cancer_type_reference_data, tree/family/lineage helpers, molecular subtype groups, source-scoped evidence resolution, matched normal tissue helpers, viral_status, fusion_status.
  • Cohortsoncoref.cohorts: cohort_registry, cohort_aggregates, cohort_source_version, and mixture-cohort helpers.
  • TMBcancer_tmb, cancer_tmb_df (parent-chain inheritance).
  • Incidence / mortalitycancer_burden, burden_category (ACS / GLOBOCAN).
  • Checkpoint responseoncoref.ici_response: regimen-aware ORR anchors, anti-PD-1 shortcuts, endpoint estimates, and pooled response summaries.
  • Expressioncohort_gene_percentiles, within_sample_top_fraction, representative_cohort_samples over the lazy-downloaded per-cohort bundle.
  • Clean TPM / normalizationoncoref.normalization for the 16/9/75 compartment transform and oncoref.gene_families for clean-TPM censored compartment IDs plus the biological housekeeping denominator panel.
  • Cancer-testis antigensoncoref.cta: cta_gene_names/cta_gene_ids, cta_evidence, synthesize_restriction (HPA-only tissue-restriction; MS evidence stays in the target-selection layer).
  • CTA coverage / peptidesoncoref.cta_coverage for patient coverage and oncoref.cta_peptides for CTA-specific 9-mer count maps and load.
  • Generic antigen-panel coverageoncoref.antigen_coverage for explicit non-CTA gene lists supplied by the caller. This computes coverage for a panel; it does not make oncoref the owner of downstream panel curation.
  • HPA normal tissuehpa_rna_consensus, hpa_normal_tissue (IHC), hpa_single_cell, and per-gene lookups (gene_tissue_ntpm, gene_protein_tissues, gene_cell_type_ntpm) over HPA v23, fetched on demand (oncoref data fetch hpa).
  • Genome referencecanonical_gene_id_and_name, find_gene_id_by_name, find_gene_name_from_ensembl_{gene,transcript}_id, aggregate_gene_expression (pyensembl-backed symbol ↔ Ensembl-ID resolution). pyensembl ships with the package, but resolution needs a downloaded human release once: pyensembl install --release 111 --species homo_sapiens (the accessors return None until then).
  • Legacy/compat response signaturesoncoref.response_signatures ships a small historical checkpoint-response signature surface used by oncoref plots. Treat it as transitional: new therapy-response signature panels belong in pirlygenes, and this small surface should not be extended in oncoref unless it is recast as source-anchored empirical fact/provenance rows.
  • Plots (pip install oncoref[plots]) — oncoref.plots.apd1_vs_tmb, apd1_orr_bars, incidence_vs_mortality, the CTA/coverage figures, and oncoref.cta_curation_plots.render.

CLI

oncoref cancer-type prostate     # registry info as JSON
oncoref tmb LUAD_EGFR            # 6.9
oncoref ici SKCM                # 42  (--regimen to pin, --all-regimens to compare)
oncoref burden pancreas --metric us_mortality_pct
oncoref cta --count             # number of expressed CTAs
oncoref plot apd1-vs-tmb --out apd1_vs_tmb.png
oncoref plot patient-coverage --gene-set cta --out coverage_out
oncoref plot cta-curation --out cta_curation_out

# managed data downloads/cache:
oncoref data list               # every wheel/bundle/HPA/source dataset
oncoref data status bundle      # expression-bundle cache state (no download)
oncoref data metadata           # package/data/cache/release contract JSON
oncoref data fetch bundle       # download the large expression bundle
oncoref data fetch hpa          # download HPA reference data (RNA / IHC / single-cell)
oncoref data dir bundle         # where the data bundle is cached
oncoref data prune --yes        # delete stale bundle version caches
oncoref version

Development

./develop.sh   # editable install with dev extras
./format.sh    # ruff format
./lint.sh      # ruff check + format --check
./test.sh      # lint + pytest with coverage

License

Apache 2.0.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oncoref-1.8.66.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oncoref-1.8.66-py3-none-any.whl (1.7 MB view details)

Uploaded Python 3

File details

Details for the file oncoref-1.8.66.tar.gz.

File metadata

  • Download URL: oncoref-1.8.66.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for oncoref-1.8.66.tar.gz
Algorithm Hash digest
SHA256 585e73da70c39e97a912e50bc82726383451f4d2437b1e9343bad59a9b838e33
MD5 b6f585ef9cad1934a3fd1a681c003b0d
BLAKE2b-256 6e0292a7469d50bdfd44c6ff1df2ccb750b828cbac3c1dc9c7ab574d14e29785

See more details on using hashes here.

File details

Details for the file oncoref-1.8.66-py3-none-any.whl.

File metadata

  • Download URL: oncoref-1.8.66-py3-none-any.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for oncoref-1.8.66-py3-none-any.whl
Algorithm Hash digest
SHA256 d9e4696d3d3d0e53b82f960cf4064b55431a1f334e111d589a6d2d1ed03d5608
MD5 50f72162e0351adfa3a86f185d7cdab2
BLAKE2b-256 147abc30118d3352e904950b65c6eab71934b20201b865963fc77f7bccd5557f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page