Curated cancer reference data: ontology, TMB, incidence/mortality, and expression
Project description
oncoref
Curated cancer reference data — cancer-type ontology, tumor mutational burden (TMB), incidence/mortality, checkpoint-inhibitor (ICI) response, per-cohort RNA-seq expression, HPA normal-tissue expression, and HPA-derived cancer-testis antigen references — behind one small Python API, a data fetch/cache CLI, and a set of reference plots.
oncoref is the base layer
oncoref is designed as the base layer of the openvax/PIRL stack — the
intended upstream home for shared cancer reference mechanics and data, meant to
become a common dependency of
pirlygenes,
trufflepig, and
tsarina. Adoption is staged: downstream
packages can delegate parity-clean primitives while keeping their own curated
tables, packaged artifacts, and compatibility APIs until those surfaces are
ready to move. Architecturally oncoref stays at the bottom: it depends only on
pandas / numpy / pyarrow / PyYAML, it never imports its consumers (data and
logic flow only downward), and shared definitions should be fixed or exposed
here rather than reimplemented separately downstream.
Use oncoref for shared questions about
- gene expression of cancer samples — per-cohort RNA-seq in a normalized, comparable space: summary stats, tail-weighted percentiles, and medoid/exemplar samples per cancer type/subtype. Downstream packages may still keep packaged expression artifacts and compatibility wrappers while parity checks converge;
- HPA protein / RNA normal-tissue expression;
- the HPA-derived cancer-testis antigen call — the HPA tissue-restriction call over the candidate list (HPA-only; no pirlygenes therapy/MS curation layer);
- the ontology of cancer types — codes, the parent/child hierarchy, subtypes, families, characteristic driver fusions, and the cross-cutting MSI/POLE/HPV groupings; and
- checkpoint-inhibitor response rates and TMB per cancer type.
Everything keys on the cancer-type registry. The small curated tables ship in the wheel; the heavy per-cohort expression bundle downloads on first use from oncoref's own GitHub Release.
Install
pip install oncoref
Python API
The flat oncoref namespace remains available for compatibility and quick
interactive use. For new code, prefer the semantic submodules in
docs/api.md; they make it clearer whether you are working with
the cancer ontology, cohorts, ICI response, CTA coverage, generic antigen-panel
coverage, or CTA-specific peptides.
import oncoref as od
od.resolve_cancer_type("prostate") # -> "PRAD"
od.cancer_type_info("SARC_RMS_ARMS") # full registry record + burden + tmb
od.cancer_tmb("LUAD_EGFR") # 6.9 (inherited from LUAD)
od.cancer_burden("pancreas", metric="us_mortality_pct")
od.burden_category("SARC_OS") # -> "bone_and_joint" (incidence/mortality bucket)
od.cancer_ici_response("SKCM") # 42 (anti-PD-1 ORR %; fallback aPD-1 → aPD-L1 → combo)
od.cancer_ici_response("SKCM", regimen="PD-1+CTLA-4") # 57.6 (pin a regimen)
# Cancer-testis antigens (HPA-derived tissue-restriction):
od.cta_gene_names() # expressed CTA symbols (MAGEA4, CT83, …)
od.cta_evidence() # full HPA restriction table
# Per-cohort expression percentiles (downloads the data bundle on first use):
od.cohort_gene_percentiles("PRAD") # per-gene p0…p100 vector (within-cohort)
od.within_sample_top_fraction("PRAD") # per-gene frac of samples top-5% (within-sample)
Domains
- Cancer ontology —
oncoref.cancer_ontology:cancer_type_registry,resolve_cancer_type,cancer_type_records,cancer_type_codes,cancer_type_path,cancer_type_reference_data, tree/family/lineage helpers, molecular subtype groups, source-scoped evidence resolution, matched normal tissue helpers,viral_status,fusion_status. - Cohorts —
oncoref.cohorts:cohort_registry,cohort_aggregates,cohort_source_version, and mixture-cohort helpers. - TMB —
cancer_tmb,cancer_tmb_df(parent-chain inheritance). - Incidence / mortality —
cancer_burden,burden_category(ACS / GLOBOCAN). - Checkpoint response —
oncoref.ici_response: regimen-aware ORR anchors, anti-PD-1 shortcuts, endpoint estimates, and pooled response summaries. - Expression —
cohort_gene_percentiles,within_sample_top_fraction,representative_cohort_samplesover the lazy-downloaded per-cohort bundle. - Clean TPM / normalization —
oncoref.normalizationfor the 16/9/75 compartment transform andoncoref.gene_familiesfor clean-TPM censored compartment IDs plus the biological housekeeping denominator panel. - Cancer-testis antigens —
oncoref.cta:cta_gene_names/cta_gene_ids,cta_evidence,synthesize_restriction(HPA-only tissue-restriction; MS evidence stays in the target-selection layer). - CTA coverage / peptides —
oncoref.cta_coveragefor patient coverage andoncoref.cta_peptidesfor CTA-specific 9-mer count maps and load. - Generic antigen-panel coverage —
oncoref.antigen_coveragefor explicit non-CTA panels. - HPA normal tissue —
hpa_rna_consensus,hpa_normal_tissue(IHC),hpa_single_cell, and per-gene lookups (gene_tissue_ntpm,gene_protein_tissues,gene_cell_type_ntpm) over HPA v23, fetched on demand (oncoref data fetch hpa). - Genome reference —
canonical_gene_id_and_name,find_gene_id_by_name,find_gene_name_from_ensembl_{gene,transcript}_id,aggregate_gene_expression(pyensembl-backed symbol ↔ Ensembl-ID resolution). pyensembl ships with the package, but resolution needs a downloaded human release once:pyensembl install --release 111 --species homo_sapiens(the accessors returnNoneuntil then). - Plots (
pip install oncoref[plots]) —oncoref.plots.apd1_vs_tmb,apd1_orr_bars,incidence_vs_mortality, the CTA/coverage figures, andoncoref.cta_curation_plots.render.
CLI
oncoref cancer-type prostate # registry info as JSON
oncoref tmb LUAD_EGFR # 6.9
oncoref ici SKCM # 42 (--regimen to pin, --all-regimens to compare)
oncoref burden pancreas --metric us_mortality_pct
oncoref cta --count # number of expressed CTAs
oncoref plot apd1-vs-tmb --out apd1_vs_tmb.png
oncoref plot patient-coverage --gene-set cta --out coverage_out
oncoref plot cta-curation --out cta_curation_out
# managed data downloads/cache:
oncoref data list # every wheel/bundle/HPA/source dataset
oncoref data status bundle # expression-bundle cache state (no download)
oncoref data fetch bundle # download the large expression bundle
oncoref data fetch hpa # download HPA reference data (RNA / IHC / single-cell)
oncoref data dir bundle # where the data bundle is cached
oncoref data prune --yes # delete stale bundle version caches
oncoref version
Development
./develop.sh # editable install with dev extras
./format.sh # ruff format
./lint.sh # ruff check + format --check
./test.sh # lint + pytest with coverage
License
Apache 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oncoref-1.8.48.tar.gz.
File metadata
- Download URL: oncoref-1.8.48.tar.gz
- Upload date:
- Size: 1.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f71c467fb7edbc15979fa89e24600fa1afb9f589cad678411da5d90fb8e46e50
|
|
| MD5 |
bcf9e0dbe23dcd509d9ddc1eb1b098c1
|
|
| BLAKE2b-256 |
32ce0ab70d9d57aea951b8c287e0aa950762a6dcb4eb6cd51809495cdbac025d
|
File details
Details for the file oncoref-1.8.48-py3-none-any.whl.
File metadata
- Download URL: oncoref-1.8.48-py3-none-any.whl
- Upload date:
- Size: 1.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
714db581594fa4487d7b7519d20a8431392d743dc255512612a22194b0d448ef
|
|
| MD5 |
fa5fb38505e33fa7ef4b4401b62cac7a
|
|
| BLAKE2b-256 |
52ff13e75a855ef2c42c052a5158b8bea166aca4dde60866b3e2288706158c02
|