Curated cancer reference data: ontology, TMB, incidence/mortality, and expression
Project description
oncodata
Curated cancer reference data — cancer-type ontology, tumor mutational burden (TMB), incidence/mortality, anti-PD-1 response, per-cohort RNA-seq expression, and cancer-testis antigens — behind one small Python API, a data fetch/cache CLI, and a set of reference plots.
oncodata is the base layer
oncodata is the foundation of the openvax/PIRL dependency pyramid — the
single upstream source of truth for cancer reference data. It depends only on
pandas / numpy / pyarrow / PyYAML, and it never imports its consumers: data
and logic flow only downward. It does not mirror these definitions from anywhere;
it owns them.
Anything that needs to know about
- gene expression of cancer samples — per-cohort RNA-seq in a normalized, comparable space: summary stats, tail-weighted percentiles, and medoid/exemplar samples per cancer type/subtype;
- HPA protein / RNA normal-tissue expression;
- the definition of cancer-testis antigens — the HPA tissue-restriction call over the candidate list (HPA-only; no MS/peptide layer);
- the ontology of cancer types — codes, the parent/child hierarchy, subtypes, families, characteristic driver fusions, and the cross-cutting MSI/POLE/HPV groupings; and
- anti-PD-1 response rates and TMB per cancer type
depends on oncodata — including pirlygenes (gene-set curation/analysis),
tsarina (personalized target selection), hitlist (panel selection),
trufflepig (sample classification), and anything else downstream.
Everything keys on the cancer-type registry. The small curated tables ship in the wheel; the heavy per-cohort expression bundle downloads on first use from oncodata's own GitHub Release.
Install
pip install oncodata
Python API
import oncodata as cd
cd.resolve_cancer_type("prostate") # -> "PRAD"
cd.cancer_type_info("SARC_RMS_ARMS") # full registry record + burden + tmb
cd.cancer_tmb("LUAD_EGFR") # 6.9 (inherited from LUAD)
cd.cancer_burden("pancreas", metric="us_mortality_pct")
cd.burden_category("SARC_OS") # -> "bone_and_joint"
cd.cancer_apd1_response("SKCM") # 42 (anti-PD-1 monotherapy ORR %)
# Cancer-testis antigens (HPA-derived tissue-restriction):
cd.CTA_gene_names() # expressed CTA symbols (MAGEA4, CT83, …)
cd.CTA_evidence() # full HPA restriction table
# Per-cohort expression percentiles (downloads the data bundle on first use):
cd.cohort_gene_percentiles("PRAD") # per-gene p0…p100 vector (within-cohort)
cd.within_sample_top_fraction("PRAD") # per-gene frac of samples top-5% (within-sample)
Domains
- Ontology —
cancer_type_registry,resolve_cancer_type,cancer_type_info,cancer_types_in_family,viral_status,fusion_status, the cohort vocabulary (cohort_registry,cohort_aggregates). - TMB —
cancer_tmb,cancer_tmb_df(parent-chain inheritance). - Incidence / mortality —
cancer_burden,burden_category(ACS / GLOBOCAN). - Anti-PD-1 response —
cancer_apd1_response(monotherapy ORR per type). - Expression —
cohort_gene_percentiles,within_sample_top_fraction,representative_cohort_samplesover the lazy-downloaded per-cohort bundle. - Cancer-testis antigens —
CTA_gene_names/CTA_gene_ids,CTA_evidence,synthesize_restriction(HPA-only tissue-restriction; MS evidence stays in the target-selection layer). - HPA normal tissue —
hpa_rna_consensus,hpa_normal_tissue(IHC),hpa_single_cell, and per-gene lookups (gene_tissue_ntpm,gene_protein_tissues,gene_cell_type_ntpm) over HPA v23, fetched on demand (oncodata sources fetch). - Genome reference —
canonical_gene_id_and_name,find_gene_id_by_name,find_gene_name_from_ensembl_{gene,transcript}_id,aggregate_gene_expression(pyensembl-backed symbol ↔ Ensembl-ID resolution). pyensembl ships with the package, but resolution needs a downloaded human release once:pyensembl install --release 111 --species homo_sapiens(the accessors returnNoneuntil then). - Peptides —
cta_specific_9mer_counts,cta_specific_9mer_load(per-cohort mean per-patient CTA-specific 9-mer load): 9-mers found in a CTA protein but in no non-CTA protein, enumerated from the reference proteome and cached per release. - Plots (
pip install oncodata[plots]) —oncodata.plots.apd1_vs_tmb,apd1_orr_bars,incidence_vs_mortality, and the CTA/coverage figures.
CLI
oncodata cancer-type prostate # registry info as JSON
oncodata tmb LUAD_EGFR # 6.9
oncodata apd1 SKCM # 42
oncodata burden pancreas --metric us_mortality_pct
oncodata cta --count # number of expressed CTAs
oncodata plot apd1-vs-tmb --out apd1_vs_tmb.png
# data bundle (per-cohort expression):
oncodata fetch # download the ~340 MB bundle
oncodata status # which bundle paths are cached (no download)
oncodata cache-dir # where the data bundle is cached
oncodata prune --yes # delete stale version caches
oncodata version
Development
./develop.sh # editable install with dev extras
./format.sh # ruff format
./lint.sh # ruff check + format --check
./test.sh # lint + pytest with coverage
License
Apache 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oncodata-1.4.0.tar.gz.
File metadata
- Download URL: oncodata-1.4.0.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
233f319067e772f85b7299230d8e498e7eb821de63928dbfe5ec9cf629efaf63
|
|
| MD5 |
345acc9e3a4b90dbda45ac44b6530d0a
|
|
| BLAKE2b-256 |
310d141d76f75763631528db524ed17259ca957fc70bd386f4e443c756c110da
|
File details
Details for the file oncodata-1.4.0-py3-none-any.whl.
File metadata
- Download URL: oncodata-1.4.0-py3-none-any.whl
- Upload date:
- Size: 1.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92a268f7d8007390800e0d9e97b88f1c7b9cb7d2df82fe55b6942a4b19ac0e9e
|
|
| MD5 |
493946296a38fb3d6bb1e5b35c60d931
|
|
| BLAKE2b-256 |
bcb9e78eb95331dd7e29e52fde2e578f34bdd56a7b33625c9f6581f5bf462be9
|