Skip to main content

TESSERA: a foundation model for the cancer genome.

Project description

TESSERA logo

Tumour Embeddings via Self-Supervised Encoding and Reconstruction of Alterations
A foundation model for the cancer genome.


TESSERA is a self-supervised foundation model jointly pretrained on somatic single-nucleotide variants (SNVs) and copy-number alterations (CNAs) from the TCGA Pan-Cancer Atlas. A single learned representation, produced once and reused without retraining, supports variant pathogenicity prediction, pan-cancer tumour-type classification, unsupervised molecular subtyping, prognostic stratification, and counterfactual treatment-effect estimation.

This repository contains the reference implementation, the pretrained-weights pointer, the inference utilities described in the accompanying paper, and the end-to-end analysis pipelines that reproduce every panel of Figures 1-6 and Supplementary Figures 1-12.

Quick start

The fastest way to use TESSERA is via the public inference API on Hugging Face; no local installation required. Upload SNV and/or CNA data, get back per-variant predictions and embeddings:

🔗 Inference API: huggingface.co/spaces/JW-Sidhom-Lab/tessera (coming soon)

From Python (pip install gradio_client):

import time
from gradio_client import Client, handle_file

client = Client("JW-Sidhom-Lab/tessera")        # the public Spaces URL also works

# Submit returns (status_html, job_id) immediately; inference runs async
_, job_id = client.predict(
    handle_file("snv.csv"),         # SNV CSV; or None
    handle_file("cna.csv"),         # CNA CSV; or None. At least one required.
    True,                           # apply TCGA quantile normalization to CNA
    "you@example.com",              # email address for the download link
    "GRCh37",                       # genome assembly: "GRCh37" or "GRCh38"
    api_name="/submit",
)

# Poll for completion (the same URL also gets emailed when the job finishes)
while True:
    status = client.predict(job_id, api_name="/status")
    if status["status"] in ("done", "failed"):
        break
    time.sleep(10)

print(status["url"])    # 24h pre-signed S3 download URL with the result ZIP

The API serves the foundation-model outputs only (per-token embeddings + per-token reconstruction predictions, returned as .npy files inside the result ZIP). Downstream task heads (tumour-type classifier, treatment-effect score) are available on request under a Data Use Agreement.

CSV column conventions:

  • SNV: Tumor_Sample_Barcode, Chromosome (no chr prefix), Start_Position, Reference_Allele, Tumor_Seq_Allele2, plus either vaf or both t_alt_count + t_ref_count. Single-base substitutions only.
  • CNA: Tumor_Sample_Barcode, Chromosome, Start, End, Segment_Mean (log2 ratio); optional LOH column triggers the with-LoH model variant.

Local installation

For users who want to run inference offline or integrate TESSERA into a custom pipeline:

pip install tessera-foundation

The first call to tessera.featurize (below) downloads the reference genome (~3 GB) and the requested model weights from Hugging Face Hub on demand and caches both, so you don't need a separate setup step.

To reproduce the manuscript or retrain from scratch, clone the repo for the analysis scripts and the FASTA bootstrap helper:

git clone https://github.com/JW-Sidhom-Lab/tessera.git
cd tessera
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
bash tessera/ref_genomes/download_ref_genomes.sh

requirements.txt covers the foundation-model package, all manuscript-reproduction scripts (pretraining, classifiers, prognostic / predictive-biomarker analyses), and the Gradio inference API. A trimmer subset for deploying only the inference API is at inference_api/requirements.txt.

Weights are hosted on Hugging Face Hub at huggingface.co/JW-Sidhom-Lab/tessera-foundation under CC-BY-NC-4.0. The shortest path from raw dataframes to feature tensors is the featurize one-liner, which downloads weights on first call (cached afterwards), lifts non-hg19 coordinates, builds the dataset, and runs both per-modality feature heads:

import tessera

result = tessera.featurize(
    snv_df=snv_df,                      # columns: Tumor_Sample_Barcode, Chromosome, Start_Position,
                                        #          Reference_Allele, Tumor_Seq_Allele2, vaf
    cna_df=cna_df,                      # columns: Tumor_Sample_Barcode, Chromosome, Start, End, Segment_Mean
    variant="joint_snv_cna_noloh",      # or "joint_snv_cna" for the with-LoH variant
    from_assembly="GRCh38",             # "GRCh37" / "hg19" is a no-op; otherwise UCSC liftover runs
)

result.snv_features      # (n_variants, 1169)  per-variant embeddings, row-aligned with result.snv_table
result.cna_features      # (n_segments, 688)   per-segment embeddings, row-aligned with result.cna_table
result.liftover_stats    # {"snv": {"n_in", "n_out", "n_dropped"}, "cna": {...}}

For finer-grained control there are still building blocks:

from tessera import load_pretrained, lift_snv, lift_cna

model = load_pretrained("joint_snv_cna_noloh")    # download + instantiate, ~3 s cold
snv_df, _ = lift_snv(snv_df, from_assembly="GRCh38")    # identity if from_assembly=="GRCh37"
cna_df, _ = lift_cna(cna_df, from_assembly="GRCh38")
result = model.featurize(snv_df=snv_df, cna_df=cna_df)  # repeat without re-downloading

UCSC chain files are downloaded on first use and cached at ~/.cache/pyliftover/; offline environments can point the loader at a bundled chain file via the chain_file= argument or the TESSERA_LIFTOVER_CHAIN environment variable.

Reproducing the manuscript

Every published panel is backed by a script in this repository. The pipeline runs in three stages:

  1. Data preparation (data/): per-cohort download instructions, source-table provenance, and the create_training_data*.py / build_<cohort>_metadata.py builders that turn raw releases into the analysis-ready CSVs.
  2. Foundation-model pretraining (scripts/tcga_pancan_*/): trains the SNV models, the CNA models, and the joint SNV+CNA InfoNCE-aligned foundation model on the TCGA Pan-Cancer Atlas.
  3. Downstream analyses (scripts/): variant-pathogenicity (Fig. 1 h-o), cross-platform validation (Fig. 1 f-g, Fig. 2 d), tumour-type classification (Fig. 3, Fig. 4 b-e), prognostic UMAP + joint Cox (Fig. 5), doubly-robust counterfactual treatment-effect (Fig. 6 a-m), and DepMap cell-line transfer (Fig. 6 n).

scripts/README.md and data/README.md hold the full per-directory tables mapping each script and cohort to its manuscript figure.

Repository layout

tessera/
├── tessera/                        # foundation-model package
│   ├── base.py                     # BaseModel: shared data + training infrastructure
│   ├── input_keys.py               # input-key helpers
│   ├── model.py                    # TESSERA: foundation-model class
│   ├── data/
│   │   └── preprocessing.py        # SNV/CNA tokenization, FASTA lookup, sample bagging
│   ├── layers/                     # custom Keras layers (attention, masking, MIL, ...)
│   ├── training/                   # training utilities (callbacks, losses, schedules)
│   └── ref_genomes/                # reference-genome download script + indices
├── data/                           # per-cohort data preparation pipelines (data/README.md)
├── scripts/                        # analysis pipelines backing the manuscript figures (scripts/README.md)
└── README.md

Citing TESSERA

If you use TESSERA in your work, please cite:

citation pending publication

A BibTeX entry will be added on acceptance.

License

This repository is distributed under the PolyForm Noncommercial License 1.0.0 (see LICENSE). Use is permitted for academic research, education, public-research-organization use, and personal experimentation; commercial use is not permitted without a separate license. Pretrained foundation-model weights are released on the Hugging Face Hub under CC-BY-NC-4.0 (non-commercial, attribution required). Pretrained weights for downstream clinical task heads (CRC and PDAC treatment-effect models) remain available on request under a Data Use Agreement. Patents covering clinical applications of TESSERA are assigned to NewYork-Presbyterian; commercial licensing inquiries should be directed to NYP's technology transfer office.

Lab

TESSERA is developed in the JW Sidhom Lab at Weill Cornell Medicine.

For questions, collaborations, or commercial-licensing enquiries, contact the corresponding author.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tessera_foundation-0.1.1.tar.gz (143.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tessera_foundation-0.1.1-py3-none-any.whl (152.3 kB view details)

Uploaded Python 3

File details

Details for the file tessera_foundation-0.1.1.tar.gz.

File metadata

  • Download URL: tessera_foundation-0.1.1.tar.gz
  • Upload date:
  • Size: 143.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for tessera_foundation-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3d088ebe88312fdb0dedf260a4806d03582b1fe0c038c0986c5059e2418b4f0c
MD5 bcd311edc81fec94dcf570f2ffcec810
BLAKE2b-256 776269b06548ae1a3739662c7647a8a2b37237af6fa3f1eb24b6ca8e86728b89

See more details on using hashes here.

File details

Details for the file tessera_foundation-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for tessera_foundation-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fbe81a5e688147fa4763d61c5142194c064609abf685ef37b93d2922fc4f9531
MD5 dc0ff7ad54a2815578055d51590aa44c
BLAKE2b-256 86fef526e2740d7e6f0daf23bb3402bbe6f014a93862e455aac0744e00b732ee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page