Skip to main content

Streamline conversion of clinical and genomic data into cBioPortal-compatible formats

Project description

cBioFormatter

A Python package for streamlined preparation and formatting of clinical and molecular genomic data for upload to cBioPortal.

Overview

cBioFormatter simplifies the process of converting your genomic data into cBioPortal-compatible formats. Designed for data scientists with basic Python knowledge, this package handles all the complexity of cBioPortal file formatting, validation, and metadata generation.

What it does:

  • Converts clinical data (patient and sample attributes) into cBioPortal format
  • Processes VCF files into MAF format for mutation data
  • Generates all required metadata files automatically
  • Validates your study using cBioPortal's official validator
  • Creates case lists for sample grouping
  • Uploads studies and gene panels into a running cBioPortal instance (optional)
  • Fetches public studies and gene panels from the cBioPortal datahub (optional)

What you need:

  • Basic Python knowledge (pandas DataFrames, module imports)
  • Your clinical data (Excel, CSV, database query, anything that can be converted to a pandas DataFrame)
  • VCF files for mutation data (optional)
  • vcf2maf installed (for VCF processing, optional)

Installation

pip install cbioportal-formatter

Additional requirements:

Development

For local development, clone the repository and install in editable mode with dev dependencies.

Using uv (recommended)

uv is a fast Python package manager. If you don't have it installed:

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Then set up the project:

git clone https://github.com/getwilds/cbioformatter.git
cd cbioformatter
uv sync --extra dev

To run commands in the virtual environment:

uv run pytest              # Run tests
uv run pytest --cov        # Run tests with coverage
uv run ruff check .        # Run linter
uv run ruff format .       # Format code
uv run ipython             # Interactive Python shell (or: uv run python)

Using pip

git clone https://github.com/getwilds/cbioformatter.git
cd cbioformatter
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -e ".[dev]"

To run tests and linting:

pytest                     # Run tests
pytest --cov               # Run tests with coverage
ruff check .               # Run linter
ruff format .              # Format code
ipython                    # Interactive Python shell (or: python)

Quick Start

Basic Study with Clinical Data Only

import pandas as pd
from cbioformatter import ClinicalStudy

# Prepare your sample-level clinical data
# (typically loaded from a CSV, Excel file, or database query)
sample_df = pd.DataFrame({
    'SAMPLE_ID': ['S001', 'S002', 'S003'],
    'PATIENT_ID': ['P001', 'P001', 'P002'],
    'TUMOR_TYPE': ['Primary', 'Metastasis', 'Primary'],
    'AGE_AT_DIAGNOSIS': [45, 45, 67]
})

# sample_df looks like:
# | SAMPLE_ID | PATIENT_ID | TUMOR_TYPE | AGE_AT_DIAGNOSIS |
# |-----------|------------|------------|------------------|
# | S001      | P001       | Primary    | 45               |
# | S002      | P001       | Metastasis | 45               |
# | S003      | P002       | Primary    | 67               |

# Prepare your patient-level clinical data (optional)
patient_df = pd.DataFrame({
    'PATIENT_ID': ['P001', 'P002'],
    'SEX': ['Female', 'Male'],
    'ETHNICITY': ['Hispanic', 'Asian']
})

# patient_df looks like:
# | PATIENT_ID | SEX    | ETHNICITY |
# |------------|--------|-----------|
# | P001       | Female | Hispanic  |
# | P002       | Male   | Asian     |

# Create and validate the study
study = ClinicalStudy(
    study_id="brca_ocdo_2026",
    name="Breast Cancer Study (Office of the Chief Data Officer 2026)",
    description="Clinical and genomic data from breast cancer patients",
    cancer_type="brca",  # must be a valid cBioPortal cancer type
    genome_build="GRCh38",  # Options: "GRCh37", "hg19", or "GRCh38"
    sample_data=sample_df,
    patient_data=patient_df  # optional
)

# Validate the study (generates temp files, runs validator, cleans up)
result = study.validate()

if result.is_valid:
    print("✓ Study is valid!")
    print(f"Validation report: {result.report_path}")
    
    # Write files to disk
    study.write_files(output_dir="./my_studies")
    print(f"Study files written to: ./my_studies/brca_ocdo_2026/")
else:
    print("✗ Validation failed. Check the report for details:")
    print(f"Report: {result.report_path}")

Study with Mutation Data

# Add VCF file paths to your sample DataFrame
sample_df = pd.DataFrame({
    'SAMPLE_ID': ['S001', 'S002', 'S003'],
    'PATIENT_ID': ['P001', 'P001', 'P002'],
    'TUMOR_TYPE': ['Primary', 'Metastasis', 'Primary'],
    'VCF_PATH': [
        '/data/vcf/S001.vcf',
        '/data/vcf/S002.vcf',
        None  # This sample has no mutation data
    ]
})

# The rest is identical - mutation data is automatically detected
study = ClinicalStudy(
    study_id="brca_ocdo_2026",
    name="Breast Cancer Study (Office of the Chief Data Officer 2026)",
    description="Clinical and genomic data from breast cancer patients",
    cancer_type="brca",
    genome_build="GRCh38",
    sample_data=sample_df
)

result = study.validate()
if result.is_valid:
    study.write_files(output_dir="./my_studies")

Uploading to a Local cBioPortal Instance

If you're running a local cBioPortal instance (via Docker), you can upload the study directly:

# After write_files() has produced a study directory:
study_dir = study.write_files(output_dir="./my_studies")

study.upload(
    study_dir,
    url="http://localhost:8080/",   # your cBioPortal URL
    container="cbioportal",          # docker-compose service name
)

Upload is an optional advanced step — it requires a running cBioPortal instance and Docker. Validation (study.validate()) does not require any of this; it runs locally so newcomers can format and validate without setting up infrastructure.

Fetching Public Studies from the cBioPortal Datahub

from cbioformatter import fetch_datahub_study, fetch_datahub_panel

# Download a public study (returns the path to the extracted directory)
study_dir = fetch_datahub_study("msk_impact_2017", output_dir="./studies")

# Download a gene panel definition
panel_file = fetch_datahub_panel("impact341", output_dir="./panels")

Useful for seeding a fresh cBioPortal instance with reference data, or for round-tripping public studies through cbioformatter for testing.

Features

Clinical Data Handling

Required columns:

  • SAMPLE_ID in sample DataFrame (must be unique)
  • PATIENT_ID in patient DataFrame if provided (must be unique)

Smart defaults:

  • If patient_data is not provided, it's auto-generated from unique PATIENT_ID values in sample_data
  • If PATIENT_ID column is missing from sample_data, each sample is assigned its own patient (PATIENT_ID = SAMPLE_ID)
  • Column names are automatically cleaned for cBioPortal compatibility while preserving display names
  • Data types are automatically inferred: NUMBER (int/float), BOOLEAN (bool), STRING (everything else)

Validation:

  • Ensures all SAMPLE_ID values are unique
  • Ensures all PATIENT_ID values are unique (if patient data provided)
  • Validates referential integrity (all patient IDs in samples exist in patient data)
  • Failures raise clear exceptions with specific issues identified

Mutation Data Processing

Input: VCF files (one per sample)

How it works:

  1. Add a VCF_PATH column to your sample_data DataFrame with file paths
  2. VCF files are automatically converted to MAF format using vcf2maf
  3. All MAF files are concatenated into a single mutation file
  4. Sample IDs are correctly mapped to Tumor_Sample_Barcode

Flexible data availability:

  • If VCF_PATH column is missing entirely → no mutation data included
  • If some samples have VCF paths and others don't → mutation data included only for samples with valid paths
  • At least one valid VCF path must be provided if the column exists

Requirements:

  • vcf2maf must be installed (see installation guide)
  • VCF files must match the specified genome build (GRCh37 or GRCh38)
  • Reference genome files for vcf2maf (users provide their own reference path)

Study Validation

The validate() method:

  1. Creates temporary files in cBioPortal format
  2. Runs the official cBioPortal validator (from cBioPortal datahub-study-curation-tools)
  3. Generates an HTML validation report
  4. Cleans up temporary files
  5. Returns a validation result object

Validation result object:

result.is_valid      # True if validation passed (clean or warnings-only)
result.report_path   # Path to HTML validation report
result.errors        # Errors AND/OR warnings emitted by the validator

is_valid is True for a clean validation and for warnings-only results; in the warnings-only case, result.errors is populated and write_files(validate=True) proceeds with a UserWarning. Errors (validator exit code 1 or 2) raise ValidationError from write_files(validate=True) and study files are not written.

Validator acquisition: The cBioPortal validator is AGPL-3.0 licensed and lives in a separate repository, so cbioformatter does not bundle it. On first validate() call, the validator is cloned into ~/.cache/cbioformatter/validator/ (~5 MB, requires git and internet). Subsequent calls reuse the cache.

For air-gapped or CI environments, pre-clone the validator and set CBIOFORMATTER_VALIDATOR_PATH:

git clone --depth 1 https://github.com/cBioPortal/datahub-study-curation-tools.git
export CBIOFORMATTER_VALIDATOR_PATH=$(pwd)/datahub-study-curation-tools/validation/validator

File Output

The write_files() method generates a complete cBioPortal study directory:

my_studies/
└── brca_ocdo_2026/
    ├── meta_study.txt
    ├── meta_clinical_patient.txt
    ├── data_clinical_patient.txt
    ├── meta_clinical_sample.txt
    ├── data_clinical_sample.txt
    ├── meta_mutations.txt      # if mutation data provided
    ├── data_mutations.txt      # if mutation data provided
    ├── case_lists/
    │   ├── cases_all.txt
    │   └── cases_sequenced.txt          # if mutation data provided

Parameters:

  • output_dir (default: ".") - Base directory for output. Study files are created in {output_dir}/{study_id}/
  • validate (default: True) - If True, runs validation before writing files. Set to False to skip validation (use with caution).

Uploading to cBioPortal (Optional)

For users running their own cBioPortal instance, cbioformatter can push studies and gene panels directly into the running server. This is a fully optional advanced feature — the formatting and validation features above work standalone.

Requirements:

  • A running cBioPortal instance (typically via Docker)
  • Docker accessible on your machine (docker compose available in your PATH)
  • The host directory containing your study must be bind-mounted into the cBioPortal container

Uploading a study:

study.upload(
    study_dir,                          # Path returned by write_files()
    url="http://localhost:8080/",       # cBioPortal instance URL
    container="cbioportal",             # docker-compose service name
    mount_path="/study",                # path inside container where study_dir is mounted
)

The upload() method invokes metaImport.py inside the cBioPortal container and returns a result object with the import status and a link to the HTML report.

Uploading a gene panel:

from cbioformatter import upload_gene_panel

upload_gene_panel(
    panel_file="./panels/data_gene_panel_impact341.txt",
    container="cbioportal",
    mount_path="/study",
)

Gene panels are study-independent reference data — they need to be loaded into cBioPortal before any studies that reference them.

Environment variable defaults:

  • CBIOPORTAL_URL — overrides the default url
  • CBIOPORTAL_CONTAINER — overrides the default container
  • CBIOPORTAL_MOUNT_PATH — overrides the default mount_path

Fetching from the cBioPortal Datahub (Optional)

The cBioPortal datahub hosts public studies and gene panel definitions. cbioformatter provides utilities to download them:

from cbioformatter import fetch_datahub_study, fetch_datahub_panel

study_dir = fetch_datahub_study("msk_impact_2017", output_dir="./studies")
panel_file = fetch_datahub_panel("impact341", output_dir="./panels")

These functions return local paths; they do not automatically upload the fetched data. Combine with upload() / upload_gene_panel() for a complete fetch-and-load workflow.

API Reference

ClinicalStudy

ClinicalStudy(
    study_id: str,
    name: str,
    description: str,
    cancer_type: str,
    genome_build: str,
    sample_data: pd.DataFrame,
    patient_data: pd.DataFrame = None
)

Parameters:

  • study_id: Unique identifier for the study (no spaces, lowercase recommended)
  • name: Human-readable study name
  • description: Brief description of the study
  • cancer_type: Valid cBioPortal cancer type (see cBioPortal documentation)
  • genome_build: Reference genome build. Accepts UCSC names ("hg19", "hg38", "mm10") or NCBI/Ensembl aliases ("GRCh37", "GRCh38", "GRCm38"); aliases are translated to the UCSC form on write since cBioPortal's validator only accepts UCSC names
  • sample_data: pandas DataFrame with sample-level clinical attributes. Must include SAMPLE_ID. Optionally includes PATIENT_ID and VCF_PATH
  • patient_data: Optional pandas DataFrame with patient-level clinical attributes. Must include PATIENT_ID if provided

Methods:

validate()

Validates the study using cBioPortal's official validator.

Returns: ValidationResult object with:

  • is_valid (bool): Whether validation passed
  • report_path (str): Path to HTML validation report
  • errors (list): List of validation errors if validation failed

write_files(output_dir=".", validate=True)

Writes all study files to disk.

Parameters:

  • output_dir (str): Base output directory (default: current directory)
  • validate (bool): If True, runs validation before writing files (default: True)

Returns: Path to the created study directory ({output_dir}/{study_id}/)

Raises:

  • ValidationError if validate=True and the cBioPortal validator reports errors. Study files are not written. Pass validate=False to skip validation.

upload(study_dir, url=..., container=..., mount_path=..., report_dir=None)

Uploads a written study directory into a running cBioPortal instance via metaImport.py.

Parameters:

  • study_dir (str | Path): Path to the study directory produced by write_files()
  • url (str): cBioPortal instance URL (default: "http://localhost:8080/", or $CBIOPORTAL_URL)
  • container (str): Name of the cbioportal docker-compose service (default: "cbioportal", or $CBIOPORTAL_CONTAINER)
  • mount_path (str): Path inside the container where study_dir's parent is bind-mounted (default: "/study", or $CBIOPORTAL_MOUNT_PATH)
  • report_dir (str | Path, optional): Where to save the HTML import report (default: alongside study_dir)

Returns: UploadResult object with:

  • success (bool): Whether import succeeded
  • report_path (str): Path to HTML import report
  • errors (list): List of import errors if upload failed

Raises:

  • RuntimeError if Docker is not running or the container cannot be reached

Module-level functions

upload_gene_panel(panel_file, container=..., mount_path=...)

Imports a single gene panel definition file into a running cBioPortal instance via importGenePanel.pl.

Parameters:

  • panel_file (str | Path): Path to the panel definition file
  • container (str): docker-compose service name (default: "cbioportal", or $CBIOPORTAL_CONTAINER)
  • mount_path (str): Path inside the container where panel_file's parent is mounted (default: "/study", or $CBIOPORTAL_MOUNT_PATH)

fetch_datahub_study(study_id, output_dir=".")

Downloads and extracts a public study from the cBioPortal datahub.

Parameters:

  • study_id (str): Datahub study ID (e.g., "msk_impact_2017", "chol_tcga")
  • output_dir (str | Path): Where to extract the study (default: current directory)

Returns: Path to the extracted study directory

Raises:

  • ValueError if the study ID is not found in the datahub

fetch_datahub_panel(panel_name, output_dir=".")

Downloads a public gene panel definition from the cBioPortal datahub.

Parameters:

  • panel_name (str): Datahub panel name (e.g., "impact341", "impact468")
  • output_dir (str | Path): Where to save the file (default: current directory)

Returns: Path to the downloaded panel file

Raises:

  • ValueError if the panel name is not found in the datahub

Example Workflow

See the example notebook for a complete walkthrough using simulated data.

Supported Data Types (Current Version)

  • ✅ Clinical data (patient and sample attributes)
  • ✅ Mutation data (VCF → MAF conversion)
  • ⏳ Copy number alterations (CNA) - planned for future release
  • ⏳ Gene expression data - planned for future release
  • ⏳ Methylation data - planned for future release

Supported Workflows

  • ✅ Format clinical and genomic data into cBioPortal-compatible files
  • ✅ Validate study files locally (no cBioPortal instance required)
  • ⏳ Upload studies into a running cBioPortal instance - planned
  • ⏳ Import gene panel definitions - planned
  • ⏳ Fetch public studies and panels from the cBioPortal datahub - planned

Requirements

  • Python 3.10+
  • pandas
  • vcf2maf (optional, for VCF processing)
  • Docker with a running cBioPortal instance (optional, only for upload features)

External Tools

This package relies on the following external tools for mutation data processing:

vcf2maf (optional, for VCF processing):

  • Required only if you're including mutation data from VCF files
  • See vcf2maf installation guide for setup instructions
  • Requires a reference genome (GRCh37 or GRCh38)

Troubleshooting

Common Issues

"SAMPLE_ID duplicates found"

  • Ensure all values in your SAMPLE_ID column are unique
  • Check for accidentally duplicated rows in your data

"PATIENT_ID 'P123' not found in patient data"

  • Every patient ID referenced in sample data must exist in patient data
  • If you didn't provide patient data, this shouldn't happen (it's auto-generated)

"VCF file not found: /path/to/file.vcf"

  • Check that all file paths in the VCF_PATH column are correct
  • Ensure files are accessible from your current working directory

"vcf2maf not found"

  • Install vcf2maf following the installation guide
  • Ensure vcf2maf is available in your PATH

Validation fails with complex errors

  • Review the HTML validation report at the path provided
  • Common issues: incorrect cancer type, malformed column names, missing required fields

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.

Citation

If you use cBioFormatter in your research, please mention the GitHub repository:

cBioFormatter: https://github.com/getwilds/cbioportal-formatter

Future aim: We plan to submit cBioFormatter to the Journal of Open Source Software (JOSS) for peer review. Once published, a formal citation will be provided here.

Contact

Fred Hutch users:

External users:

Acknowledgments

  • Built to support the Fred Hutch Cancer Center cBioPortal instance
  • Uses cBioPortal's official validation tools
  • Part of the WILDS ecosystem

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cbioformatter-0.1.0.tar.gz (21.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cbioformatter-0.1.0-py3-none-any.whl (21.7 kB view details)

Uploaded Python 3

File details

Details for the file cbioformatter-0.1.0.tar.gz.

File metadata

  • Download URL: cbioformatter-0.1.0.tar.gz
  • Upload date:
  • Size: 21.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for cbioformatter-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e5d7f02527b622ecf8b6e46b3a07821556970a4c114c089a93cb8c53bb2e2f6a
MD5 a97840b1ef736fa905dc28a1be6628e3
BLAKE2b-256 a9883b73ffecb85a82387097a3e6646150ee9f99957391837c9576b5249c02a1

See more details on using hashes here.

Provenance

The following attestation bundles were made for cbioformatter-0.1.0.tar.gz:

Publisher: publish.yml on getwilds/cbioformatter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cbioformatter-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cbioformatter-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for cbioformatter-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5595288076250f454551b8c853184ef0140bebf0617148003e11aef33c8d38b7
MD5 6a2334c6e3d025af69230fd9e8221147
BLAKE2b-256 085a61a658c594fb1ef8e2dbda0c3b77edeb21b8552ab0b246457f9420d761b6

See more details on using hashes here.

Provenance

The following attestation bundles were made for cbioformatter-0.1.0-py3-none-any.whl:

Publisher: publish.yml on getwilds/cbioformatter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page