Streamline conversion of clinical and genomic data into cBioPortal-compatible formats
Project description
cBioFormatter
A Python package for streamlined preparation and formatting of clinical and molecular genomic data for upload to cBioPortal.
Overview
cBioFormatter simplifies the process of converting your genomic data into cBioPortal-compatible formats. Designed for data scientists with basic Python knowledge, this package handles all the complexity of cBioPortal file formatting, validation, and metadata generation.
What it does:
- Converts clinical data (patient and sample attributes) into cBioPortal format
- Processes VCF files into MAF format for mutation data
- Generates all required metadata files automatically
- Validates your study using cBioPortal's official validator
- Creates case lists for sample grouping
- Uploads studies and gene panels into a running cBioPortal instance (optional)
- Fetches public studies and gene panels from the cBioPortal datahub (optional)
What you need:
- Basic Python knowledge (pandas DataFrames, module imports)
- Your clinical data (Excel, CSV, database query, anything that can be converted to a pandas DataFrame)
- VCF files for mutation data (optional)
- vcf2maf installed (for VCF processing, optional)
Installation
pip install cbioportal-formatter
Additional requirements:
- vcf2maf (for mutation data processing, if using VCF files) - see vcf2maf installation guide
Development
For local development, clone the repository and install in editable mode with dev dependencies.
Using uv (recommended)
uv is a fast Python package manager. If you don't have it installed:
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
Then set up the project:
git clone https://github.com/getwilds/cbioformatter.git
cd cbioformatter
uv sync --extra dev
To run commands in the virtual environment:
uv run pytest # Run tests
uv run pytest --cov # Run tests with coverage
uv run ruff check . # Run linter
uv run ruff format . # Format code
uv run ipython # Interactive Python shell (or: uv run python)
Using pip
git clone https://github.com/getwilds/cbioformatter.git
cd cbioformatter
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e ".[dev]"
To run tests and linting:
pytest # Run tests
pytest --cov # Run tests with coverage
ruff check . # Run linter
ruff format . # Format code
ipython # Interactive Python shell (or: python)
Quick Start
Basic Study with Clinical Data Only
import pandas as pd
from cbioformatter import ClinicalStudy
# Prepare your sample-level clinical data
# (typically loaded from a CSV, Excel file, or database query)
sample_df = pd.DataFrame({
'SAMPLE_ID': ['S001', 'S002', 'S003'],
'PATIENT_ID': ['P001', 'P001', 'P002'],
'TUMOR_TYPE': ['Primary', 'Metastasis', 'Primary'],
'AGE_AT_DIAGNOSIS': [45, 45, 67]
})
# sample_df looks like:
# | SAMPLE_ID | PATIENT_ID | TUMOR_TYPE | AGE_AT_DIAGNOSIS |
# |-----------|------------|------------|------------------|
# | S001 | P001 | Primary | 45 |
# | S002 | P001 | Metastasis | 45 |
# | S003 | P002 | Primary | 67 |
# Prepare your patient-level clinical data (optional)
patient_df = pd.DataFrame({
'PATIENT_ID': ['P001', 'P002'],
'SEX': ['Female', 'Male'],
'ETHNICITY': ['Hispanic', 'Asian']
})
# patient_df looks like:
# | PATIENT_ID | SEX | ETHNICITY |
# |------------|--------|-----------|
# | P001 | Female | Hispanic |
# | P002 | Male | Asian |
# Create and validate the study
study = ClinicalStudy(
study_id="brca_ocdo_2026",
name="Breast Cancer Study (Office of the Chief Data Officer 2026)",
description="Clinical and genomic data from breast cancer patients",
cancer_type="brca", # must be a valid cBioPortal cancer type
genome_build="GRCh38", # Options: "GRCh37", "hg19", or "GRCh38"
sample_data=sample_df,
patient_data=patient_df # optional
)
# Validate the study (generates temp files, runs validator, cleans up)
result = study.validate()
if result.is_valid:
print("✓ Study is valid!")
print(f"Validation report: {result.report_path}")
# Write files to disk
study.write_files(output_dir="./my_studies")
print(f"Study files written to: ./my_studies/brca_ocdo_2026/")
else:
print("✗ Validation failed. Check the report for details:")
print(f"Report: {result.report_path}")
Study with Mutation Data
# Add VCF file paths to your sample DataFrame
sample_df = pd.DataFrame({
'SAMPLE_ID': ['S001', 'S002', 'S003'],
'PATIENT_ID': ['P001', 'P001', 'P002'],
'TUMOR_TYPE': ['Primary', 'Metastasis', 'Primary'],
'VCF_PATH': [
'/data/vcf/S001.vcf',
'/data/vcf/S002.vcf',
None # This sample has no mutation data
]
})
# The rest is identical - mutation data is automatically detected
study = ClinicalStudy(
study_id="brca_ocdo_2026",
name="Breast Cancer Study (Office of the Chief Data Officer 2026)",
description="Clinical and genomic data from breast cancer patients",
cancer_type="brca",
genome_build="GRCh38",
sample_data=sample_df
)
result = study.validate()
if result.is_valid:
study.write_files(output_dir="./my_studies")
Uploading to a Local cBioPortal Instance
If you're running a local cBioPortal instance (via Docker), you can upload the study directly:
# After write_files() has produced a study directory:
study_dir = study.write_files(output_dir="./my_studies")
study.upload(
study_dir,
url="http://localhost:8080/", # your cBioPortal URL
container="cbioportal", # docker-compose service name
)
Upload is an optional advanced step — it requires a running cBioPortal instance and Docker. Validation (study.validate()) does not require any of this; it runs locally so newcomers can format and validate without setting up infrastructure.
Fetching Public Studies from the cBioPortal Datahub
from cbioformatter import fetch_datahub_study, fetch_datahub_panel
# Download a public study (returns the path to the extracted directory)
study_dir = fetch_datahub_study("msk_impact_2017", output_dir="./studies")
# Download a gene panel definition
panel_file = fetch_datahub_panel("impact341", output_dir="./panels")
Useful for seeding a fresh cBioPortal instance with reference data, or for round-tripping public studies through cbioformatter for testing.
Features
Clinical Data Handling
Required columns:
SAMPLE_IDin sample DataFrame (must be unique)PATIENT_IDin patient DataFrame if provided (must be unique)
Smart defaults:
- If
patient_datais not provided, it's auto-generated from uniquePATIENT_IDvalues insample_data - If
PATIENT_IDcolumn is missing fromsample_data, each sample is assigned its own patient (PATIENT_ID = SAMPLE_ID) - Column names are automatically cleaned for cBioPortal compatibility while preserving display names
- Data types are automatically inferred: NUMBER (int/float), BOOLEAN (bool), STRING (everything else)
Validation:
- Ensures all
SAMPLE_IDvalues are unique - Ensures all
PATIENT_IDvalues are unique (if patient data provided) - Validates referential integrity (all patient IDs in samples exist in patient data)
- Failures raise clear exceptions with specific issues identified
Mutation Data Processing
Input: VCF files (one per sample)
How it works:
- Add a
VCF_PATHcolumn to yoursample_dataDataFrame with file paths - VCF files are automatically converted to MAF format using vcf2maf
- All MAF files are concatenated into a single mutation file
- Sample IDs are correctly mapped to
Tumor_Sample_Barcode
Flexible data availability:
- If
VCF_PATHcolumn is missing entirely → no mutation data included - If some samples have VCF paths and others don't → mutation data included only for samples with valid paths
- At least one valid VCF path must be provided if the column exists
Requirements:
- vcf2maf must be installed (see installation guide)
- VCF files must match the specified genome build (
GRCh37orGRCh38) - Reference genome files for vcf2maf (users provide their own reference path)
Study Validation
The validate() method:
- Creates temporary files in cBioPortal format
- Runs the official cBioPortal validator (from cBioPortal datahub-study-curation-tools)
- Generates an HTML validation report
- Cleans up temporary files
- Returns a validation result object
Validation result object:
result.is_valid # True if validation passed (clean or warnings-only)
result.report_path # Path to HTML validation report
result.errors # Errors AND/OR warnings emitted by the validator
is_valid is True for a clean validation and for warnings-only results; in the warnings-only case, result.errors is populated and write_files(validate=True) proceeds with a UserWarning. Errors (validator exit code 1 or 2) raise ValidationError from write_files(validate=True) and study files are not written.
Validator acquisition: The cBioPortal validator is AGPL-3.0 licensed and lives in a separate repository, so cbioformatter does not bundle it. On first validate() call, the validator is cloned into ~/.cache/cbioformatter/validator/ (~5 MB, requires git and internet). Subsequent calls reuse the cache.
For air-gapped or CI environments, pre-clone the validator and set CBIOFORMATTER_VALIDATOR_PATH:
git clone --depth 1 https://github.com/cBioPortal/datahub-study-curation-tools.git
export CBIOFORMATTER_VALIDATOR_PATH=$(pwd)/datahub-study-curation-tools/validation/validator
File Output
The write_files() method generates a complete cBioPortal study directory:
my_studies/
└── brca_ocdo_2026/
├── meta_study.txt
├── meta_clinical_patient.txt
├── data_clinical_patient.txt
├── meta_clinical_sample.txt
├── data_clinical_sample.txt
├── meta_mutations.txt # if mutation data provided
├── data_mutations.txt # if mutation data provided
├── case_lists/
│ ├── cases_all.txt
│ └── cases_sequenced.txt # if mutation data provided
Parameters:
output_dir(default:".") - Base directory for output. Study files are created in{output_dir}/{study_id}/validate(default:True) - IfTrue, runs validation before writing files. Set toFalseto skip validation (use with caution).
Uploading to cBioPortal (Optional)
For users running their own cBioPortal instance, cbioformatter can push studies and gene panels directly into the running server. This is a fully optional advanced feature — the formatting and validation features above work standalone.
Requirements:
- A running cBioPortal instance (typically via Docker)
- Docker accessible on your machine (
docker composeavailable in your PATH) - The host directory containing your study must be bind-mounted into the cBioPortal container
Uploading a study:
study.upload(
study_dir, # Path returned by write_files()
url="http://localhost:8080/", # cBioPortal instance URL
container="cbioportal", # docker-compose service name
mount_path="/study", # path inside container where study_dir is mounted
)
The upload() method invokes metaImport.py inside the cBioPortal container and returns a result object with the import status and a link to the HTML report.
Uploading a gene panel:
from cbioformatter import upload_gene_panel
upload_gene_panel(
panel_file="./panels/data_gene_panel_impact341.txt",
container="cbioportal",
mount_path="/study",
)
Gene panels are study-independent reference data — they need to be loaded into cBioPortal before any studies that reference them.
Environment variable defaults:
CBIOPORTAL_URL— overrides the defaulturlCBIOPORTAL_CONTAINER— overrides the defaultcontainerCBIOPORTAL_MOUNT_PATH— overrides the defaultmount_path
Fetching from the cBioPortal Datahub (Optional)
The cBioPortal datahub hosts public studies and gene panel definitions. cbioformatter provides utilities to download them:
from cbioformatter import fetch_datahub_study, fetch_datahub_panel
study_dir = fetch_datahub_study("msk_impact_2017", output_dir="./studies")
panel_file = fetch_datahub_panel("impact341", output_dir="./panels")
These functions return local paths; they do not automatically upload the fetched data. Combine with upload() / upload_gene_panel() for a complete fetch-and-load workflow.
API Reference
ClinicalStudy
ClinicalStudy(
study_id: str,
name: str,
description: str,
cancer_type: str,
genome_build: str,
sample_data: pd.DataFrame,
patient_data: pd.DataFrame = None
)
Parameters:
study_id: Unique identifier for the study (no spaces, lowercase recommended)name: Human-readable study namedescription: Brief description of the studycancer_type: Valid cBioPortal cancer type (see cBioPortal documentation)genome_build: Reference genome build. Accepts UCSC names ("hg19","hg38","mm10") or NCBI/Ensembl aliases ("GRCh37","GRCh38","GRCm38"); aliases are translated to the UCSC form on write since cBioPortal's validator only accepts UCSC namessample_data: pandas DataFrame with sample-level clinical attributes. Must includeSAMPLE_ID. Optionally includesPATIENT_IDandVCF_PATHpatient_data: Optional pandas DataFrame with patient-level clinical attributes. Must includePATIENT_IDif provided
Methods:
validate()
Validates the study using cBioPortal's official validator.
Returns: ValidationResult object with:
is_valid(bool): Whether validation passedreport_path(str): Path to HTML validation reporterrors(list): List of validation errors if validation failed
write_files(output_dir=".", validate=True)
Writes all study files to disk.
Parameters:
output_dir(str): Base output directory (default: current directory)validate(bool): If True, runs validation before writing files (default: True)
Returns: Path to the created study directory ({output_dir}/{study_id}/)
Raises:
ValidationErrorifvalidate=Trueand the cBioPortal validator reports errors. Study files are not written. Passvalidate=Falseto skip validation.
upload(study_dir, url=..., container=..., mount_path=..., report_dir=None)
Uploads a written study directory into a running cBioPortal instance via metaImport.py.
Parameters:
study_dir(str | Path): Path to the study directory produced bywrite_files()url(str): cBioPortal instance URL (default:"http://localhost:8080/", or$CBIOPORTAL_URL)container(str): Name of the cbioportal docker-compose service (default:"cbioportal", or$CBIOPORTAL_CONTAINER)mount_path(str): Path inside the container wherestudy_dir's parent is bind-mounted (default:"/study", or$CBIOPORTAL_MOUNT_PATH)report_dir(str | Path, optional): Where to save the HTML import report (default: alongsidestudy_dir)
Returns: UploadResult object with:
success(bool): Whether import succeededreport_path(str): Path to HTML import reporterrors(list): List of import errors if upload failed
Raises:
RuntimeErrorif Docker is not running or the container cannot be reached
Module-level functions
upload_gene_panel(panel_file, container=..., mount_path=...)
Imports a single gene panel definition file into a running cBioPortal instance via importGenePanel.pl.
Parameters:
panel_file(str | Path): Path to the panel definition filecontainer(str): docker-compose service name (default:"cbioportal", or$CBIOPORTAL_CONTAINER)mount_path(str): Path inside the container wherepanel_file's parent is mounted (default:"/study", or$CBIOPORTAL_MOUNT_PATH)
fetch_datahub_study(study_id, output_dir=".")
Downloads and extracts a public study from the cBioPortal datahub.
Parameters:
study_id(str): Datahub study ID (e.g.,"msk_impact_2017","chol_tcga")output_dir(str | Path): Where to extract the study (default: current directory)
Returns: Path to the extracted study directory
Raises:
ValueErrorif the study ID is not found in the datahub
fetch_datahub_panel(panel_name, output_dir=".")
Downloads a public gene panel definition from the cBioPortal datahub.
Parameters:
panel_name(str): Datahub panel name (e.g.,"impact341","impact468")output_dir(str | Path): Where to save the file (default: current directory)
Returns: Path to the downloaded panel file
Raises:
ValueErrorif the panel name is not found in the datahub
Example Workflow
See the example notebook for a complete walkthrough using simulated data.
Supported Data Types (Current Version)
- ✅ Clinical data (patient and sample attributes)
- ✅ Mutation data (VCF → MAF conversion)
- ⏳ Copy number alterations (CNA) - planned for future release
- ⏳ Gene expression data - planned for future release
- ⏳ Methylation data - planned for future release
Supported Workflows
- ✅ Format clinical and genomic data into cBioPortal-compatible files
- ✅ Validate study files locally (no cBioPortal instance required)
- ⏳ Upload studies into a running cBioPortal instance - planned
- ⏳ Import gene panel definitions - planned
- ⏳ Fetch public studies and panels from the cBioPortal datahub - planned
Requirements
- Python 3.10+
- pandas
- vcf2maf (optional, for VCF processing)
- Docker with a running cBioPortal instance (optional, only for upload features)
External Tools
This package relies on the following external tools for mutation data processing:
vcf2maf (optional, for VCF processing):
- Required only if you're including mutation data from VCF files
- See vcf2maf installation guide for setup instructions
- Requires a reference genome (GRCh37 or GRCh38)
Troubleshooting
Common Issues
"SAMPLE_ID duplicates found"
- Ensure all values in your
SAMPLE_IDcolumn are unique - Check for accidentally duplicated rows in your data
"PATIENT_ID 'P123' not found in patient data"
- Every patient ID referenced in sample data must exist in patient data
- If you didn't provide patient data, this shouldn't happen (it's auto-generated)
"VCF file not found: /path/to/file.vcf"
- Check that all file paths in the
VCF_PATHcolumn are correct - Ensure files are accessible from your current working directory
"vcf2maf not found"
- Install vcf2maf following the installation guide
- Ensure vcf2maf is available in your PATH
Validation fails with complex errors
- Review the HTML validation report at the path provided
- Common issues: incorrect cancer type, malformed column names, missing required fields
Contributing
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
License
MIT License - see LICENSE for details.
Citation
If you use cBioFormatter in your research, please mention the GitHub repository:
cBioFormatter: https://github.com/getwilds/cbioportal-formatter
Future aim: We plan to submit cBioFormatter to the Journal of Open Source Software (JOSS) for peer review. Once published, a formal citation will be provided here.
Contact
Fred Hutch users:
- FH-Data Slack: #cbioportal-support channel (or reach out to Taylor Firman or Emma Bishop)
- Research Computing Data House Call
External users:
- Email: wilds@fredhutch.org
- Issues: GitHub Issues
- Questions: GitHub Discussions
Acknowledgments
- Built to support the Fred Hutch Cancer Center cBioPortal instance
- Uses cBioPortal's official validation tools
- Part of the WILDS ecosystem
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cbioformatter-0.1.0.tar.gz.
File metadata
- Download URL: cbioformatter-0.1.0.tar.gz
- Upload date:
- Size: 21.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5d7f02527b622ecf8b6e46b3a07821556970a4c114c089a93cb8c53bb2e2f6a
|
|
| MD5 |
a97840b1ef736fa905dc28a1be6628e3
|
|
| BLAKE2b-256 |
a9883b73ffecb85a82387097a3e6646150ee9f99957391837c9576b5249c02a1
|
Provenance
The following attestation bundles were made for cbioformatter-0.1.0.tar.gz:
Publisher:
publish.yml on getwilds/cbioformatter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cbioformatter-0.1.0.tar.gz -
Subject digest:
e5d7f02527b622ecf8b6e46b3a07821556970a4c114c089a93cb8c53bb2e2f6a - Sigstore transparency entry: 1546252011
- Sigstore integration time:
-
Permalink:
getwilds/cbioformatter@2a3b4cbb5109d62c3529c9a7cf4f6523b975a333 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/getwilds
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2a3b4cbb5109d62c3529c9a7cf4f6523b975a333 -
Trigger Event:
release
-
Statement type:
File details
Details for the file cbioformatter-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cbioformatter-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5595288076250f454551b8c853184ef0140bebf0617148003e11aef33c8d38b7
|
|
| MD5 |
6a2334c6e3d025af69230fd9e8221147
|
|
| BLAKE2b-256 |
085a61a658c594fb1ef8e2dbda0c3b77edeb21b8552ab0b246457f9420d761b6
|
Provenance
The following attestation bundles were made for cbioformatter-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on getwilds/cbioformatter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cbioformatter-0.1.0-py3-none-any.whl -
Subject digest:
5595288076250f454551b8c853184ef0140bebf0617148003e11aef33c8d38b7 - Sigstore transparency entry: 1546252013
- Sigstore integration time:
-
Permalink:
getwilds/cbioformatter@2a3b4cbb5109d62c3529c9a7cf4f6523b975a333 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/getwilds
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2a3b4cbb5109d62c3529c9a7cf4f6523b975a333 -
Trigger Event:
release
-
Statement type: