Streamline conversion of clinical and genomic data into cBioPortal-compatible formats

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

cBioFormatter

A Python package for streamlined preparation and formatting of clinical and molecular genomic data for upload to cBioPortal.

Overview

cBioFormatter simplifies the process of converting your genomic data into cBioPortal-compatible formats. Designed for data scientists with basic Python knowledge, this package handles all the complexity of cBioPortal file formatting, validation, and metadata generation.

What it does:

Converts clinical data (patient and sample attributes) into cBioPortal format
Processes VCF files into MAF format for mutation data
Generates all required metadata files automatically
Validates your study using cBioPortal's official validator
Creates case lists for sample grouping
Uploads studies and gene panels into a running cBioPortal instance (optional)
Fetches public studies and gene panels from the cBioPortal datahub (optional)

What you need:

Basic Python knowledge (pandas DataFrames, module imports)
Your clinical data (Excel, CSV, database query, anything that can be converted to a pandas DataFrame)
VCF files for mutation data (optional)
vcf2maf installed (for VCF processing, optional)

Installation

pip install cbioportal-formatter

Additional requirements:

vcf2maf (for mutation data processing, if using VCF files) - see vcf2maf installation guide

Development

For local development, clone the repository and install in editable mode with dev dependencies.

Using uv (recommended)

uv is a fast Python package manager. If you don't have it installed:

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Then set up the project:

git clone https://github.com/getwilds/cbioformatter.git
cd cbioformatter
uv sync --extra dev

To run commands in the virtual environment:

uv run pytest              # Run tests
uv run pytest --cov        # Run tests with coverage
uv run ruff check .        # Run linter
uv run ruff format .       # Format code
uv run ipython             # Interactive Python shell (or: uv run python)

Using pip

git clone https://github.com/getwilds/cbioformatter.git
cd cbioformatter
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -e ".[dev]"

To run tests and linting:

pytest                     # Run tests
pytest --cov               # Run tests with coverage
ruff check .               # Run linter
ruff format .              # Format code
ipython                    # Interactive Python shell (or: python)

Quick Start

Basic Study with Clinical Data Only

import pandas as pd
from cbioformatter import ClinicalStudy

# Prepare your sample-level clinical data
# (typically loaded from a CSV, Excel file, or database query)
sample_df = pd.DataFrame({
    'SAMPLE_ID': ['S001', 'S002', 'S003'],
    'PATIENT_ID': ['P001', 'P001', 'P002'],
    'TUMOR_TYPE': ['Primary', 'Metastasis', 'Primary'],
    'AGE_AT_DIAGNOSIS': [45, 45, 67]
})

# sample_df looks like:
# | SAMPLE_ID | PATIENT_ID | TUMOR_TYPE | AGE_AT_DIAGNOSIS |
# |-----------|------------|------------|------------------|
# | S001      | P001       | Primary    | 45               |
# | S002      | P001       | Metastasis | 45               |
# | S003      | P002       | Primary    | 67               |

# Prepare your patient-level clinical data (optional)
patient_df = pd.DataFrame({
    'PATIENT_ID': ['P001', 'P002'],
    'SEX': ['Female', 'Male'],
    'ETHNICITY': ['Hispanic', 'Asian']
})

# patient_df looks like:
# | PATIENT_ID | SEX    | ETHNICITY |
# |------------|--------|-----------|
# | P001       | Female | Hispanic  |
# | P002       | Male   | Asian     |

# Create and validate the study
study = ClinicalStudy(
    study_id="brca_ocdo_2026",
    name="Breast Cancer Study (Office of the Chief Data Officer 2026)",
    description="Clinical and genomic data from breast cancer patients",
    cancer_type="brca",  # must be a valid cBioPortal cancer type
    genome_build="GRCh38",  # Options: "GRCh37", "hg19", or "GRCh38"
    sample_data=sample_df,
    patient_data=patient_df  # optional
)

# Validate the study (generates temp files, runs validator, cleans up)
result = study.validate()

if result.is_valid:
    print("✓ Study is valid!")
    print(f"Validation report: {result.report_path}")
    
    # Write files to disk
    study.write_files(output_dir="./my_studies")
    print(f"Study files written to: ./my_studies/brca_ocdo_2026/")
else:
    print("✗ Validation failed. Check the report for details:")
    print(f"Report: {result.report_path}")

Study with Mutation Data

# Add VCF file paths to your sample DataFrame
sample_df = pd.DataFrame({
    'SAMPLE_ID': ['S001', 'S002', 'S003'],
    'PATIENT_ID': ['P001', 'P001', 'P002'],
    'TUMOR_TYPE': ['Primary', 'Metastasis', 'Primary'],
    'VCF_PATH': [
        '/data/vcf/S001.vcf',
        '/data/vcf/S002.vcf',
        None  # This sample has no mutation data
    ]
})

# The rest is identical - mutation data is automatically detected
study = ClinicalStudy(
    study_id="brca_ocdo_2026",
    name="Breast Cancer Study (Office of the Chief Data Officer 2026)",
    description="Clinical and genomic data from breast cancer patients",
    cancer_type="brca",
    genome_build="GRCh38",
    sample_data=sample_df
)

result = study.validate()
if result.is_valid:
    study.write_files(output_dir="./my_studies")

Uploading to a Local cBioPortal Instance

If you're running a local cBioPortal instance (via Docker), you can upload the study directly:

# After write_files() has produced a study directory:
study_dir = study.write_files(output_dir="./my_studies")

study.upload(
    study_dir,
    url="http://localhost:8080/",   # your cBioPortal URL
    container="cbioportal",          # docker-compose service name
)

Upload is an optional advanced step — it requires a running cBioPortal instance and Docker. Validation (study.validate()) does not require any of this; it runs locally so newcomers can format and validate without setting up infrastructure.

Fetching Public Studies from the cBioPortal Datahub

from cbioformatter import fetch_datahub_study, fetch_datahub_panel

# Download a public study (returns the path to the extracted directory)
study_dir = fetch_datahub_study("msk_impact_2017", output_dir="./studies")

# Download a gene panel definition
panel_file = fetch_datahub_panel("impact341", output_dir="./panels")

Useful for seeding a fresh cBioPortal instance with reference data, or for round-tripping public studies through cbioformatter for testing.

Features

Clinical Data Handling

Required columns:

SAMPLE_ID in sample DataFrame (must be unique)
PATIENT_ID in patient DataFrame if provided (must be unique)

Smart defaults:

If patient_data is not provided, it's auto-generated from unique PATIENT_ID values in sample_data
If PATIENT_ID column is missing from sample_data, each sample is assigned its own patient (PATIENT_ID = SAMPLE_ID)
Column names are automatically cleaned for cBioPortal compatibility while preserving display names
Data types are automatically inferred: NUMBER (int/float), BOOLEAN (bool), STRING (everything else)

Validation:

Ensures all SAMPLE_ID values are unique
Ensures all PATIENT_ID values are unique (if patient data provided)
Validates referential integrity (all patient IDs in samples exist in patient data)
Failures raise clear exceptions with specific issues identified

Mutation Data Processing

Input: VCF files (one per sample)

How it works:

Add a VCF_PATH column to your sample_data DataFrame with file paths
VCF files are automatically converted to MAF format using vcf2maf
All MAF files are concatenated into a single mutation file
Sample IDs are correctly mapped to Tumor_Sample_Barcode

Flexible data availability:

If VCF_PATH column is missing entirely → no mutation data included
If some samples have VCF paths and others don't → mutation data included only for samples with valid paths
At least one valid VCF path must be provided if the column exists

Requirements:

vcf2maf must be installed (see installation guide)
VCF files must match the specified genome build (GRCh37 or GRCh38)
Reference genome files for vcf2maf (users provide their own reference path)

Study Validation

The validate() method:

Creates temporary files in cBioPortal format
Runs the official cBioPortal validator (from cBioPortal datahub-study-curation-tools)
Generates an HTML validation report
Cleans up temporary files
Returns a validation result object

Validation result object:

result.is_valid      # True if validation passed (clean or warnings-only)
result.report_path   # Path to HTML validation report
result.errors        # Errors AND/OR warnings emitted by the validator

is_valid is True for a clean validation and for warnings-only results; in the warnings-only case, result.errors is populated and write_files(validate=True) proceeds with a UserWarning. Errors (validator exit code 1 or 2) raise ValidationError from write_files(validate=True) and study files are not written.

Validator acquisition: The cBioPortal validator is AGPL-3.0 licensed and lives in a separate repository, so cbioformatter does not bundle it. On first validate() call, the validator is cloned into ~/.cache/cbioformatter/validator/ (~5 MB, requires git and internet). Subsequent calls reuse the cache.

For air-gapped or CI environments, pre-clone the validator and set CBIOFORMATTER_VALIDATOR_PATH:

git clone --depth 1 https://github.com/cBioPortal/datahub-study-curation-tools.git
export CBIOFORMATTER_VALIDATOR_PATH=$(pwd)/datahub-study-curation-tools/validation/validator

File Output

The write_files() method generates a complete cBioPortal study directory:

my_studies/
└── brca_ocdo_2026/
    ├── meta_study.txt
    ├── meta_clinical_patient.txt
    ├── data_clinical_patient.txt
    ├── meta_clinical_sample.txt
    ├── data_clinical_sample.txt
    ├── meta_mutations.txt      # if mutation data provided
    ├── data_mutations.txt      # if mutation data provided
    ├── case_lists/
    │   ├── cases_all.txt
    │   └── cases_sequenced.txt          # if mutation data provided

Parameters:

output_dir (default: ".") - Base directory for output. Study files are created in {output_dir}/{study_id}/
validate (default: True) - If True, runs validation before writing files. Set to False to skip validation (use with caution).

Uploading to cBioPortal (Optional)

For users running their own cBioPortal instance, cbioformatter can push studies and gene panels directly into the running server. This is a fully optional advanced feature — the formatting and validation features above work standalone.

Requirements:

A running cBioPortal instance (typically via Docker)
Docker accessible on your machine (docker compose available in your PATH)
The host directory containing your study must be bind-mounted into the cBioPortal container

Uploading a study:

study.upload(
    study_dir,                          # Path returned by write_files()
    url="http://localhost:8080/",       # cBioPortal instance URL
    container="cbioportal",             # docker-compose service name
    mount_path="/study",                # path inside container where study_dir is mounted
)

The upload() method invokes metaImport.py inside the cBioPortal container and returns a result object with the import status and a link to the HTML report.

Uploading a gene panel:

from cbioformatter import upload_gene_panel

upload_gene_panel(
    panel_file="./panels/data_gene_panel_impact341.txt",
    container="cbioportal",
    mount_path="/study",
)

Gene panels are study-independent reference data — they need to be loaded into cBioPortal before any studies that reference them.

Environment variable defaults:

CBIOPORTAL_URL — overrides the default url
CBIOPORTAL_CONTAINER — overrides the default container
CBIOPORTAL_MOUNT_PATH — overrides the default mount_path

Fetching from the cBioPortal Datahub (Optional)

The cBioPortal datahub hosts public studies and gene panel definitions. cbioformatter provides utilities to download them:

from cbioformatter import fetch_datahub_study, fetch_datahub_panel

study_dir = fetch_datahub_study("msk_impact_2017", output_dir="./studies")
panel_file = fetch_datahub_panel("impact341", output_dir="./panels")

These functions return local paths; they do not automatically upload the fetched data. Combine with upload() / upload_gene_panel() for a complete fetch-and-load workflow.

API Reference

ClinicalStudy

ClinicalStudy(
    study_id: str,
    name: str,
    description: str,
    cancer_type: str,
    genome_build: str,
    sample_data: pd.DataFrame,
    patient_data: pd.DataFrame = None
)

Parameters:

study_id: Unique identifier for the study (no spaces, lowercase recommended)
name: Human-readable study name
description: Brief description of the study
cancer_type: Valid cBioPortal cancer type (see cBioPortal documentation)
genome_build: Reference genome build. Accepts UCSC names ("hg19", "hg38", "mm10") or NCBI/Ensembl aliases ("GRCh37", "GRCh38", "GRCm38"); aliases are translated to the UCSC form on write since cBioPortal's validator only accepts UCSC names
sample_data: pandas DataFrame with sample-level clinical attributes. Must include SAMPLE_ID. Optionally includes PATIENT_ID and VCF_PATH
patient_data: Optional pandas DataFrame with patient-level clinical attributes. Must include PATIENT_ID if provided

Methods:

`validate()`

Validates the study using cBioPortal's official validator.

Returns: ValidationResult object with:

is_valid (bool): Whether validation passed
report_path (str): Path to HTML validation report
errors (list): List of validation errors if validation failed

`write_files(output_dir=".", validate=True)`

Writes all study files to disk.

Parameters:

output_dir (str): Base output directory (default: current directory)
validate (bool): If True, runs validation before writing files (default: True)

Returns: Path to the created study directory ({output_dir}/{study_id}/)

Raises:

ValidationError if validate=True and the cBioPortal validator reports errors. Study files are not written. Pass validate=False to skip validation.

`upload(study_dir, url=..., container=..., mount_path=..., report_dir=None)`

Uploads a written study directory into a running cBioPortal instance via metaImport.py.

Parameters:

study_dir (str | Path): Path to the study directory produced by write_files()
url (str): cBioPortal instance URL (default: "http://localhost:8080/", or $CBIOPORTAL_URL)
container (str): Name of the cbioportal docker-compose service (default: "cbioportal", or $CBIOPORTAL_CONTAINER)
mount_path (str): Path inside the container where study_dir's parent is bind-mounted (default: "/study", or $CBIOPORTAL_MOUNT_PATH)
report_dir (str | Path, optional): Where to save the HTML import report (default: alongside study_dir)

Returns: UploadResult object with:

success (bool): Whether import succeeded
report_path (str): Path to HTML import report
errors (list): List of import errors if upload failed

Raises:

RuntimeError if Docker is not running or the container cannot be reached

Module-level functions

`upload_gene_panel(panel_file, container=..., mount_path=...)`

Imports a single gene panel definition file into a running cBioPortal instance via importGenePanel.pl.

Parameters:

panel_file (str | Path): Path to the panel definition file
container (str): docker-compose service name (default: "cbioportal", or $CBIOPORTAL_CONTAINER)
mount_path (str): Path inside the container where panel_file's parent is mounted (default: "/study", or $CBIOPORTAL_MOUNT_PATH)

`fetch_datahub_study(study_id, output_dir=".")`

Downloads and extracts a public study from the cBioPortal datahub.

Parameters:

study_id (str): Datahub study ID (e.g., "msk_impact_2017", "chol_tcga")
output_dir (str | Path): Where to extract the study (default: current directory)

Returns: Path to the extracted study directory

Raises:

ValueError if the study ID is not found in the datahub

`fetch_datahub_panel(panel_name, output_dir=".")`

Downloads a public gene panel definition from the cBioPortal datahub.

Parameters:

panel_name (str): Datahub panel name (e.g., "impact341", "impact468")
output_dir (str | Path): Where to save the file (default: current directory)

Returns: Path to the downloaded panel file

Raises:

ValueError if the panel name is not found in the datahub

Example Workflow

See the example notebook for a complete walkthrough using simulated data.

Supported Data Types (Current Version)

✅ Clinical data (patient and sample attributes)
✅ Mutation data (VCF → MAF conversion)
⏳ Copy number alterations (CNA) - planned for future release
⏳ Gene expression data - planned for future release
⏳ Methylation data - planned for future release

Supported Workflows

✅ Format clinical and genomic data into cBioPortal-compatible files
✅ Validate study files locally (no cBioPortal instance required)
⏳ Upload studies into a running cBioPortal instance - planned
⏳ Import gene panel definitions - planned
⏳ Fetch public studies and panels from the cBioPortal datahub - planned

Requirements

Python 3.10+
pandas
vcf2maf (optional, for VCF processing)
Docker with a running cBioPortal instance (optional, only for upload features)

External Tools

This package relies on the following external tools for mutation data processing:

vcf2maf (optional, for VCF processing):

Required only if you're including mutation data from VCF files
See vcf2maf installation guide for setup instructions
Requires a reference genome (GRCh37 or GRCh38)

Troubleshooting

Common Issues

"SAMPLE_ID duplicates found"

Ensure all values in your SAMPLE_ID column are unique
Check for accidentally duplicated rows in your data

"PATIENT_ID 'P123' not found in patient data"

Every patient ID referenced in sample data must exist in patient data
If you didn't provide patient data, this shouldn't happen (it's auto-generated)

"VCF file not found: /path/to/file.vcf"

Check that all file paths in the VCF_PATH column are correct
Ensure files are accessible from your current working directory

"vcf2maf not found"

Install vcf2maf following the installation guide
Ensure vcf2maf is available in your PATH

Validation fails with complex errors

Review the HTML validation report at the path provided
Common issues: incorrect cancer type, malformed column names, missing required fields

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.

Citation

If you use cBioFormatter in your research, please mention the GitHub repository:

cBioFormatter: https://github.com/getwilds/cbioportal-formatter

Future aim: We plan to submit cBioFormatter to the Journal of Open Source Software (JOSS) for peer review. Once published, a formal citation will be provided here.

Contact

Fred Hutch users:

FH-Data Slack: #cbioportal-support channel (or reach out to Taylor Firman or Emma Bishop)
Research Computing Data House Call

External users:

Email: wilds@fredhutch.org
Issues: GitHub Issues
Questions: GitHub Discussions

Acknowledgments

Built to support the Fred Hutch Cancer Center cBioPortal instance
Uses cBioPortal's official validation tools
Part of the WILDS ecosystem

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

tefirman

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cbioformatter-0.1.0.tar.gz (21.7 kB view details)

Uploaded May 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cbioformatter-0.1.0-py3-none-any.whl (21.7 kB view details)

Uploaded May 15, 2026 Python 3

File details

Details for the file cbioformatter-0.1.0.tar.gz.

File metadata

Download URL: cbioformatter-0.1.0.tar.gz
Upload date: May 15, 2026
Size: 21.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for cbioformatter-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e5d7f02527b622ecf8b6e46b3a07821556970a4c114c089a93cb8c53bb2e2f6a`
MD5	`a97840b1ef736fa905dc28a1be6628e3`
BLAKE2b-256	`a9883b73ffecb85a82387097a3e6646150ee9f99957391837c9576b5249c02a1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cbioformatter-0.1.0.tar.gz:

Publisher: publish.yml on getwilds/cbioformatter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cbioformatter-0.1.0.tar.gz
- Subject digest: e5d7f02527b622ecf8b6e46b3a07821556970a4c114c089a93cb8c53bb2e2f6a
- Sigstore transparency entry: 1546252011
- Sigstore integration time: May 15, 2026
Source repository:
- Permalink: getwilds/cbioformatter@2a3b4cbb5109d62c3529c9a7cf4f6523b975a333
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/getwilds
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2a3b4cbb5109d62c3529c9a7cf4f6523b975a333
- Trigger Event: release

File details

Details for the file cbioformatter-0.1.0-py3-none-any.whl.

File metadata

Download URL: cbioformatter-0.1.0-py3-none-any.whl
Upload date: May 15, 2026
Size: 21.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for cbioformatter-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5595288076250f454551b8c853184ef0140bebf0617148003e11aef33c8d38b7`
MD5	`6a2334c6e3d025af69230fd9e8221147`
BLAKE2b-256	`085a61a658c594fb1ef8e2dbda0c3b77edeb21b8552ab0b246457f9420d761b6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cbioformatter-0.1.0-py3-none-any.whl:

Publisher: publish.yml on getwilds/cbioformatter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cbioformatter-0.1.0-py3-none-any.whl
- Subject digest: 5595288076250f454551b8c853184ef0140bebf0617148003e11aef33c8d38b7
- Sigstore transparency entry: 1546252013
- Sigstore integration time: May 15, 2026
Source repository:
- Permalink: getwilds/cbioformatter@2a3b4cbb5109d62c3529c9a7cf4f6523b975a333
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/getwilds
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2a3b4cbb5109d62c3529c9a7cf4f6523b975a333
- Trigger Event: release

cbioformatter 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

cBioFormatter

Overview

Installation

Development

Using uv (recommended)

Using pip

Quick Start

Basic Study with Clinical Data Only

Study with Mutation Data

Uploading to a Local cBioPortal Instance

Fetching Public Studies from the cBioPortal Datahub

Features

Clinical Data Handling

Mutation Data Processing

Study Validation

File Output

Uploading to cBioPortal (Optional)

Fetching from the cBioPortal Datahub (Optional)

API Reference

ClinicalStudy

validate()

write_files(output_dir=".", validate=True)

upload(study_dir, url=..., container=..., mount_path=..., report_dir=None)

Module-level functions

upload_gene_panel(panel_file, container=..., mount_path=...)

fetch_datahub_study(study_id, output_dir=".")

fetch_datahub_panel(panel_name, output_dir=".")

Example Workflow

Supported Data Types (Current Version)

Supported Workflows

Requirements

External Tools

Troubleshooting

Common Issues

Contributing

License

Citation

Contact

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`validate()`

`write_files(output_dir=".", validate=True)`

`upload(study_dir, url=..., container=..., mount_path=..., report_dir=None)`

`upload_gene_panel(panel_file, container=..., mount_path=...)`

`fetch_datahub_study(study_id, output_dir=".")`

`fetch_datahub_panel(panel_name, output_dir=".")`