Skip to main content

SurvSet package

Project description

SurvSet: An open-source time-to-event dataset respository

SurvSet is the first ever open-source time-to-event dataset repository. The goal of SurvSet is to allow researchers and practioneeres to benchmark machine learning models and assess statistical methods. All datasets in this repository are consisently formatted to enable rapid prototyping and inference. The origins of this dataset were for testing regularity conditions of the False Positive Control Lasso.

While SurvSet is designed for python, the formatted datasets can found in a comma-separated format within this folder. SurvSet currently has 76 datasets which vary in dimensionality (see figure below). This includes high-dimensional genomics datasets (p >> n) like gse1992, and long and skinny datasets like hdfail (n >> p).

Installation

SurvSet can be installed using pip for python3: pip install SurvSet. You can run python3 -m SurvSet to make sure the package has compiled without errors. Please note that pandas and numpy will be installed as dependencies (see PyPI for more details).

Dataset structure and origin

Most of SurvSet's datasets come from existing R packages. The accompanying arXiv paper provides a full list of package sources and references. Datasets can be called in from the main class SurvLoader with the load_dataset method. This will return a pandas DataFrame with the following columns structure:

  1. pid: the unique observation identifier (especially relevant for time-varying datasets)
  2. event: a binary event indicator (1==event has happened)
  3. time: time to event/censoring (or start time if time2 exists)
  4. time2: end time [time, time2) if there are time-varying features
  5. num_{}: prefix implies a continuous feature
  6. fac_{}: prefix implies a categorical feature

Currently 7 datasets have time-varying features. Some datasets will have the same feature a both a continuous and categorical feature. This was done for those features that are plausibly ordinal.

Figure: Dataset dimensionality

gg_ds

Usage (simple)

Print the list of datasets that can be loaded and load the ova dataset.

from SurvSet.data import SurvLoader
loader = SurvLoader()
# List of available datasets and meta-info
print(loader.df_ds.head())
# Load dataset and its reference
df, ref = loader.load_dataset(ds_name='ova').values()
print(df.head())

Usage (complex)

The example below shows a simple machine learning pipeline that fits a series of ElasticNet CoxPH models to each of the (non-time-varying) datasets. To make run the code, please install the appropriate packages: conda install -c bcg_gamma -c conda-forge scikit-learn=1.0.2 sklearndf=2.0 scikit-survival=0.17.0 plotnine=0.8.0.

import os
import numpy as np
import pandas as pd
import plotnine as pn
from SurvSet.data import SurvLoader
from sksurv.util import Surv
from sksurv.metrics import concordance_index_censored as concordance
from sksurv.linear_model import CoxnetSurvivalAnalysis
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector
from sklearndf.pipeline import PipelineDF
from sklearndf.transformation import OneHotEncoderDF, ColumnTransformerDF, SimpleImputerDF, StandardScalerDF

# (i) Set up feature transformer pipeline
enc_fac = PipelineDF(steps=[('ohe', OneHotEncoderDF(sparse=False, drop=None, handle_unknown='ignore'))])
sel_fac = make_column_selector(pattern='^fac\\_')
enc_num = PipelineDF(steps=[('impute', SimpleImputerDF(strategy='median')), ('scale', StandardScalerDF())])
sel_num = make_column_selector(pattern='^num\\_')
# Combine both
enc_df = ColumnTransformerDF(transformers=[('ohe', enc_fac, sel_fac),('s', enc_num, sel_num)])

# (ii) Run on datasets
alpha = 0.1
senc = Surv()
loader = SurvLoader()
ds_lst = loader.df_ds[~loader.df_ds['is_td']]['ds'].to_list()  # Remove datasets with time-varying covariates
n_ds = len(ds_lst)
holder_cindex = np.zeros([n_ds, 3])
for i, ds in enumerate(ds_lst):
    print('Dataset %s (%i of %i)' % (ds, i+1, n_ds))
    anno = loader.df_ds.query('ds == @ds').T.to_dict()
    anno = anno[list(anno)[0]]
    df, ref = loader.load_dataset(ds).values()
    # Random stratified split
    df_train, df_test = train_test_split(df, stratify=df['event'], random_state=1, test_size=0.3)
    # Fit encoder
    enc_df.fit(df_train)
    # Sanity check
    cn_prefix = enc_df.feature_names_original_.str.split('_',1,True)[0].unique()
    assert all([cn in ['fac', 'num'] for cn in cn_prefix])
    # Prepare numpy arrays
    X_train = enc_df.transform(df_train)
    So_train = senc.from_arrays(df_train['event'].astype(bool), df_train['time'])
    X_test = enc_df.transform(df_test)
    # Fit model
    mdl = CoxnetSurvivalAnalysis(normalize=True)
    mdl.fit(X=X_train, y=So_train)
    scores_test = mdl.predict(X_test)
    res_test = df_test[['event','time']].assign(scores=scores_test)
    So_test = senc.from_arrays(res_test['event'].astype(bool), res_test['time'])
    conc_test = concordance(So_test['event'], So_test['time'], res_test['scores'])[0]
    # Get concordance and 90% CI
    n_bs = 250
    holder_bs = np.zeros(n_bs)
    for j in range(n_bs):
        res_bs = res_test.groupby(['event']).sample(frac=1,replace=True,random_state=j)
        So_bs = senc.from_arrays(res_bs['event'].astype(bool), res_bs['time'])
        conc_bs = concordance(So_bs['event'], So_bs['time'], res_bs['scores'])[0]
        holder_bs[j] = conc_bs
    lb, ub = np.quantile(holder_bs, [alpha,1-alpha])
    holder_cindex[i] = [conc_test, lb, ub]

# (iii) Merge results & plot
df_cindex = pd.DataFrame(holder_cindex, columns=['cindex', 'lb', 'ub'])
df_cindex.insert(0, 'ds', ds_lst)
ds_ord = df_cindex.sort_values('cindex')['ds'].values
df_cindex['ds'] = pd.Categorical(df_cindex['ds'], ds_ord)

gg_cindex = (pn.ggplot(df_cindex, pn.aes(y='cindex',x='ds')) + 
    pn.theme_bw() + pn.coord_flip() + 
    pn.geom_point(size=2) + 
    pn.geom_linerange(pn.aes(ymin='lb', ymax='ub')) + 
    pn.labs(y='Concordance') + 
    pn.geom_hline(yintercept=0.5,linetype='--', color='red') + 
    pn.theme(axis_title_y=pn.element_blank()))
gg_cindex

gg_cindex

Adding new datasets

If you are interested in contributing to SurvSet or know of other open-source time-to-event datasets you think would be useful additions, please contact me. If you would like to see these datasets adopted quickly, please directly modify the data generating process found in SurvSet/_datagen/pipeline.sh and create a pull request.

How to cite

If you use SurvSet in your research or project please cite the following:

@article{drysdale2022,
  title={{SurvSet}: An open-source time-to-event dataset repository},
  author={Drysdale, Erik},
  journal={arXiv preprint arXiv:2203.03094},
  year={2022}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

SurvSet-0.2.6-py2.py3-none-any.whl (42.4 MB view details)

Uploaded Python 2 Python 3

File details

Details for the file SurvSet-0.2.6-py2.py3-none-any.whl.

File metadata

  • Download URL: SurvSet-0.2.6-py2.py3-none-any.whl
  • Upload date:
  • Size: 42.4 MB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.7

File hashes

Hashes for SurvSet-0.2.6-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 f2be0ac9853dae1f3642f6072989dda2bca45fe4d986fe224ced7261811e2c58
MD5 b0d39f1221bedbc0d7b6a8be02ef7667
BLAKE2b-256 ae874c4ee386bf2d32a34b1a3e8a411d201f465c32ee2fba7534035a64fcee2d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page