Skip to main content

GlucoseDao's fork of GlucoBench by IrinaStatsLab

Project description

GlucoBench

The official implementation of the paper "GlucoBench: Curated List of Continuous Glucose Monitoring Datasets with Prediction Benchmarks." If you found our work interesting and plan to re-use the code, please cite us as:

@article{
  author  = {Renat Sergazinov and Valeriya Rogovchenko and Elizabeth Chun and Nathaniel Fernandes and Irina Gaynanova},
  title   = {GlucoBench: Curated List of Continuous Glucose Monitoring Datasets with Prediction Benchmarks},
  journal = {arXiv}
  year    = {2023},
}

Dependencies

We recommend to setup clean Python environment with conda by running conda create -n glucobench python=3.10. Then we can install all dependenices by running pip install -r requirments.txt.

To run Latent ODE model, install torchdiffeq.

Code organization

The code is organized as follows:

  • bin/: training commands for all models
  • config/: configuration files for all datasets
  • data_formatter/
    • base.py: performs all pre-processing for all CGM datasets
  • exploratory_analysis/: notebooks with processing steps for pulling the data and converting to .csv files
  • lib/
    • gluformer/: model implementation
    • latent_ode/: model implementation
    • *.py: hyper-paraemter tuning, training, validation, and testing scripts
  • output/: hyper-parameter optimization and testing logs
  • paper_results/: code for producing tables and plots, found in the paper
  • utils/: helper functions for model training and testing
  • raw_data.zip: web-pulled CGM data (processed using exploratory_analysis)
  • environment.yml: conda environment file

Data

The datasets are distributed according to the following licences and can be downloaded from the following links outlined in the table below.

Dataset License Number of patients CGM Frequency
Colas Creative Commons 4.0 208 5 minutes
Dubosson Creative Commons 4.0 9 5 minutes
Hall Creative Commons 4.0 57 5 minutes
Broll GPL-2 5 5 minutes
Weinstock Creative Commons 4.0 200 5 minutes

To process the data, follow the instructions in the exploratory_analysis/ folder. Processed datasets should be saved in the raw_data/ folder. We provide examples in the raw_data.zip file.

How to reproduce results?

Setting up the enviroment

We recommend setting up a clean Python environment using Conda. Follow these steps:

  1. Create a new environment named glucobench with Python 3.10 by running:

    conda env create -n glucobench python=3.10
    
  2. Activate the environment with:

    conda activate glucobench
    
  3. Install all required dependencies by running:

    pip install -r requirements.txt
    
  4. (Optional) To confirm that you're installing in the correct environment, run:

    which pip
    

    This should display the path to the pip executable within the glucobench environment."

Changing the configs

The config/ folder stores the best hyper-parameters (selected by Optuna) for each dataset and model. The config/ also stores the dataset-specific parameters for interpolation, dropping, splitting, and scaling. To train and evaluate the models with these defaults, we can simply run:

python ./lib/model.py --dataset dataset --use_covs False --optuna False

Changing the hyper-parameters

To change the search grid for hyper-parameters, we need to modify the ./lib/model.py file. Specifically, we look at the objective() function and modify the trial.suggest_* parameters to set the desired ranges. Once we are done, we can run the following command to re-run the hyper-parameter optimization:

python ./lib/model.py --dataset dataset --use_covs False --optuna True

How to work with the repository?

We provide a detailed example of the workflow in the example.ipynb notebook. For clarification, we provide some general suggestions below in order of increasing complexity.

Just the data

To start experimenting with the data, we can run the following command:

import yaml
from data_formatter.base import DataFormatter

with open(f'./config/{dataset}.yaml', 'r') as f:
    config = yaml.safe_load(f)
formatter = DataFormatter(config)

The command exposes an object of class DataFormatter which automatically pre-processes the data upon initialization. The pre-processing steps can be controlled via the config/ files. The DataFormatter object exposes the following attributes:

  1. formatter.train_data: training data (as pandas.DataFrame)
  2. formatter.val_data: validation data
  3. formatter.test_data: testing (in-distribution and out-of-distribution) data i. formatter.test_data.loc[~formatter.test_data.index.isin(formatter.test_idx_ood)]: in-distribution testing data ii. formatter.test_data.loc[formatter.test_data.index.isin(formatter.test_idx_ood)]: out-of-distribution testing data
  4. formatter.data: unscaled full data

Integration with PyTorch

Training models with PyTorch typically boils down to (1) defining a Dataset class with __getitem__() method, (2) wrapping it into a DataLoader, (3) defining a torch.nn.Module class with forward() method that implements the model, and (4) optimizing the model with torch.optim in a training loop.

Parts (1) and (2) crucically depend on the definition of the Dataset class. Essentially, having the data in the table format (e.g. formatter.train_data), how do we sample input-output pairs and pass the covariate information? The various Dataset classes conveniently adopted from the Darts library (see here) offer one way to wrap the data into a Dataset class. Different Dataset classes differ in what information is provided to the model:

  1. SamplingDatasetPast: supports only past covariates
  2. SamplingDatasetDual: supports only future-known covariates
  3. SamplingDatasetMixed: supports both past and future-known covariates

Below we give an example of loading the data and wrapping it into a Dataset:

from utils.darts_processing import load_data
from utils.darts_dataset import SamplingDatasetDual

formatter, series, scalers = load_data(seed=0,
                                       dataset=dataset,
                                       use_covs=True, 
                                       cov_type='dual',
                                       use_static_covs=True)
dataset_train = SamplingDatasetDual(series['train']['target'],
                                    series['train']['future'],
                                    output_chunk_length=out_len,
                                    input_chunk_length=in_len,
                                    use_static_covariates=True,
                                    max_samples_per_ts=max_samples_per_ts,)

Parts (3) and (4) are model-specific, so we omit their discussion. For inspiration, we suggest to take a look at the lib/gluformer/model.py and lib/latent_ode/trainer_glunet.py files.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glucosedao_glucobench-0.4.0.tar.gz (537.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glucosedao_glucobench-0.4.0-py3-none-any.whl (101.4 kB view details)

Uploaded Python 3

File details

Details for the file glucosedao_glucobench-0.4.0.tar.gz.

File metadata

File hashes

Hashes for glucosedao_glucobench-0.4.0.tar.gz
Algorithm Hash digest
SHA256 caebe0eaecde96a40a3bad782bff7884882b710f7822af2761929ced5f2dc725
MD5 101b055fdd1767f0363df7b4f47e6816
BLAKE2b-256 71f7aa978d754125217660a1faa9f484ba1cca19d5b3091a0e8112cbf15607c2

See more details on using hashes here.

File details

Details for the file glucosedao_glucobench-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for glucosedao_glucobench-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e9f3daea0b2f01f5ccc7883bed3e48117a5dd3cb054c31b1c9a8ebe0ffefbe59
MD5 384f95af5a834fb463c859cb14dda997
BLAKE2b-256 4b3a96b271afa16932935851a12216eeb6a37bfb47e60bdbb03a11e0e9323c02

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page