Skip to main content

Scalable Tabularization of MEDS format Time-Series data

Project description

Scalable tabularization and tabular feature usage utilities over generic MEDS datasets

This repository provides utilities and scripts to run limited automatic tabular ML pipelines for generic MEDS datasets.

Installation

Pip Install

pip install meds-tab

Local Install

# clone the git repo
pip install .

Usage

This repository consists of two key pieces:

  1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) either backwards or forwards in time from a given index date. Naturally, only "look-back" windows should be used for future-event prediction tasks, and are thus currently implemented.
  2. Running a basic XGBoost AutoML pipeline over these tabular features to predict arbitrary binary classification or regression downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- what is more advanced is the efficient construction, storage, and loading of tabular features for the candidate AutoML models, enabling a far more extensive search over different featurization strategies.

Scripts and Examples

See tests/test_integration.py for an example of the end-to-end pipeline being run on synthetic data. This script is a functional test that is also run with pytest to verify the correctness of the algorithm.

For an end to end example over MIMIC-IV, see the companion repository

Core CLI Scripts Overview

  1. meds-tab-describe: This command processes MEDS data shards to compute the frequencies of different code-types

    • time-series codes (codes with timestamps)
    • time-series numerical values (codes with timestamps and numerical values)
    • static codes (codes without timestamps)
    • static numerical codes (codes without timestamps but with numerical values).
  2. meds-tab-tabularize-static: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique patient_id and timestamp combination, thus rows are duplicated across multiple timestamps for the same patient.

  3. meds-tab-tabularize-time-series: Iterates through combinations of a shard, window_size, and aggregation to generate feature vectors that aggregate patient data for each unique patient_id x timestamp. This stage (and the previous stage) use sparse matrix formats to efficiently handle the computational and storage demands of rolling window calculations on large datasets. We support parallelization through Hydra's --multirun flag and the joblib launcher.

  4. meds-tab-cache-task: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (patient_id, timestamp, label) structured similarly to the MEDS_cohort_dir.

  5. meds-tab-xgboost: Trains an XGBoost model using user-specified parameters.

  6. meds-tab-xgboost-sweep: Conducts an Optuna hyperparameter sweep to optimize over window_sizes, aggregations, and min_code_inclusion_frequency, aiming to enhance model performance and adaptability.

What do you mean "tabular pipelines"? Isn't all structured EHR data already tabular?

This is a common misconception. Tabular data refers to data that can be organized in a consistent, logical set of rows/columns such that the entirety of a "sample" or "instance" for modeling or analysis is contained in a single row, and the set of columns possibly observed (there can be missingness) is consistent across all rows. Structured EHR data does not satisfy this definition, as we will have different numbers of observations of medical codes and values at different timestamps for different patients, so it cannot simultanesouly satisfy the (1) "single row single instance", (2) "consistent set of columns", and (3) "logical" requirements. Thus, in this pipeline, when we say we will produce a "tabular" view of MEDS data, we mean a dataset that can realize these constraints, which will explicitly involve summarizing the patient data over various historical or future windows in time to produce a single row per patient with a consistent, logical set of columns (though there may still be missingness).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meds_tab-0.0.1.tar.gz (44.5 kB view hashes)

Uploaded Source

Built Distribution

meds_tab-0.0.1-py3-none-any.whl (41.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page