An efficient, flexible PyTorch dataset class for MEDS data.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mmd_pypi

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

MEDS TorchData: A PyTorch Dataset Class for MEDS Datasets

🚀 Quick Start

Step 1: Install

pip install meds-torch-data

Step 2: Data Tensorization

[!WARNING] If your dataset is not sharded by split, you need to run a reshard to split stage first! You can enable this by adding the do_reshard=True argument to the command below.

If your input MEDS dataset lives in $MEDS_ROOT and you want to store your pre-processed files in $PYD_ROOT, you run:

MTD_preprocess MEDS_dataset_dir="$MEDS_ROOT" output_dir="$PYD_ROOT"

Step 3: Use the dataset

To use a dataset, you need to (1) define your configuration object and (2) create the dataset object. The only required configuration parameters are tensorized_cohort_dir, which points to the root directory containing the pre-processed data on disk ($PYD_ROOT in the above example), and max_seq_len, which is the maximum sequence length you want to use for your model. Here's an example:

import os
from meds_torchdata import MEDSPytorchDataset, MEDSTorchDataConfig

cfg = MEDSTorchDataConfig(tensorized_cohort_dir=os.environ["PYD_ROOT"], max_seq_len=512)
pyd = MEDSPytorchDataset(cfg, split="train")

If you want to use a specific binary classification task, you can add the task_labels_dir parameter to the configuration object. This should point to a directory containing the sharded MEDS label format parquet files for the labels. The sharding scheme is arbitrary and will not be reflected in the dataset.

That's it!

[!NOTE] Only binary classification tasks are supported at this time. If you need multi-class classification or other kinds of tasks, please file a GitHub issue

📚 Documentation

Design Principles

A good PyTorch dataset class should:

Be easy to use
Have a minimal, constant resource footprint (memory, CPU, start-up time) during model training and inference, regardless of the overall dataset size.
Perform as much work as possible in static, reusable dataset pre-processing, rather than upon construction or in the __getitem__ method.
Induce effectively negligible computational overhead in the __getitem__ method relative to model training.
Be easily configurable, with a simple, consistent API, and cover the most common use-cases.
Encourage efficient use of GPU resources in the resulting batches.
Should be comprehensively documented, tested, and benchmarked for performance implications so users can use it reliably and effectively.

To achieve this, MEDS TorchData leverages the following design principles:

Lazy Loading: Data is loaded only when needed, and only the data needed for the current batch is loaded.
Efficient Loading: Data is loaded efficiently leveraging the HuggingFace Safetensors library for raw IO through the nested, ragged interface encoded in the Nested Ragged Tensors library.
Configurable, Transparent Pre-processing: Mandatory data pre-processing prior to effective use in this library is managed through a simple MEDS-Transforms pipeline which can be run on any MEDS dataset, after any model-specific pre-processing, via a transparent configuration file.
Continuous Integration: The library is continuously tested and benchmarked for performance implications, and the results are available to users.

Examples and Detailed Usage

To see how this works, let's look at some examples. These examples will be powered by some synthetic data defined as "fixtures" in this package's pytest stack; namely, we'll use the following fixtures:

simple_static_MEDS: This will point to a Path containing a simple MEDS dataset.
simple_static_MEDS_dataset_with_task: This will point to a Path containing a simple MEDS dataset with a boolean-value task defined. The core data is the same between both the simple_static_MEDS and this dataset, but the latter has a task defined.
tensorized_MEDS_dataset fixture that points to a Path containing the tensorized and schema files for the simple_static_MEDS dataset.
tensorized_MEDS_dataset_with_task fixture that points to a tuple containing:
- A Path containing the tensorized and schema files for the simple_static_MEDS_dataset_with_task dataset
- A Path pointing to the root task directory for the dataset
- The specific task name for the dataset. Task label files will be stored in a subdir of the root task directory with this name.

You can find these in either the conftest.py file for this repository or the meds_testing_helpers package, which this package leverages for testing.

Synthetic Data

To start, let's take a look at this synthetic data. It is sharded by split, and we'll look at the train split first, which has two shards (we convert to polars just for prettier printing). It has four subjects across the two shards:

>>> import polars as pl
>>> from meds_testing_helpers.dataset import MEDSDataset
>>> D = MEDSDataset(root_dir=simple_static_MEDS)
>>> train_0 = pl.from_arrow(D.data_shards["train/0"])
>>> train_0
shape: (30, 4)
┌────────────┬─────────────────────┬────────────────────┬───────────────┐
│ subject_id ┆ time                ┆ code               ┆ numeric_value │
│ ---        ┆ ---                 ┆ ---                ┆ ---           │
│ i64        ┆ datetime[μs]        ┆ str                ┆ f32           │
╞════════════╪═════════════════════╪════════════════════╪═══════════════╡
│ 239684     ┆ null                ┆ EYE_COLOR//BROWN   ┆ null          │
│ 239684     ┆ null                ┆ HEIGHT             ┆ 175.271118    │
│ 239684     ┆ 1980-12-28 00:00:00 ┆ DOB                ┆ null          │
│ 239684     ┆ 2010-05-11 17:41:51 ┆ ADMISSION//CARDIAC ┆ null          │
│ 239684     ┆ 2010-05-11 17:41:51 ┆ HR                 ┆ 102.599998    │
│ …          ┆ …                   ┆ …                  ┆ …             │
│ 1195293    ┆ 2010-06-20 20:24:44 ┆ HR                 ┆ 107.699997    │
│ 1195293    ┆ 2010-06-20 20:24:44 ┆ TEMP               ┆ 100.0         │
│ 1195293    ┆ 2010-06-20 20:41:33 ┆ HR                 ┆ 107.5         │
│ 1195293    ┆ 2010-06-20 20:41:33 ┆ TEMP               ┆ 100.400002    │
│ 1195293    ┆ 2010-06-20 20:50:04 ┆ DISCHARGE          ┆ null          │
└────────────┴─────────────────────┴────────────────────┴───────────────┘
>>> train_1 = pl.from_arrow(D.data_shards["train/1"])
>>> train_1
shape: (14, 4)
┌────────────┬─────────────────────┬───────────────────────┬───────────────┐
│ subject_id ┆ time                ┆ code                  ┆ numeric_value │
│ ---        ┆ ---                 ┆ ---                   ┆ ---           │
│ i64        ┆ datetime[μs]        ┆ str                   ┆ f32           │
╞════════════╪═════════════════════╪═══════════════════════╪═══════════════╡
│ 68729      ┆ null                ┆ EYE_COLOR//HAZEL      ┆ null          │
│ 68729      ┆ null                ┆ HEIGHT                ┆ 160.395309    │
│ 68729      ┆ 1978-03-09 00:00:00 ┆ DOB                   ┆ null          │
│ 68729      ┆ 2010-05-26 02:30:56 ┆ ADMISSION//PULMONARY  ┆ null          │
│ 68729      ┆ 2010-05-26 02:30:56 ┆ HR                    ┆ 86.0          │
│ …          ┆ …                   ┆ …                     ┆ …             │
│ 814703     ┆ 1976-03-28 00:00:00 ┆ DOB                   ┆ null          │
│ 814703     ┆ 2010-02-05 05:55:39 ┆ ADMISSION//ORTHOPEDIC ┆ null          │
│ 814703     ┆ 2010-02-05 05:55:39 ┆ HR                    ┆ 170.199997    │
│ 814703     ┆ 2010-02-05 05:55:39 ┆ TEMP                  ┆ 100.099998    │
│ 814703     ┆ 2010-02-05 07:02:30 ┆ DISCHARGE             ┆ null          │
└────────────┴─────────────────────┴───────────────────────┴───────────────┘
>>> sorted(set(train_0["subject_id"].unique()) | set(train_1["subject_id"].unique()))
[68729, 239684, 814703, 1195293]

`MEDSTorchDataConfig` Configuration Object

Full API documentation for the configuration object can be found here.

The configuration object contains two kinds of parameters: Data processing parameters and file paths. Data processing parameters include:

max_seq_len: The maximum sequence length to use for the model.
seq_sampling_strategy: The strategy to use when sampling sub-sequences to return for input sequences longer than max_seq_len.
static_inclusion_mode: The mode to use when including static data in the output.
batch_mode: Whether to return sequences at the measurement level ("SM") or the event level ("SEM"). Note that here, we use "measurement" to refer to a single row (observation) in the raw MEDS data, and "event" to refer to all measurements taken at a single time-point.

Of these, seq_sampling_strategy and static_inclusion_mode are restricted, and must be of the SubsequenceSamplingStrategy and StaticInclusionMode StrEnums, respectively:

seq_sampling_strategy: One of ["random", "to_end", "from_start"] (defaults to "random").
static_inclusion_mode: One of ["include", "omit"] (defaults to "include").

File path parameters include:

tensorized_cohort_dir: The directory containing the tensorized data.
task_labels_dir: The directory containing the task labels files.

It also provides a convenient property to get the vocab size for the dataset, given by the vocab indices in the tensorized metadata. Let's start by building a configuration object for this data and inspect some of its file-path related properties and helpers:

>>> from meds_torchdata import MEDSTorchDataConfig
>>> cfg = MEDSTorchDataConfig(tensorized_MEDS_dataset, max_seq_len=5)
>>> cfg.tensorized_cohort_dir
PosixPath('/tmp/tmp...')
>>> cfg.schema_dir
PosixPath('/tmp/tmp.../tokenization/schemas')
>>> print(sorted(list(cfg.schema_fps)))
[('held_out/0', PosixPath('/tmp/tmp.../tokenization/schemas/held_out/0.parquet')),
 ('train/0', PosixPath('/tmp/tmp.../tokenization/schemas/train/0.parquet')),
 ('train/1', PosixPath('/tmp/tmp.../tokenization/schemas/train/1.parquet')),
 ('tuning/0', PosixPath('/tmp/tmp.../tokenization/schemas/tuning/0.parquet'))]
>>> print(cfg.task_labels_dir)
None
>>> print(cfg.task_labels_fps)
None
>>> print(cfg.vocab_size)
12

If we specify a task_labels_dir parameter, the config operates in task-specific mode. This allows us to use the task-specific helpers, but it also mandates we set seq_sampling_strategy to "to_end" as you shouldn't try to predict a downstream task without leveraging the most recent data.

>>> cohort_dir, tasks_dir, task_name = tensorized_MEDS_dataset_with_task
>>> cfg = MEDSTorchDataConfig(
...     cohort_dir, max_seq_len=5, task_labels_dir=(tasks_dir / task_name)
... )
Traceback (most recent call last):
    ...
ValueError: Not sampling data till the end of the sequence when predicting for a specific task is not
permitted! This is because there is no use-case we know of where you would want to do this. If you disagree,
please let us know via a GitHub issue.
>>> cfg = MEDSTorchDataConfig(
...     cohort_dir, max_seq_len=5, task_labels_dir=(tasks_dir / task_name), seq_sampling_strategy="to_end"
... )
>>> cfg.task_labels_dir
PosixPath('/tmp/tmp.../task_labels/boolean_value_task')
>>> print(list(cfg.task_labels_fps))
[PosixPath('/tmp/tmp.../task_labels/boolean_value_task/labels_A.parquet.parquet'),
 PosixPath('/tmp/tmp.../task_labels/boolean_value_task/labels_B.parquet.parquet')]

Based on the seq_sampling_strategy, batch_mode, and max_seq_len parameters, the configuration object also has the process_dynamic_data helper function to slice the subject's dynamic data appropriately. This function is used internally, and you will not need to use it yourself.

`MEDSPytorchDataset` Dataset Class

Full API documentation for the dataset class can be found here.

Now let's build a dataset object from the synthetic data.

Dataset "Schema"

When we build a PyTorch dataset from it for training, with no task specified, the length will be four, as it will correspond to each of the four subjects in the train split. The index variable contains the list of subject IDs and the end of the allowed region of reading for the dataset. We can also see it in dataframe format via the schema_df:

>>> from meds_torchdata import MEDSPytorchDataset
>>> cfg = MEDSTorchDataConfig(tensorized_cohort_dir=tensorized_MEDS_dataset, max_seq_len=5)
>>> pyd = MEDSPytorchDataset(cfg, split="train")
>>> len(pyd)
4
>>> pyd.index
[(239684, 6), (1195293, 8), (68729, 3), (814703, 3)]
>>> pyd.schema_df
shape: (4, 2)
┌────────────┬─────────────────┐
│ subject_id ┆ end_event_index │
│ ---        ┆ ---             │
│ i64        ┆ u32             │
╞════════════╪═════════════════╡
│ 239684     ┆ 6               │
│ 1195293    ┆ 8               │
│ 68729      ┆ 3               │
│ 814703     ┆ 3               │
└────────────┴─────────────────┘

Note the index is in terms of event indices, not measurement indices -- meaning it is the index of the unique timestamp corresponding to the start and end of each subject's data; not the unique measurement. We can validate that against the raw data. To do so, we'll define the simple helper function get_event_bounds that will just group by the subject_id and time columns, and then calculate the event index for each subject and show us the min and max such index, per-subject.

>>> def get_event_indices(df: pl.DataFrame) -> pl.DataFrame:
...     return (
...         df
...         .group_by("subject_id", "time", maintain_order=True).agg(pl.len().alias("n_measurements"))
...         .with_row_index()
...         .select(
...             "subject_id", "time",
...             (pl.col("index") - pl.col("index").min().over("subject_id")).alias("event_idx"),
...             "n_measurements",
...         )
...     )
>>> def get_event_bounds(df: pl.DataFrame) -> pl.DataFrame:
...     return (
...         get_event_indices(df)
...         .with_columns(
...             pl.col("event_idx").max().over("subject_id").alias("max_event_idx")
...         )
...         .filter((pl.col("event_idx") == 0) | (pl.col("event_idx") == pl.col("max_event_idx")))
...         .select("subject_id", "event_idx", "time")
...     )
>>> get_event_bounds(train_0)
shape: (4, 3)
┌────────────┬───────────┬─────────────────────┐
│ subject_id ┆ event_idx ┆ time                │
│ ---        ┆ ---       ┆ ---                 │
│ i64        ┆ u32       ┆ datetime[μs]        │
╞════════════╪═══════════╪═════════════════════╡
│ 239684     ┆ 0         ┆ null                │
│ 239684     ┆ 6         ┆ 2010-05-11 19:27:19 │
│ 1195293    ┆ 0         ┆ null                │
│ 1195293    ┆ 8         ┆ 2010-06-20 20:50:04 │
└────────────┴───────────┴─────────────────────┘
>>> get_event_bounds(train_1)
shape: (4, 3)
┌────────────┬───────────┬─────────────────────┐
│ subject_id ┆ event_idx ┆ time                │
│ ---        ┆ ---       ┆ ---                 │
│ i64        ┆ u32       ┆ datetime[μs]        │
╞════════════╪═══════════╪═════════════════════╡
│ 68729      ┆ 0         ┆ null                │
│ 68729      ┆ 3         ┆ 2010-05-26 04:51:52 │
│ 814703     ┆ 0         ┆ null                │
│ 814703     ┆ 3         ┆ 2010-02-05 07:02:30 │
└────────────┴───────────┴─────────────────────┘

The schema changes to reflect the different split if we change the split:

>>> pyd_tuning = MEDSPytorchDataset(cfg, split="tuning")
>>> pyd_tuning.schema_df
shape: (1, 2)
┌────────────┬─────────────────┐
│ subject_id ┆ end_event_index │
│ ---        ┆ ---             │
│ i64        ┆ u32             │
╞════════════╪═════════════════╡
│ 754281     ┆ 3               │
└────────────┴─────────────────┘
>>> pyd_held_out = MEDSPytorchDataset(cfg, split="held_out")
>>> pyd_held_out.schema_df
shape: (1, 2)
┌────────────┬─────────────────┐
│ subject_id ┆ end_event_index │
│ ---        ┆ ---             │
│ i64        ┆ u32             │
╞════════════╪═════════════════╡
│ 1500733    ┆ 5               │
└────────────┴─────────────────┘

If you use a non-existent split or have something misconfigured, you'll get an error upon Dataset creation:

>>> pyd_bad = MEDSPytorchDataset(cfg, split="bad_split")
Traceback (most recent call last):
    ...
FileNotFoundError: No schema files found in /tmp/.../tokenization/schemas! If your data is not sharded by
split, this error may occur because this codebase does not handle non-split sharded data. See Issue #79 for
tracking this issue.

We can also inspect the schema for a dataset built with downstream task labels:

>>> cohort_dir, tasks_dir, task_name = tensorized_MEDS_dataset_with_task
>>> cfg_with_task = MEDSTorchDataConfig(
...     cohort_dir, max_seq_len=5, task_labels_dir=(tasks_dir / task_name), seq_sampling_strategy="to_end"
... )
>>> pyd_with_task = MEDSPytorchDataset(cfg_with_task, split="train")
>>> pyd_with_task.schema_df
shape: (13, 4)
┌────────────┬─────────────────┬─────────────────────┬───────────────┐
│ subject_id ┆ end_event_index ┆ prediction_time     ┆ boolean_value │
│ ---        ┆ ---             ┆ ---                 ┆ ---           │
│ i64        ┆ u32             ┆ datetime[μs]        ┆ bool          │
╞════════════╪═════════════════╪═════════════════════╪═══════════════╡
│ 239684     ┆ 3               ┆ 2010-05-11 18:00:00 ┆ false         │
│ 239684     ┆ 4               ┆ 2010-05-11 18:30:00 ┆ true          │
│ 239684     ┆ 5               ┆ 2010-05-11 19:00:00 ┆ true          │
│ 1195293    ┆ 3               ┆ 2010-06-20 19:30:00 ┆ false         │
│ 1195293    ┆ 4               ┆ 2010-06-20 20:00:00 ┆ true          │
│ …          ┆ …               ┆ …                   ┆ …             │
│ 68729      ┆ 2               ┆ 2010-05-26 04:00:00 ┆ true          │
│ 68729      ┆ 2               ┆ 2010-05-26 04:30:00 ┆ true          │
│ 814703     ┆ 2               ┆ 2010-02-05 06:00:00 ┆ false         │
│ 814703     ┆ 2               ┆ 2010-02-05 06:30:00 ┆ true          │
│ 814703     ┆ 2               ┆ 2010-02-05 07:00:00 ┆ true          │
└────────────┴─────────────────┴─────────────────────┴───────────────┘

Returned items

While the raw data has codes as strings, naturally, when embedded in the pytorch dataset, they'll get converted to integers. This happens during the forementioned tensorization step. We can see how the codes are mapped to integers by looking at the output code metadata of that step:

>>> code_metadata = pl.read_parquet(tensorized_MEDS_dataset.joinpath("metadata/codes.parquet"))
>>> code_metadata.select("code", "code/vocab_index")
shape: (11, 2)
┌───────────────────────┬──────────────────┐
│ code                  ┆ code/vocab_index │
│ ---                   ┆ ---              │
│ str                   ┆ u8               │
╞═══════════════════════╪══════════════════╡
│ ADMISSION//CARDIAC    ┆ 1                │
│ ADMISSION//ORTHOPEDIC ┆ 2                │
│ ADMISSION//PULMONARY  ┆ 3                │
│ DISCHARGE             ┆ 4                │
│ DOB                   ┆ 5                │
│ …                     ┆ …                │
│ EYE_COLOR//BROWN      ┆ 7                │
│ EYE_COLOR//HAZEL      ┆ 8                │
│ HEIGHT                ┆ 9                │
│ HR                    ┆ 10               │
│ TEMP                  ┆ 11               │
└───────────────────────┴──────────────────┘

We can see these vocab indices being used if we look at some elements of the pytorch dataset. Note that some elements of the returned dictionaries are JointNestedRaggedTensorDict objects, so we'll define a helper here that will use a helper from the associated library to help us pretty-print out outputs. Note that we'll also reduce precision in the numeric values to make the output more readable.

>>> from nested_ragged_tensors.ragged_numpy import pprint_dense
>>> def print_element(el: dict):
...     for k, v in el.items():
...         print(f"{k} ({type(v).__name__}):")
...         if k == "dynamic":
...             pprint_dense(v.to_dense())
...         else:
...             print(v)
>>> print_element(pyd[2])
static_code (list):
[8, 9]
static_numeric_value (list):
[nan, -0.5438239574432373]
dynamic (JointNestedRaggedTensorDict):
code
[ 5  3 10 11  4]
.
numeric_value
[        nan         nan -1.4474752  -0.34049404         nan]
.
time_delta_days
[           nan 1.17661045e+04 0.00000000e+00 0.00000000e+00
 9.78703722e-02]

We can also look at what would be returned if we had included a task in the dataset:

>>> print_element(pyd_with_task[0])
static_code (list):
[7, 9]
static_numeric_value (list):
[nan, 1.5770268440246582]
dynamic (JointNestedRaggedTensorDict):
code
[ 1 10 11 10 11]
.
numeric_value
[       nan -0.5697369 -1.2714673 -0.4375474 -1.1680276]
.
time_delta_days
[1.0726737e+04 0.0000000e+00 0.0000000e+00 4.8263888e-03 0.0000000e+00]
boolean_value (bool):
False

We can see in this case that the boolean_value field is included in the output, capturing the task label.

The contents of pyd[2] are stable, because index element 0, (68729, 0, 3), indicates the first subject has a sequence of length 3 in the dataset and our max_seq_len is set to 5.

>>> print_element(pyd[2])
static_code (list):
[8, 9]
static_numeric_value (list):
[nan, -0.5438239574432373]
dynamic (JointNestedRaggedTensorDict):
code
[ 5  3 10 11  4]
.
numeric_value
[        nan         nan -1.4474752  -0.34049404         nan]
.
time_delta_days
[           nan 1.17661045e+04 0.00000000e+00 0.00000000e+00
 9.78703722e-02]

If we sampled a different subject, one with more than 5 events, the output we'd get would be dependent on the config.seq_sampling_strategy option, and could be non-deterministic. By default, this is set to random, so we'll get a random subset of length 5 each time. Here, so that this code is deterministic, we'll use _seeded_getitem, an internal, seeded version of the __getitem__ call.

>>> print_element(pyd._seeded_getitem(1, seed=0))
static_code (list):
[6, 9]
static_numeric_value (list):
[nan, 0.06802856922149658]
dynamic (JointNestedRaggedTensorDict):
code
[10 11 10 11 10]
.
numeric_value
[-0.04626633  0.69391906 -0.30007038  0.79735875 -0.31064537]
.
time_delta_days
[0.01888889 0.         0.0084838  0.         0.01167824]
>>> print_element(pyd._seeded_getitem(1, seed=1))
static_code (list):
[6, 9]
static_numeric_value (list):
[nan, 0.06802856922149658]
dynamic (JointNestedRaggedTensorDict):
code
[10 11 10 11 10]
.
numeric_value
[ 0.03833488  0.79735875  0.33972722  0.7456389  -0.04626633]
.
time_delta_days
[0.00115741 0.         0.01373843 0.         0.01888889]

Of course, if we set seq_sampling_strategy to something other than "random", this non-determinism would disappear:

>>> cfg_from_start = MEDSTorchDataConfig(
...     tensorized_cohort_dir=tensorized_MEDS_dataset, max_seq_len=5, seq_sampling_strategy="from_start"
... )
>>> pyd_from_start = MEDSPytorchDataset(cfg_from_start, split="train")
>>> print_element(pyd_from_start[1])
static_code (list):
[6, 9]
static_numeric_value (list):
[nan, 0.06802856922149658]
dynamic (JointNestedRaggedTensorDict):
code
[ 5  1 10 11 10]
.
numeric_value
[        nan         nan -0.23133166  0.79735875  0.03833488]
.
time_delta_days
[          nan 1.1688809e+04 0.0000000e+00 0.0000000e+00 1.1574074e-03]
>>> print_element(pyd_from_start[1])
static_code (list):
[6, 9]
static_numeric_value (list):
[nan, 0.06802856922149658]
dynamic (JointNestedRaggedTensorDict):
code
[ 5  1 10 11 10]
.
numeric_value
[        nan         nan -0.23133166  0.79735875  0.03833488]
.
time_delta_days
[          nan 1.1688809e+04 0.0000000e+00 0.0000000e+00 1.1574074e-03]

Batches, Collation, and Dataloaders

We can also examine not just individual elements, but full batches, that we can access with the appropriate collate function via the built in get_dataloader method. Here, we'll treat these outputs like dictionaries, but they actually return dataclass objects that have some additional properties we can use to access shapes and validate data. See the API documentation on the batch class for more information.

>>> batches = [batch for batch in pyd.get_dataloader(batch_size=2)]
>>> print_element(batches[1])
code (Tensor):
tensor([[ 5,  3, 10, 11,  4],
        [ 5,  2, 10, 11,  4]])
numeric_value (Tensor):
tensor([[ 0.0000,  0.0000, -1.4475, -0.3405,  0.0000],
        [ 0.0000,  0.0000,  3.0047,  0.8491,  0.0000]])
numeric_value_mask (Tensor):
tensor([[False, False,  True,  True, False],
        [False, False,  True,  True, False]])
time_delta_days (Tensor):
tensor([[0.0000e+00, 1.1766e+04, 0.0000e+00, 0.0000e+00, 9.7870e-02],
        [0.0000e+00, 1.2367e+04, 0.0000e+00, 0.0000e+00, 4.6424e-02]])
static_code (Tensor):
tensor([[8, 9],
        [8, 9]])
static_numeric_value (Tensor):
tensor([[ 0.0000, -0.5438],
        [ 0.0000, -1.1012]])
static_numeric_value_mask (Tensor):
tensor([[False,  True],
        [False,  True]])
>>> print_element(next(iter(pyd_with_task.get_dataloader(batch_size=2))))
code (Tensor):
tensor([[ 1, 10, 11, 10, 11],
        [11, 10, 11, 10, 11]])
numeric_value (Tensor):
tensor([[ 0.0000e+00, -5.6974e-01, -1.2715e+00, -4.3755e-01, -1.1680e+00],
        [-1.2715e+00, -4.3755e-01, -1.1680e+00,  1.3220e-03, -1.3749e+00]])
numeric_value_mask (Tensor):
tensor([[False,  True,  True,  True,  True],
        [ True,  True,  True,  True,  True]])
time_delta_days (Tensor):
tensor([[1.0727e+04, 0.0000e+00, 0.0000e+00, 4.8264e-03, 0.0000e+00],
        [0.0000e+00, 4.8264e-03, 0.0000e+00, 2.5544e-02, 0.0000e+00]])
static_code (Tensor):
tensor([[7, 9],
        [7, 9]])
static_numeric_value (Tensor):
tensor([[0.0000, 1.5770],
        [0.0000, 1.5770]])
static_numeric_value_mask (Tensor):
tensor([[False,  True],
        [False,  True]])
boolean_value (Tensor):
tensor([False,  True])

Thus far, our examples have all worked with the default config object, which sets (among other things) the default output to be at a measurement level, rather than an event level, by virtue of setting batch_mode to SM. Let's see what happens if we change that:

>>> pyd.config.batch_mode = "SEM"
>>> print_element(pyd[2])
static_code (list):
[8, 9]
static_numeric_value (list):
[nan, -0.5438239574432373]
dynamic (JointNestedRaggedTensorDict):
time_delta_days
[           nan 1.17661045e+04 9.78703722e-02]
.
---
.
dim1/mask
[[ True False False]
 [ True  True  True]
 [ True False False]]
.
code
[[ 5  0  0]
 [ 3 10 11]
 [ 4  0  0]]
.
numeric_value
[[        nan  0.          0.        ]
 [        nan -1.4474752  -0.34049404]
 [        nan  0.          0.        ]]
>>> batches = [batch for batch in pyd.get_dataloader(batch_size=2)]
>>> print_element(batches[1])
code (Tensor):
tensor([[[ 5,  0,  0],
         [ 3, 10, 11],
         [ 4,  0,  0]],
<BLANKLINE>
        [[ 5,  0,  0],
         [ 2, 10, 11],
         [ 4,  0,  0]]])
numeric_value (Tensor):
tensor([[[ 0.0000,  0.0000,  0.0000],
         [ 0.0000, -1.4475, -0.3405],
         [ 0.0000,  0.0000,  0.0000]],
<BLANKLINE>
        [[ 0.0000,  0.0000,  0.0000],
         [ 0.0000,  3.0047,  0.8491],
         [ 0.0000,  0.0000,  0.0000]]])
numeric_value_mask (Tensor):
tensor([[[False,  True,  True],
         [False,  True,  True],
         [False,  True,  True]],
<BLANKLINE>
        [[False,  True,  True],
         [False,  True,  True],
         [False,  True,  True]]])
time_delta_days (Tensor):
tensor([[0.0000e+00, 1.1766e+04, 9.7870e-02],
        [0.0000e+00, 1.2367e+04, 4.6424e-02]])
event_mask (Tensor):
tensor([[True, True, True],
        [True, True, True]])
static_code (Tensor):
tensor([[8, 9],
        [8, 9]])
static_numeric_value (Tensor):
tensor([[ 0.0000, -0.5438],
        [ 0.0000, -1.1012]])
static_numeric_value_mask (Tensor):
tensor([[False,  True],
        [False,  True]])
>>> pyd_with_task.config.batch_mode = "SEM"
>>> print_element(next(iter(pyd_with_task.get_dataloader(batch_size=2))))
code (Tensor):
tensor([[[ 5,  0,  0],
         [ 1, 10, 11],
         [10, 11,  0],
         [ 0,  0,  0]],
<BLANKLINE>
        [[ 5,  0,  0],
         [ 1, 10, 11],
         [10, 11,  0],
         [10, 11,  0]]])
numeric_value (Tensor):
tensor([[[ 0.0000e+00,  0.0000e+00,  0.0000e+00],
         [ 0.0000e+00, -5.6974e-01, -1.2715e+00],
         [-4.3755e-01, -1.1680e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00]],
<BLANKLINE>
        [[ 0.0000e+00,  0.0000e+00,  0.0000e+00],
         [ 0.0000e+00, -5.6974e-01, -1.2715e+00],
         [-4.3755e-01, -1.1680e+00,  0.0000e+00],
         [ 1.3220e-03, -1.3749e+00,  0.0000e+00]]])
numeric_value_mask (Tensor):
tensor([[[False,  True,  True],
         [False,  True,  True],
         [ True,  True,  True],
         [ True,  True,  True]],
<BLANKLINE>
        [[False,  True,  True],
         [False,  True,  True],
         [ True,  True,  True],
         [ True,  True,  True]]])
time_delta_days (Tensor):
tensor([[0.0000e+00, 1.0727e+04, 4.8264e-03, 0.0000e+00],
        [0.0000e+00, 1.0727e+04, 4.8264e-03, 2.5544e-02]])
event_mask (Tensor):
tensor([[ True,  True,  True, False],
        [ True,  True,  True,  True]])
static_code (Tensor):
tensor([[7, 9],
        [7, 9]])
static_numeric_value (Tensor):
tensor([[0.0000, 1.5770],
        [0.0000, 1.5770]])
static_numeric_value_mask (Tensor):
tensor([[False,  True],
        [False,  True]])
boolean_value (Tensor):
tensor([False,  True])

Data Tensorization and Pre-processing Details

Full documentation for the preprocessing pipeline can be found here

The MTD_preprocess command leverages hydra to manage the configuration and running via the command line. You can see the available options by running the command with the --help flag:

== MTD_preprocess ==

MTD_preprocess is a command line tool for pre-processing MEDS data for use with meds_torchdata.

== Config ==

This is the config generated for this run:

MEDS_dataset_dir: ???
output_dir: ???
stage_runner_fp: null
do_overwrite: false
do_reshard: false
log_dir: ${output_dir}/.logs

You can override everything using the hydra `key=value` syntax; for example:

MTD_preprocess MEDS_dataset_dir=/path/to/dataset output_dir=/path/to/output do_overwrite=True

The MTD_preprocess command runs the following pre-processing stages:

fit_normalization: Fitting necessary parameters for normalization from the raw data (e.g., the mean and standard deviation of the numeric_value field).
fit_vocabulary_indices: Assigning unique vocabulary indices to each unique code in the data so that they can be transformed to numerical indices for tensorization.
normalization: Normalizing the data using the parameters fit in the fit_normalization stage to have a mean of 0 and a standard deviation of 1.
tokenization: Producing the schema files necessary for the tensorization stage.
tensorization: Producing the nested ragged tensor views of the data.

[!NOTE] If you would like additional normalization options to be supported, please comment on the upstream issue in MEDS-Transforms, and file an issue here to capture supporting additional options cleanly going forward.

[!NOTE] You should perform any additional, model specific pre-processing on the data prior to running the MTD_preprocess command for your specific use-case. Indeed, if you wish to perform additional pre-processing, such as

Dropping numeric values entirely and converting to quantile-modified codes.

Drop infrequent codes or aggregate codes into higher-order categories.

Restrict subjects to a specific time-window

Drop subjects with infrequent values

Occlude outlier numeric values

etc. You should perform these steps on the raw MEDS data prior to running the tensorization command. This ensures that the data is modified as you desire in an efficient, transparent way and that the tensorization step works with data in its final format to avoid any issues with discrepancies in code vocabulary, etc.

Advanced features

You can also use this package natively with Hydra in modeling applications by adding the meds_torchdata.MEDSTorchDataConfig to the Hydra config store. This will allow you to use it as though it were a fully defined .yaml configuration file in your application configuration. To do this, you simply need to run MEDSTorchDataConfig.add_to_config_store() in your application, specifying the group name in which you plan to use the config in your application.

E.g., if you have a config file like this:

dataset:
  _target_: meds_torchdata.MEDSPytorchDataset
  config: MEDSTorchDataConfig

Then in your main application, prior to @hydra.main, you can run:

from meds_torchdata.config import MEDSTorchDataConfig

MEDSTorchDataConfig.add_to_config_store("dataset/config")

This will add the MEDSTorchDataConfig to the Hydra config store in the nested dataset/config group, which will allow you to override its parameters from the command line and instantiate it into object form natively.

Testing Models that Use this Package

If you use this package to build your model, we also expose some pytest fixtures that can be used to test your models. These fixtures are designed to be used with the pytest testing framework. These fixtures are similar to the four fixtures we used above in the Examples and Detailed Usage section. Namely, they are:

tensorized_MEDS_dataset fixture that points to a Path containing the tensorized and schema files for the simple_static_MEDS dataset.
tensorized_MEDS_dataset_with_task fixture that points to a tuple containing:
- A Path containing the tensorized and schema files for the simple_static_MEDS_dataset_with_task dataset
- A Path pointing to the root task directory for the dataset
- The specific task name for the dataset. Task label files will be stored in a subdir of the root task directory with this name.
sample_pytorch_dataset: This will yield a MEDSPytorchDataset object built using the tensorized_MEDS_dataset data, without a downstream task.
sample_pytorch_dataset_with_task: This will yield a MEDSPytorchDataset object built using the tensorized_MEDS_dataset_with_task data, with the associated downstream task.

You can rely on these fixtures to test your model in the normal way, directly having your model train using input batches derived from the sample datasets.

Performance

See https://mmcdermott.github.io/meds-torch-data/dev/bench/ for performance benchmarks for all commits in this repository. See here for the benchmarking script. Note that these benchmarks are likely to change over time so should be judged relative to the content of the associated commits, not in absolute terms (e.g., we are likely to benchmark on more or more complex synthetic data, etc.).

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mmd_pypi

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.9.0

Apr 21, 2026

0.8.2

Apr 20, 2026

0.8.1

Apr 19, 2026

0.8.0

Apr 19, 2026

0.7.0

Apr 16, 2026

0.6.6

Nov 5, 2025

0.6.5

Sep 14, 2025

0.6.4

Sep 14, 2025

0.6.3

Jun 6, 2025

0.6.2

Jun 3, 2025

0.6.1

May 11, 2025

This version

0.6.0

May 6, 2025

0.5.1

Apr 18, 2025

0.5

Apr 15, 2025

0.4.1

Apr 15, 2025

0.4

Mar 25, 2025

0.3.1

Mar 19, 2025

0.3

Mar 18, 2025

0.1.11

Sep 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meds_torch_data-0.6.0.tar.gz (83.2 kB view details)

Uploaded May 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

meds_torch_data-0.6.0-py3-none-any.whl (48.3 kB view details)

Uploaded May 6, 2025 Python 3

File details

Details for the file meds_torch_data-0.6.0.tar.gz.

File metadata

Download URL: meds_torch_data-0.6.0.tar.gz
Upload date: May 6, 2025
Size: 83.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for meds_torch_data-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`c65a5ebd723185194144746a80fb1db577d2f185bd533c47eeeec6723d1f9203`
MD5	`6b3121aad2b5458643f6ddd81fc5dff8`
BLAKE2b-256	`16530dc58ff68f01642f6e502c2127fa553404b5d173cfa108ee2bcc6f817f60`

See more details on using hashes here.

Provenance

The following attestation bundles were made for meds_torch_data-0.6.0.tar.gz:

Publisher: python-build.yaml on mmcdermott/meds-torch-data

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: meds_torch_data-0.6.0.tar.gz
- Subject digest: c65a5ebd723185194144746a80fb1db577d2f185bd533c47eeeec6723d1f9203
- Sigstore transparency entry: 207555911
- Sigstore integration time: May 6, 2025
Source repository:
- Permalink: mmcdermott/meds-torch-data@854cc62b577552e04dce62ccee391a279d85763e
- Branch / Tag: refs/tags/0.6.0
- Owner: https://github.com/mmcdermott
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-build.yaml@854cc62b577552e04dce62ccee391a279d85763e
- Trigger Event: push

File details

Details for the file meds_torch_data-0.6.0-py3-none-any.whl.

File metadata

Download URL: meds_torch_data-0.6.0-py3-none-any.whl
Upload date: May 6, 2025
Size: 48.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for meds_torch_data-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2afee051204cce6ebeea951a84294f87859862cf07a91543dc870a5df2d34019`
MD5	`c92dbefb9f7068211bcdd6d4313a5d1a`
BLAKE2b-256	`113e5843689cc854491938820f1eafe95e79a1522f5d053ffd505d51b3f47659`

See more details on using hashes here.

Provenance

The following attestation bundles were made for meds_torch_data-0.6.0-py3-none-any.whl:

Publisher: python-build.yaml on mmcdermott/meds-torch-data

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: meds_torch_data-0.6.0-py3-none-any.whl
- Subject digest: 2afee051204cce6ebeea951a84294f87859862cf07a91543dc870a5df2d34019
- Sigstore transparency entry: 207555916
- Sigstore integration time: May 6, 2025
Source repository:
- Permalink: mmcdermott/meds-torch-data@854cc62b577552e04dce62ccee391a279d85763e
- Branch / Tag: refs/tags/0.6.0
- Owner: https://github.com/mmcdermott
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-build.yaml@854cc62b577552e04dce62ccee391a279d85763e
- Trigger Event: push

meds-torch-data 0.6.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

MEDS TorchData: A PyTorch Dataset Class for MEDS Datasets

🚀 Quick Start

Step 1: Install

Step 2: Data Tensorization

Step 3: Use the dataset

📚 Documentation

Design Principles

Examples and Detailed Usage

Synthetic Data

MEDSTorchDataConfig Configuration Object

MEDSPytorchDataset Dataset Class

Dataset "Schema"

Returned items

Batches, Collation, and Dataloaders

Data Tensorization and Pre-processing Details

Advanced features

Testing Models that Use this Package

Performance

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`MEDSTorchDataConfig` Configuration Object

`MEDSPytorchDataset` Dataset Class