A simple auto-regressive, 'everything-is-code' style model for MEDS datasets

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mmd_pypi

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

MEDS "Everything-is-code" Autoregressive Model

python

A MEDS, "Everything-is-code" style Autoregressive Generative Model, capable of zero-shot inference.

This is based on the MEDS-Torch model of the same name.

Installation

pip install MEDS-EIC-AR

Optional Dependencies

WandB

If you want to use WandB for logging, you can install it via:

pip install MEDS-EIC-AR[wandb]

MLFlow

If you want to use MLFlow for logging, you can install it via:

pip install MEDS-EIC-AR[mlflow]

This will also install psutil and pynvml as dependencies, to enable MLFlow tracking of system CPU and GPU resources, which is enabled by default or can be controlled via the MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING environment variable. See the MLFlow documentation for more details.

Flash Attention

For using flash attention, you need to subsequently install flash attention as well. This can often be done via:

pip install flash-attn --no-build-isolation

If you encounter errors, see the flash-attn package documentation.

Usage

1. Pre-process your data

You have three directories:

$RAW_MEDS_DIR -- The raw MEDS data directory that you want to pre-process.
$INTERMEDIATE_DIR -- An intermediate directory where the partially processed data will be stored prior to tokenization and tensorization.
$FINAL_DATA_DIR -- The final output directory where the tokenized and tensorized data will be stored. This directory is suitable for use in loading the data with meds-torch-data.

Run:

MEICAR_process_data input_dir="$RAW_MEDS_DIR" \
    intermediate_dir="$INTERMEDIATE_DIR" \
    output_dir="$FINAL_DATA_DIR"

[!NOTE] If your data is not sharded by split at the outset, you will need to add the do_reshard=True command line parameter to the MEICAR_process_data command, which ensures the system reshards the data to be sub-sharded by split before beginning pre-processing.

You can also run this in demo mode, which lowers the filtering thresholds significantly so the script does not filter out all data:

MEICAR_process_data ... do_demo=True

You can exert more fine-grained control on the filtering with the following environment variables:

MIN_SUBJECTS_PER_CODE: How many subjects must a given code be observed within to be included in the final vocabulary? Note that this excludes some sentinel codes which are always retained.
MIN_EVENTS_PER_SUBJECT: How many events must a subject have to be included in the final dataset?

2. Pre-train the model

You can pre-train the model using the MEICAR_pretrain command. To use this, let us assume you have a new directory to store the pretrained model artifacts called $PRETRAINED_MODEL_DIR. Then, you can run:

MEICAR_pretrain datamodule.config.tensorized_cohort_dir="$FINAL_DATA_DIR" \
    output_dir="$PRETRAINED_MODEL_DIR" \
    datamodule.batch_size=32

to train the model for 10 epochs.

This uses a Hydra configuration system, with the root config located in the _pretrain.yaml file. You can override any of the nested configuration parameters (as shown above via datamodule.config.tensorized_cohort_dir on the command line, though you will more likely materialize an experimental configuration file to disk in yaml form and overwrite the config path and name directly in the normal hydra manner.

[!WARNING] Tests here only validate that the model runs without errors and (in demo mode) runs without producing nans or invalid values. It has not yet been assessed to ensure it runs to convergence, etc.

3. Zero-shot Inference

Zero-shot inference consists of two steps:

Given a task cohort and a pre-trained model, for each sample in the task cohort, generate future trajectories from those inputs forward with the pre-trained model and save them to disk in a pseudo-MEDS format.
Resolve these generated trajectories into concrete, probabilistic predictions for the task cohort.

3.1 Generate Trajectories for a task spec.

You can directly generate trajectories using the MEICAR_generate_trajectories command. This requires a few more configuration parameters than the pre-training step, so let's go through those:

You need to specify the task labels directory in the datamodule.config.task_labels_dir parameter.
You need to specify the model initialization directory in the model_initialization_dir parameter. This is the output directory of the pre-train step.
You need to specify how you want to trade-off between allowed input context size and the maximum possible generated trajectory length. The former allows you to use more of the patient's record, but the latter controls how far into the future you can predict. This can be configured with one of three parameters in the seq_lens part of the config. If you set:
- seq_lens.generation_context_size, that will be the maximum length of the input context, and the remaining length of the pretrained model's maximum sequence length will be used for generation.
- seq_lens.max_generated_trajectory_len, that will be the maximum length of the generated trajectory, and the remaining length of the pretrained model's maximum sequence length will be used for the input.
- seq_lens.frac_seq_len_as_context, that will be the fraction of the pretrained model's maximum sequence length that will be used for the input context, and the remaining length will be used for generation. This is set by default to 0.25, which means that 25% of the maximum sequence length will be used for the input context, and 75% will be used for generation. If you wish to use another mode on the command line, be sure to set this to null to disable it.
Lastly, you need to specify how many trajectories per task sample you wish to generate, and for which splits you wish to generate samples. You can do this via the inference.generate_for_splits and inference.N_trajectories_per_task_sample parameters. The former is a list of splits to generate and the latter is the number of trajectories to generate per task sample. The default is to generate 20 trajectories for each task sample in the tuning and held out splits.

After these are set, you can run the following command to generate trajectories for a task cohort:

MEICAR_generate_trajectories \
    output_dir="$GENERATED_TRAJECTORIES_DIR" \
    model_initialization_dir="$PRETRAINED_MODEL_DIR" \
    datamodule.config.tensorized_cohort_dir="$FINAL_DATA_DIR" \
    datamodule.config.task_labels_dir="$TASK_ROOT_DIR/$TASK_NAME" \
    datamodule.batch_size=32

This will generate trajectories for the task cohort and save them in the format: $GENERATED_TRAJECTORIES_DIR/$SPLIT/$SAMPLE.parquet.

See the documentation for format_trajectories for more details on the format of the generated trajectories.

[!WARNING] The tests here only validate that this runs without errors and produces trajectory files that are valid, non-identical across different samples, and containing the right subjects. It has not yet been assessed to ensure full correctness.

3.2 Resolve Trajectories into Predictions.

Not yet implemented.

Documentation

Configuration and Controlling Model Structure

This model is configured via Hydra and PyTorch lightning. The configuration structure of this repository is as follows:

>>> print_directory("./src/MEDS_EIC_AR/configs", config=PrintConfig(file_extension=".yaml"))
├── _demo_generate_trajectories.yaml
├── _demo_pretrain.yaml
├── _generate_trajectories.yaml
├── _pretrain.yaml
├── datamodule
│   ├── default.yaml
│   ├── generate_trajectories.yaml
│   └── pretrain.yaml
├── inference
│   ├── default.yaml
│   └── demo.yaml
├── lightning_module
│   ├── LR_scheduler
│   │   ├── cosine_annealing_warm_restarts.yaml
│   │   ├── get_cosine_schedule_with_warmup.yaml
│   │   ├── one_cycle_LR.yaml
│   │   └── reduce_LR_on_plateau.yaml
│   ├── default.yaml
│   ├── demo.yaml
│   ├── metrics
│   │   └── default.yaml
│   ├── model
│   │   ├── default.yaml
│   │   ├── demo.yaml
│   │   └── small.yaml
│   └── optimizer
│       ├── adam.yaml
│       └── adamw.yaml
└── trainer
    ├── callbacks
    │   ├── default.yaml
    │   ├── early_stopping.yaml
    │   ├── learning_rate_monitor.yaml
    │   └── model_checkpoint.yaml
    ├── default.yaml
    ├── demo.yaml
    └── logger
        ├── csv.yaml
        ├── mlflow.yaml
        └── wandb.yaml

Output Files

The output files of the pre-training step are stored in the directory specified by the output_dir parameter and take the following structure:

>>> print_directory(pretrained_model)
├── .logs
│   ├── .hydra
│   │   ├── config.yaml
│   │   ├── hydra.yaml
│   │   └── overrides.yaml
│   └── __main__.log
├── best_model.ckpt
├── checkpoints
│   ├── epoch=0-step=1.ckpt
│   ├── epoch=0-step=2.ckpt
│   ├── epoch=1-step=3.ckpt
│   ├── epoch=1-step=4.ckpt
│   └── last.ckpt
├── config.yaml
└── loggers
    └── csv
        └── version_0
            ├── hparams.yaml
            └── metrics.csv

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mmd_pypi

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.1.11

Nov 5, 2025

0.1.10

Jun 4, 2025

0.1.9

Jun 4, 2025

0.1.8

Jun 4, 2025

0.1.7

Jun 4, 2025

This version

0.1.6

May 29, 2025

0.1.5

May 11, 2025

0.1.4

May 11, 2025

0.1.3

May 8, 2025

0.1.2

May 8, 2025

0.1.1

May 7, 2025

0.1.0

May 6, 2025

0.0.2

Apr 29, 2025

0.0.1

Apr 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meds_eic_ar-0.1.6.tar.gz (53.3 kB view details)

Uploaded May 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

meds_eic_ar-0.1.6-py3-none-any.whl (55.9 kB view details)

Uploaded May 29, 2025 Python 3

File details

Details for the file meds_eic_ar-0.1.6.tar.gz.

File metadata

Download URL: meds_eic_ar-0.1.6.tar.gz
Upload date: May 29, 2025
Size: 53.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for meds_eic_ar-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`f1101e1a1ca4d5e04292cfafc43260ec386147c485bab764627bb86a7a44d995`
MD5	`608ae9d58c433774ad5de8ae7d2a3d82`
BLAKE2b-256	`1959fbd919c66be54a59ed0efcaacdce6e742bc5d462bcf028ae806dc8147d11`

See more details on using hashes here.

Provenance

The following attestation bundles were made for meds_eic_ar-0.1.6.tar.gz:

Publisher: python-build.yaml on mmcdermott/MEDS_EIC_AR

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: meds_eic_ar-0.1.6.tar.gz
- Subject digest: f1101e1a1ca4d5e04292cfafc43260ec386147c485bab764627bb86a7a44d995
- Sigstore transparency entry: 223924540
- Sigstore integration time: May 29, 2025
Source repository:
- Permalink: mmcdermott/MEDS_EIC_AR@e2ea6d532489ccce7a6ba4573201ba612d4c827c
- Branch / Tag: refs/tags/0.1.6
- Owner: https://github.com/mmcdermott
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-build.yaml@e2ea6d532489ccce7a6ba4573201ba612d4c827c
- Trigger Event: push

File details

Details for the file meds_eic_ar-0.1.6-py3-none-any.whl.

File metadata

Download URL: meds_eic_ar-0.1.6-py3-none-any.whl
Upload date: May 29, 2025
Size: 55.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for meds_eic_ar-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ceeb93034ca5c72fa7218f545693289c3c71d4169973dd3bd4377778fbb53376`
MD5	`3669cd04e18bcf82c3e4942b4535b9b9`
BLAKE2b-256	`f30ddeba8a21646565c6ce962ff3f9a74b30d7d25e81a36ca3cb53a0a130ad2b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for meds_eic_ar-0.1.6-py3-none-any.whl:

Publisher: python-build.yaml on mmcdermott/MEDS_EIC_AR

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: meds_eic_ar-0.1.6-py3-none-any.whl
- Subject digest: ceeb93034ca5c72fa7218f545693289c3c71d4169973dd3bd4377778fbb53376
- Sigstore transparency entry: 223924544
- Sigstore integration time: May 29, 2025
Source repository:
- Permalink: mmcdermott/MEDS_EIC_AR@e2ea6d532489ccce7a6ba4573201ba612d4c827c
- Branch / Tag: refs/tags/0.1.6
- Owner: https://github.com/mmcdermott
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-build.yaml@e2ea6d532489ccce7a6ba4573201ba612d4c827c
- Trigger Event: push

MEDS-EIC-AR 0.1.6

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

MEDS "Everything-is-code" Autoregressive Model

Installation

Optional Dependencies

WandB

MLFlow

Flash Attention

Usage

1. Pre-process your data

2. Pre-train the model

3. Zero-shot Inference

3.1 Generate Trajectories for a task spec.

3.2 Resolve Trajectories into Predictions.

Documentation

Configuration and Controlling Model Structure

Output Files

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance