A framework for compiling simple, mapreduce style pipelines over MEDS datasets.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mmd_pypi

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

MEDS Logot

MEDS Transforms: Build and run complex pipelines over MEDS datasets via simple parts

python

MEDS-Transforms is a Python package for assembling complex data pre-processing workflows over MEDS datasets. To do this, you define a pipeline as a series of stages, each with its own arguments, then run the pipeline over your dataset. This allows the community to curate a library of shared stages for common operations, such as filtering, normalization, outlier detection, and more, which can be used to build novel pipelines for diverse use cases. Learn more below to see how MEDS-Transforms can help you build your data pipelines!

🚀 Quick Start

1. Install via `pip`:

pip install MEDS-transforms

2. Craft a pipeline YAML file:

input_dir: $MEDS_ROOT
output_dir: $PIPELINE_OUTPUT

description: Your special pipeline

stages:
  - filter_subjects:
      min_events_per_subject: 5
  - add_time_derived_measurements:
      age:
        DOB_code: MEDS_BIRTH
        age_code: AGE
        age_unit: years
      time_of_day:
        time_of_day_code: TIME_OF_DAY
        endpoints: [6, 12, 18, 24]
  - fit_outlier_detection:
      _base_stage: aggregate_code_metadata
      aggregations:
        - values/n_occurrences
        - values/sum
        - values/sum_sqd
  - occlude_outliers:
      stddev_cutoff: 1
  - fit_normalization:
      _base_stage: aggregate_code_metadata
      aggregations:
        - code/n_occurrences
        - code/n_subjects
        - values/n_occurrences
        - values/sum
        - values/sum_sqd
  - fit_vocabulary_indices
  - normalization

This pipeline will:

Filter subjects to only those with at least 5 events (unique timestamps).
Add codes and values for the subject's age and the time-of-day of each unique measurement.
Fit statistics to recognize and occlude outliers over the numeric values.
Remove numeric values that are more than 1 standard deviation away from the mean.
Fit statistics to normalize the numeric values.
Assign codes to unique vocabulary indices in preparation for modeling.
Normalize the codes and numeric values to proper numeric form for modeling.

Save your pipeline YAML file on disk at $PIPELINE_YAML.

3. Run the pipeline

In the terminal, run

MEDS_transform-pipeline pipeline_config_fp="$PIPELINE_YAML"

After you do, you will see output files stored in $PIPELINE_OUTPUT with the results of each stage of the pipeline, stored in stage specific directories, and the global output in $PIPELINE_OUTPUT/data and $PIPELINE_OUTPUT/metadata (for data and metadata outputs, respectively). That's it!

4. Do even more!

Beyond just running a simple pipeline over the built-in stages, you can also do things like

Define your own stages or use stages from other packages!
Run your pipeline in parallel or across a slurm cluster with stage specific compute and memory requirements!
Use meta-stage functionality like Match-Revise to dynamically control how your stage is run over different parts of the data!

To understand these capabilities and more, read the full documentation.

Examples of MEDS-Transforms in Action:

See any of the below projects to understand how to use MEDS-Transforms in different ways!

[!NOTE] If your package uses MEDS-Transforms, please submit a PR to add it to this list!

Detailed Documentation

Read the full API documentation for technical details

Design Philosophy

MEDS-Transforms is built around the following design philosophy:

The MEDS format

MEDS-Transforms is built for use with MEDS datasets. This format is an incredibly simple, usable, and powerful format for representing electronic health record (EHR) datasets for use in machine learning or artificial intelligence applications.

Pipelines are Composed from Modular Stages

Any complex data pre-processing pipeline should be expressible as a series of simpler, interoperable stages. Expressing complex pipelines in this way allows the MEDS community to curate a library of "pre-processing stages" which can be used within the community to build novel, complex pipelines.

Stages should be Simple, Testable, and Interoperable

Each stage of a pipeline should be simple, testable, and (where possible) interoperable with other stages. This helps the community ensure correctness of pipelines and develop new tools in an efficient, reliable manner. It also helps researchers break down complex operations into simpler conceptual pieces. See the documentation on MEDS-Transforms Stages for more details on how to define your own stages!

Pipelines should be Defined via Readable, Comprehensive Configuration Files

Complex pipelines should also be communicable to other researchers, so that we can easily reproduce others' results, understand their work, and iterate on it. This is best enabled when pipelines can be defined by clear, simple configuration files over this shared library of stages. MEDS-Transforms realizes this with our pipeline configuration specification, shown above. See the full pipeline configuration documentation for more details.

Pipelines should Scale with Compute Resources to Arbitrary Dataset Sizes

Just as the MEDS format is designed to enable easy scaling of datasets through sharding, MEDS-Transforms is built around a mapreduce paradigm to enable easy scaling of pipelines to arbitrary dataset sizes by parallelizing operations across the input datasets' shards. Check out the mapreduce helpers MEDS-Transforms exposes for your use in downstream pipelines.

Data is the Interface

Much as MEDS is a data standard, MEDS-Transforms tries to embody the principle that data, rather than python objects, should be the interface between various pipeline components as much as possible. To that end, each MEDS-Transform stage can be run as a standalone script outputting transformed files to disk, which subsequent stages read. This means that you can easily run multiple MEDS-Transforms pipelines in sequence to combine operations across different packages or use-cases, and seamlessly resume pipelines after interruptions or failures from the partially completed stage outputs.

[!NOTE] This does cause some performance limitations, which we are solving; follow Issue #56 to track updates on this!

Running MEDS-Transforms Pipelines

Parallelization

MEDS-Transforms pipelines can be run in serial mode or with controllable parallelization via Hydra launchers. The use of Hydra launchers and the core design principle of this library means that this parallelization is as simple as launching the individual stages multiple times with near-identical arguments to spin up more workers in parallel, and they can be launched in any mode over a networked file-system that you like. For example, default supported modes include:

Local parallelism via the joblib Hydra launcher, which can be used to run multiple copies of the same script in parallel on a single machine.
Slurm parallelism via the submitit Hydra launcher, which can be used to run multiple copies of the same script in parallel on a cluster.

[!NOTE] The joblib and submitit Hydra launchers are optional dependencies of this package. To install them, you can run pip install MEDS-transforms[local_parallelism] or pip install MEDS-transforms[slurm_parallelism], respectively.

Building MEDS-Transforms Pipelines

Defining your own stages

Overview

MEDS-Transforms is built for you and other users to define their own stages and export them in your own packages. When you define a stage in your package, you simply "register" it as a MEDS_transforms.stages.Stage object via a MEDS_transforms.stages plugin in your package's entry points, and MEDS-Transforms will be able to find it and use it in pipelines, tests, and more.

Concretely, to define a function that you want to run as a MEDS-Transforms stage, you simply:

1. Use the `Stage.register` helper:

E.g., in my_package/my_stage.py:

from MEDS_transforms.stages import Stage


@Stage.register
def main(cfg: DictConfig):
    # Do something with the MEDS data
    pass

2. Add your stage as a `MEDS_transforms.stages` entry point:

E.g., in your pyproject.toml file:

[project.entry-points."MEDS_transforms.stages"]
my_stage = "my_package.my_stage:main"

Stage types

MEDS-Transforms supports several different types of stages, which are listed in the StageType StrEnum. These are:

MAP stages, which apply an operation to each data shard in the input, and save the output to the same shard name in the output folder.
MAPREDUCE stages, which apply a metadata extraction operation to each shard in the input, then reduce those outputs to a single metadata file, which is merged with the input metadata and written to the output.
MAIN stages, which do not fall into either of the above categories and are simply run as standalone scripts without additional modification. MAIN stages cannot use things like the "Match-Revise" protocol.

MAP and MAPREDUCE stages take in map and reduce functions; these functions can be direct functions that apply to each shard, but more commonly they are "functors" that take as input the configuration parameters or other consistently typed and annotated information and build the specific functions that are to be applied. MEDS-Transforms can reliably bind these functors to the particular pipeline parameters to streamline your ability to register stages. See the bind_compute_fn function to better understand how this works and how to ensure your stages will be appropriately recognized in downstream usage.

Stage registration configuration

Stages are registered via the Stage.register method, which can be used as a function or a decorator.

Defining your own pipelines

In addition to writing your own scripts, you can also allow users to reference your pipeline configuration files directly from your package by ensuring they are included in your packaged files. Users can then refer to them by using the pkg:// syntax in specifying the pipeline configuration file path, rather than an absolute path on disk. For example:

MEDS_transform-pipeline pipeline_fp="pkg://my_package.my_pipeline.yaml"

Meta-stage functionality

Currently, the only supported meta-stage functionality is the "Match-Revise" protocol, which allows you to dynamically control how your stage is run over different parts of the data. This is useful for things like extraction of numerical values based on a collection of regular expressions, filtering different subsets of the data with different criteria, etc.

Testing Support

Given the critical importance of testing in the MEDS-Transforms library, we have built-in support for you to test your derived stages via a semi-automated, clear pipeline that will aid you in both writing tests and ensuring your stages are understandable to your users.

Roadmap & Contributing

MEDS-Transforms has several key current priorities:

Improve the quality of the documentation and tutorials.
Improve the performance of the library, especially eliminating the fact that all stages currently write to disk and read from disk and that polars is not as efficient in low-resource settings.
Improve the usability and clarity of the core components of this library, both conceptually and technically; this includes things like removing the distinction between data and metadata stages, ensuring all stages have a clear output schema, supporting reduce- or metadata- only stages, etc.
Supporting more parallelization and scheduling systems, such as LSF, Spark, and more.

See the GitHub Issues to see all open issues we're considering. If you have an idea for a new feature, please open an issue to discuss it with us!

Contributions are very welcome; please follow the MEDS Organization's Contribution Guide if you submit a PR.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mmd_pypi

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.5.3

Jun 6, 2025

0.5.2

Jun 6, 2025

0.5.1

May 13, 2025

0.5.0

May 6, 2025

0.4.3

Apr 20, 2025

0.4.2

Apr 17, 2025

0.4.1

Apr 17, 2025

0.4

Apr 15, 2025

0.2.4

Apr 11, 2025

0.2.3

Apr 3, 2025

0.2.2

Mar 27, 2025

0.2.1

Mar 26, 2025

0.2

Mar 25, 2025

0.1.1

Feb 18, 2025

0.1

Feb 2, 2025

0.0.9

Dec 4, 2024

0.0.8

Oct 14, 2024

0.0.7

Sep 1, 2024

0.0.6

Aug 27, 2024

0.0.5

Aug 12, 2024

0.0.4

Aug 11, 2024

0.0.3

Aug 5, 2024

0.0.2

Aug 3, 2024

0.0.1

Jul 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meds_transforms-0.5.3.tar.gz (332.6 kB view details)

Uploaded Jun 6, 2025 Source

Built Distribution

meds_transforms-0.5.3-py3-none-any.whl (200.6 kB view details)

Uploaded Jun 6, 2025 Python 3

File details

Details for the file meds_transforms-0.5.3.tar.gz.

File metadata

Download URL: meds_transforms-0.5.3.tar.gz
Upload date: Jun 6, 2025
Size: 332.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for meds_transforms-0.5.3.tar.gz
Algorithm	Hash digest
SHA256	`832f607b60ae81f919beee6b200e4cb7259781f08536766853ef06fe55948ed6`
MD5	`eb15ef75d3e323b7b43c237fa34a1e53`
BLAKE2b-256	`7752e9bc9ece29db1d9bd0b7a9c3d581557b529cad0ebabbc55bb1a1365ae9db`

See more details on using hashes here.

Provenance

The following attestation bundles were made for meds_transforms-0.5.3.tar.gz:

Publisher: python-build.yaml on mmcdermott/MEDS_transforms

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: meds_transforms-0.5.3.tar.gz
- Subject digest: 832f607b60ae81f919beee6b200e4cb7259781f08536766853ef06fe55948ed6
- Sigstore transparency entry: 231375640
- Sigstore integration time: Jun 6, 2025
Source repository:
- Permalink: mmcdermott/MEDS_transforms@7dd977715b2b50a0cd6b7f9a33f812bd57f5b228
- Branch / Tag: refs/tags/0.5.3
- Owner: https://github.com/mmcdermott
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-build.yaml@7dd977715b2b50a0cd6b7f9a33f812bd57f5b228
- Trigger Event: push

File details

Details for the file meds_transforms-0.5.3-py3-none-any.whl.

File metadata

Download URL: meds_transforms-0.5.3-py3-none-any.whl
Upload date: Jun 6, 2025
Size: 200.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for meds_transforms-0.5.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bb8e4cb65470d669de99bea813c64ec9c46d3b8b749fd0445f0261da46f2912d`
MD5	`8b53026061f2fedb04125f8ca7065780`
BLAKE2b-256	`c2017522f825fd26e82fbc394be44718a16925d6983ca8481a08bbf74f8c67fd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for meds_transforms-0.5.3-py3-none-any.whl:

Publisher: python-build.yaml on mmcdermott/MEDS_transforms

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: meds_transforms-0.5.3-py3-none-any.whl
- Subject digest: bb8e4cb65470d669de99bea813c64ec9c46d3b8b749fd0445f0261da46f2912d
- Sigstore transparency entry: 231375645
- Sigstore integration time: Jun 6, 2025
Source repository:
- Permalink: mmcdermott/MEDS_transforms@7dd977715b2b50a0cd6b7f9a33f812bd57f5b228
- Branch / Tag: refs/tags/0.5.3
- Owner: https://github.com/mmcdermott
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-build.yaml@7dd977715b2b50a0cd6b7f9a33f812bd57f5b228
- Trigger Event: push

MEDS-transforms 0.5.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

MEDS Transforms: Build and run complex pipelines over MEDS datasets via simple parts

🚀 Quick Start

1. Install via pip:

2. Craft a pipeline YAML file:

3. Run the pipeline

4. Do even more!

Examples of MEDS-Transforms in Action:

Detailed Documentation

Design Philosophy

The MEDS format

Pipelines are Composed from Modular Stages

Stages should be Simple, Testable, and Interoperable

Pipelines should be Defined via Readable, Comprehensive Configuration Files

Pipelines should Scale with Compute Resources to Arbitrary Dataset Sizes

Data is the Interface

Running MEDS-Transforms Pipelines

Parallelization

Building MEDS-Transforms Pipelines

Defining your own stages

Overview

1. Use the Stage.register helper:

2. Add your stage as a MEDS_transforms.stages entry point:

Stage types

Stage registration configuration

Defining your own pipelines

Meta-stage functionality

Testing Support

Roadmap & Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

1. Install via `pip`:

1. Use the `Stage.register` helper:

2. Add your stage as a `MEDS_transforms.stages` entry point: