A framework for compiling simple, mapreduce style pipelines over MEDS datasets.
Project description
MEDS Transforms: Build and run complex pipelines over MEDS datasets via simple parts
MEDS-Transforms is a Python package for assembling complex data pre-processing workflows over MEDS datasets. To do this, you define a pipeline as a series of stages, each with its own arguments, then run the pipeline over your dataset. This allows the community to curate a library of shared stages for common operations, such as filtering, normalization, outlier detection, and more, which can be used to build novel pipelines for diverse use cases. Learn more below to see how MEDS-Transforms can help you build your data pipelines!
🚀 Quick Start
1. Install via pip:
pip install MEDS-transforms
2. Craft a pipeline YAML file:
input_dir: $MEDS_ROOT
output_dir: $PIPELINE_OUTPUT
description: Your special pipeline
stages:
- filter_subjects:
min_events_per_subject: 5
- add_time_derived_measurements:
age:
DOB_code: MEDS_BIRTH
age_code: AGE
age_unit: years
time_of_day:
time_of_day_code: TIME_OF_DAY
endpoints: [6, 12, 18, 24]
- fit_outlier_detection:
_base_stage: aggregate_code_metadata
aggregations:
- values/n_occurrences
- values/sum
- values/sum_sqd
- occlude_outliers:
stddev_cutoff: 1
- fit_normalization:
_base_stage: aggregate_code_metadata
aggregations:
- code/n_occurrences
- code/n_subjects
- values/n_occurrences
- values/sum
- values/sum_sqd
- fit_vocabulary_indices
- normalization
This pipeline will:
- Filter subjects to only those with at least 5 events (unique timestamps).
- Add codes and values for the subject's age and the time-of-day of each unique measurement.
- Fit statistics to recognize and occlude outliers over the numeric values.
- Remove numeric values that are more than 1 standard deviation away from the mean.
- Fit statistics to normalize the numeric values.
- Assign codes to unique vocabulary indices in preparation for modeling.
- Normalize the codes and numeric values to proper numeric form for modeling.
Save your pipeline YAML file on disk at $PIPELINE_YAML.
3. Run the pipeline
In the terminal, run
MEDS_transform-pipeline "$PIPELINE_YAML"
The runner creates a .logs folder inside the pipeline's output_dir and marks stages as complete by
placing <stage>.done files in that folder. A _all_stages.done file is written when the entire pipeline
finishes. Re-running the command will skip any stages that already have corresponding .done files.
You can optionally supply a stage runner YAML as a second argument to control how each stage is launched (for example, providing parallelization options or custom stage scripts). Any additional arguments are forwarded to the stage invocations using Hydra override syntax, allowing you to tweak stage parameters on the command line.
After running, you will see output files stored in $PIPELINE_OUTPUT with the results of each stage of the
pipeline in stage-specific directories, and the global output in $PIPELINE_OUTPUT/data and
$PIPELINE_OUTPUT/metadata (for data and metadata outputs, respectively). That's it!
4. Do even more!
Beyond just running a simple pipeline over the built-in stages, you can also do things like
- Define your own stages or use stages from other packages!
- Run your pipeline in parallel or across a slurm cluster with stage specific compute and memory requirements!
- Use meta-stage functionality like Match-Revise to dynamically control how your stage is run over different parts of the data!
To understand these capabilities and more, read the full documentation.
Examples of MEDS-Transforms in Action:
See any of the below projects to understand how to use MEDS-Transforms in different ways!
[!NOTE] If your package uses MEDS-Transforms, please submit a PR to add it to this list!
Detailed Documentation
Read the full API documentation for technical details
Design Philosophy
MEDS-Transforms is built around the following design philosophy:
The MEDS format
MEDS-Transforms is built for use with MEDS datasets. This format is an incredibly simple, usable, and powerful format for representing electronic health record (EHR) datasets for use in machine learning or artificial intelligence applications.
Pipelines are Composed from Modular Stages
Any complex data pre-processing pipeline should be expressible as a series of simpler, interoperable stages. Expressing complex pipelines in this way allows the MEDS community to curate a library of "pre-processing stages" which can be used within the community to build novel, complex pipelines.
Stages should be Simple, Testable, and Interoperable
Each stage of a pipeline should be simple, testable, and (where possible) interoperable with other stages. This helps the community ensure correctness of pipelines and develop new tools in an efficient, reliable manner. It also helps researchers break down complex operations into simpler conceptual pieces. See the documentation on MEDS-Transforms Stages for more details on how to define your own stages!
Pipelines should be Defined via Readable, Comprehensive Configuration Files
Complex pipelines should also be communicable to other researchers, so that we can easily reproduce others' results, understand their work, and iterate on it. This is best enabled when pipelines can be defined by clear, simple configuration files over this shared library of stages. MEDS-Transforms realizes this with our pipeline configuration specification, shown above. See the full pipeline configuration documentation for more details.
Pipelines should Scale with Compute Resources to Arbitrary Dataset Sizes
Just as the MEDS format is designed to enable easy scaling of datasets through sharding, MEDS-Transforms is built around a mapreduce paradigm to enable easy scaling of pipelines to arbitrary dataset sizes by parallelizing operations across the input datasets' shards. Check out the mapreduce helpers MEDS-Transforms exposes for your use in downstream pipelines.
Data is the Interface
Much as MEDS is a data standard, MEDS-Transforms tries to embody the principle that data, rather than python objects, should be the interface between various pipeline components as much as possible. To that end, each MEDS-Transform stage can be run as a standalone script outputting transformed files to disk, which subsequent stages read. This means that you can easily run multiple MEDS-Transforms pipelines in sequence to combine operations across different packages or use-cases, and seamlessly resume pipelines after interruptions or failures from the partially completed stage outputs.
[!NOTE] This does cause some performance limitations, which we are solving; follow Issue #56 to track updates on this!
Running MEDS-Transforms Pipelines
Parallelization
MEDS-Transforms pipelines can be run in serial mode or with controllable parallelization via Hydra launchers. The use of Hydra launchers and the core design principle of this library means that this parallelization is as simple as launching the individual stages multiple times with near-identical arguments to spin up more workers in parallel, and they can be launched in any mode over a networked file-system that you like. For example, default supported modes include:
- Local parallelism via the
joblibHydra launcher, which can be used to run multiple copies of the same script in parallel on a single machine. - Slurm parallelism via the
submititHydra launcher, which can be used to run multiple copies of the same script in parallel on a cluster.
[!NOTE] The
joblibandsubmititHydra launchers are optional dependencies of this package. To install them, you can runpip install MEDS-transforms[local_parallelism]orpip install MEDS-transforms[slurm_parallelism], respectively.
Building MEDS-Transforms Pipelines
Defining your own stages
Overview
MEDS-Transforms is built for you and other users to define their own stages and export them in your own
packages. When you define a stage in your package, you simply "register" it as a
MEDS_transforms.stages.Stage object via a MEDS_transforms.stages plugin in your package's entry points,
and MEDS-Transforms will be able to find it and use it in pipelines, tests, and more.
Concretely, to define a function that you want to run as a MEDS-Transforms stage, you simply:
1. Use the Stage.register helper:
E.g., in my_package/my_stage.py:
from MEDS_transforms.stages import Stage
@Stage.register
def main(cfg: DictConfig):
# Do something with the MEDS data
pass
2. Add your stage as a MEDS_transforms.stages entry point:
E.g., in your pyproject.toml file:
[project.entry-points."MEDS_transforms.stages"]
my_stage = "my_package.my_stage:main"
Stage types
MEDS-Transforms supports several different types of stages, which are listed in the
StageType StrEnum. These are:
MAPstages, which apply an operation to each data shard in the input, and save the output to the same shard name in the output folder.MAPREDUCEstages, which apply a metadata extraction operation to each shard in the input, then reduce those outputs to a single metadata file, which is merged with the input metadata and written to the output.MAINstages, which do not fall into either of the above categories and are simply run as standalone scripts without additional modification.MAINstages cannot use things like the "Match-Revise" protocol.
MAP and MAPREDUCE stages take in map and reduce functions; these functions can be direct functions that
apply to each shard, but more commonly they are "functors" that take as input the configuration parameters or
other consistently typed and annotated information and build the specific functions that are to be applied.
MEDS-Transforms can reliably bind these functors to the particular pipeline parameters to streamline your
ability to register stages. See the bind_compute_fn function to better understand how this works and how to
ensure your stages will be appropriately recognized in downstream usage.
Stage registration configuration
Stages are registered via the Stage.register method, which can be used as a function or a decorator.
Defining your own pipelines
In addition to writing your own scripts, you can also allow users to reference your pipeline configuration
files directly from your package by ensuring they are included in your packaged files. Users can then refer to
them by using the pkg:// syntax in specifying the pipeline configuration file path, rather than an absolute
path on disk. For example:
MEDS_transform-pipeline pkg://my_package.my_pipeline.yaml
Meta-stage functionality
Currently, the only supported meta-stage functionality is the "Match-Revise" protocol, which allows you to dynamically control how your stage is run over different parts of the data. This is useful for things like extraction of numerical values based on a collection of regular expressions, filtering different subsets of the data with different criteria, etc.
Testing Support
Given the critical importance of testing in the MEDS-Transforms library, we have built-in support for you to test your derived stages via a semi-automated, clear pipeline that will aid you in both writing tests and ensuring your stages are understandable to your users.
Example plugin package
This repository includes a minimal example of a downstream package that depends
on MEDS-Transforms. You can find it under
examples/simple_example_pkg.
The package registers an identity_stage via an entry point and ships a simple
identity_pipeline.yaml that exercises the stage. After installing the package
locally you can run the pipeline with
MEDS_transform-pipeline "pkg://simple_example_pkg.pipelines/identity_pipeline.yaml"
You can also provide a stage runner configuration to configure options like parallelization as well as pipeline specific overrides via this syntax (e.g., the output and input directories); for example:
MEDS_transform-pipeline "...pipeline.yaml" --stage_runner_fp "stage_runner.yaml" --overrides "input_dir=foo"
See tests/test_example_pkg.py for an automated demonstration of this setup.
Roadmap & Contributing
MEDS-Transforms has several key current priorities:
- Improve the quality of the documentation and tutorials.
- Improve the performance of the library, especially eliminating the fact that all stages currently write to disk and read from disk and that polars is not as efficient in low-resource settings.
- Improve the usability and clarity of the core components of this library, both conceptually and technically; this includes things like removing the distinction between data and metadata stages, ensuring all stages have a clear output schema, supporting reduce- or metadata- only stages, etc.
- Supporting more parallelization and scheduling systems, such as LSF, Spark, and more.
See the GitHub Issues to see all open issues we're considering. If you have an idea for a new feature, please open an issue to discuss it with us!
Contributions are very welcome; please follow the MEDS Organization's Contribution Guide if you submit a PR.
Note that contributions undergo pre-commit checks (pre-commit run --all), tests (pytest), and
documentation generation (check via mkdocs serve).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file meds_transforms-0.6.1.tar.gz.
File metadata
- Download URL: meds_transforms-0.6.1.tar.gz
- Upload date:
- Size: 353.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d055166bfea5aaa6352d046245d44487815a3d7a966f191c5fd1420618fc2cf
|
|
| MD5 |
0bcd03335122a5ea3de1e6285b340ba6
|
|
| BLAKE2b-256 |
cac82689ed944975ea54e2ce2d633ca37362e40654081dd6ed7daaa9c35d9850
|
Provenance
The following attestation bundles were made for meds_transforms-0.6.1.tar.gz:
Publisher:
python-build.yaml on mmcdermott/MEDS_transforms
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
meds_transforms-0.6.1.tar.gz -
Subject digest:
4d055166bfea5aaa6352d046245d44487815a3d7a966f191c5fd1420618fc2cf - Sigstore transparency entry: 672490089
- Sigstore integration time:
-
Permalink:
mmcdermott/MEDS_transforms@9af69de437015e8192b6f2ffb5642a7982e2ded8 -
Branch / Tag:
refs/tags/0.6.1 - Owner: https://github.com/mmcdermott
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-build.yaml@9af69de437015e8192b6f2ffb5642a7982e2ded8 -
Trigger Event:
push
-
Statement type:
File details
Details for the file meds_transforms-0.6.1-py3-none-any.whl.
File metadata
- Download URL: meds_transforms-0.6.1-py3-none-any.whl
- Upload date:
- Size: 202.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e6e819064ae4e094ce6bc0896b8ae5d03a3388d1b039f04e3d8e12e7e55f760
|
|
| MD5 |
97dc3e82d3bb90aead6f7f8c665cc689
|
|
| BLAKE2b-256 |
15c2484c0cf868fe935454a623c2f72f2e520241d1b557a555e8755d4b5bcb78
|
Provenance
The following attestation bundles were made for meds_transforms-0.6.1-py3-none-any.whl:
Publisher:
python-build.yaml on mmcdermott/MEDS_transforms
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
meds_transforms-0.6.1-py3-none-any.whl -
Subject digest:
1e6e819064ae4e094ce6bc0896b8ae5d03a3388d1b039f04e3d8e12e7e55f760 - Sigstore transparency entry: 672490174
- Sigstore integration time:
-
Permalink:
mmcdermott/MEDS_transforms@9af69de437015e8192b6f2ffb5642a7982e2ded8 -
Branch / Tag:
refs/tags/0.6.1 - Owner: https://github.com/mmcdermott
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-build.yaml@9af69de437015e8192b6f2ffb5642a7982e2ded8 -
Trigger Event:
push
-
Statement type: