A framework for compiling simple, mapreduce style pipelines over MEDS datasets.
Project description
MEDS Transforms: Build and run complex pipelines over MEDS datasets via simple parts
MEDS-Transforms is a Python package for assembling complex data pre-processing workflows over MEDS datasets. To do this, you define a pipeline as a series of stages, each with its own arguments, then run the pipeline over your dataset. This allows the community to curate a library of shared stages for common operations, such as filtering, normalization, outlier detection, and more, which can be used to build novel pipelines for diverse use cases. Learn more below to see how MEDS-Transforms can help you build your data pipelines!
🚀 Quick Start
1. Install via pip
:
pip install MEDS-transforms
2. Craft a pipeline YAML file:
input_dir: $MEDS_ROOT
output_dir: $PIPELINE_OUTPUT
description: Your special pipeline
stages:
- filter_subjects:
min_events_per_subject: 5
- add_time_derived_measurements:
age:
DOB_code: MEDS_BIRTH
age_code: AGE
age_unit: years
time_of_day:
time_of_day_code: TIME_OF_DAY
endpoints: [6, 12, 18, 24]
- fit_outlier_detection:
_base_stage: aggregate_code_metadata
aggregations:
- values/n_occurrences
- values/sum
- values/sum_sqd
- occlude_outliers:
stddev_cutoff: 1
- fit_normalization:
_base_stage: aggregate_code_metadata
aggregations:
- code/n_occurrences
- code/n_subjects
- values/n_occurrences
- values/sum
- values/sum_sqd
- fit_vocabulary_indices
- normalization
This pipeline will:
- Filter subjects to only those with at least 5 events (unique timestamps).
- Add codes and values for the subject's age and the time-of-day of each unique measurement.
- Fit statistics to recognize and occlude outliers over the numeric values.
- Remove numeric values that are more than 1 standard deviation away from the mean.
- Fit statistics to normalize the numeric values.
- Assign codes to unique vocabulary indices in preparation for modeling.
- Normalize the codes and numeric values to proper numeric form for modeling.
Save your pipeline YAML file on disk at $PIPELINE_YAML
.
3. Run the pipeline
In the terminal, run
MEDS_transform-pipeline pipeline_config_fp="$PIPELINE_YAML"
After you do, you will see output files stored in $PIPELINE_OUTPUT
with the results of each stage of the
pipeline, stored in stage specific directories, and the global output in $PIPELINE_OUTPUT/data
and
$PIPELINE_OUTPUT/metadata
(for data and metadata outputs, respectively). That's it!
4. Do even more!
Beyond just running a simple pipeline over the built-in stages, you can also do things like
- Define your own stages or use stages from other packages!
- Run your pipeline in parallel or across a slurm cluster with stage specific compute and memory requirements!
- Use meta-stage functionality like Match-Revise to dynamically control how your stage is run over different parts of the data!
To understand these capabilities and more, read the full documentation.
Examples of MEDS-Transforms in Action:
See any of the below projects to understand how to use MEDS-Transforms in different ways!
[!NOTE] If your package uses MEDS-Transforms, please submit a PR to add it to this list!
Detailed Documentation
Read the full API documentation for technical details
Design Philosophy
MEDS-Transforms is built around the following design philosophy:
The MEDS format
MEDS-Transforms is built for use with MEDS datasets. This format is an incredibly simple, usable, and powerful format for representing electronic health record (EHR) datasets for use in machine learning or artificial intelligence applications.
Pipelines are Composed from Modular Stages
Any complex data pre-processing pipeline should be expressible as a series of simpler, interoperable stages. Expressing complex pipelines in this way allows the MEDS community to curate a library of "pre-processing stages" which can be used within the community to build novel, complex pipelines.
Stages should be Simple, Testable, and Interoperable
Each stage of a pipeline should be simple, testable, and (where possible) interoperable with other stages. This helps the community ensure correctness of pipelines and develop new tools in an efficient, reliable manner. It also helps researchers break down complex operations into simpler conceptual pieces. See the documentation on MEDS-Transforms Stages for more details on how to define your own stages!
Pipelines should be Defined via Readable, Comprehensive Configuration Files
Complex pipelines should also be communicable to other researchers, so that we can easily reproduce others' results, understand their work, and iterate on it. This is best enabled when pipelines can be defined by clear, simple configuration files over this shared library of stages. MEDS-Transforms realizes this with our pipeline configuration specification, shown above. See the full pipeline configuration documentation for more details.
Pipelines should Scale with Compute Resources to Arbitrary Dataset Sizes
Just as the MEDS format is designed to enable easy scaling of datasets through sharding, MEDS-Transforms is built around a mapreduce paradigm to enable easy scaling of pipelines to arbitrary dataset sizes by parallelizing operations across the input datasets' shards. Check out the mapreduce helpers MEDS-Transforms exposes for your use in downstream pipelines.
Data is the Interface
Much as MEDS is a data standard, MEDS-Transforms tries to embody the principle that data, rather than python objects, should be the interface between various pipeline components as much as possible. To that end, each MEDS-Transform stage can be run as a standalone script outputting transformed files to disk, which subsequent stages read. This means that you can easily run multiple MEDS-Transforms pipelines in sequence to combine operations across different packages or use-cases, and seamlessly resume pipelines after interruptions or failures from the partially completed stage outputs.
[!NOTE] This does cause some performance limitations, which we are solving; follow Issue #56 to track updates on this!
Running MEDS-Transforms Pipelines
Parallelization
MEDS-Transforms pipelines can be run in serial mode or with controllable parallelization via Hydra launchers. The use of Hydra launchers and the core design principle of this library means that this parallelization is as simple as launching the individual stages multiple times with near-identical arguments to spin up more workers in parallel, and they can be launched in any mode over a networked file-system that you like. For example, default supported modes include:
- Local parallelism via the
joblib
Hydra launcher, which can be used to run multiple copies of the same script in parallel on a single machine. - Slurm parallelism via the
submitit
Hydra launcher, which can be used to run multiple copies of the same script in parallel on a cluster.
[!NOTE] The
joblib
andsubmitit
Hydra launchers are optional dependencies of this package. To install them, you can runpip install MEDS-transforms[local_parallelism]
orpip install MEDS-transforms[slurm_parallelism]
, respectively.
Building MEDS-Transforms Pipelines
Defining your own stages
Overview
MEDS-Transforms is built for you and other users to define their own stages and export them in your own
packages. When you define a stage in your package, you simply "register" it as a
MEDS_transforms.stages.Stage
object via a MEDS_transforms.stages
plugin in your package's entry points,
and MEDS-Transforms will be able to find it and use it in pipelines, tests, and more.
Concretely, to define a function that you want to run as a MEDS-Transforms stage, you simply:
1. Use the Stage.register
helper:
E.g., in my_package/my_stage.py
:
from MEDS_transforms.stages import Stage
@Stage.register
def main(cfg: DictConfig):
# Do something with the MEDS data
pass
2. Add your stage as a MEDS_transforms.stages
entry point:
E.g., in your pyproject.toml
file:
[project.entry-points."MEDS_transforms.stages"]
my_stage = "my_package.my_stage:main"
Stage types
MEDS-Transforms supports several different types of stages, which are listed in the
StageType
StrEnum
. These are:
MAP
stages, which apply an operation to each data shard in the input, and save the output to the same shard name in the output folder.MAPREDUCE
stages, which apply a metadata extraction operation to each shard in the input, then reduce those outputs to a single metadata file, which is merged with the input metadata and written to the output.MAIN
stages, which do not fall into either of the above categories and are simply run as standalone scripts without additional modification.MAIN
stages cannot use things like the "Match-Revise" protocol.
MAP
and MAPREDUCE
stages take in map and reduce functions; these functions can be direct functions that
apply to each shard, but more commonly they are "functors" that take as input the configuration parameters or
other consistently typed and annotated information and build the specific functions that are to be applied.
MEDS-Transforms can reliably bind these functors to the particular pipeline parameters to streamline your
ability to register stages. See the bind_compute_fn
function to better understand how this works and how to
ensure your stages will be appropriately recognized in downstream usage.
Stage registration configuration
Stages are registered via the Stage.register
method, which can be used as a function or a decorator.
Defining your own pipelines
In addition to writing your own scripts, you can also allow users to reference your pipeline configuration
files directly from your package by ensuring they are included in your packaged files. Users can then refer to
them by using the pkg://
syntax in specifying the pipeline configuration file path, rather than an absolute
path on disk. For example:
MEDS_transform-pipeline pipeline_fp="pkg://my_package.my_pipeline.yaml"
Meta-stage functionality
Currently, the only supported meta-stage functionality is the "Match-Revise" protocol, which allows you to dynamically control how your stage is run over different parts of the data. This is useful for things like extraction of numerical values based on a collection of regular expressions, filtering different subsets of the data with different criteria, etc.
Testing Support
Given the critical importance of testing in the MEDS-Transforms library, we have built-in support for you to test your derived stages via a semi-automated, clear pipeline that will aid you in both writing tests and ensuring your stages are understandable to your users.
Roadmap & Contributing
MEDS-Transforms has several key current priorities:
- Improve the quality of the documentation and tutorials.
- Improve the performance of the library, especially eliminating the fact that all stages currently write to disk and read from disk and that polars is not as efficient in low-resource settings.
- Improve the usability and clarity of the core components of this library, both conceptually and technically; this includes things like removing the distinction between data and metadata stages, ensuring all stages have a clear output schema, supporting reduce- or metadata- only stages, etc.
- Supporting more parallelization and scheduling systems, such as LSF, Spark, and more.
See the GitHub Issues to see all open issues we're considering. If you have an idea for a new feature, please open an issue to discuss it with us!
Contributions are very welcome; please follow the MEDS Organization's Contribution Guide if you submit a PR.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file meds_transforms-0.5.3.tar.gz
.
File metadata
- Download URL: meds_transforms-0.5.3.tar.gz
- Upload date:
- Size: 332.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
832f607b60ae81f919beee6b200e4cb7259781f08536766853ef06fe55948ed6
|
|
MD5 |
eb15ef75d3e323b7b43c237fa34a1e53
|
|
BLAKE2b-256 |
7752e9bc9ece29db1d9bd0b7a9c3d581557b529cad0ebabbc55bb1a1365ae9db
|
Provenance
The following attestation bundles were made for meds_transforms-0.5.3.tar.gz
:
Publisher:
python-build.yaml
on mmcdermott/MEDS_transforms
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1
-
Predicate type:
https://docs.pypi.org/attestations/publish/v1
-
Subject name:
meds_transforms-0.5.3.tar.gz
-
Subject digest:
832f607b60ae81f919beee6b200e4cb7259781f08536766853ef06fe55948ed6
- Sigstore transparency entry: 231375640
- Sigstore integration time:
-
Permalink:
mmcdermott/MEDS_transforms@7dd977715b2b50a0cd6b7f9a33f812bd57f5b228
-
Branch / Tag:
refs/tags/0.5.3
- Owner: https://github.com/mmcdermott
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com
-
Runner Environment:
github-hosted
-
Publication workflow:
python-build.yaml@7dd977715b2b50a0cd6b7f9a33f812bd57f5b228
-
Trigger Event:
push
-
Statement type:
File details
Details for the file meds_transforms-0.5.3-py3-none-any.whl
.
File metadata
- Download URL: meds_transforms-0.5.3-py3-none-any.whl
- Upload date:
- Size: 200.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
bb8e4cb65470d669de99bea813c64ec9c46d3b8b749fd0445f0261da46f2912d
|
|
MD5 |
8b53026061f2fedb04125f8ca7065780
|
|
BLAKE2b-256 |
c2017522f825fd26e82fbc394be44718a16925d6983ca8481a08bbf74f8c67fd
|
Provenance
The following attestation bundles were made for meds_transforms-0.5.3-py3-none-any.whl
:
Publisher:
python-build.yaml
on mmcdermott/MEDS_transforms
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1
-
Predicate type:
https://docs.pypi.org/attestations/publish/v1
-
Subject name:
meds_transforms-0.5.3-py3-none-any.whl
-
Subject digest:
bb8e4cb65470d669de99bea813c64ec9c46d3b8b749fd0445f0261da46f2912d
- Sigstore transparency entry: 231375645
- Sigstore integration time:
-
Permalink:
mmcdermott/MEDS_transforms@7dd977715b2b50a0cd6b7f9a33f812bd57f5b228
-
Branch / Tag:
refs/tags/0.5.3
- Owner: https://github.com/mmcdermott
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com
-
Runner Environment:
github-hosted
-
Publication workflow:
python-build.yaml@7dd977715b2b50a0cd6b7f9a33f812bd57f5b228
-
Trigger Event:
push
-
Statement type: