MEDS ETL building support leveraging MEDS-Transforms.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mmd_pypi

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

MEDS Logot

MEDS-Extract

python

MEDS Extract is a Python package that leverages the MEDS-Transforms framework to build efficient, reproducible ETL (Extract, Transform, Load) pipelines for converting raw electronic health record (EHR) data into the standardized MEDS format. If your dataset consists of files containing patient observations with timestamps, codes, and values, MEDS Extract can automatically convert your raw data into a compliant MEDS dataset in an efficient, scalable, and communicable way.

🚀 Quick Start

1. Install via `pip`:

pip install MEDS-extract

[!NOTE] MEDS Extract v0.2.0 uses meds v0.3.3 and MEDS transforms v0.4.0. MEDS Extract v0.3.0 uses meds v0.4.0 and MEDS v0.5.0. Hotfixes will be released within those namespaces as required. Older versions may be supported in the v0.1.0 namespace.

2. Prepare your raw data

Ensure your data meets these requirements:

File-based: Data stored in .csv, .csv.gz, or .parquet files. These may be stored locally or in the cloud, though intermediate processing currently must be done locally.
Comprehensive Rows: Each file contains a dataframe structure where each row contains all required information to produce one or more MEDS events at full temporal granularity, without additional joining or merging.
Integer subject IDs: The subject_id column must contain integer values (int64). Convert string IDs to integers before running the pipeline.

If these requirements are not met, you may need to perform some pre-processing steps to convert your raw data into an accepted format, though typically these are very minor (e.g., joining across a join key, converting time deltas into timestamps, etc.).

3. Create an event configuration file

Create a YAML file (e.g., event_config.yaml) that tells MEDS Extract how to interpret your raw data:

# Global subject ID column (can be overridden per file)
subject_id_col: patient_id

# File-level configurations
patients:
  subject_id_col: MRN # This file has a different subject ID column
  demographics: # One kind of event in this file.
    code:
      - DEMOGRAPHIC
      - col(gender)
    time:       # Static event
    race: race
    ethnicity: ethnicity

admissions:
  admission: # One kind of event in this file.
    code:
      - HOSPITAL_ADMISSION
      - col(admission_type)
    time: col(admit_datetime)
    time_format: '%Y-%m-%d %H:%M:%S'
    department: department # Extra columns get tracked
    insurance: insurance

  discharge: # A different kind of event in this file.
    code:
      - HOSPITAL_DISCHARGE
      - col(discharge_location)
    time: col(discharge_datetime)
    time_format: '%Y-%m-%d %H:%M:%S'

lab_results:
  lab:
    code:
      - LAB
      - col(test_name)
      - col(units)
    time: col(result_datetime)
    time_format: '%Y-%m-%d %H:%M:%S'
    numeric_value: result_value # This will get converted to a numeric
    text_value: result_text # This will get converted to a string

4. Assemble your pipeline configuration

Beyond your extraction event configuration file, you also need to specify what pipeline stages you want to run. You do this through a typical MEDS-Transforms pipeline configuration file. Here is a typical pipeline configuration file example. Values like $RAW_INPUT_DIR are placeholders for your own paths or environment variables and should be replaced with real values:

input_dir: $RAW_INPUT_DIR
output_dir: $PIPELINE_OUTPUT

description: This pipeline extracts a dataset to MEDS format.

etl_metadata:
  dataset_name: $DATASET_NAME
  dataset_version: $DATASET_VERSION

# Points to the event conversion yaml file defined above.
event_conversion_config_fp: ???
# The shards mapping is stored in the root of the final output directory.
shards_map_fp: ${output_dir}/metadata/.shards.json

# Used if you need to load input files from cloud storage.
cloud_io_storage_options: {}

stages:
  - shard_events:
      data_input_dir: ${input_dir}
  - split_and_shard_subjects
  - convert_to_subject_sharded
  - convert_to_MEDS_events
  - merge_to_MEDS_cohort
  - extract_code_metadata
  - finalize_MEDS_metadata
  - finalize_MEDS_data

Save it on disk to $PIPELINE_YAML (e.g., pipeline_config.yaml).

[!NOTE] A pipeline with these defaults is provided in MEDS_extract.configs._extract. You can reference it directly using the package path with the pkg:// prefix in the runner command: MEDS_transform-pipeline pipeline_config_fp=pkg://MEDS_extract.configs._extract This avoids needing a local copy on disk.

5. Run the extraction pipeline

MEDS-Extract does not have a stand-alone CLI runner; instead, you run it via the default MEDS-Transforms pipeline, but you specify your own pipeline configuration file via the package syntax.

MEDS_transform-pipeline pipeline_config_fp="$PIPELINE_YAML"

The result of this will be an extracted MEDS dataset in the specified output directory!

📊 Real-World Examples

MEDS Extract has been successfully used to convert several major EHR datasets, including MIMIC-IV.

📖 Event Configuration Deep Dive

The event configuration file is the heart of MEDS Extract. Here's how it works:

Basic Structure

relative_table_file_stem:
  event_name:
    code: [required] How to construct the event code
    time: [required] Timestamp column (set to null for static events)
    time_format: [optional] Format string for parsing timestamps
    property_name: column_name  # Additional properties to extract

Code Construction

Event codes can be built in several ways:

# Simple string literal
vitals:
  heart_rate:
    code: "HEART_RATE"

# Column reference
vitals:
  heart_rate:
    code: col(measurement_type)

# Composite codes (joined with "//")
vitals:
  heart_rate:
    code:
      - "VITAL_SIGN"
      - col(measurement_type)
      - col(units)

Time Handling

# Simple datetime column
lab_results:
  lab:
    time: col(result_time)

# Custom time format
lab_results:
  lab:
    time: col(result_time)
    time_format: "%m/%d/%Y %H:%M"

# Multiple format attempts
lab_results:
  lab:
    time: col(result_time)
    time_format:
      - "%Y-%m-%d %H:%M:%S"
      - "%m/%d/%Y %H:%M"

# Static events (no time)
demographics:
  gender:
    time: null

Subject ID Configuration

# Global default
subject_id_col: patient_id

# File-specific override
admissions:
  subject_id_col: hadm_id
  admission:
    code: ADMISSION
    # ...

Metadata Linking

For datasets with separate metadata tables:

lab_results:
  lab:
    code:
      - LAB
      - col(itemid)
    time: col(charttime)
    numeric_value: valuenum
    _metadata:
      input_file: d_labitems
      code_columns:
        - itemid
      properties:
        label: label
        fluid: fluid
        category: category

🛠️ Troubleshooting

Performance Optimization

Manually pre-shard your input data if you have very large files. You can then configure your pipeline to skip the row-sharding stage and start directly with the convert_to_subject_sharded stage.
Use parallel processing for faster extraction via the typical MEDs-Transforms parallelization options.

Future Roadmap

Incorporating more of the pre-MEDS and joining logic that is common into this repository.
Automatic support for running in "demo mode" for testing and validation.
Better examples and documentation for common use cases, including incorporating data cleaning stages after the core extraction.
Providing a default runner or multiple default pipeline files for user convenience.

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for more details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

MEDS Extract builds on the MEDS-Transforms framework and the MEDS standard. Special thanks to:

The MEDS community for developing the standard
Contributors to MEDS-Transforms for the underlying infrastructure
Healthcare institutions sharing their data for research

📖 Citation

If you use MEDS Extract in your research, please cite:

@software{meds_extract2024,
  title={MEDS Extract: ETL Pipelines for Converting EHR Data to MEDS Format},
  author={McDermott, Matthew and contributors},
  year={2024},
  url={https://github.com/mmcdermott/MEDS_extract}
}

Ready to standardize your EHR data? Start with our Quick Start guide or explore our examples directory for real-world configurations.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mmd_pypi

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.6.1

Apr 23, 2026

0.6.0

Apr 10, 2026

0.5.0

Nov 5, 2025

0.4.1

Jul 17, 2025

This version

0.4.0

Jul 10, 2025

0.3.0

May 9, 2025

0.2.0

May 9, 2025

0.1.3

Apr 5, 2025

0.1.2

Apr 4, 2025

0.1.1

Apr 4, 2025

0.1

Apr 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meds_extract-0.4.0.tar.gz (94.6 kB view details)

Uploaded Jul 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

meds_extract-0.4.0-py3-none-any.whl (48.4 kB view details)

Uploaded Jul 10, 2025 Python 3

File details

Details for the file meds_extract-0.4.0.tar.gz.

File metadata

Download URL: meds_extract-0.4.0.tar.gz
Upload date: Jul 10, 2025
Size: 94.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for meds_extract-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`9867415a21e3a9bfc65af4be88a7aad1821c6f2fc914d74b771fbc538327e325`
MD5	`7a21d124bf97fe32799ee244fdc2d19d`
BLAKE2b-256	`7eee1abc0a7191b720f2586f41a70cc44261b356dd0616df9c4e104c336f8c6a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for meds_extract-0.4.0.tar.gz:

Publisher: python-build.yaml on mmcdermott/MEDS_extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: meds_extract-0.4.0.tar.gz
- Subject digest: 9867415a21e3a9bfc65af4be88a7aad1821c6f2fc914d74b771fbc538327e325
- Sigstore transparency entry: 270369112
- Sigstore integration time: Jul 10, 2025
Source repository:
- Permalink: mmcdermott/MEDS_extract@4552e35a2883d3f51aea450314e06ddbd6011f3e
- Branch / Tag: refs/tags/0.4.0
- Owner: https://github.com/mmcdermott
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-build.yaml@4552e35a2883d3f51aea450314e06ddbd6011f3e
- Trigger Event: push

File details

Details for the file meds_extract-0.4.0-py3-none-any.whl.

File metadata

Download URL: meds_extract-0.4.0-py3-none-any.whl
Upload date: Jul 10, 2025
Size: 48.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for meds_extract-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`942c10d0b1c6029ddc1a65d2a121f71103a2160d1f826933eba3b2044f2f788d`
MD5	`67423354d050644a1ec1281c970098a0`
BLAKE2b-256	`f393c39973d74f8ca37cfb5d7214bf10dccfa4528876c12d4c078cb0baabfbda`

See more details on using hashes here.

Provenance

The following attestation bundles were made for meds_extract-0.4.0-py3-none-any.whl:

Publisher: python-build.yaml on mmcdermott/MEDS_extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: meds_extract-0.4.0-py3-none-any.whl
- Subject digest: 942c10d0b1c6029ddc1a65d2a121f71103a2160d1f826933eba3b2044f2f788d
- Sigstore transparency entry: 270369119
- Sigstore integration time: Jul 10, 2025
Source repository:
- Permalink: mmcdermott/MEDS_extract@4552e35a2883d3f51aea450314e06ddbd6011f3e
- Branch / Tag: refs/tags/0.4.0
- Owner: https://github.com/mmcdermott
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-build.yaml@4552e35a2883d3f51aea450314e06ddbd6011f3e
- Trigger Event: push

MEDS-extract 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

MEDS-Extract

🚀 Quick Start

1. Install via pip:

2. Prepare your raw data

3. Create an event configuration file

4. Assemble your pipeline configuration

5. Run the extraction pipeline

📊 Real-World Examples

📖 Event Configuration Deep Dive

Basic Structure

Code Construction

Time Handling

Subject ID Configuration

Metadata Linking

🛠️ Troubleshooting

Performance Optimization

Future Roadmap

🤝 Contributing

📄 License

🙏 Acknowledgments

📖 Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

1. Install via `pip`: