Skip to main content

A data standard for working with event stream data

Project description

meds_etl

A collection of ETLs from common data formats to Medical Event Data Standard (MEDS)

This package library currently supports:

  • MIMIC-IV
  • OMOP v5.4
  • MEDS Unsorted, an unsorted version of MEDS

Setup

Install the package

pip install meds_etl

Backends

ETLs are one of the most computationally heavy components of MEDS, so efficiency is very important.

MEDS-ETL has several parallel implementations of core algorithms to balance the tradeoff between efficiency and ease of use.

All commands generally take an additional parameter --backend, that allows users to switch between different backends.

We currently support two backends: polars (the default) and cpp.

Backend information:

  • polars (default backend): A Python only implementation that only requires polars to run. The main issue with this implementation is that it is rather inefficient. It's recommended to use as few shards as possible while still avoiding out of memory errors.

  • cpp: A custom C++ backend. This backend is very efficient, but might not run on all platforms and has a limited feature set. It's recommended to use the same number of shards as you have CPUs available.

If you want to use the cpp backend, make sure to install meds_etl with the correct optional dependencies.

# For the cpp backend
pip install "meds_etl[cpp]"

MIMIC-IV

In order to run the MIMIC-IV ETL, simply run the following command:

meds_etl_mimic [PATH_TO_SOURCE_MIMIC] [PATH_TO_OUTPUT]

where [PATH_TO_SOURCE_MIMIC] is a download of MIMIC-IV and [PATH_TO_OUTPUT] will be the destination path for the MEDS dataset.

OMOP

In order to run the OMOP ETL, simply run the following command:

meds_etl_omop [PATH_TO_SOURCE_OMOP] [PATH_TO_OUTPUT]

where [PATH_TO_SOURCE_OMOP] is a folder containing csv files (optionally gzipped) for an OMOP dataset and [PATH_TO_OUTPUT] will be the destination path for the MEDS dataset. Each OMOP table should either be a csv file with the table name (such as person.csv) or a folder with the table name containing csv files.

Unit tests

Tests can be run from the project root with the following command:

pytest -v

Tests requiring data will be skipped unless the tests/data/ folder is populated first.

To download the testing data, run the following command/s from project root:

# Download the MIMIC-IV-Demo dataset (v2.2) to a tests/data/ directory
wget -r -N -c --no-host-directories --cut-dirs=1 -np -P tests/data https://physionet.org/files/mimic-iv-demo/2.2/

MEDS Unsorted

MEDS itself can be a bit tricky to generate as it has ordering and shard location requirements for events (events for a particular subject must be sorted by time and can only be in one shard).

In order to make it easier to generate MEDS, this package provides a special MEDS Unsorted schema and ETLs that transform between MEDS Unsorted and MEDS. The idea is that instead of writing a complex MEDS ETL, users can instead write a simpler ETL to MEDS Unsorted and then use this package as a final stage.

MEDS Unsorted is simply MEDS without the ordering and shard requirements for events, with the name of the data folder changed from "data" to "unsorted_data".

In order to convert a MEDS Unsorted dataset into MEDS, simply run the following command:

meds_etl_sort meds_unsorted meds where meds_unsorted is a folder containing MEDS Unsorted data and meds is the target folder to store the MEDS dataset in.

Troubleshooting

Polars incompatible with Mac M1

If you get this error when running meds_etl:

RuntimeWarning: Missing required CPU features.

The following required CPU features were not detected:
    avx, fma
Continuing to use this version of Polars on this processor will likely result in a crash.
Install the `polars-lts-cpu` package instead of `polars` to run Polars with better compatibility.

Then you'll need to install the run the following:

pip uninstall polars
pip install polars-lts-cpu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meds_etl-0.3.11.tar.gz (122.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

meds_etl-0.3.11-py3-none-any.whl (118.3 kB view details)

Uploaded Python 3

File details

Details for the file meds_etl-0.3.11.tar.gz.

File metadata

  • Download URL: meds_etl-0.3.11.tar.gz
  • Upload date:
  • Size: 122.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for meds_etl-0.3.11.tar.gz
Algorithm Hash digest
SHA256 96cc784e1f4cc62b45668a56aa91cd925deb3a3461ad9fd81c9f117a0e0b15d4
MD5 c9041b92c7c028690be19551322a06e5
BLAKE2b-256 02e723edaecbb2d7fedb20cd3a3b7a065f2467fe69f6edad7b7d71c5188f1559

See more details on using hashes here.

Provenance

The following attestation bundles were made for meds_etl-0.3.11.tar.gz:

Publisher: python-build.yml on Medical-Event-Data-Standard/meds_etl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file meds_etl-0.3.11-py3-none-any.whl.

File metadata

  • Download URL: meds_etl-0.3.11-py3-none-any.whl
  • Upload date:
  • Size: 118.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for meds_etl-0.3.11-py3-none-any.whl
Algorithm Hash digest
SHA256 81b3870904a519d587bfb0adcc8a6539bbb2634db70f912361e6543589624b07
MD5 60b6249103696074bb2f38777badc4c5
BLAKE2b-256 155a4d49d67d822a2d2c5dabde5c7a889a62ce5026afc70ac4c4a214eea1fd9c

See more details on using hashes here.

Provenance

The following attestation bundles were made for meds_etl-0.3.11-py3-none-any.whl:

Publisher: python-build.yml on Medical-Event-Data-Standard/meds_etl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page