Skip to main content

An ETL to convert OMOP data to the MEDS format.

Project description

MEDS OMOP ETL

PyPI - Version codecov tests code-quality python license PRs contributors DOI Static Badge

An ETL pipeline for transforming Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) datasets (into the MEDS format using the MEDS-Transforms library. We gratefully acknowledge the developers of the first OMOP MEDS ETL, from which we took inspiration, which can be found here: https://github.com/Medical-Event-Data-Standard/meds_etl.

We currently support OMOP 5.3 and 5.4 datasets. Earlier versions might work but are not tested and are perhaps (?) not used in practice anymore. Please open pull requests if you want to add support for earlier versions.

Setup

First install the package:

pip install OMOP_MEDS

Then:

export DATASET_NAME="Your_OMOP_Dataset_Name" # e.g. MIMIC_IV_OMOP
export OMOP_VERSION="5.3" # or 5.4
export RAW_INPUT_DIR="path/to/your/input"
export ROOT_OUTPUT_DIR="/path/to/your/output"
OMOP_MEDS raw_input_dir=$RAW_INPUT_DIR root_output_dir=$ROOT_OUTPUT_DIR

To try with the MIMIC-IV OMOP demo dataset (this downloads a version to your local machine), you can run:

OMOP_MEDS raw_input_dir=path/to/your/input root_output_dir=/path/to/your/output do_download=True ++do_demo=True

Example config for an OMOP dataset:

dataset_name: MIMIC_IV_OMOP
raw_dataset_version: 1.0
omop_version: 5.3

urls:
  dataset:
    - https://physionet.org/content/mimic-iv-demo-omop/0.9/
    - url: EXAMPLE_CONTROLLED_URL
      username: ${oc.env:DATASET_DOWNLOAD_USERNAME}
      password: ${oc.env:DATASET_DOWNLOAD_PASSWORD}
  demo:
    - https://physionet.org/content/mimic-iv-demo-omop/0.9/
  common:
    - EXAMPLE_SHARED_URL # Often used for shared metadata files

Run this with:

OMOP_MEDS ++DATASET_CFG=your_config.yaml raw_input_dir=path/to/your/input root_output_dir=/path/to/your/output \
do_download=True

Differences with the original meds_etl_omop

This package is designed as a more flexible and configurable alternative to the original meds_etl_omop package. We make a few important choices that have impact on your downstream training and task definitions:

  • We use the mapped concepts by default, which are more standardized across datasets and, for large, health systems can be more clean, especially if you are working with a limited tokenizer on a large dataset. You can still use the source concepts by setting ++prefer_source=True.
  • We use more tables than in the original meds_etl_omop package, which can lead to more complete patient histories. Watch for potential information leakage. You can change your table configs in pre_MEDS.yaml and event_configs.yaml
  • This package is more resource intensive, please adjust your n_shards and watch your memory usage.

Pre-MEDS settings

The following settings can be used to configure the pre-MEDS steps.

OMOP_MEDS \
	root_output_dir=/sc/arion/projects/hpims-hpi/projects/foundation_models_ehr/cohorts/meds_debug/small_demo \
	raw_input_dir=/sc/arion/projects/hpims-hpi/projects/foundation_models_ehr/cohorts/full_omop \
	do_download=False ++do_overwrite=True ++limit_subjects=50
  • root_output_dir: Set the root output directory.
  • raw_input_dir: Path to the raw input directory.
  • do_download: Set to False to skip downloading the dataset.
  • ++do_overwrite: Set to True to overwrite existing files.
  • ++limit_subjects: Limit the number of subjects to process.

MEDS-transforms settings

If you want to convert a large dataset, you can use parallelization with MEDS-transforms (the MEDS-transformation step that takes the longest).

Using local parallelization with the hydra-joblib-launcher package, you can set the number of workers:

pip install hydra-joblib-launcher --upgrade

Then, you can set the number of workers as environment variable:

export N_WORKERS=8

Moreover, you can set the number of subjects per shard to balance the parallelization overhead based on how many subjects you have in your dataset:

export N_SUBJECTS_PER_SHARD=100000

The MIMIC-IV OMOP Dataset

We use the demo dataset for MIMIC-IV in the OMOP format, which is a subset of the MIMIC-IV dataset. This dataset downloaded from Physionet does not include the standard dictionary linking definitions but should otherwise be functional

Particularities

  • Care site is added to the visit as text
  • Add support for care_site table (visit_detail)

Citation

If you use this dataset, please use the citation link in Github.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omop_meds-0.1.0.tar.gz (912.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omop_meds-0.1.0-py3-none-any.whl (30.6 kB view details)

Uploaded Python 3

File details

Details for the file omop_meds-0.1.0.tar.gz.

File metadata

  • Download URL: omop_meds-0.1.0.tar.gz
  • Upload date:
  • Size: 912.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for omop_meds-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1b88372560b31721901fe788948e79865d1513f3fecabb761a562aab6b6cf319
MD5 2d7193f3b53ee6821d51a1e76ac8c06b
BLAKE2b-256 c207660a6324291365d20fc11b47f3e16ad19136eb1d6e4cbfab2d5ecc4f1fca

See more details on using hashes here.

Provenance

The following attestation bundles were made for omop_meds-0.1.0.tar.gz:

Publisher: python-build.yaml on rvandewater/OMOP_MEDS

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file omop_meds-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: omop_meds-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 30.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for omop_meds-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 559f7e8823cd1ae3467215439b2a622a9c68454e207231405d950949b60f30a7
MD5 6fee7f7cc9558a7be26f892f6c2dd65f
BLAKE2b-256 7cc332cc3292d4fff92d32f355506a2551e8e85f7423ad878a1fdd9fb2f19420

See more details on using hashes here.

Provenance

The following attestation bundles were made for omop_meds-0.1.0-py3-none-any.whl:

Publisher: python-build.yaml on rvandewater/OMOP_MEDS

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page