Skip to main content

An ETL to convert OMOP data to the MEDS format.

Project description

omop-to-meds logo

MEDS OMOP ETL

PyPI - Version codecov tests code-quality python license PRs contributors DOI Static Badge

An ETL pipeline for transforming Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) datasets (into the MEDS format using the MEDS-Transforms library. We gratefully acknowledge the developers of the first OMOP MEDS ETL, from which we took inspiration, which can be found here: https://github.com/Medical-Event-Data-Standard/meds_etl.

We currently support OMOP 5.3 and 5.4 datasets. Earlier versions might work but are not tested and are perhaps (?) not used in practice anymore. Please open pull requests if you want to add support for earlier versions.

Setup

First install the package:

pip install OMOP_MEDS

Then:

export DATASET_NAME="Your_OMOP_Dataset_Name" # e.g. MIMIC_IV_OMOP
export OMOP_VERSION="5.3" # or 5.4
export RAW_INPUT_DIR="path/to/your/input"
export ROOT_OUTPUT_DIR="/path/to/your/output"
OMOP_MEDS raw_input_dir=$RAW_INPUT_DIR root_output_dir=$ROOT_OUTPUT_DIR

To try with the MIMIC-IV OMOP demo dataset (this downloads a version to your local machine), you can run:

OMOP_MEDS raw_input_dir=path/to/your/input root_output_dir=/path/to/your/output do_download=True ++do_demo=True

Example config for an OMOP dataset:

dataset_name: MIMIC_IV_OMOP
raw_dataset_version: 1.0
omop_version: 5.3

urls:
  dataset:
    - https://physionet.org/content/mimic-iv-demo-omop/0.9/
    - url: EXAMPLE_CONTROLLED_URL
      username: ${oc.env:DATASET_DOWNLOAD_USERNAME}
      password: ${oc.env:DATASET_DOWNLOAD_PASSWORD}
  demo:
    - https://physionet.org/content/mimic-iv-demo-omop/0.9/
  common:
    - EXAMPLE_SHARED_URL # Often used for shared metadata files

Run this with:

OMOP_MEDS ++DATASET_CFG=your_config.yaml raw_input_dir=path/to/your/input root_output_dir=/path/to/your/output \
do_download=True

Differences with the original meds_etl_omop

This package is designed as a more flexible and configurable alternative to the original meds_etl_omop package. We make a few important choices that have impact on your downstream training and task definitions:

  • We use the mapped concepts by default, which are more standardized across datasets and, for large, health systems can be more clean, especially if you are working with a limited tokenizer on a large dataset. You can still use the source concepts by setting ++prefer_source=True.
  • We use more tables than in the original meds_etl_omop package, which can lead to more complete patient histories. Watch for potential information leakage. You can change your table configs in pre_MEDS.yaml and event_configs.yaml
  • This package is more resource intensive, please adjust your n_shards and watch your memory usage.

Pre-MEDS settings

The following settings can be used to configure the pre-MEDS steps.

OMOP_MEDS \
	root_output_dir=/sc/arion/projects/hpims-hpi/projects/foundation_models_ehr/cohorts/meds_debug/small_demo \
	raw_input_dir=/sc/arion/projects/hpims-hpi/projects/foundation_models_ehr/cohorts/full_omop \
	do_download=False ++do_overwrite=True ++limit_subjects=50
  • root_output_dir: Set the root output directory.
  • raw_input_dir: Path to the raw input directory.
  • do_download: Set to False to skip downloading the dataset.
  • ++do_overwrite: Set to True to overwrite existing files.
  • ++limit_subjects: Limit the number of subjects to process.

MEDS-transforms settings

If you want to convert a large dataset, you can use parallelization with MEDS-transforms (the MEDS-transformation step that takes the longest).

Using local parallelization with the hydra-joblib-launcher package, you can set the number of workers:

pip install hydra-joblib-launcher --upgrade

Then, you can set the number of workers as environment variable:

export N_WORKERS=8

Moreover, you can set the number of subjects per shard to balance the parallelization overhead based on how many subjects you have in your dataset:

export N_SUBJECTS_PER_SHARD=100000

The MIMIC-IV OMOP Dataset

We use the demo dataset for MIMIC-IV in the OMOP format, which is a subset of the MIMIC-IV dataset. This dataset downloaded from Physionet does not include the standard dictionary linking definitions but should otherwise be functional

Particularities

  • Care site is added to the visit as text
  • Add support for care_site table (visit_detail)

Citation

If you use this dataset, please use the citation link in Github.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omop_meds-0.2.0.tar.gz (971.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omop_meds-0.2.0-py3-none-any.whl (37.2 kB view details)

Uploaded Python 3

File details

Details for the file omop_meds-0.2.0.tar.gz.

File metadata

  • Download URL: omop_meds-0.2.0.tar.gz
  • Upload date:
  • Size: 971.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for omop_meds-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d156b76102380ff8233e1142fd15b92cb2ae76418613305aea556e4b9a34bc33
MD5 48363a2fd573089db51c78388dca0add
BLAKE2b-256 fe184ce89d2eea3bc0ef1f3349ab21168b275908c5e71768f9a41bdd60e6e882

See more details on using hashes here.

Provenance

The following attestation bundles were made for omop_meds-0.2.0.tar.gz:

Publisher: python-build.yaml on rvandewater/OMOP_MEDS

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file omop_meds-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: omop_meds-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 37.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for omop_meds-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b25408395305f01515c1c9c9d89a1ff514a9585f31784342e38673a62706c9ea
MD5 dc1170cbb5488701822d50dd176fc96f
BLAKE2b-256 16ff58b1199bba1522e4bb9cce00df1370946bb927ac0b9c7f4bea0463d24581

See more details on using hashes here.

Provenance

The following attestation bundles were made for omop_meds-0.2.0-py3-none-any.whl:

Publisher: python-build.yaml on rvandewater/OMOP_MEDS

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page