An ETL to convert OMOP data to the MEDS format.
Project description
MEDS OMOP ETL
An ETL pipeline for transforming Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) datasets (into the MEDS format using the MEDS-Transforms library. We gratefully acknowledge the developers of the first OMOP MEDS ETL, from which we took inspiration, which can be found here: https://github.com/Medical-Event-Data-Standard/meds_etl.
We currently support OMOP 5.3 and 5.4 datasets. Earlier versions might work but are not tested and are perhaps (?) not used in practice anymore. Please open pull requests if you want to add support for earlier versions.
- More information about OMOP can be found here: https://ohdsi.github.io/CommonDataModel/
- More information about MEDS can be found here: https://medical-event-data-standard.github.io/
Setup
First install the package:
pip install OMOP_MEDS
Then:
export DATASET_NAME="Your_OMOP_Dataset_Name" # e.g. MIMIC_IV_OMOP
export OMOP_VERSION="5.3" # or 5.4
export RAW_INPUT_DIR="path/to/your/input"
export ROOT_OUTPUT_DIR="/path/to/your/output"
OMOP_MEDS raw_input_dir=$RAW_INPUT_DIR root_output_dir=$ROOT_OUTPUT_DIR
To try with the MIMIC-IV OMOP demo dataset (this downloads a version to your local machine), you can run:
OMOP_MEDS raw_input_dir=path/to/your/input root_output_dir=/path/to/your/output do_download=True ++do_demo=True
Example config for an OMOP dataset:
dataset_name: MIMIC_IV_OMOP
raw_dataset_version: 1.0
omop_version: 5.3
urls:
dataset:
- https://physionet.org/content/mimic-iv-demo-omop/0.9/
- url: EXAMPLE_CONTROLLED_URL
username: ${oc.env:DATASET_DOWNLOAD_USERNAME}
password: ${oc.env:DATASET_DOWNLOAD_PASSWORD}
demo:
- https://physionet.org/content/mimic-iv-demo-omop/0.9/
common:
- EXAMPLE_SHARED_URL # Often used for shared metadata files
Run this with:
OMOP_MEDS ++DATASET_CFG=your_config.yaml raw_input_dir=path/to/your/input root_output_dir=/path/to/your/output \
do_download=True
Differences with the original meds_etl_omop
This package is designed as a more flexible and configurable alternative to the original meds_etl_omop package.
We make a few important choices that have impact on your downstream training and task definitions:
- We use the mapped concepts by default, which are more standardized across datasets and, for large, health systems can
be more clean, especially if you are working with a limited tokenizer on a large dataset.
You can still use the source concepts by setting
++prefer_source=True. - We use more tables than in the original
meds_etl_omoppackage, which can lead to more complete patient histories. Watch for potential information leakage. You can change your table configs in pre_MEDS.yaml and event_configs.yaml - This package is more resource intensive, please adjust your
n_shardsand watch your memory usage.
Pre-MEDS settings
The following settings can be used to configure the pre-MEDS steps.
OMOP_MEDS \
root_output_dir=/sc/arion/projects/hpims-hpi/projects/foundation_models_ehr/cohorts/meds_debug/small_demo \
raw_input_dir=/sc/arion/projects/hpims-hpi/projects/foundation_models_ehr/cohorts/full_omop \
do_download=False ++do_overwrite=True ++limit_subjects=50
root_output_dir: Set the root output directory.raw_input_dir: Path to the raw input directory.do_download: Set toFalseto skip downloading the dataset.++do_overwrite: Set toTrueto overwrite existing files.++limit_subjects: Limit the number of subjects to process.
MEDS-transforms settings
If you want to convert a large dataset, you can use parallelization with MEDS-transforms (the MEDS-transformation step that takes the longest).
Using local parallelization with the hydra-joblib-launcher package, you can set the number of workers:
pip install hydra-joblib-launcher --upgrade
Then, you can set the number of workers as environment variable:
export N_WORKERS=8
Moreover, you can set the number of subjects per shard to balance the parallelization overhead based on how many subjects you have in your dataset:
export N_SUBJECTS_PER_SHARD=100000
The MIMIC-IV OMOP Dataset
We use the demo dataset for MIMIC-IV in the OMOP format, which is a subset of the MIMIC-IV dataset. This dataset downloaded from Physionet does not include the standard dictionary linking definitions but should otherwise be functional
Particularities
- Care site is added to the visit as text
- Add support for care_site table (visit_detail)
Citation
If you use this dataset, please use the citation link in Github.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file omop_meds-0.2.0.tar.gz.
File metadata
- Download URL: omop_meds-0.2.0.tar.gz
- Upload date:
- Size: 971.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d156b76102380ff8233e1142fd15b92cb2ae76418613305aea556e4b9a34bc33
|
|
| MD5 |
48363a2fd573089db51c78388dca0add
|
|
| BLAKE2b-256 |
fe184ce89d2eea3bc0ef1f3349ab21168b275908c5e71768f9a41bdd60e6e882
|
Provenance
The following attestation bundles were made for omop_meds-0.2.0.tar.gz:
Publisher:
python-build.yaml on rvandewater/OMOP_MEDS
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
omop_meds-0.2.0.tar.gz -
Subject digest:
d156b76102380ff8233e1142fd15b92cb2ae76418613305aea556e4b9a34bc33 - Sigstore transparency entry: 1224724795
- Sigstore integration time:
-
Permalink:
rvandewater/OMOP_MEDS@aeaca700b053223fa82aae25f1298a54eee824a3 -
Branch / Tag:
refs/tags/0.2.0 - Owner: https://github.com/rvandewater
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-build.yaml@aeaca700b053223fa82aae25f1298a54eee824a3 -
Trigger Event:
push
-
Statement type:
File details
Details for the file omop_meds-0.2.0-py3-none-any.whl.
File metadata
- Download URL: omop_meds-0.2.0-py3-none-any.whl
- Upload date:
- Size: 37.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b25408395305f01515c1c9c9d89a1ff514a9585f31784342e38673a62706c9ea
|
|
| MD5 |
dc1170cbb5488701822d50dd176fc96f
|
|
| BLAKE2b-256 |
16ff58b1199bba1522e4bb9cce00df1370946bb927ac0b9c7f4bea0463d24581
|
Provenance
The following attestation bundles were made for omop_meds-0.2.0-py3-none-any.whl:
Publisher:
python-build.yaml on rvandewater/OMOP_MEDS
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
omop_meds-0.2.0-py3-none-any.whl -
Subject digest:
b25408395305f01515c1c9c9d89a1ff514a9585f31784342e38673a62706c9ea - Sigstore transparency entry: 1224724833
- Sigstore integration time:
-
Permalink:
rvandewater/OMOP_MEDS@aeaca700b053223fa82aae25f1298a54eee824a3 -
Branch / Tag:
refs/tags/0.2.0 - Owner: https://github.com/rvandewater
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-build.yaml@aeaca700b053223fa82aae25f1298a54eee824a3 -
Trigger Event:
push
-
Statement type: