Skip to main content

An ETL pipeline to transform your EMP data to OMOP.

Project description

Rabbit in a Blender

Rabbit in a Blender is an ETL pipeline CLI to transform your EMR data to OMOP.

Why the name 'Rabbit in a Blender'? It stays in the rabbit theme of the OHDSI tools, and an ETL pipeline is like putting all your data in a blender.

No rabbits were harmed during the development of this tool!

Introduction

Extract-Transform-Load (ETL) processes are very complex and are mainly crafted by highly skilled data engineers. The process of transforming the electronic medical record (EMR) data into the observational medical outcomes partnership (OMOP) common data model (CDM) is no exception. The mapping process of the source values to standard concepts is mostly done by subject matter experts, who lack the knowledge of programming the ETL process. Wouldn’t it be nice if we could drastically simplify the ETL process, so that you don’t need seasoned data engineers to start the OMOP CDM journey. Imagine that you just save your queries, Usagi comma separated value (CSV) text files and custom concept CSV’s on disk, and run a command line interface (CLI) tool that does all the ETL magic automatically.

Concept

The main strength of the CDM is its simplified scheme. This scheme is a relational data model, where each table has a primary key and can have foreign keys to other tables. Because of the relational data model, we can extract the dependencies of the tables from the scheme. For example, the provider table is dependent on the care_site table, which is in its turn dependent on the location table. If we flatten that dependency graph, we have a sequence of ETL steps that we need to follow to have consistent data in our OMOP CDM. These ETL steps can be automated, so a hospital can focus its resources on the queries and the mapping of the concepts. The automated ETL consists of multiple tasks. It needs to execute queries, add custom concepts, apply the Usagi source to concept mapping, and do a lot of housekeeping. An example of that housekeeping is the autonumbering of the OMOP CDM primary keys, for which the ETL process needs to maintain a swap table that holds the key of the source table and the generated sequential number of the CDM table’s primary key. Another example of the housekeeping is the upload and processing of the Usagi CSV’s and also the upload and parsing of the custom concept CSV’s. In an ETL process data is divided in zones (cfr. the zones in a data lake). The raw zone holds the source data (for example the data from the EMR), the work zone holds all the house keeping tables of the ETL process and the gold zone holds our final OMOP CDM. After designing the architecture, the implementation needs to be developed. We have two options to choose from: configuration and convention as design paradigm. We choose convention over configuration, because it decreases the number of decisions the user has to make and eliminates the complexity. As convention a specific folder structure is adopted (see our mappings as example). A folder is created for each OMOP CDM table, where the SQL queries are stored to fill up the specific CDM table. In the table folders we also have for each concept column a sub folder. Those concept column sub folders hold our Usagi CSV’s (files ending with _usagi.csv). We also have a custom folder in the concept column sub folder, that holds the custom concept CSV’s (files ending with _concept.csv). With this convention in place, our ETL CLI tool has everything it needs to do its magic. One final requirement we want to build in the ETL CLI tool, is that each ETL step is an atomic operation, it either fails or succeeds, so that there is no possibility to corrupt the final OMOP CDM data.

Notes on Use

You will need to run the cleanup command when concept mappings change in your existing Usagi CSV's. The cleanup is not necessary when you add new queries or add additional Usagi mappings.

The measurement table has the measurement_event_id field, the observation table has the observation_event_id field, the cost table has the cost_event_id field, the episode_event table has the event_id field and the fact_relationship table has the fact_id_1 and fact_id_2 fields. All those fields are foreign keys to almost all OMOP CMD tables. Put the source id in the event_id field and the reffered table in the field_concept_id field. An example of how to implement this in the sql-file:

  • Linking two people with a personal relationship in the fact_relationship table:
select distinct
  'person' as domain_concept_id_1  -- foreign table name as string
  ,pr.person_nr_1 as fact_id_1	-- same key as used as when adding the person to the person table
  ,'person' as domain_concept_id_2  -- foreign table name as string	
  ,pr.person_nr_1 as fact_id_2	-- same key as used as when adding the person to the person table
  ,pr.relationship as relationship_concept_id  -- column with sourceCodes specifying the relationship and mapped in a _usagi.csv file in the relationship_concept_id folder
from person_relationships pr

The custom concepts get added to the source_to_concept_map table. The concept_code is used as the source_code, the newly assigned (>2.000.000.000) concept_id is used as the target_concept_id. The custom concepts are also added to the usagi table, using the concept_code as sourceCode and the newly assigned concept_id as the conceptId. The custom concepts can be used as mapping targets in two ways:

  • No explicit mapping in usagi table: the source code (as assigned in your ETL sql file) is mapped to the custom concept where it equals the concept_code
  • Explicit mapping in usagi: map a sourceCode in the usagi table to a custom concept by setting its targetConceptId equal to the concept_code of the custom concept.

Examples:

  • mapping "Cemiplimab", with source_code M1:

    • Add custom concept for cemiplimab with concept_code = M1, it gets a assigned a concept_id automatically, no usagi-entry required
    • Optionally add an usagi-entry for clarity, mapping sourceCode = M1 to targetConceptId = M1, the concept_id gets filled in automatically
  • mapping "Atezolizumab + Cemiplimab", with source_code C12:

    • Add custom concept for cemiplimab with concept_code = M1, it gets a assigned a concept_id automatically
    • Add two usagi-entries: mapping sourceCode C12 to targetConceptId = 1792776 (standard code for atezolizumab) and to targetConceptId = M1

    ALTERNATIVELY:

    • Add custom concept for cemiplimab with concept_code = C12, it gets a assigned a concept_id automatically, no usagi-entry required
    • Add only one usagi-entry, mapping sourceCode C12 to targetConceptId = 1792776 (standard code for atezolizumab), the second mapping of C12 to custom concept with concept_code = C12 is done automatically.

For the moment we only implemented a BigQuery backend for the ETL process, because this is what our hospital uses. Other database technologies as ETL backend can be implemented.

Installation

pip install Rabbit-in-a-Blender

Config

With the addition of additional database engines, we switched to a ini config file for database specific configurations. This makes the CLI arguments less cumbersome.

RiaB searches for the ini config file by using the following cascade:

  1. CLI --config argument
  2. RIAB_CONFIG environment variable (the RIAB_CONFIG environment variable can also be placed in a .env file in the current folder)
  3. riab.ini in the current folder

Below an example of a config:

[RiaB]
db_engine=BigQuery

[BigQuery]
credentials_file=service_account.json
; Optional
; The credentials file must be a service account key, stored authorized user credentials, external account credentials, or impersonated service account credentials. (see https://google-auth.readthedocs.io/en/master/reference/google.auth.html#google.auth.load_credentials_from_file)
; Alternatively, you can also use 'Application Default Credentials' (ADC) (see https://cloud.google.com/sdk/gcloud/reference/auth/application-default/login)
location=EU
; Location where to run the BigQuery jobs. Must match the location of the datasets used in the query. (important for GDPR)
project_raw=my_omop_project
; Optional
; Can be handy if you use jinja templates for your ETL queries (ex if you are using development-staging-production environments). Must have the following format: PROJECT_ID
dataset_work=my_omop_project.work
; The dataset that will hold RiaB's housekeeping tables. Must have the following format: PROJECT_ID.DATASET_ID
dataset_omop=my_omop_project.omop
; The dataset that will hold the OMOP tables. Must have the following format: PROJECT_ID.DATASET_ID
dataset_dqd=my_omop_project.dqd
; The dataset that will hold the data quality tables. Must have the following format: PROJECT_ID.DATASET_ID
dataset_achilles=my_omop_project.achilles
; The dataset that will hold the data achilles tables. Must have the following format: PROJECT_ID.DATASET_ID
bucket=gs://my_omop_bucket/upload
; The Cloud Storage bucket uri, that will hold the uploaded Usagi and custom concept files. (the uri has format 'gs://{bucket_name}/{bucket_path}')

CLI Usage

  • Options:
    command help
    -h, --help Show help message and exit
    -v, --verbose Verbose logging (logs are also writen to a log file in the systems tmp folder)
    --config Optional path to the ini config file that holds the database engine configuration. Alternatively set the RIAB_CONFIG environment variable, pointing to the ini file. Or place a riab.ini file in the current directory.
  • ETL Commands:

    command help
    -cd, --create-db Create the OMOP CDM tables
    -cf, --create-folders [PATH] Create the ETL folder structure that will hold your queries, Usagi CSV's an custom concept CSV's.
    -i, --import-vocabularies [VOCABULARIES_ZIP_FILE] Extracts the vocabulary zip file (downloaded from the Athena website) and imports it into the OMOP CDM database.
    -r [PATH], --run-etl [PATH] Runs the ETL, pass the path to ETL folder structure that holds your queries, Usagi CSV's an custom concept CSV's.
    -c, --cleanup [TABLE] Cleanup all the OMOP tables, or just one. Be aware that the cleanup of a single table can screw up foreign keys! For instance cleaning up only the 'Person' table, will result in clicical results being mapped to the wrong persons!!!!
    -dq, --data-quality Check the data quality and store the results.
    -dqd, --data-quality-dashnoard View the results of the data quality checks. (UNDER DEVELOPMENT)
  • Run ETL specific command options (-r [PATH], --run-etl [PATH]):

    command help
    -t [TABLE], --table [TABLE] Do only ETL on this specific OMOP CDM table
    -q [PATH], --only-query [PATH] Do ETL for a specified sql file in the CDM folder structure. (ex: measurement/lab_measurements.sql)
    -s, --skip-usagi-and-custom-concept-upload Skips the parsing and uploading of the Usagi and custom concept CSV's. Skipping results in a significant speed boost.
  • Data quality specific command options (-dq, --data-quality):

    command help
    --json [PATH] Save the data quality result as JSON file for use in the OHDSI Data Quality Dashboard.
  • Data quality dashboard specific command options (-dqd, --data-quality-check):

    command help
    --port [PORT] The port the dashboard schould listen on.

CLI Examples

Create the OMOP CDM database:

riab --create-db

Import your downloaded vocabularies (from Athena) zip file:

riab --import-vocabularies ./vocabulary-2022-07-28.zip

Create the ETL folder structure:

riab --create-folders ./OMOP_CDM

Run full ETL:

riab --run-etl ./OMOP-CDM

Run ETL on one table:

riab --run-etl ./OMOP-CDM \
  --table provider

Run ETL without re-upload of Usagi CSV's and custom concept CSV's:

riab --run-etl ./OMOP-CDM \
  --skip-usagi-and-custom-concept-upload

Run ETL for a specified sql file in the CDM folder structure. (ex: measurement/lab_measurements.sql)

riab --run-etl ./OMOP-CDM \
  --skip-usagi-and-custom-concept-upload \
  --only-sql measurement/lab_measurements.sql

Cleanup all tables:

riab --cleanup

Cleanup one table (example provider table):

riab --cleanup provider

Data quality check:

riab --data-quality

Data quality check (export result to JSON):

riab --data-quality
  --json dqd_result.json

Data quality dashboard (default port = 8050):

riab --data-quality-dashboard
  --port 8888

BigQuery

There are 2 ways to authenticate with GCP:

  • Use a Service Account key file with --google-credentials-file cli option

  • When developing or testing you can use Application Default Credentials (ADC)

    Install the gcloud CLI!

    For example for windows run the folowing powershell script:

    (New-Object Net.WebClient).DownloadFile("https://dl.google.com/dl/cloudsdk/channels/rapid/GoogleCloudSDKInstaller.exe", "$env:Temp\GoogleCloudSDKInstaller.exe")
    & $env:Temp\GoogleCloudSDKInstaller.exe
    

    Authenticate:

    # login
    gcloud auth application-default login
    # set our project
    PROJECT_ID="our_omop_etl_project_id_on_GCP" #you need to change this
    gcloud config set project ${PROJECT_ID}
    

    More info can also be found in the Python API for GCP authentication

TODO

  • improve documentation
  • fix visits and observation period to be data driven (based on other clinical tables in OMOP)
  • add concept_relationship and concept_ancestor tables for custom concepts to framework
  • generate ERA tables (drug_era, dose_era and condition_era)
  • improve error logging: check for duplicate concept codes in custom concept files

Authors

License

Copyright © 2022, RADar-AZDelta. Released under the GNU General Public License v3.0.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Rabbit-in-a-Blender-0.0.35.tar.gz (502.6 kB view hashes)

Uploaded Source

Built Distribution

Rabbit_in_a_Blender-0.0.35-py3-none-any.whl (747.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page