Skip to main content

The Spark ETL tools for generating the CEHR-BERT and CEHR-GPT pre-training and finetuning data

Project description

cehrbert_data

cehrbert_data is the ETL tool that generates the pretraining and finetuning datasets for CEHRbERT, which is a large language model developed for the structured EHR data, the work has been published at https://proceedings.mlr.press/v158/pang21a.html.

Patient Representation

For each patient, all medical codes were aggregated and constructed into a sequence chronologically. In order to incorporate temporal information, we inserted an artificial time token (ATT) between two neighboring visits based on their time interval. The following logic was used for creating ATTs based on the following time intervals between visits, if less than 28 days, ATTs take on the form of $W_n$ where n represents the week number ranging from 0-3 (e.g. $W_1$); 2) if between 28 days and 365 days, ATTs are in the form of $M_n$ where n represents the month number ranging from 1-11 e.g $M_{11}$;

  1. beyond 365 days then a LT (Long Term) token is inserted. In addition, we added two more special tokens — VS and VE to represent the start and the end of a visit to explicitly define the visit segment, where all the concepts associated with the visit are subsumed by VS and VE.

"patient_representation"

Pre-requisite

The project is built in python 3.10, and project dependency needs to be installed

Create a new Python virtual environment

python3.10 -m venv .venv;
source .venv/bin/activate;

Build the project

pip install -e .

Download jtds-1.3.1.jar into the spark jars folder in the python environment

cp jtds-1.3.1.jar .venv/lib/python3.10/site-packages/pyspark/jars/

Instructions for Use

1. Download OMOP tables as parquet files

We created a spark app to download OMOP tables from SQL Server as parquet files. You need adjust the properties in db_properties.ini to match with your database setup.

PYTHONPATH=./: spark-submit tools/download_omop_tables.py -c db_properties.ini -tc person visit_occurrence condition_occurrence procedure_occurrence drug_exposure measurement observation_period concept concept_relationship concept_ancestor -o ~/Documents/omop_test/

We have prepared a synthea dataset with 1M patients for you to test, you could download it at omop_synthea.tar.gz

tar -xvf omop_synthea.tar ~/Document/omop_test/

2. Generate training data for CEHR-BERT

We order the patient events in chronological order and put all data points in a sequence. We insert artificial tokens VS (visit start) and VE (visit end) to the start and the end of the visit. In addition, we insert artificial time tokens (ATT) between visits to indicate the time interval between visits. This approach allows us to apply BERT to structured EHR as-is. The sequence can be seen conceptually as [VS] [V1] [VE] [ATT] [VS] [V2] [VE], where [V1] and [V2] represent a list of concepts associated with those visits.

PYTHONPATH=./: spark-submit spark_apps/generate_training_data.py -i ~/Documents/omop_test/ -o ~/Documents/omop_test/cehr-bert -tc condition_occurrence procedure_occurrence drug_exposure -d 1985-01-01 --is_new_patient_representation -iv

3. Generate hf readmission prediction task

If you don't have your own OMOP instance, we have provided a sample of patient sequence data generated using Synthea at sample/hf_readmissioon in the repo

PYTHONPATH=./:$PYTHONPATH spark-submit spark_apps/prediction_cohorts/hf_readmission.py -c hf_readmission -i ~/Documents/omop_test/ -o ~/Documents/omop_test/cehr-bert -dl 1985-01-01 -du 2020-12-31 -l 18 -u 100 -ow 360 -ps 0 -pw 30 --is_new_patient_representation

Contact us

If you have any questions, feel free to contact us at CEHR-BERT@lists.cumc.columbia.edu

Citation

Please acknowledge the following work in papers

Chao Pang, Xinzhuo Jiang, Krishna S. Kalluri, Matthew Spotnitz, RuiJun Chen, Adler Perotte, and Karthik Natarajan. "Cehr-bert: Incorporating temporal information from structured ehr data to improve prediction tasks." In Proceedings of Machine Learning for Health, volume 158 of Proceedings of Machine Learning Research, pages 239–260. PMLR, 04 Dec 2021.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cehrbert_data-0.1.2.tar.gz (425.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cehrbert_data-0.1.2-py3-none-any.whl (95.3 kB view details)

Uploaded Python 3

File details

Details for the file cehrbert_data-0.1.2.tar.gz.

File metadata

  • Download URL: cehrbert_data-0.1.2.tar.gz
  • Upload date:
  • Size: 425.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cehrbert_data-0.1.2.tar.gz
Algorithm Hash digest
SHA256 1b6b0b582df35a05e18d40b90d961602b25e398ddcccf58af40b9265a49985da
MD5 27e0dfcdf513ca2ab08f7e61a830fea1
BLAKE2b-256 e30f8ebf3a7e6371eba8b9c56fa337dfc78420572402af0a461a915a29efeb05

See more details on using hashes here.

Provenance

The following attestation bundles were made for cehrbert_data-0.1.2.tar.gz:

Publisher: python-build.yml on knatarajan-lab/cehrbert_data

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cehrbert_data-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: cehrbert_data-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 95.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cehrbert_data-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d493bc03e7cecda715411b4986fb24a84e9aa493605800692503d2eb85f3b5fe
MD5 72322d4cd0b458e165a8c071ad50b719
BLAKE2b-256 c884fbac735344e680b9f4210dadbd7e8c27dd13731b6463ab234f865c6fe74f

See more details on using hashes here.

Provenance

The following attestation bundles were made for cehrbert_data-0.1.2-py3-none-any.whl:

Publisher: python-build.yml on knatarajan-lab/cehrbert_data

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page