This package allows you to load in a data dictionary and map cuis to defined fields using either the UMLS API or MetaMap API from NLM, or a Semantic Search pipeline using Pinecone vector database.
Project description
data-dictionary-cui-mapping
This package allows you to load in a data dictionary and semi-automatically query appropriate UMLS concepts using either the UMLS API, MetaMap API, and/or Semantic Search through a custom Pinecone vector database .
Prerequisites
- For UMLS API and MetaMap API, you will need to have an account with the UMLS API and/or MetaMap API. You can sign up for an account here: https://www.nlm.nih.gov/research/umls/index.html
- For Semantic Search with Pinecone, you will need to have an account with Pinecone. You can sign up for an account here: https://www.pinecone.io/. Please reach out to me if you would like temporary access to my Pinecone index to explore these embeddings.
Installation
Use the package manager pip to install data-dictionary-cui-mapping or pip install from the GitHub repo.
pip install data-dictionary-cui-mapping
#pip install git+https://github.com/kevon217/data-dictionary-cui-mapping.git
Input: Data Dictionary
Below is a sample data dictionary format that can be used as input for this package.
variable name | title | permissible value descriptions |
---|---|---|
AgeYrs | Age in years | |
CaseContrlInd | Case control indicator | Case;Control;Unknown |
Configuration Files
In order to run and customize these pipelines, you will need to create/edit yaml configuration files located in configs. Run configurations are saved and can be reloaded.
├───ddcuimap
│ ├───configs
│ │ │ config.yaml
│ │ │ __init__.py
│ │ │
│ │ ├───apis
│ │ │ __init__.py
│ │ │ config_metamap_api.yaml
│ │ │ config_pinecone_api.yaml
│ │ │ config_umls_api.yaml
│ │ │
│ │ ├───custom
│ │ │ de.yaml
│ │ │ hydra_base.yaml
│ │ │ pvd.yaml
│ │ │ title_def.yaml
│ │ │
│ │ ├───semantic_search
│ │ │ embeddings.yaml
UMLS API and MetaMap Batch Queries
Import modules
# import batch_query_pipeline modules from metamap OR umls package
from ddcuimap.metamap import batch_query_pipeline as mm_bqp
from ddcuimap.umls import batch_query_pipeline as umls_bqp
# import helper functions for loading, viewing, composing configurations for pipeline run
from ddcuimap.utils import helper
from omegaconf import OmegaConf
# import modules to create data dictionary with curated CUIs and check the file for missing mappings
from ddcuimap.curation import create_dictionary_import_file
from ddcuimap.curation import check_cuis
Load/edit configuration files
cfg = helper.compose_config.fn(overrides=["custom=de", "apis=config_metamap_api"]) # custom config for MetaMap on data element 'title' column
# cfg = helper.compose_config.fn(overrides=["custom=de", "apis=config_umls_api"]) # custom config for UMLS API on data element 'title' column
# cfg = helper.compose_config.fn(overrides=["custom=pvd", "apis=config_metamap_api"]) # custom config for MetaMap on 'permissible value descriptions' column
# cfg = helper.compose_config.fn(overrides=["custom=pvd", "apis=config_umls_api"]) # custom config for UMLS API on 'permissible value descriptions' column
cfg.apis.user_info.email = '' # enter your email
cfg.apis.user_info.apiKey = '' # enter your api key
print(OmegaConf.to_yaml(cfg))
Step 1: Run batch query pipeline
df_final_mm = mm_bqp.run_mm_batch(cfg) # run MetaMap batch query pipeline
# df_final_umls = umls_bqp.run_umls_batch(cfg) # run UMLS API batch query pipeline
Step 2: *Manual curation step in excel file
*see curation example in notebooks/examples_files/DE_Step-1_curation_keepCol.xlsx
Step 3: Create data dictionary import file
cfg = helper.load_config.fn(helper.choose_file.fn("Load config file from Step 1"))
create_dictionary_import_file.create_dd_file(cfg)
Step 4: Check curated cui mappings
cfg = helper.load_config.fn(helper.choose_file.fn("Load config file from Step 2"))
check_cuis.check_cuis(cfg)
Output: Data Dictionary + CUIs
Below is the final output of the data dictionary with curated CUIs.
variable name | title | data element concept identifiers | data element concept names | data element terminology sources | permissible values | permissible value descriptions | permissible value output codes | permissible value concept identifiers | permissible value concept names | permissible value terminology sources |
---|---|---|---|---|---|---|---|---|---|---|
AgeYrs | Age in years | C1510829;C0001779 | Age-Years;Age | UMLS;UMLS | ||||||
CaseContrlInd | Case control indicator | C0007328 | Case-Control Studies | UMLS | Case;Control;Unknown | Case;Control;Unknown | 1;2;999 | C1706256;C4553389;C0439673 | Clinical Study Case;Study Control;Unknown | UMLS;UMLS;UMLS |
Semantic Search with SentenceTransformers Batch Queries
More documentation to come... Basic pipeline is described below:
Subset/Embed/Upsert UMLS Metathesaurus for Pinecone vector database
Step 1: Subset local copy of UMLS Metathesaurus
Step 2: Embed UMLS CUI names and definitions and format metadata
Step 3: Upsert embeddings and metadata into Pinecone index
Query UMLS Metathesaurus vector database with data dictionary embeddings
Step 1: Embed data dictionary fields
Step 2: Batch Query data dictionary against CUI names and definitions in Pinecone index
Step 3: Evaluate/Curate Results
Step 4: Create data dictionary based on curation
Acknowledgements
The MetaMap API code included is from Will J Roger's repository --> https://github.com/lhncbc/skr_web_python_api
Special thanks to Olga Vovk, Henry Ogoe, and Sofia Syed for their guidance, feedback, and testing of this package.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for data_dictionary_cui_mapping-1.1.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 301a9d7760364d89f5575f32885d88f143ee4d37d943b0ee4f1f7c4c909fc26c |
|
MD5 | 4fcc7363a448bee8c65a070962d0efbc |
|
BLAKE2b-256 | 4bff30c17f8a915ee16e5727111c08da7202dc9cb477fcb693acf036864a318d |
Hashes for data_dictionary_cui_mapping-1.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd2d4795ccae0a7499b24a5feba2376a8302537a6b0ad9d30130dac70d7e2c11 |
|
MD5 | ded22cbc5532e1a3efffc53706499093 |
|
BLAKE2b-256 | deffd36b27e89eaba5b79ead436ba72247d7e9d9b03ae09b0c1e1b05df95bbde |