This package allows you to load in a data dictionary and map cuis to defined fields using either the UMLS API or MetaMap API from NLM, or a Semantic Search pipeline using Pinecone vector database.
Project description
data-dictionary-cui-mapping
This package assists with mapping a user's data dictionary fields to UMLS concepts. It is designed to be modular and flexible to allow for different configurations and use cases.
Roughly, the high-level steps are as follows:
- Configure yaml files
- Load in data dictionary
- Preprocess desired columns
- Query for UMLS concepts using any or all of the following pipeline modules:
- umls (UMLS API)
- metamap (MetaMap API)
- semantic_search (relies on access to a custom Pinecone vector database)
- hydra_search (combines any combination of the above three modules)
- Manually curate/select concepts in excel
- Create data dictionary file with new UMLS concept fields
Prerequisites
- For UMLS API and MetaMap API, you will need to have an account with the UMLS API and/or MetaMap API. You can sign up for an account here: https://www.nlm.nih.gov/research/umls/index.html
- For Semantic Search with Pinecone, you will need to have an account with Pinecone. You can sign up for an account here: https://www.pinecone.io/. Please reach out to me if you would like temporary access to my Pinecone index to explore these embeddings.
Installation
Use the package manager pip to install data-dictionary-cui-mapping from PyPI or pip install from the GitHub repo. The project uses poetry for packaging and dependency management.
pip install data-dictionary-cui-mapping
#pip install git+https://github.com/kevon217/data-dictionary-cui-mapping.git
Input: Data Dictionary
Below is a sample data dictionary format (.csv) that can be used as input for this package:
variable name | title | permissible value descriptions |
---|---|---|
AgeYrs | Age in years | |
CaseContrlInd | Case control indicator | Case;Control;Unknown |
Configuration Files
In order to run and customize these pipelines, you will need to create/edit yaml configuration files located in configs. Run configurations are saved and can be reloaded.
├───ddcuimap
│ ├───configs
│ │ │ config.yaml
│ │ │ __init__.py
│ │ │
│ │ ├───apis
│ │ │ __init__.py
│ │ │ config_metamap_api.yaml
│ │ │ config_pinecone_api.yaml
│ │ │ config_umls_api.yaml
│ │ │
│ │ ├───custom
│ │ │ de.yaml
│ │ │ hydra_base.yaml
│ │ │ pvd.yaml
│ │ │ title_def.yaml
│ │ │
│ │ ├───semantic_search
│ │ │ embeddings.yaml
CUI Batch Query Pipelines
STEP-1A: RUN BATCH QUERY PIPELINE
IMPORT PACKAGES
# from ddcuimap.umls import batch_query_pipeline as umls_bqp
# from ddcuimap.metamap import batch_query_pipeline as mm_bqp
# from ddcuimap.semantic_search import batch_hybrid_query_pipeline as ss_bqp
from ddcuimap.hydra_search import batch_hydra_query_pipeline as hs_bqp
from ddcuimap.utils import helper
from omegaconf import OmegaConf
LOAD/EDIT CONFIGURATION FILES
cfg_hydra = helper.compose_config(overrides=["custom=hydra_base"])
# cfg_umls = helper.compose_config(overrides=["custom=de", "apis=config_umls_api"])
cfg_mm = helper.compose_config(overrides=["custom=de", "apis=config_metamap_api"])
cfg_ss = helper.compose_config(
overrides=[
"custom=title_def",
"semantic_search=embeddings",
"apis=config_pinecone_api",
]
)
# # UMLS API CREDENTIALS
# cfg_umls.apis.umls.user_info.apiKey = ''
# cfg_umls.apis.umls.user_info.email = ''
# # MetaMap API CREDENTIALS
# cfg_mm.apis.metamap.user_info.apiKey = ''
# cfg_mm.apis.metamap.user_info.email = ''
#
# # Pinecone API CREDENTIALS
# cfg_ss.apis.pinecone.index_info.apiKey = ''
# cfg_ss.apis.pinecone.index_info.environment = ''
print(OmegaConf.to_yaml(cfg_hydra))
RUN BATCH QUERY PIPELINE
# df_umls, cfg_umls = umls_bqp.run_umls_batch(cfg_umls)
# df_mm, cfg_mm = mm_bqp.run_mm_batch(cfg_mm)
# df_ss, cfg_ss = ss_bqp.run_hybrid_ss_batch(cfg_ss)
df_hydra, cfg_step1 = hs_bqp.run_hydra_batch(cfg_hydra, cfg_umls=None, cfg_mm=cfg_mm, cfg_ss=cfg_ss)
print(df_hydra.head())
STEP-1B: *MANUAL CURATION STEP IN EXCEL
CURATION/SELECTION
*see curation example in notebooks/examples_files/DE_Step-1_curation_keepCol.xlsx
STEP-2A: CREATE DATA DICTIONARY IMPORT FILE
IMPORT CURATION MODULES
from ddcuimap.curation import create_dictionary_import_file
from ddcuimap.curation import check_cuis
from ddcuimap.utils import helper
CREATE DATA DICTIONARY IMPORT FILE
cfg_step1 = helper.load_config(helper.choose_file("Load config file from Step 1"))
df_dd = create_dictionary_import_file.create_dd_file(cfg_step1)
print(df_dd.head())
STEP-2B: CHECK CUIS IN DATA DICTIONARY IMPORT FILE
CHECK CUIS
cfg_step2 = helper.load_config(helper.choose_file("Load config file from Step 2"))
df_check = check_cuis.check_cuis(cfg_step2)
print(df_check.head())
Output: Data Dictionary + CUIs
Below is a sample modified data dictionary with curated CUIs after:
- Running Steps 1-2 on title then taking the generated output dictionary file and;
- Running Steps 1-2 again on permissible value descriptions to get the final output dictionary file.
variable name | title | data element concept identifiers | data element concept names | data element terminology sources | permissible values | permissible value descriptions | permissible value output codes | permissible value concept identifiers | permissible value concept names | permissible value terminology sources |
---|---|---|---|---|---|---|---|---|---|---|
AgeYrs | Age in years | C1510829;C0001779 | Age-Years;Age | UMLS;UMLS | ||||||
CaseContrlInd | Case control indicator | C0007328 | Case-Control Studies | UMLS | Case;Control;Unknown | Case;Control;Unknown | 1;2;999 | C1706256;C4553389;C0439673 | Clinical Study Case;Study Control;Unknown | UMLS;UMLS;UMLS |
Semantic Search with SentenceTransformers Batch Queries
More documentation to come... Basic pipeline is described below:
Subset/Embed/Upsert UMLS Metathesaurus for Pinecone vector database
Step 1: Subset local copy of UMLS Metathesaurus
Step 2: Embed UMLS CUI names and definitions and format metadata
Step 3: Upsert embeddings and metadata into Pinecone index
Query UMLS Metathesaurus vector database with data dictionary embeddings
Step 1: Embed data dictionary fields
Step 2: Batch Query data dictionary against CUI names and definitions in Pinecone index
Step 3: Evaluate/Curate Results
Step 4: Create data dictionary based on curation
Acknowledgements
The MetaMap API code included is from Will J Roger's repository --> https://github.com/lhncbc/skr_web_python_api
Special thanks to Olga Vovk, Henry Ogoe, and Sofia Syed for their guidance, feedback, and testing of this package.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for data_dictionary_cui_mapping-1.1.6.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 22fdf3e48f05c44ae34c97a2c24dca296022dc65438e6bd805b162e1c711c84a |
|
MD5 | 0959e9788c5073139695e6d79fe05afc |
|
BLAKE2b-256 | cdf0676fb8c7d91ffd616c4362925f5127ae970ea23de792d57378f514627b68 |
Hashes for data_dictionary_cui_mapping-1.1.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b5ee87685bde59ba9f7e08cead3bc8fe93063300c1833fb0d54fd57d9b6856d4 |
|
MD5 | 0e54f2073a0f2c139ef3dca9abf383cf |
|
BLAKE2b-256 | 06ddc99f3c9813c8bf7fef520cef9ae4bf091d341d4205e7886c9902eac8b582 |