This package allows you to load in a data dictionary and map cuis to defined fields using either the UMLS API or MetaMap API from NLM, or a Semantic Search pipeline using Pinecone vector database.
Project description
data-dictionary-cui-mapping
This package assists with mapping a user's data dictionary fields to UMLS concepts. It is designed to be modular and flexible to allow for different configurations and use cases.
Roughly, the high-level steps are as follows:
- Configure yaml files
- Load in data dictionary
- Preprocess desired columns
- Query for UMLS concepts using any or all of the following pipeline modules:
- umls (UMLS API)
- metamap (MetaMap API)
- semantic_search (relies on access to a custom Pinecone vector database)
- hydra_search (combines any combination of the above three modules)
- Manually curate/select concepts in excel
- Create data dictionary file with new UMLS concept fields
Prerequisites
- For UMLS API and MetaMap API, you will need to have an account with the UMLS API and/or MetaMap API. You can sign up for an account here: https://www.nlm.nih.gov/research/umls/index.html
- For Semantic Search with Pinecone, you will need to have an account with Pinecone. You can sign up for an account here: https://www.pinecone.io/. Please reach out to me if you would like temporary access to my Pinecone index to explore these embeddings.
Installation
Use the package manager pip to install data-dictionary-cui-mapping from PyPI or pip install from the GitHub repo. The project uses poetry for packaging and dependency management.
pip install data-dictionary-cui-mapping
#pip install git+https://github.com/kevon217/data-dictionary-cui-mapping.git
Input: Data Dictionary
Below is a sample data dictionary format (.csv) that can be used as input for this package:
variable name | title | permissible value descriptions |
---|---|---|
AgeYrs | Age in years | |
CaseContrlInd | Case control indicator | Case;Control;Unknown |
Configuration Files
In order to run and customize these pipelines, you will need to create/edit yaml configuration files located in configs. Run configurations are saved and can be reloaded.
├───ddcuimap
│ ├───configs
│ │ │ config.yaml
│ │ │ __init__.py
│ │ │
│ │ ├───apis
│ │ │ __init__.py
│ │ │ config_metamap_api.yaml
│ │ │ config_pinecone_api.yaml
│ │ │ config_umls_api.yaml
│ │ │
│ │ ├───custom
│ │ │ de.yaml
│ │ │ hydra_base.yaml
│ │ │ pvd.yaml
│ │ │ title_def.yaml
│ │ │
│ │ ├───semantic_search
│ │ │ embeddings.yaml
CUI Batch Query Pipelines
STEP-1A: RUN BATCH QUERY PIPELINE
IMPORT PACKAGES
# from ddcuimap.umls import batch_query_pipeline as umls_bqp
# from ddcuimap.metamap import batch_query_pipeline as mm_bqp
# from ddcuimap.semantic_search import batch_hybrid_query_pipeline as ss_bqp
from ddcuimap.hydra_search import batch_hydra_query_pipeline as hs_bqp
from ddcuimap.utils import helper
from omegaconf import OmegaConf
LOAD/EDIT CONFIGURATION FILES
cfg_hydra = helper.compose_config(overrides=["custom=hydra_base"])
# cfg_umls = helper.compose_config(overrides=["custom=de", "apis=config_umls_api"])
cfg_mm = helper.compose_config(overrides=["custom=de", "apis=config_metamap_api"])
cfg_ss = helper.compose_config(
overrides=[
"custom=title_def",
"semantic_search=embeddings",
"apis=config_pinecone_api",
]
)
# # UMLS API CREDENTIALS
# cfg_umls.apis.umls.user_info.apiKey = ''
# cfg_umls.apis.umls.user_info.email = ''
# # MetaMap API CREDENTIALS
# cfg_mm.apis.metamap.user_info.apiKey = ''
# cfg_mm.apis.metamap.user_info.email = ''
#
# # Pinecone API CREDENTIALS
# cfg_ss.apis.pinecone.index_info.apiKey = ''
# cfg_ss.apis.pinecone.index_info.environment = ''
print(OmegaConf.to_yaml(cfg_hydra))
RUN BATCH QUERY PIPELINE
# df_umls, cfg_umls = umls_bqp.run_umls_batch(cfg_umls)
# df_mm, cfg_mm = mm_bqp.run_mm_batch(cfg_mm)
# df_ss, cfg_ss = ss_bqp.run_hybrid_ss_batch(cfg_ss)
df_hydra, cfg_step1 = hs_bqp.run_hydra_batch(cfg_hydra, cfg_umls=None, cfg_mm=cfg_mm, cfg_ss=cfg_ss)
print(df_hydra.head())
STEP-1B: *MANUAL CURATION STEP IN EXCEL
CURATION/SELECTION
*see curation example in notebooks/examples_files/DE_Step-1_curation_keepCol.xlsx
STEP-2A: CREATE DATA DICTIONARY IMPORT FILE
IMPORT CURATION MODULES
from ddcuimap.curation import create_dictionary_import_file
from ddcuimap.curation import check_cuis
from ddcuimap.utils import helper
CREATE DATA DICTIONARY IMPORT FILE
cfg_step1 = helper.load_config(helper.choose_file("Load config file from Step 1"))
df_dd = create_dictionary_import_file.create_dd_file(cfg_step1)
print(df_dd.head())
STEP-2B: CHECK CUIS IN DATA DICTIONARY IMPORT FILE
CHECK CUIS
cfg_step2 = helper.load_config(helper.choose_file("Load config file from Step 2"))
df_check = check_cuis.check_cuis(cfg_step2)
print(df_check.head())
Output: Data Dictionary + CUIs
Below is a sample modified data dictionary with curated CUIs after:
- Running Steps 1-2 on title then taking the generated output dictionary file and;
- Running Steps 1-2 again on permissible value descriptions to get the final output dictionary file.
variable name | title | data element concept identifiers | data element concept names | data element terminology sources | permissible values | permissible value descriptions | permissible value output codes | permissible value concept identifiers | permissible value concept names | permissible value terminology sources |
---|---|---|---|---|---|---|---|---|---|---|
AgeYrs | Age in years | C1510829;C0001779 | Age-Years;Age | UMLS;UMLS | ||||||
CaseContrlInd | Case control indicator | C0007328 | Case-Control Studies | UMLS | Case;Control;Unknown | Case;Control;Unknown | 1;2;999 | C1706256;C4553389;C0439673 | Clinical Study Case;Study Control;Unknown | UMLS;UMLS;UMLS |
Semantic Search with SentenceTransformers Batch Queries
More documentation to come... Basic pipeline is described below:
Subset/Embed/Upsert UMLS Metathesaurus for Pinecone vector database
Step 1: Subset local copy of UMLS Metathesaurus
Step 2: Embed UMLS CUI names and definitions and format metadata
Step 3: Upsert embeddings and metadata into Pinecone index
Query UMLS Metathesaurus vector database with data dictionary embeddings
Step 1: Embed data dictionary fields
Step 2: Batch Query data dictionary against CUI names and definitions in Pinecone index
Step 3: Evaluate/Curate Results
Step 4: Create data dictionary based on curation
Acknowledgements
The MetaMap API code included is from Will J Roger's repository --> https://github.com/lhncbc/skr_web_python_api
Special thanks to Olga Vovk, Henry Ogoe, and Sofia Syed for their guidance, feedback, and testing of this package.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file data_dictionary_cui_mapping-1.1.6.tar.gz
.
File metadata
- Download URL: data_dictionary_cui_mapping-1.1.6.tar.gz
- Upload date:
- Size: 20.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.11.3 Linux/5.15.0-1037-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 22fdf3e48f05c44ae34c97a2c24dca296022dc65438e6bd805b162e1c711c84a |
|
MD5 | 0959e9788c5073139695e6d79fe05afc |
|
BLAKE2b-256 | cdf0676fb8c7d91ffd616c4362925f5127ae970ea23de792d57378f514627b68 |
File details
Details for the file data_dictionary_cui_mapping-1.1.6-py3-none-any.whl
.
File metadata
- Download URL: data_dictionary_cui_mapping-1.1.6-py3-none-any.whl
- Upload date:
- Size: 20.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.11.3 Linux/5.15.0-1037-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b5ee87685bde59ba9f7e08cead3bc8fe93063300c1833fb0d54fd57d9b6856d4 |
|
MD5 | 0e54f2073a0f2c139ef3dca9abf383cf |
|
BLAKE2b-256 | 06ddc99f3c9813c8bf7fef520cef9ae4bf091d341d4205e7886c9902eac8b582 |