Skip to main content

Semantic retrieval for auditing and expanding ICD-based phenotypes in EHR biobanks.

Project description

Phecoder: semantic retrieval for auditing and expanding ICD-based phenotypes in EHR biobanks

Overview

Phecoder maps clinical phenotypes (Phecodes) to diagnosis (ICD) codes using pretrained text embedding models. It evaluates multiple embedding models and ensemble methods to find the most relevant diagnosis codes for each phenotype.

Figure description

Installing Phecoder

Note : python >=3.10 is required

As a user

python -m venv venv
source venv/bin/activate
pip install phecoder

PyTorch with CUDA

If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version before installing Phecoder. Follow PyTorch - Get Started for further details on how to install PyTorch.

As a developer

Phecoder is developed using Poetry. Follow Poetry - Installation for further details on how to install Poetry. Then,

git clone https://github.com/DiseaseNeuroGenomics/phecoder.git
poetry install

Quick Start

Workflow example with default settings

The default settings in Phecoder allow you to use the best ensemble as per our study.

import os
import pandas as pd
from phecoder import Phecoder

# Setup
os.environ["HF_HOME"] = "./hf-home"

# Initialize
pc = Phecoder(
    phecodes=["Suicidal ideation", "Depression", "Anxiety"],
    output_dir="./results",
    icd_cache_dir="./icd_cache"
)

# Run pipeline
pc.run()
pc.build_ensemble()

# Load results into dataframe
results = pc.load_results('ensemble-zsum')

A more detailed example

1. Setup and Import

import os
import pandas as pd
from phecoder import Phecoder, load_icd_df

# Set Hugging Face cache directory (optional but recommended)
os.environ["HF_HOME"] = "./hf-home"

2. Define Directories

output_dir = "./results"  # Results saved here
icd_cache_dir = "./icd_cache"  # ICD embeddings cached here (optional, reusable across runs)

3. Load ICD Codes

Your ICD data must have columns: icd_code and icd_string

icd_df = load_icd_df()  # loads default ICDs, or load your own according to the below format              

Example format (essential columns):

icd_code icd_string
E11.9 Type 2 diabetes mellitus without complications
I10 Essential (primary) hypertension
J45.909 Unspecified asthma, uncomplicated

4. Define Phenotype(s)

# Single phenotype
phenotype = "Eating disorders"

# OR multiple phenotypes
phenotypes = ["Eating disorders", "Type 2 diabetes", "Hypertension"]

# OR DataFrame with phecode and description
phecode_df = pd.DataFrame({
    'phecode': ['250.2', '401.1'],
    'phecode_string': ['Type 2 diabetes', 'Hypertension']
})

5. Choose Models

# Light model (fast, ~80MB)
models = ["sentence-transformers/all-MiniLM-L6-v2"]

# OR clinically-trained model (better for medical text, ~440MB)
models = ["FremyCompany/BioLORD-2023"]

# OR multiple models (for ensemble)
models = [
    "sentence-transformers/all-MiniLM-L6-v2",
    "FremyCompany/BioLORD-2023",
    "NeuML/pubmedbert-base-embeddings"
]
# OR use a preset according to our evaluation
models = "preset:best_single"  # best single model
models = "preset:best_ensemble"  # best set of models for ensemble (same as default)

6. Initialize Phecoder

pc = Phecoder(
    icd_df=icd_df,
    phecodes=phenotype,                  # string, list of strings, or dataframe with "phecode" column
    models=models,
    output_dir=output_dir,
    icd_cache_dir=icd_cache_dir,         # Optional: cache ICD embeddings for reuse
    st_search_kwargs={
    "top_k": 100,
    }      # Return top 100 ICD codes per phenotype
)

7. Run Pipeline

# Option 1: Run directly (models auto-download if needed)
pc.run()

# Option 2: Pre-download models, then run (useful for batch jobs)
pc.download_models()  # Optional: explicitly download models first
pc.run()

# Build ensemble (combines multiple models using reciprocal rank fusion)
pc.build_ensemble(
    method="rrf",
    method_kwargs={"k": 60},
    name="ens:rrf60"
)

8. Load Results

# Load all results (individual models + ensemble)
results = pc.load_results()

# Load ensemble results only
ensemble_results = pc.load_results(
    models=['ens:rrf60'],
    include_ensembles=True
)


Tips

  • First run is slower - Models download and embeddings are computed
  • Subsequent runs are fast - ICD embeddings are cached and reused
  • Use icd_cache_dir to share ICD embeddings across multiple projects
  • Start with light models for testing, then use clinical models for production
  • Ensembles typically outperform individual models
  • Pre-download models with pc.download_models() for batch jobs to separate download time from computation

See also

For more information on how the ICD file was created, see the ICD Data Preparation.

For best results, use the actual ICD codes and descriptions from your biobank/EHR dataset.

The semantic matching works best when it operates on the same code descriptions that exist in your data. If your EHR uses specific phrasings or truncated descriptions, provide those exact strings rather than standard reference descriptions. This ensures the ranked results directly correspond to codes available in your dataset.

Support: If you have any questions, feel free to post your question as a GitHub Issue here or send an email to jamie.bennett@mssm.edu.

Citations

If you use Phecoder in research, please cite our preprint on medRxiv:

Phecoder: semantic retrieval for auditing and expanding ICD-based phenotypes in EHR biobanks. Jamie J. R. Bennett, Simone Tomasi, Sonali Gupta, VA Million Veteran Program, Georgios Voloudakis, Panos Roussos, David Burstein (2026). doi: https://doi.org/10.64898/2026.01.08.26343725.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phecoder-0.2.0.tar.gz (1.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phecoder-0.2.0-py3-none-any.whl (1.6 MB view details)

Uploaded Python 3

File details

Details for the file phecoder-0.2.0.tar.gz.

File metadata

  • Download URL: phecoder-0.2.0.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for phecoder-0.2.0.tar.gz
Algorithm Hash digest
SHA256 5853dee5cf300ec3b1395bc411646fc01d5df0d86261d2bf7b75f11ebd420e87
MD5 7e70e75a3e2cefa6ca28d8bca3fa77e3
BLAKE2b-256 a49ede9057687b01401bc1a267a1efcbd01bc5106cabdc4dd2987235345edd23

See more details on using hashes here.

File details

Details for the file phecoder-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: phecoder-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for phecoder-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f3c65e9824bbbdaaf0b32cef7351dd8f917318a04e900a0e502438d9b2d17746
MD5 f5035993eb94d56ec79590f88c120378
BLAKE2b-256 c585a3d6c03d76da35135c2fd24dd54686882faec967311290b228efdd98e921

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page