Semantic retrieval for auditing and expanding ICD-based phenotypes in EHR biobanks.

These details have not been verified by PyPI

Project links

Homepage

Project description

Phecoder: semantic retrieval for auditing and expanding ICD-based phenotypes in EHR biobanks

Overview

Phecoder maps clinical phenotypes (Phecodes) to diagnosis (ICD) codes using pretrained text embedding models. It evaluates multiple embedding models and ensemble methods to find the most relevant diagnosis codes for each phenotype.

Figure description

Installing Phecoder

Note : python >=3.10 is required

As a user

python -m venv venv
source venv/bin/activate
pip install phecoder

PyTorch with CUDA

If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version before installing Phecoder. Follow PyTorch - Get Started for further details on how to install PyTorch.

As a developer

Phecoder is developed using Poetry. Follow Poetry - Installation for further details on how to install Poetry. Then,

git clone https://github.com/DiseaseNeuroGenomics/phecoder.git
poetry install

Quick Start

Workflow example with default settings

The default settings in Phecoder allow you to use the best ensemble as per our study.

import os
import pandas as pd
from phecoder import Phecoder

# Setup
os.environ["HF_HOME"] = "./hf-home"

# Initialize
pc = Phecoder(
    phecodes=["Suicidal ideation", "Depression", "Anxiety"],
    output_dir="./results",
    icd_cache_dir="./icd_cache"
)

# Run pipeline
pc.run()
pc.build_ensemble()

# Load results into dataframe
results = pc.load_results('ensemble-zsum')

A more detailed example

1. Setup and Import

import os
import pandas as pd
from phecoder import Phecoder, load_icd_df

# Set Hugging Face cache directory (optional but recommended)
os.environ["HF_HOME"] = "./hf-home"

2. Define Directories

output_dir = "./results"  # Results saved here
icd_cache_dir = "./icd_cache"  # ICD embeddings cached here (optional, reusable across runs)

3. Load ICD Codes

Your ICD data must have columns: icd_code and icd_string

icd_df = load_icd_df()  # loads default ICDs, or load your own according to the below format

Example format (essential columns):

icd_code	icd_string
E11.9	Type 2 diabetes mellitus without complications
I10	Essential (primary) hypertension
J45.909	Unspecified asthma, uncomplicated

4. Define Phenotype(s)

# Single phenotype
phenotype = "Eating disorders"

# OR multiple phenotypes
phenotypes = ["Eating disorders", "Type 2 diabetes", "Hypertension"]

# OR DataFrame with phecode and description
phecode_df = pd.DataFrame({
    'phecode': ['250.2', '401.1'],
    'phecode_string': ['Type 2 diabetes', 'Hypertension']
})

5. Choose Models

# Light model (fast, ~80MB)
models = ["sentence-transformers/all-MiniLM-L6-v2"]

# OR clinically-trained model (better for medical text, ~440MB)
models = ["FremyCompany/BioLORD-2023"]

# OR multiple models (for ensemble)
models = [
    "sentence-transformers/all-MiniLM-L6-v2",
    "FremyCompany/BioLORD-2023",
    "NeuML/pubmedbert-base-embeddings"
]
# OR use a preset according to our evaluation
models = "preset:best_single"  # best single model
models = "preset:best_ensemble"  # best set of models for ensemble (same as default)

6. Initialize Phecoder

pc = Phecoder(
    icd_df=icd_df,
    phecodes=phenotype,                  # string, list of strings, or dataframe with "phecode" column
    models=models,
    output_dir=output_dir,
    icd_cache_dir=icd_cache_dir,         # Optional: cache ICD embeddings for reuse
    st_search_kwargs={
    "top_k": 100,
    }      # Return top 100 ICD codes per phenotype
)

7. Run Pipeline

# Option 1: Run directly (models auto-download if needed)
pc.run()

# Option 2: Pre-download models, then run (useful for batch jobs)
pc.download_models()  # Optional: explicitly download models first
pc.run()

# Build ensemble (combines multiple models using reciprocal rank fusion)
pc.build_ensemble(
    method="rrf",
    method_kwargs={"k": 60},
    name="ens:rrf60"
)

8. Load Results

# Load all results (individual models + ensemble)
results = pc.load_results()

# Load ensemble results only
ensemble_results = pc.load_results(
    models=['ens:rrf60'],
    include_ensembles=True
)

Tips

First run is slower - Models download and embeddings are computed
Subsequent runs are fast - ICD embeddings are cached and reused
Use icd_cache_dir to share ICD embeddings across multiple projects
Start with light models for testing, then use clinical models for production
Ensembles typically outperform individual models
Pre-download models with pc.download_models() for batch jobs to separate download time from computation

Citations

If you use Phecoder in research, please cite our preprint on medRxiv:

Phecoder: semantic retrieval for auditing and expanding ICD-based phenotypes in EHR biobanks. Jamie J. R. Bennett, Simone Tomasi, Sonali Gupta, VA Million Veteran Program, Georgios Voloudakis, Panos Roussos, David Burstein (2026). doi: https://doi.org/10.64898/2026.01.08.26343725.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.0

Feb 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phecoder-0.2.0.tar.gz (1.6 MB view details)

Uploaded Feb 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

phecoder-0.2.0-py3-none-any.whl (1.6 MB view details)

Uploaded Feb 13, 2026 Python 3

File details

Details for the file phecoder-0.2.0.tar.gz.

File metadata

Download URL: phecoder-0.2.0.tar.gz
Upload date: Feb 13, 2026
Size: 1.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for phecoder-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`5853dee5cf300ec3b1395bc411646fc01d5df0d86261d2bf7b75f11ebd420e87`
MD5	`7e70e75a3e2cefa6ca28d8bca3fa77e3`
BLAKE2b-256	`a49ede9057687b01401bc1a267a1efcbd01bc5106cabdc4dd2987235345edd23`

See more details on using hashes here.

File details

Details for the file phecoder-0.2.0-py3-none-any.whl.

File metadata

Download URL: phecoder-0.2.0-py3-none-any.whl
Upload date: Feb 13, 2026
Size: 1.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for phecoder-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f3c65e9824bbbdaaf0b32cef7351dd8f917318a04e900a0e502438d9b2d17746`
MD5	`f5035993eb94d56ec79590f88c120378`
BLAKE2b-256	`c585a3d6c03d76da35135c2fd24dd54686882faec967311290b228efdd98e921`

See more details on using hashes here.

phecoder 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Phecoder: semantic retrieval for auditing and expanding ICD-based phenotypes in EHR biobanks

Overview

Installing Phecoder

As a user

PyTorch with CUDA

As a developer

Quick Start

Workflow example with default settings

A more detailed example

1. Setup and Import

2. Define Directories

3. Load ICD Codes

4. Define Phenotype(s)

5. Choose Models

6. Initialize Phecoder

7. Run Pipeline

8. Load Results

Tips

See also

Citations

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes