Skip to main content

Python implementation of the variational latent phenotype model described in Blair et al..

Project description

Variational Latent Phenotype Inference (vLPI)

The vlpi software package implements the latent phenotype model described in Blair et al, which infers the cryptic, quantitative traits that underlie some set of observed disease symptoms. Details concerning the implementation of the model and inference algorithm can be found in the Supplementary Materials of Blair et al. Please contact david.blair@ucsf.edu with any questions.

Installation

The software package can be installed using pip by running the following command: pip install vlpi

Use

The software package is essentially broken into two sections. The first implements a data structure (ClinicalDataset class) that efficiently stores and manipulates large clinical datasets. It is essentially a sparse binary array with added functionality that automates many tasks, such as constructing training and validation splits and converting among different symptom encodings. In addition, another class (ClinicalDatasetSampler) is used to efficiently generate random subsets of the ClinicalDataset, which is important for training.

The second part of the software package, the vLPI class, implements the model fitting itself using a stochastic, amortized variational inference algorithm (see Blair et al. for details). It requires a ClinicalDataset class passed in the form of a ClinicalDatasetSampler. Below, we provide an example of how to use the software package by simulating a relatively simple symptom dataset. Further details regarding the package and it's functionality can be found by reading the source code documentation associated with the individual functions and classes.

Simulation Example

First, we will import the functions required for dataset simulation and inference from the vlpi package. Note, torch and string are imported simply to assist with the simulation.

from vlpi.data.ClinicalDataset import ClinicalDataset,ClinicalDatasetSampler
from vlpi.data.ClinicalDataSimulator import ClinicalDataSimulator
from vlpi.vLPI import vLPI
import torch
import string

seed=torch.manual_seed(1023)

Using the ClinicalDataSimulator class, simulating a clinical dataset of binary symptoms is straightforward. Additional details regarding the functionality of ClinicalDataSimulator are provided in the source code.

numberOfSamples=50000
numberOfSymptoms=20

rareDiseaseFrequency=0.001
numLatentPhenotypes=2
simulator = ClinicalDataSimulator(numberOfSymptoms,numLatentPhenotypes,rareDiseaseFrequency)
simulatedData=simulator.GenerateClinicalData(numberOfSamples)

The simulatedData returned by ClinicalDataSimulator is a nested dictionary of results, but to use the vLPI model, we need to get this information into a ClinicalDataset. First, we initialize an empty dataset, which by default is constructed using the full ICD10-CM codebook.

clinicalData = ClinicalDataset()

Most applications, however, do not require the complete codebook. In fact, we recommend against trying to fit this model to more than 10-100 hundred symptoms, as it unlikely to be able to reliably tease apart such a complex latent structure. Therefore, a ClinicalDataset can be a priori manipulated such that it aligns to a different encoding, which is what we do below. Note, the ClinicalDataset class can also read a dataset directly from a text file (see ClinicalDataset.ReadDatasetFromFile), and this should work for arbitrary encodings as long as the ClinicalDataset class is set up appropriately. However, we have only tested this function on raw, ICD10-CM encoded datasets.

allICDCodes = list(clinicalData.dxCodeToDataIndexMap.keys())
symptomConversionMap=dict(zip(allICDCodes[0:numberOfSymptoms],string.ascii_uppercase[0:numberOfSymptoms]))
clinicalData.ConstructNewDataArray(symptomConversionMap)
print(clinicalData.dxCodeToDataIndexMap)

Now, we can load our simulated symptom data into the ClinicalDataset. Note, the empty list argument would normally contain a list of covariate names, but since we didn't simulate any covariates, there are no names to provide.

clinicalData.LoadFromArrays(simulatedData['incidence_data'],simulatedData['covariate_data'],[],catCovDicts=None, arrayType = 'Torch')

The next step is contruct a ClinicalDatasetSampler class from the ClinicalDataset, which enables the stochastic sampling that is required for inference. To do so, you must specify a fraction of the dataset to withhold for testing/validation. Note, there is also a way to write ClinicalDataset and ClinicalDatasetSamplers to disk, that way the same dataset and sampler class can be reloaded to ensure replicability.

training_data_fraction=0.75
sampler = ClinicalDatasetSampler(clinicalData,training_data_fraction,returnArrays='Torch')

Now, we're ready to perform model inference. Technically, only the initial number of latent phenotypes needs to be specified, although there are additional optional arguments as well (see source code). We've really only tested the model on symptoms sets that likely contain <10 latent phenotypes. The model may be effective at inferring more complex structures, but we have not thoroughly tested this.

infNumberOfLatentPhenotypes=10
vlpiModel= vLPI(sampler,infNumberOfLatentPhenotypes)

Fitting the model is very straightforward, although there are multiple hyper-parameters (learning rate, batch size, max epochs, etc) that can be changed from their baseline values. The default hyperparameters and different options used in practice are described in Blair et al. We always recommend saving any model that was successfully fit using the vLPI.PackageModel function.

inference_output = vlpiModel.FitModel(batch_size=1000,errorTol=(1.0/numberOfSamples),verbose=False)

Great! Now, we can check model fit with little bit of visualization. We'll use the matplotlib and seaborn plotting libraries for this, which are not included with the vlpi package. First, we can track the loss function on the held-out testing data, where the loss function is the negative evidence lower bound. We dropped the first 20 epochs because the they obscure the rest of the data.

import seaborn as sns
from matplotlib import cm
import numpy as np
import matplotlib.pyplot as plt
sns.set(context='talk',color_codes=True,style='ticks',font='Arial',font_scale=2.5,rc={'axes.linewidth':5,"font.weight":"bold",'axes.labelweight':"bold",'xtick.major.width':4,'xtick.minor.width': 2})
cmap = cm.get_cmap('viridis', 12)
color_list=[cmap(x) for x in [0.0,0.1,0.25,0.5,0.75,0.9,1.0]]

sns.lineplot(x=range(1,len(inference_output[2])+1)[20:],y=inference_output[2][20:],color=color_list[0],lw=3.0)
o=plt.xlabel('Epoch')
o=plt.ylabel('-ELBO')

Alt text

Convergence appears to have been attained, but we can also compare the simulated and inferred symptom risk functions. If the inference algorithm is converging to the correct mode, then the two functions should be nearly identical.

f,axes = plt.subplots(1, 2,figsize=(16,8))
for ax in axes:
    ax.tick_params(axis='x',which='both',bottom=False,top=False,left=False,right=False,labelbottom=False)
    ax.tick_params(axis='y',which='both',left=False,right=False,bottom=False,top=False,labelleft=False)

infRiskFunc=vlpiModel.ReturnComponents()
simRiskFunc=simulatedData['model_params']['latentPhenotypeEffects']
im1=axes[0].imshow(simRiskFunc,cmap=cmap)
im2=axes[1].imshow(infRiskFunc,cmap=cmap)
o=axes[0].set_title('Simulated Symptom\nRisk Function',fontsize=36)
o=axes[1].set_title('Inferred Symptom\nRisk Function',fontsize=36)

Alt text

Note, the inference algorithm automatically selects the appropriate number of latent phenotypes by zeroing out the parts of the risk function that correspond to the unneeded components. As a final step, we can compare the inferred latent phenotypes themselves. In this case, we simply visually match the simulated and inferred latent phenotypes based on the risk functions depicted above, but there are formal ways to align matrices of parameters (Orthogonal Procrustes Analysis, see Blair et al).

inferredCrypticPhenotypes=vlpiModel.ComputeEmbeddings((simulatedData['incidence_data'],simulatedData['covariate_data']))
f,axes = plt.subplots(1, 2,figsize=(16,8))
f.tight_layout(pad=3.0)

sns.scatterplot(simulatedData['latent_phenotypes'][:,0],inferredCrypticPhenotypes[:,-1],color=color_list[0],ax=axes[0])
sns.scatterplot(simulatedData['latent_phenotypes'][:,1],inferredCrypticPhenotypes[:,-4],color=color_list[2],ax=axes[1])
axes[0].plot([-3,3],[-3,3],'--',lw=5.0,color='r')
axes[1].plot([-3,3],[-3,3],'--',lw=5.0,color='r')

o=axes[0].set_xlabel('Simulated Latent\nPhenotype 1',fontsize=20)
o=axes[1].set_xlabel('Simulated Latent\nPhenotype 2',fontsize=20)

o=axes[0].set_ylabel('Inferred Latent\nPhenotype 10',fontsize=20)
o=axes[1].set_ylabel('Inferred Latent\nPhenotype 7',fontsize=20)

Alt text

Clearly, the inferred and simulated latent phenotypes are highly correlated. However, there is a fair amount of noise associated with the inferred latent phenotypes, and in addition, there are floor/ceiling effects. These reflect a loss of information that occurs when continuous traits are transformed into noisy, binary symptoms. This noise level is greatly reduced by simulating datasets with hundreds of symptoms, although this is not a realistic clinical scenario.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vlpi-0.1.1.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

vlpi-0.1.1-py3-none-any.whl (1.7 MB view details)

Uploaded Python 3

File details

Details for the file vlpi-0.1.1.tar.gz.

File metadata

  • Download URL: vlpi-0.1.1.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for vlpi-0.1.1.tar.gz
Algorithm Hash digest
SHA256 55810583ba4b45039727b9aa8a4daef9cfe1c227d088dc28bf785ad8ad883082
MD5 a7edf1a4dcc8b083654b8f1d9b49a281
BLAKE2b-256 fe028d9ddcd3031bccba3953e9841d5a716ee1372154414a68c4a3c5c0b8dbc8

See more details on using hashes here.

File details

Details for the file vlpi-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: vlpi-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for vlpi-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d0ce1dd3a47cf18419f04df9a2850f5c61c88a39e419da5a67c9a2e332adbece
MD5 8be627d4d51bcb7f26dbc50fa4432df7
BLAKE2b-256 8545208d71d265831ef3dee21709e249fd99e75b64bf3e795694adae5f3b124e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page