Skip to main content

CLI tool to run a PheWAS

Project description

PyPheWAS package

A python script I use to run PheWAS analyses. Full Documentation can be found here: online docs

Summary

This repository contains a CLI tool implemented in python that can be used to run a PheWAS analysis. This script supports both PheCode 1.2 and PheCode X (read about each here). This package is based on the PheTK package but offers flexibility in the model that I wanted and has a more verbose output by reporting the betas and standard errors for all predictors. The PyPhewas-package supports both logistic and linear regression. Additionally, this package will use Firth Regression when a perfect separation error is encountered in the logistic model.

Installation

This code is hosted on PYPI and can be installed using a package manager such as conda or pip. If using Pip it is recommended to first make a virtualenv using venv and then installing the program into the virtual environment. Use the following commands to install the program.

python3 -m venv pyphewas-venv

source pyphewas-venv/bin/activate

pip install pyphewas-package
conda create -n pyphewas_env python=3.13 -y

conda activate pyphewas_env

pip install pyphewas-package

If you would like to install the PyPheWAS Package from source it is available on Github. It is recommended to use PDM to install the project. To install the PyPheWAS package from source using PDM, run the following command:

pdm install

If you want to install the program from source code without PDM then you must first install the necessary dependencies from the pyproject.toml file using pip. Then you can call the source file which is located at './src/pyphewas/run_PheWAS.py'

Required Inputs

  • --counts: filepath to a comma separated file where each row has a ID, a phecode id, and the number of times that individual has that phecode in their medical record.

  • --covariate-file: filepath to a comma separated file that lists the covariates and predictor for each individual. The individuals listed in the covariate file will be the individuals in the cohort. Note If the 'flip-predictor-and-outcome' flag is used then the predictor variable is assumed to be the outcome in the model.

  • --covariate-list: Space separated list of covariates to use in the model. All of these covariates must be present in the covariate file and must be spelled exactly the same otherwise the code will crash.

  • --phecode-version: String telling which version of phecodes to use. This argument helps with mapping the PheCode ID to a description. The allowed values are "phecodeX", "phecode1.2", and "phecodeX_who". Most users will only need to use either the PhecodeX or Phecode1.2 option.

Optional Inputs

Although these arguments are not required for runtime, some combination of them will generally be used to make the analysis either more rigorous, more robust, or more fine tuned for the exact question being asked.

  • --min-phecode-count: Minimum number of phecodes an individual is required to have in order to be considered a case for a phecode. Default value is 2. Under default settings, all individuals with 1 occurrence of the phecode are excluded from the regression. If this value is set to 1 then there are no excluded individuals.

  • --min-case-count: Minimum number of cases a phecode has to have to be included in the analysis. The default value is 20. There is no rigorous testing behind this value, only convention. For more rigorous results, a more conservative value of 100 may be ideal.

  • --status-col: column name for the column in the covariate file that has the predictor case/control status. Default value is "status"

  • --sample-col: column name for the column in the covariates file that has the individual ids. Default value is "person_id"

  • --output: filename to write the output to. The output will be written as a tab separated file. If the suffix of the file ends in gz then the file will be gzipped otherwise the file will be uncompressed. Default value is test_output.txt

  • --phecode-descriptions: filepath to a comma separated file that lists the phecode ID and the corresponding phecode name. There are default description files stored in the './src/phecode_maps/' folder if you wish to see example files that are currently used in the code. The phecode ID is expected to be the first column while the phecode description is expected to be the 4th column.

  • --cpus: Number of cpus to use during the analysis. Default value is 1.

  • --max-iterations: Number of iterations for the regression to try to converge. If the model doesn't converge after reaching the max iteration threshold then a ConvergenceWarning will be thrown. If you run this code and find that many PheCodes are not converging then it is recommended to increase this value to attempt to get more phecodes to converge. Default value is 200

  • --flip-predictor-and-outcome: Depending on the analysis, you may want the status column in the covariate file to be a predictor or to be the outcome. If you want the status to be the outcome then you can supply this flag as '--flip-predictor-and-outcome'. When the status is the outcome, then the case/control status for the individual phecodes will become the predictor.

  • --run-sex-specific: Depending on the analysis, you may also want to restrict the analysis to a sex stratified cohort. This command is one of three flags that have to be used in tandem that allow you to stratify the analysis. Allowed values are 'male-only' and 'female-only'.

  • --male-as-one: If the '--run-sex-specific' flag is used then this flag also has to be passed indicating if males were coded as 1 and females as 0 or vice versa. You could pass this flag as '--male-as-one' to indicate that males were coded as 1. The default value is True although this flag will be ignored if the '--run-sex-specific' flag is not provided.

  • --sex-col: Column name of the column in the covariate file containing Sex or Gender information. This flag is required if the '--run-sex-specific' flag was used. Values should be coded numerically as 0 or 1.

  • --model: Whether to run a linear model or a logistic model for the regression. Default value is 'logistic'. Allowed values are 'linear' and 'logistic'.

  • --firth-max-iterations: Maximum number of iterations to try for firth regression model to converge. Default value is 50.

Example Command

Non sex stratified with parallelization:

pyphewas \
    --counts counts.csv \
    --covariate-file covariates.csv \
    --min-phecode-count 2 \
    --status-col status \
    --sample-col person_id \
    --covariate-list EHR_GENDER age unique_phecode_count \
    --min-case-count 100 \
    --cpus 25 \
    --output output.txt.gz \
    --phecode-version phecodeX

Sex Stratified with parallelization:

pyphewas \
    --counts counts.csv \
    --covariate-file covariates.csv \
    --min-phecode-count 2 \
    --status-col status \
    --sample-col person_id \
    --covariate-list age unique_phecode_count \
    --min-case-count 100 \
    --cpus 25 \
    --output output.txt.gz \
    --phecode-version phecodeX \
    --flip-predictor-and-outcome \
    --run-sex-specific female-only \
    --male-as-one True \
    --sex-col EHR_GENDER

note on parallelization: Generally using logistic regression is faster than the linear model. This observation is also true in this package. The logistic model is faster and more memory efficient than the linear model. In testing the linear model, each "process" (defined as each CPU in the commandline arguments) used between 10-16 GB of RAM and the total process took ~60 minutes. The logistic model ran on 30 GB of RAM total with 15 CPUs over 30 minutes. Both of these comparisons were run for a set of ~1.6 million individuals. You can test how the linear model we perform on you machine by just running it with 2 cpus for about 250 phecodes and seeing what the memory is for each python process.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyphewas_package-0.5.0.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyphewas_package-0.5.0-py3-none-any.whl (1.3 MB view details)

Uploaded Python 3

File details

Details for the file pyphewas_package-0.5.0.tar.gz.

File metadata

  • Download URL: pyphewas_package-0.5.0.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.26.3 CPython/3.10.10 Darwin/24.6.0

File hashes

Hashes for pyphewas_package-0.5.0.tar.gz
Algorithm Hash digest
SHA256 410d2a281ff1cdcdf150fdf9f918929498a94d46a480306b749334e0b7896a4c
MD5 fb782201ff9b0b122944062ce2c8c09c
BLAKE2b-256 c792131202bbd7b949676ff2f7af2c84da216aadcd3e088110a47de002b4bfd1

See more details on using hashes here.

File details

Details for the file pyphewas_package-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: pyphewas_package-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.26.3 CPython/3.10.10 Darwin/24.6.0

File hashes

Hashes for pyphewas_package-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 96781a99658c7bd5d297c2891b55202755b5de1591d339de3bdc3c96b7ef2fc4
MD5 2bd18ee822ffa9a3b4f041fd847e182c
BLAKE2b-256 fe14bb0b3729789c4fd03a59231220b042c6e6fc97a58cb7fe8888dca6e6421d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page