CLI tool to run a PheWAS
Project description
PyPheWAS package
A python script I use to run PheWAS analyses. Full Documentation can be found here: online docs
Summary
This repository contains a CLI tool implemented in python that can be used to run a PheWAS analysis. This script supports both PheCode 1.2 and PheCode X (read about each here). This package is based on the PheTK package but offers flexibility in the model that I wanted and has a more verbose output by reporting the betas and standard errors for all predictors. The PyPhewas-package supports both logistic and linear regression. Additionally, this package will use Firth Regression when a perfect separation error is encountered in the logistic model.
Installation
This code is hosted on PYPI and can be installed using a package manager such as conda or pip. If using Pip it is recommended to first make a virtualenv using venv and then installing the program into the virtual environment. Use the following commands to install the program.
python3 -m venv pyphewas-venv
source pyphewas-venv/bin/activate
pip install pyphewas-package
conda create -n pyphewas_env python=3.13 -y
conda activate pyphewas_env
pip install pyphewas-package
If you would like to install the PyPheWAS Package from source it is available on Github. It is recommended to use PDM to install the project. To install the PyPheWAS package from source using PDM, run the following command:
pdm install
If you want to install the program from source code without PDM then you must first install the necessary dependencies from the pyproject.toml file using pip. Then you can call the source file which is located at './src/pyphewas/run_PheWAS.py'
Required Inputs
-
--counts: filepath to a comma separated file where each row has a ID, a phecode id, and the number of times that individual has that phecode in their medical record.
-
--covariate-file: filepath to a comma separated file that lists the covariates and predictor for each individual. The individuals listed in the covariate file will be the individuals in the cohort. Note If the 'flip-predictor-and-outcome' flag is used then the predictor variable is assumed to be the outcome in the model.
-
--covariate-list: Space separated list of covariates to use in the model. All of these covariates must be present in the covariate file and must be spelled exactly the same otherwise the code will crash.
-
--phecode-version: String telling which version of phecodes to use. This argument helps with mapping the PheCode ID to a description. The allowed values are "phecodeX", "phecode1.2", and "phecodeX_who". Most users will only need to use either the PhecodeX or Phecode1.2 option.
Optional Inputs
Although these arguments are not required for runtime, some combination of them will generally be used to make the analysis either more rigorous, more robust, or more fine tuned for the exact question being asked.
-
--min-phecode-count: Minimum number of phecodes an individual is required to have in order to be considered a case for a phecode. Default value is 2. Under default settings, all individuals with 1 occurrence of the phecode are excluded from the regression. If this value is set to 1 then there are no excluded individuals.
-
--min-case-count: Minimum number of cases a phecode has to have to be included in the analysis. The default value is 20. There is no rigorous testing behind this value, only convention. For more rigorous results, a more conservative value of 100 may be ideal.
-
--status-col: column name for the column in the covariate file that has the predictor case/control status. Default value is "status"
-
--sample-col: column name for the column in the covariates file that has the individual ids. Default value is "person_id"
-
--output: filename to write the output to. The output will be written as a tab separated file. If the suffix of the file ends in gz then the file will be gzipped otherwise the file will be uncompressed. Default value is test_output.txt
-
--phecode-descriptions: filepath to a comma separated file that lists the phecode ID and the corresponding phecode name. There are default description files stored in the './src/phecode_maps/' folder if you wish to see example files that are currently used in the code. The phecode ID is expected to be the first column while the phecode description is expected to be the 4th column.
-
--cpus: Number of cpus to use during the analysis. Default value is 1.
-
--max-iterations: Number of iterations for the regression to try to converge. If the model doesn't converge after reaching the max iteration threshold then a ConvergenceWarning will be thrown. If you run this code and find that many PheCodes are not converging then it is recommended to increase this value to attempt to get more phecodes to converge. Default value is 200
-
--flip-predictor-and-outcome: Depending on the analysis, you may want the status column in the covariate file to be a predictor or to be the outcome. If you want the status to be the outcome then you can supply this flag as '--flip-predictor-and-outcome'. When the status is the outcome, then the case/control status for the individual phecodes will become the predictor.
-
--run-sex-specific: Depending on the analysis, you may also want to restrict the analysis to a sex stratified cohort. This command is one of three flags that have to be used in tandem that allow you to stratify the analysis. Allowed values are 'male-only' and 'female-only'.
-
--male-as-one: If the '--run-sex-specific' flag is used then this flag also has to be passed indicating if males were coded as 1 and females as 0 or vice versa. You could pass this flag as '--male-as-one' to indicate that males were coded as 1. The default value is True although this flag will be ignored if the '--run-sex-specific' flag is not provided.
-
--sex-col: Column name of the column in the covariate file containing Sex or Gender information. This flag is required if the '--run-sex-specific' flag was used. Values should be coded numerically as 0 or 1.
-
--model: Whether to run a linear model or a logistic model for the regression. Default value is 'logistic'. Allowed values are 'linear' and 'logistic'.
-
--firth-max-iterations: Maximum number of iterations to try for firth regression model to converge. Default value is 50.
Example Command
Non sex stratified with parallelization:
pyphewas \
--counts counts.csv \
--covariate-file covariates.csv \
--min-phecode-count 2 \
--status-col status \
--sample-col person_id \
--covariate-list EHR_GENDER age unique_phecode_count \
--min-case-count 100 \
--cpus 25 \
--output output.txt.gz \
--phecode-version phecodeX
Sex Stratified with parallelization:
pyphewas \
--counts counts.csv \
--covariate-file covariates.csv \
--min-phecode-count 2 \
--status-col status \
--sample-col person_id \
--covariate-list age unique_phecode_count \
--min-case-count 100 \
--cpus 25 \
--output output.txt.gz \
--phecode-version phecodeX \
--flip-predictor-and-outcome \
--run-sex-specific female-only \
--male-as-one True \
--sex-col EHR_GENDER
note on parallelization: Generally using logistic regression is faster than the linear model. This observation is also true in this package. The logistic model is faster and more memory efficient than the linear model. In testing the linear model, each "process" (defined as each CPU in the commandline arguments) used between 10-16 GB of RAM and the total process took ~60 minutes. The logistic model ran on 30 GB of RAM total with 15 CPUs over 30 minutes. Both of these comparisons were run for a set of ~1.6 million individuals. You can test how the linear model we perform on you machine by just running it with 2 cpus for about 250 phecodes and seeing what the memory is for each python process.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyphewas_package-0.4.3.tar.gz.
File metadata
- Download URL: pyphewas_package-0.4.3.tar.gz
- Upload date:
- Size: 2.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.26.3 CPython/3.10.10 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
470ec5cfef5159e33dc26fd0ddf7afaedaecb88e79b657750cc9359ca40fd1d8
|
|
| MD5 |
c9c615a50c5a30d4984359b61c144325
|
|
| BLAKE2b-256 |
ebf64f329968d837469ee8376ecb928b3d079e00815e2eea6bf2795ccab3230c
|
File details
Details for the file pyphewas_package-0.4.3-py3-none-any.whl.
File metadata
- Download URL: pyphewas_package-0.4.3-py3-none-any.whl
- Upload date:
- Size: 1.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.26.3 CPython/3.10.10 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
695d1b146a64def246a91b23ef2de5f8a638c272c556412e024f10f84792c8cc
|
|
| MD5 |
fa997688671b752161225178093b8c37
|
|
| BLAKE2b-256 |
93b273b1bfaa8c866b33b62c7c5d225ffe616d7f12dde47a53c37d567d68ffed
|