The Phenotype Toolkit
Project description
PheTK - The Phenotype Toolkit
The official repository of PheTK, a fast python library for Phenome Wide Association Studies (PheWAS) utilizing both phecode 1.2 and phecodeX 1.0.
Reference: Tam C Tran, David J Schlueter, Chenjie Zeng, Huan Mo, Robert J Carroll, Joshua C Denny, PheWAS analysis on large-scale biobank data with PheTK, Bioinformatics, Volume 41, Issue 1, January 2025, btae719, https://doi.org/10.1093/bioinformatics/btae719
Contact: PheTK@mail.nih.gov
Releases: check GitHub Releases for the latest versions and changelogs.
🆕 WHAT'S NEW IN v0.2
Major updates in this release:
- Cox regression support - Added survival analysis capabilities alongside logistic regression
- dsub integration - Built-in support for distributed computing on Google Cloud Platform
- Forest plot visualization - New main visualization option alongside Manhattan plots
- PEP-compliant naming - Changed to lowercase package/module names (affects import syntax)
- Expanded CLI support - Added command-line interfaces for cohort and phecode modules
- Simplified CLI commands - Added entry points for easier CLI usage (e.g.,
phetk phewasinstead ofpython3 -m phetk.phewas) - Enhanced user experience - Various improvements for clarity and usability
NOTE: If you are using PheTK v0.2+, please upgrade to the latest version using pip install phetk --upgrade to fix a bug in the controls selection in Cox regression.
Version 0.1.47 is the last stable version of version 0.1. Users can still continue to use this version, and the previous README file can be found here
QUICK LINKS
- Installation
- 1-minute PheWAS demo
- PheTK description
- Usage examples
- System requirements & computing resources
- Platform specific tutorial(s):
- All of Us: Tutorial notebooks - Interactive Jupyter notebooks demonstrating PheTK usage on the All of Us Researcher Workbench with various analysis examples. Please note that all examples require All of Us registered user access.
- Changelogs and releases: from v0.1.45, please use GitHub Releases for the latest versions and changelogs. Legacy changelogs were archived in CHANGELOG.md.
- Resource to learn about PheWAS and phecode: The PheWAS Catalog.
1. INSTALLATION
Using pip
The latest version (v0.2+) of PheTK can be installed using the pip install command in the terminal (note that the lowercase package name "phetk" starts from version 0.2+):
pip install phetk --upgrade
Users can also specify a version, e.g., for the last stable version of version 0.1 (note use "PheTK" instead of "phetk" for version 0.1):
pip install PheTK==0.1.47
To check current installed version:
pip show phetk | grep Version
Using Docker
Please refer to https://hub.docker.com/r/phetk/phetk/tags for the latest docker images.
docker pull phetk/phetk:latest
2. 1-MINUTE PHEWAS DEMO
User can run the quick 1-minute PheWAS demo with the following command in a terminal:
phetk demo
Or in Jupyter Notebook:
from phetk import demo
demo.run()
The example files (example_cohort.tsv, example_phecode_counts.tsv, and example_phewas_results.tsv)
generated in this Demo should be in users' current working directory.
New-to-PheWAS users could explore these files to get a sense of what data are used or generated in PheWAS with PheTK.
3. DESCRIPTIONS
PheTK is a fast python library for Phenome Wide Association Studies (PheWAS) utilizing both phecode 1.2 and phecodeX 1.0.
Standard PheWAS workflow. Green italicized texts are PheTK module names.
Black components are supported while gray ones are not supported by PheTK currently.
All of Us: the All of Us Research Program (https://allofus.nih.gov/)
4. USAGE
For detailed usage examples and documentation for each module, please refer to the individual module documentation:
- Cohort module - Generate genetic cohorts and add covariates
- Phecode module - Map ICD codes to phecodes and generate counts
- PheWAS module - Run PheWAS analysis with logistic or Cox regression
- Plot module - Generate Manhattan plots and other visualizations
5. SYSTEM REQUIREMENTS
PheTK was developed for efficient processing of large data while being resource-friendly. It was tested on different platforms from laptops to different cloud environments.
General Requirements
PheTK's resource requirements vary by usage context. The information in this section is tailored towards cloud computing platforms where large biobanks are often hosted.
- All PheTK functions run on standard machines, except
by_genotype()in the Cohort module which requires a Spark cluster (dataproc VM) - Both logistic regression and Cox regression scale with CPU counts for faster processing. See figure S2 below from PheTK publication for more information. In our experience, 4 CPU machines are the most cost-efficient, especially for large-scale analyses.
- For an end-to-end pipeline, the system requirements should be based on the most demanding steps. For example, for the All of Us data v8, a VM with 16CPU 104GB RAM and 2 dataproc workers at default settings should work; if users only need to run PheWAS analysis, it can be run at a much lower configuration as shown in figure S2.
Figure S2: Logistic regression performance benchmarks from PheTK publication showing scalability with different CPU configurations and cohort sizes.
PheWAS Module - Logistic Regression
- Minimal resources required - Can run efficiently on lightweight configurations
- Minimum tested configuration: GCP
X-highcpu-4(4 vCPUs, 8GB RAM, X=GCP machine type, e.g., c2d) or equivalent - Uses multithreading for parallel processing with lower memory overhead
PheWAS Module - Cox Regression
- Slightly higher resources required - Uses multiprocessing which demands more memory
- Minimum tested configuration: GCP
X-standard-4(4 vCPUs, 16GB RAM, X=GCP machine type, e.g., c2d) or equivalent - The additional memory accommodates the multiprocessing overhead for survival analysis
Phecode Module (ICD Code Mapping)
- Memory requirements scale with cohort size - Large cohorts require higher memory configurations
- Recommended: For All of Us database v8 with over 500k participants, phecode mapping could be done with a 16 vCPU 104GB RAM machine.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phetk-0.2.6.tar.gz.
File metadata
- Download URL: phetk-0.2.6.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d05f9f93d8ee632a76c7593dba4f0a88dc1ce01c49f1467c7647428951e506fe
|
|
| MD5 |
1608c23eb07db3fa0ed16873c904aaeb
|
|
| BLAKE2b-256 |
c038a178683aaa63f63e34613fa9b919a22a3f52b48e88985a0c6961232edeb0
|
Provenance
The following attestation bundles were made for phetk-0.2.6.tar.gz:
Publisher:
publish.yml on nhgritctran/PheTK
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
phetk-0.2.6.tar.gz -
Subject digest:
d05f9f93d8ee632a76c7593dba4f0a88dc1ce01c49f1467c7647428951e506fe - Sigstore transparency entry: 1244356975
- Sigstore integration time:
-
Permalink:
nhgritctran/PheTK@0784899aa47402f6bda089a0f4122fe550a2262c -
Branch / Tag:
refs/tags/v0.2.6 - Owner: https://github.com/nhgritctran
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0784899aa47402f6bda089a0f4122fe550a2262c -
Trigger Event:
release
-
Statement type:
File details
Details for the file phetk-0.2.6-py3-none-any.whl.
File metadata
- Download URL: phetk-0.2.6-py3-none-any.whl
- Upload date:
- Size: 1.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1609ff69914a062c41feda5ed1d629d7c02061f1df91154b4e0fb3d59dd311f7
|
|
| MD5 |
a9e56539c9914bb4289372ca59c82226
|
|
| BLAKE2b-256 |
6dd393f0b70ebb9cf7f4f92d152f2338d02938adc806337562f6995a334a616f
|
Provenance
The following attestation bundles were made for phetk-0.2.6-py3-none-any.whl:
Publisher:
publish.yml on nhgritctran/PheTK
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
phetk-0.2.6-py3-none-any.whl -
Subject digest:
1609ff69914a062c41feda5ed1d629d7c02061f1df91154b4e0fb3d59dd311f7 - Sigstore transparency entry: 1244356982
- Sigstore integration time:
-
Permalink:
nhgritctran/PheTK@0784899aa47402f6bda089a0f4122fe550a2262c -
Branch / Tag:
refs/tags/v0.2.6 - Owner: https://github.com/nhgritctran
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0784899aa47402f6bda089a0f4122fe550a2262c -
Trigger Event:
release
-
Statement type: