Skip to main content

The Phenotype Toolkit

Project description

PheTK - The Phenotype Toolkit

Tests PyPI version Python versions License: GPL v3 DOI

The official repository of PheTK, a fast python library for Phenome Wide Association Studies (PheWAS) utilizing both phecode 1.2 and phecodeX 1.0.

Reference: Tam C Tran, David J Schlueter, Chenjie Zeng, Huan Mo, Robert J Carroll, Joshua C Denny, PheWAS analysis on large-scale biobank data with PheTK, Bioinformatics, Volume 41, Issue 1, January 2025, btae719, https://doi.org/10.1093/bioinformatics/btae719

Contact: PheTK@mail.nih.gov

Releases: check GitHub Releases for the latest versions and changelogs.

🆕 WHAT'S NEW IN v0.2

Major updates in this release:

  • Cox regression support - Added survival analysis capabilities alongside logistic regression
  • dsub integration - Built-in support for distributed computing on Google Cloud Platform
  • Forest plot visualization - New main visualization option alongside Manhattan plots
  • PEP-compliant naming - Changed to lowercase package/module names (affects import syntax)
  • Expanded CLI support - Added command-line interfaces for cohort and phecode modules
  • Simplified CLI commands - Added entry points for easier CLI usage (e.g., phetk phewas instead of python3 -m phetk.phewas)
  • Enhanced user experience - Various improvements for clarity and usability

NOTE: If you are using PheTK v0.2+, please upgrade to the latest version using pip install phetk --upgrade to fix a bug in the controls selection in Cox regression.

📋 View full changelog

Version 0.1.47 is the last stable version of version 0.1. Users can still continue to use this version, and the previous README file can be found here


QUICK LINKS


1. INSTALLATION

Using pip

The latest version (v0.2+) of PheTK can be installed using the pip install command in the terminal (note that the lowercase package name "phetk" starts from version 0.2+):

pip install phetk --upgrade

Users can also specify a version, e.g., for the last stable version of version 0.1 (note use "PheTK" instead of "phetk" for version 0.1):

pip install PheTK==0.1.47

To check current installed version:

pip show phetk | grep Version

Using Docker

Please refer to https://hub.docker.com/r/phetk/phetk/tags for the latest docker images.

docker pull phetk/phetk:latest

2. 1-MINUTE PHEWAS DEMO

User can run the quick 1-minute PheWAS demo with the following command in a terminal:

phetk demo

Or in Jupyter Notebook:

from phetk import demo

demo.run()

The example files (example_cohort.tsv, example_phecode_counts.tsv, and example_phewas_results.tsv) generated in this Demo should be in users' current working directory. New-to-PheWAS users could explore these files to get a sense of what data are used or generated in PheWAS with PheTK.

3. DESCRIPTIONS

PheTK is a fast python library for Phenome Wide Association Studies (PheWAS) utilizing both phecode 1.2 and phecodeX 1.0.

PheWAS workflow and PheTK modules Standard PheWAS workflow. Green italicized texts are PheTK module names. Black components are supported while gray ones are not supported by PheTK currently.

All of Us: the All of Us Research Program (https://allofus.nih.gov/)

4. USAGE

For detailed usage examples and documentation for each module, please refer to the individual module documentation:

  • Cohort module - Generate genetic cohorts and add covariates
  • Phecode module - Map ICD codes to phecodes and generate counts
  • PheWAS module - Run PheWAS analysis with logistic or Cox regression
  • Plot module - Generate Manhattan plots and other visualizations

5. SYSTEM REQUIREMENTS

PheTK was developed for efficient processing of large data while being resource-friendly. It was tested on different platforms from laptops to different cloud environments.

General Requirements

PheTK's resource requirements vary by usage context. The information in this section is tailored towards cloud computing platforms where large biobanks are often hosted.

  • All PheTK functions run on standard machines, except by_genotype() in the Cohort module which requires a Spark cluster (dataproc VM)
  • Both logistic regression and Cox regression scale with CPU counts for faster processing. See figure S2 below from PheTK publication for more information. In our experience, 4 CPU machines are the most cost-efficient, especially for large-scale analyses.
  • For an end-to-end pipeline, the system requirements should be based on the most demanding steps. For example, for the All of Us data v8, a VM with 16CPU 104GB RAM and 2 dataproc workers at default settings should work; if users only need to run PheWAS analysis, it can be run at a much lower configuration as shown in figure S2.

PheTK Performance Benchmarks Figure S2: Logistic regression performance benchmarks from PheTK publication showing scalability with different CPU configurations and cohort sizes.

PheWAS Module - Logistic Regression

  • Minimal resources required - Can run efficiently on lightweight configurations
  • Minimum tested configuration: GCP X-highcpu-4 (4 vCPUs, 8GB RAM, X=GCP machine type, e.g., c2d) or equivalent
  • Uses multithreading for parallel processing with lower memory overhead

PheWAS Module - Cox Regression

  • Slightly higher resources required - Uses multiprocessing which demands more memory
  • Minimum tested configuration: GCP X-standard-4 (4 vCPUs, 16GB RAM, X=GCP machine type, e.g., c2d) or equivalent
  • The additional memory accommodates the multiprocessing overhead for survival analysis

Phecode Module (ICD Code Mapping)

  • Memory requirements scale with cohort size - Large cohorts require higher memory configurations
  • Recommended: For All of Us database v8 with over 500k participants, phecode mapping could be done with a 16 vCPU 104GB RAM machine.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phetk-0.2.6.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phetk-0.2.6-py3-none-any.whl (1.4 MB view details)

Uploaded Python 3

File details

Details for the file phetk-0.2.6.tar.gz.

File metadata

  • Download URL: phetk-0.2.6.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for phetk-0.2.6.tar.gz
Algorithm Hash digest
SHA256 d05f9f93d8ee632a76c7593dba4f0a88dc1ce01c49f1467c7647428951e506fe
MD5 1608c23eb07db3fa0ed16873c904aaeb
BLAKE2b-256 c038a178683aaa63f63e34613fa9b919a22a3f52b48e88985a0c6961232edeb0

See more details on using hashes here.

Provenance

The following attestation bundles were made for phetk-0.2.6.tar.gz:

Publisher: publish.yml on nhgritctran/PheTK

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file phetk-0.2.6-py3-none-any.whl.

File metadata

  • Download URL: phetk-0.2.6-py3-none-any.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for phetk-0.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 1609ff69914a062c41feda5ed1d629d7c02061f1df91154b4e0fb3d59dd311f7
MD5 a9e56539c9914bb4289372ca59c82226
BLAKE2b-256 6dd393f0b70ebb9cf7f4f92d152f2338d02938adc806337562f6995a334a616f

See more details on using hashes here.

Provenance

The following attestation bundles were made for phetk-0.2.6-py3-none-any.whl:

Publisher: publish.yml on nhgritctran/PheTK

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page