Skip to main content

KGWAS

Project description

logo

Genetics discovery powered by functional genomics knowledge graph

Genome-wide association studies (GWASs) have identified tens of thousands of disease-associated variants and provided critical insights into developing effective treatments. However, limited sample sizes have hindered the discovery of variants for less common and rare diseases. Here, we introduce KGWAS, a novel geometric deep learning method that leverages a massive functional knowledge graph across variants and genes to improve detection power in small-cohort GWASs significantly.

Installation

Install Pytorch Geometric by following this instruction and then do:

pip install KGWAS

Core KGWAS API Usage

from kgwas import KGWAS, KGWAS_Data
data = KGWAS_Data(data_path = './data') ## initialize KGWAS data class with data path

data.load_kg() ## load the knowledge graph
data.load_external_gwas(PATH) ## load the GWAS file
data.process_gwas_file() ## process the GWAS file
data.prepare_split() ## prepare the train/val/test split

run = KGWAS(data, device = 'cuda:0', seed = 1) ## initialize KGWAS model
run.initialize_model()

run.train(epoch = 10) ## train the model

Data download

To ensure fast user experience, we provide a default fast mode of KGWAS, which uses Enformer embedding for variant feature and ESM embedding for gene features (instead of the baselineLD for variant and PoPS for gene since they are large files). For the fast mode, you do not need to download any data, the KGWAS API will automatically download the relevant files. This mode can be used to apply KGWAS to your own GWAS sumstats.

If you want to (1) use the full mode of KGWAS (i.e. larger node embeddings) or (2) access the null/causal simulations or (3) access the 21 subsampled GWAS sumstats across various sample sizes or (4) analyze the KGWAS sumstats for subsampled data or (5) analyze the KGWAS sumstats for all UKBB ICD10 diseases, please use data.download_all_data(). Note that this file is large (around 70GB) and may take a while to download.

Tutorial

Notebook Try on Colab Description
Introduction & Apply KGWAS to your own sumstats TODO Tutorial on key KGWAS API and functionalities.
Use alternative variant/gene/program embedding TODO Tutorial on using alternative variant/gene/program embedding.
Simulation analysis TODO Tutorial on the simulation analysis.
Subsampling analysis TODO Tutorial on the subsampling analysis.

Extended API Usage

KGWAS_Data class

data = KGWAS_Data(data_path = './data')

  • data_path: specify the path to the data folder. If not specified, the default path is ./data. If you use the full mode, unzip the data and use the path to the unzipped folder.

data.load_kg(snp_init_emb = 'enformer', go_init_emb = 'random', gene_init_emb = 'esm', sample_edges = False, sample_ratio = 1): load KGWAS knowledge graph and node embeddings

  • snp_init_emb: specify the variant embedding method. Options are enformer (default), baselineLD, SLDSC, cadd, kg, random
  • go_init_emb: specify the gene ontology embedding method. Options are random (default), biogpt, kg
  • gene_init_emb: specify the gene embedding method. Options are esm (default), pops_expression, pops, kg, random
  • sample_edges: whether to sample edges from the knowledge graph. Default is False
  • sample_ratio: the ratio of edges to sample. Default is 1

data.load_external_gwas(path, seed = 42): load external/your own GWAS file

  • path: specify the path to the GWAS file; The expected columns are CHR, SNP, P, N, and SNP should be in rs ID.
  • seed: specify the seed for the data split. Default is 42

data.load_full_gwas(pheno, seed): load full-cohort GWAS files already run in KGWAS. Note that this requires full data download.

  • pheno: specify the phenotype to load. Use data.get_pheno_list() to see all available phenotypes.

data.load_gwas_subsample(pheno, sample_size, seed): load subsampled GWAS files already run in KGWAS. Note that this requires full data download.

  • pheno: specify the phenotype to load. Use data.get_pheno_list()["21_indep_traits"] to see all available phenotypes.
  • sample_size: specify the sample size to load, it is available in 1000, 2500, 5000, 7500, 10000, 50000, 100000, 200000.
  • seed: specify the seed for the data split. It is available in 1,2,3,4,5.

data.load_simulation_gwas(simulation_type, seed): load the null and causal simulation data

  • simulation_type: specify the simulation type. Options are null and causal.
  • seed: specify the seed for the data split. It ranges from 1-500.

data.process_gwas_file(): process the GWAS file for training

data.prepare_split(test_set_fraction_data = 0.05): prepare the train/val/test split

  • test_set_fraction_data: specify the fraction of data to use as the test set. Default is 0.05

KGWAS class

run = KGWAS(data, weight_bias_track = False, device = 'cuda', proj_name = 'KGWAS', exp_name = 'KGWAS', seed = 42): initialize KGWAS model

  • data: specify the KGWAS data class
  • weight_bias_track: whether to track the weight and bias during training. Default is False
  • device: specify the device to run the model. Default is cuda
  • proj_name: specify the project name. Default is KGWAS
  • exp_name: specify the experiment name. Default is KGWAS
  • seed: specify the seed for the model. Default is 42

run.initialize_model(gnn_num_layers = 2, gnn_hidden_dim = 128, gnn_backbone = 'GAT', gnn_aggr = 'sum', gat_num_head = 1): initialize the KGWAS model

  • gnn_num_layers: specify the number of GNN layers. Default is 2
  • gnn_hidden_dim: specify the hidden dimension of the GNN. Default is 128
  • gnn_backbone: specify the GNN backbone. Options are GAT (default), GCN, SAGE, SGC
  • gnn_aggr: specify the GNN aggregation method. Options are sum (default), mean, min, max, cat
  • gat_num_head: specify the number of GAT heads. Default is 1

run.load_pretrained(path): load pretrained model

  • path: specify the path to the pretrained model

run.train(batch_size = 512, num_workers = 6, lr = 1e-4, weight_decay = 5e-4, epoch = 10, save_best_model = False, save_name = None, data_to_cuda = False): train the model

  • batch_size: specify the batch size. Default is 512. If you get CUDA OOM error, you can reduce the batch size.
  • num_workers: specify the number of workers for data loading. Default is 6
  • lr: specify the learning rate. Default is 1e-4
  • weight_decay: specify the weight decay. Default is 5e-4
  • epoch: specify the number of epochs. Default is 10
  • save_best_model: whether to save the best model. Default is False
  • save_name: specify the name to save the model. Default is run.exp_name
  • data_to_cuda: whether to move the data to CUDA. Default is False. You will be faster if you set it to True but will take a bit more CUDA memory.

Cite Us

@misc{kgwas,
      title={Small-cohort GWAS discovery with AI over massive functional genomics knowledge graph},
      author={Kexin Huang and Tony Zeng and Soner Koc and Alexandra Pettet and Jingtian Zhou and Mika Jain and Dongbo Sun and Camilo Ruiz and Hongyu Ren and Laurence Howe and Tom Richardson and Adrian Cortes and Katie Aiello and Kim Branson and Andreas Pfenning and Jesse Engreitz and Martin Jinye Zhang and Jure Leskovec},
      year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

KGWAS-0.0.2.tar.gz (35.1 kB view details)

Uploaded Source

File details

Details for the file KGWAS-0.0.2.tar.gz.

File metadata

  • Download URL: KGWAS-0.0.2.tar.gz
  • Upload date:
  • Size: 35.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.12

File hashes

Hashes for KGWAS-0.0.2.tar.gz
Algorithm Hash digest
SHA256 81434b3d595bf70f1b557be2323ed8b119473391e21a443dc306ae7616f0a7de
MD5 a411e423624dfc87ad41b521bc8943ce
BLAKE2b-256 29247bd08f856ebe856263880832e1c1208671d0e6aa64c2d93055db73f5b670

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page