Skip to main content

AoUPRS is a Python module for calculating Polygenic Risk Scores (PRS) specific to the All of Us study

Project description

AoUPRS

AoUPRS Logo

Overview

AoUPRS is a Python module designed for calculating Polygenic Risk Scores (PRS) specific to the All of Us study. This tool leverages Hail, a scalable framework for exploring and analyzing genomic data, to provide efficient PRS calculations.

AoUPRS provides 2 different approaches for PRS calculation [Check the publication for more details]:

Approach 1: Using Hail Dense MatrixTable (MT)

Approach 2: Using Hail Sparse Variant Dataset (VDS)

⚠️ Dataset Compatibility Update (v8)

🧬 AoUPRS now supports both v7 and v8 of the All of Us Controlled Tier WGS dataset.

🔄 Key change in v8: The GT field is no longer present in the VDS.
✅ AoUPRS now infers genotype calls using the new fields:

  • LGT – local genotype index
  • LA – local alleles array

This ensures seamless and cost-effective PRS calculation using the new, sparser v8 VDS format.

Installation

To install AoUPRS from GitHub, run the following command:

pip install AoUPRS

Dependencies

AoUPRS requires the following Python packages:

  • hail
  • gcsfs
  • pandas

These dependencies will be installed automatically when you install AoUPRS.

Usage

  1. Setup your AoU cloud analysis environment by selecting the "Hail Genomic Analysis" environment and allocating the required resources.

    How to set up a Dataproc cluster:

    • Hail MT: Requires more resources. From our experience, you need to allocate 300 workers. It's expensive but you end up saving time and money because the kernel crashes with lower resources.

      Cost when running: $72.91 per hour
      Main node: 4CPUs, 15GB RAM, 150 GB Disk
      Workers (300): 4CPUs, 15GB RAM, 150GB Disk

    • Hail VDS: The default resources will mostly suffice, but if you have a big score and want to run it faster, use preemptible workers which are much cheaper.

      Cost when running: $0.73 per hour
      Main node: 4CPUs, 15GB RAM, 150 GB Disk
      Workers (2): 4CPUs, 15GB RAM, 150GB Disk

** AoUPRS gives you the option to save the output files locally or to the cloud. We recommend always saving to the cloud as the local files will be deleted with the deletion of the Hail environment.

  1. If you wish to query the Variant Annotation Table before calculating a PRS from Hail VDS to include only variants present in the callset, follow this notebook.

  2. Importing the Packages

    To use AoUPRS, first import the package:

import AoUPRS
import os
import pandas as pd
import numpy as np
from datetime import datetime
import gcsfs
import glob
import hail as hl
  1. Initiate Hail
hl.init(tmp_dir='hail_temp/', default_reference='GRCh38')
  1. Define Bucket
bucket = os.getenv("WORKSPACE_BUCKET")
  1. Read Hail MT / VDS
# Hail MT

mt_wgs_path = os.getenv("WGS_ACAF_THRESHOLD_MULTI_HAIL_PATH")
mt = hl.read_matrix_table(mt_wgs_path)

# Hail VDS

vds_srwgs_path = os.getenv("WGS_VDS_PATH")
vds = hl.vds.read_vds(vds_srwgs_path)
  1. Drop Flagged srWGS samples
    AoU provides a table listing samples that are flagged as part of the sample outlier QC for the srWGS SNP and Indel joint callset.

    Read more: How the All of Us Genomic data are organized

# Read flagged samples

flagged_samples_path = "gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/relatedness/relatedness_flagged_samples.tsv"

# Save flagged samples locally

!gsutil -u $$GOOGLE_PROJECT cat $flagged_samples_path > flagged_samples.csv

# Import flagged samples into a hail table

flagged_samples = hl.import_table(flagged_samples_path, key='sample_id')

# Drop flagged sample from main Hail 

## If Hail MT
mt = mt.anti_join_cols(flagged_samples)

## If Hail VDS:
vds_no_flag = hl.vds.filter_samples(vds, flagged_samples, keep=False)
  1. Define the sample
# For MT:

## Convert the subset_sample_ids to a Python set
subset_sample_ids_set = set(map(str, sample_ids['person_id'].tolist()))
## Filter samples
mt = mt.filter_cols(hl.literal(subset_sample_ids_set).contains(mt.s))

# For VDS:

## Import the sample as a Hail table
sample_needed_ht = hl.import_table('sample_ids.csv', delimiter=',', key='person_id')
## Filter samples
vds_subset = hl.vds.filter_samples(vds_no_flag, sample_needed_ht, keep=True)
  1. Prepare PRS Weight Table

    The weight table must have these columns:

    ["chr", "bp", "effect_allele", "noneffect_allele", "weight"]

    The table below shows an example of a PRS weight table

    chr bp effect_allele noneffect_allele weight
    2 202881162 C T 1.57E-01
    14 996676 C T 6.77E-02
    2 202881162 C T 1.57E-01
    14 99667605 C T 6.77E-02
    6 12903725 G A 1.13E-01
    13 110308365 G A 6.77E-02
# Prepare PRS weight table using function 'prepare_prs_table'

AoUPRS.prepare_prs_table('PGS######_table.csv',
'PGS######_weight_table.csv', bucket=bucket)

# Read PRS weight table

with gcsfs.GCSFileSystem().open('PGS######_weight_table.csv', 'rb') as gcs_file:
    PGS######_weights_table = pd.read_csv(gcs_file)
  1. Calculate PRS
# Define paths

prs_identifier = 'PGS######'
pgs_weight_path = 'PGS######_weight_table.csv'
output_path = 'PGS######'

# Calculate PRS

## MT:
AoUPRS.calculate_prs_mt(mt, prs_identifier, pgs_weight_path, output_path, bucket=None, save_found_variants=False)

## VDS:
AoUPRS.calculate_prs_vds(vds_subset, prs_identifier, pgs_weight_path, output_path, bucket=bucket, save_found_variants=True)

Example Notebooks

For detailed examples, refer to the provided Jupyter notebooks in the notebooks directory . These notebooks demonstrate how to use the AoUPRS package to calculate PRS step-by-step.

🚀 Try it on the All of Us Researcher Workbench

This tool is live and fully executable in a public workspace:

🔗 Launch AoUPRS on the All of Us Researcher Workbench

You can explore, duplicate, and run the included notebooks — no setup required.

License

This project is licensed under the MIT License - see the LICENSE file for details.

📚 Citation

If you use AoUPRS in your research, please cite:

Khattab A, Chen S-F, Wineinger N, Torkamani A. AoUPRS: A Cost-Effective and Versatile PRS Calculator for the All of Us Program. BMC Genomics. 2025;26:412. https://doi.org/10.1186/s12864-025-11693-9

Author

Ahmed Khattab

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aouprs-0.2.3.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aouprs-0.2.3-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file aouprs-0.2.3.tar.gz.

File metadata

  • Download URL: aouprs-0.2.3.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.17

File hashes

Hashes for aouprs-0.2.3.tar.gz
Algorithm Hash digest
SHA256 9bc24868932ceea75a2309747e791aa94f4ec652856f54aff621c7dfba55c9b2
MD5 798cd06f531534958e631c6991f4fda7
BLAKE2b-256 d9de932fade0bb059e2f65e9e1d7a9b1e3dd5e7ea6f84eff804cbbf9cf7bde02

See more details on using hashes here.

File details

Details for the file aouprs-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: aouprs-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 13.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.17

File hashes

Hashes for aouprs-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d481406234cf9be48d2ef433d1070c0d8f2bd60b284cd2df4209de91c3d45f23
MD5 80c478fb6a43e0ee1d2b1e61e1c595ee
BLAKE2b-256 8cc6f34bb1788e12a44609be377e497aedb08fcbbb2126cf55810fc3e0a3c7d3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page