AoUPRS is a Python module for calculating Polygenic Risk Scores (PRS) specific to the All of Us study
Project description
AoUPRS
🌟 Overview
AoUPRS is a Python module designed for calculating Polygenic Risk Scores (PRS) on the All of Us dataset.
It leverages Hail, a scalable framework for genomic data, to provide efficient and cost-effective PRS calculations.
AoUPRS provides two approaches:
- MatrixTable (MT) – dense representation
- Variant Dataset (VDS) – sparse representation (recommended for v8)
📄 Publication in BMC Genomics (2025)
⚠️ Dataset Compatibility (v7 & v8)
- AoUPRS supports both v7 and v8 Controlled Tier WGS datasets.
- Key change in v8: the
GTfield was removed. AoUPRS now reconstructs genotypes using:LGT(local genotype index)LA(local alleles array)
This ensures PRS calculations remain accurate and efficient with the sparser v8 VDS format.
🚀 Performance Update (v8)
Why is PRS calculation slower in v8?
- v7 ≈ 1.7B variants
- v8 ≈ 4.9B variants (3× larger)
- Bigger callset = heavier interval queries, more I/O, more shuffle overhead.
Chunking
- Large scores (>1M SNPs) must be chunked.
- Example runtimes [8 CPU / 52 GB driver, 2/10 workers (4 CPU / 15 GB RAM / 150 GB disk)]:
- 50k SNPs → 7–8 min
- 100k SNPs → 15 min
- 150k SNPs → 52 min ⚠️ nonlinear slowdown
👉 Best practice: chunk_size=50000
Workers
- Scaling up workers does not help.
- Sweet spot: 10 preemptible workers.
- More workers = more shuffle overhead + stragglers.
Cost Example
- Master: 8 CPU / 52 GB RAM
- Workers: 2/10 × (4 CPU / 15 GB RAM / 150 GB disk)
- Cost: ~$1.95/hour running, ~$0.11/hour paused
- ~1M SNP PRS (20 chunks) ≈ 3h wall time, ~$6
🛠️ Resume & Checkpointing
AoUPRS now supports chunked execution with resume:
- Each chunk is saved immediately (
PGS######_chunkN.csv). - If the environment crashes, rerun will skip completed chunks and continue.
- At the end, all chunks are merged into the final PRS results file.
👉 This makes long PRS runs on v8 robust and restartable.
🔧 Installation
To install AoUPRS from GitHub, run the following command:
pip install AoUPRS
Dependencies
AoUPRS requires the following Python packages:
- hail
- gcsfs
- pandas
These dependencies will be installed automatically when you install AoUPRS.
📘 Usage Guide
1. Setup Environment
Select Hail Genomic Analysis environment.
Dataproc cluster options:
-
Hail MT
- Requires many resources (not recommended for v8)
- ~300 workers
- Cost: ~$72.91/hour
-
Hail VDS (recommended)
- Works reliably with modest resources
- Best setup for large PRS runs:
- Master node: 8 CPUs / 52 GB RAM
- Workers: 10 × (4 CPUs / 15 GB RAM / 150 GB disk, preemptible)
- Cost: ~$1.95/hour running, ~$0.11/hour paused
- Runtime example: ~1M SNP PRS (20 × 50k chunks) ≈ 3 hours wall time, ~$6 total
👉 AoUPRS gives you the option to save the output files locally or to the cloud. We recommend always saving to the cloud as the local files will be deleted with the deletion of the Hail environment.
2. Query VAT (optional)
Before calculating PRS, you may want to restrict your weight table to variants that are actually present in the All of Us Variant Annotation Table (VAT).
This ensures you are only scoring variants found in the callset.
📓 Example notebook:
Query VAT and filter PRS weights
👉 Skipping this step will still work, but may include variants not found in AoU.
3. Import Packages
import AoUPRS
import os
import pandas as pd
import numpy as np
from datetime import datetime
import gcsfs
import glob
import hail as hl
4. Initiate Hail
hl.init(tmp_dir='hail_temp/', default_reference='GRCh38')
5. Define Bucket
bucket = os.getenv("WORKSPACE_BUCKET")
6. Read Hail MT / VDS
Hail MT
mt_wgs_path = os.getenv("WGS_ACAF_THRESHOLD_MULTI_HAIL_PATH")
mt = hl.read_matrix_table(mt_wgs_path)
Hail VDS
vds_srwgs_path = os.getenv("WGS_VDS_PATH")
vds = hl.vds.read_vds(vds_srwgs_path)
7. Drop Flagged srWGS samples (optional)
AoU provides a table listing samples that are flagged as part of the sample outlier QC for the srWGS SNP and Indel joint callset.
Read more: How the All of Us Genomic data are organized
# Read flagged samples
flagged_samples_path = "gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/relatedness/relatedness_flagged_samples.tsv"
# Save flagged samples locally
!gsutil -u $$GOOGLE_PROJECT cat $flagged_samples_path > flagged_samples.csv
# Import flagged samples into a hail table
flagged_samples = hl.import_table(flagged_samples_path, key='sample_id')
# Drop flagged sample from main Hail
## If Hail MT
mt = mt.anti_join_cols(flagged_samples)
## If Hail VDS:
vds_no_flag = hl.vds.filter_samples(vds, flagged_samples, keep=False)
8. Filter to Your Samples
Hail MT
## Convert the subset_sample_ids to a Python set
subset_sample_ids_set = set(map(str, sample_ids['person_id'].tolist()))
## Filter samples
mt = mt.filter_cols(hl.literal(subset_sample_ids_set).contains(mt.s))
Hail VDS:
## Import the sample as a Hail table
sample_needed_ht = hl.import_table('sample_ids.csv', delimiter=',', key='person_id')
## Filter samples
vds_subset = hl.vds.filter_samples(vds_no_flag, sample_needed_ht, keep=True)
9. Prepare PRS Weight Table
The weight table must have these columns:
["chr", "bp", "effect_allele", "noneffect_allele", "weight"]
The table below shows an example of a PRS weight table
| chr | bp | effect_allele | noneffect_allele | weight |
|---|---|---|---|---|
| 2 | 202881162 | C | T | 1.57E-01 |
| 14 | 996676 | C | T | 6.77E-02 |
| 2 | 202881162 | C | T | 1.57E-01 |
| 14 | 99667605 | C | T | 6.77E-02 |
| 6 | 12903725 | G | A | 1.13E-01 |
| 13 | 110308365 | G | A | 6.77E-02 |
# Prepare PRS weight table using function 'prepare_prs_table'
AoUPRS.prepare_prs_table('PGS######_table.csv',
'PGS######_weight_table.csv', bucket=bucket)
# Read PRS weight table
with gcsfs.GCSFileSystem().open('PGS######_weight_table.csv', 'rb') as gcs_file:
PGS######_weights_table = pd.read_csv(gcs_file)
10. Calculate PRS
prs_identifier = "PGS######"
pgs_weight_path = "PGS######_weight_table.csv"
output_path = "PGS######"
Hail MT
AoUPRS.calculate_prs_mt(mt, prs_identifier,
pgs_weight_path, output_path,
bucket=None, save_found_variants=False)
Hail VDS
AoUPRS.calculate_prs_vds(vds_subset, prs_identifier,
pgs_weight_path, output_path,
bucket=bucket, save_found_variants=True,
chunk_size=50000) # ✅ recommended
📓 Example Notebooks
We provide ready-to-use Jupyter notebooks that demonstrate step-by-step how to run AoUPRS:
-
AoUPRS with Hail VDS
Example of filtering PRS weights using the Variant Annotation Table (VAT) and calculating scores. -
Other AoUPRS notebooks
Full collection of usage examples, including MT and VDS approaches.
👉 You can also try AoUPRS directly on the All of Us Researcher Workbench:
🔗 Launch AoUPRS on the All of Us Researcher Workbench
You can explore, duplicate, and run the included notebooks.
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
📚 Citation
If you use AoUPRS in your research, please cite:
Khattab A, Chen S-F, Wineinger N, Torkamani A. AoUPRS: A Cost-Effective and Versatile PRS Calculator for the All of Us Program. BMC Genomics. 2025;26:412. https://doi.org/10.1186/s12864-025-11693-9
Author
Ahmed Khattab
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aouprs-0.2.5.tar.gz.
File metadata
- Download URL: aouprs-0.2.5.tar.gz
- Upload date:
- Size: 13.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0b02a54f1a8926cb516df866ecf92c5bf8ab60f4e3b788dc1d074c71ebd9b30
|
|
| MD5 |
a07277ce69561b5faff6c5616207db6f
|
|
| BLAKE2b-256 |
8b6c0d79c96f1e091cbfd22fa86f104e688101e6e98ff98c277ec6507324e598
|
File details
Details for the file aouprs-0.2.5-py3-none-any.whl.
File metadata
- Download URL: aouprs-0.2.5-py3-none-any.whl
- Upload date:
- Size: 16.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
96a4fe2990312f718454ab821f73e97542118c5b171c30260f61e9969e8778ec
|
|
| MD5 |
947605c6ddfedd72256b2179a73ecf61
|
|
| BLAKE2b-256 |
bb712dc74ac6379d03decb5c3e6a92d2164b00e7dd831ff32a57caef539d769d
|