AoUPRS is a Python module for calculating Polygenic Risk Scores (PRS) specific to the All of Us study

Project description

AoUPRS

AoUPRS Logo

🌟 Overview

AoUPRS is a Python module designed for calculating Polygenic Risk Scores (PRS) on the All of Us dataset.
It leverages Hail, a scalable framework for genomic data, to provide efficient and cost-effective PRS calculations.

AoUPRS provides two approaches:

MatrixTable (MT) – dense representation
Variant Dataset (VDS) – sparse representation (recommended for v8)

📄 Publication in BMC Genomics (2025)

⚠️ Dataset Compatibility (v7 & v8)

AoUPRS supports both v7 and v8 Controlled Tier WGS datasets.
Key change in v8: the GT field was removed. AoUPRS now reconstructs genotypes using:
- LGT (local genotype index)
- LA (local alleles array)

This ensures PRS calculations remain accurate and efficient with the sparser v8 VDS format.

🚀 Performance Update (v8)

Why is PRS calculation slower in v8?

v7 ≈ 1.7B variants
v8 ≈ 4.9B variants (3× larger)
Bigger callset = heavier interval queries, more I/O, more shuffle overhead.

Chunking

Large scores (>1M SNPs) must be chunked.
Example runtimes [8 CPU / 52 GB driver, 2/10 workers (4 CPU / 15 GB RAM / 150 GB disk)]:
- 50k SNPs → 7–8 min
- 100k SNPs → 15 min
- 150k SNPs → 52 min ⚠️ nonlinear slowdown

👉 Best practice: chunk_size=50000

Workers

Scaling up workers does not help.
Sweet spot: 10 preemptible workers.
More workers = more shuffle overhead + stragglers.

Cost Example

Master: 8 CPU / 52 GB RAM
Workers: 2/10 × (4 CPU / 15 GB RAM / 150 GB disk)
Cost: ~$1.95/hour running, ~$0.11/hour paused
~1M SNP PRS (20 chunks) ≈ 3h wall time, ~$6

🛠️ Resume & Checkpointing

AoUPRS now supports chunked execution with resume:

Each chunk is saved immediately (PGS######_chunkN.csv).
If the environment crashes, rerun will skip completed chunks and continue.
At the end, all chunks are merged into the final PRS results file.

👉 This makes long PRS runs on v8 robust and restartable.

🔧 Installation

To install AoUPRS from GitHub, run the following command:

pip install AoUPRS

Dependencies

AoUPRS requires the following Python packages:

hail
gcsfs
pandas

These dependencies will be installed automatically when you install AoUPRS.

📘 Usage Guide

1. Setup Environment

Select Hail Genomic Analysis environment.

Dataproc cluster options:

Hail MT
- Requires many resources (not recommended for v8)
- ~300 workers
- Cost: ~$72.91/hour
Hail VDS (recommended)
- Works reliably with modest resources
- Best setup for large PRS runs:
  - Master node: 8 CPUs / 52 GB RAM
  - Workers: 10 × (4 CPUs / 15 GB RAM / 150 GB disk, preemptible)
- Cost: ~$1.95/hour running, ~$0.11/hour paused
- Runtime example: ~1M SNP PRS (20 × 50k chunks) ≈ 3 hours wall time, ~$6 total

👉 AoUPRS gives you the option to save the output files locally or to the cloud. We recommend always saving to the cloud as the local files will be deleted with the deletion of the Hail environment.

2. Query VAT (optional)

Before calculating PRS, you may want to restrict your weight table to variants that are actually present in the All of Us Variant Annotation Table (VAT).
This ensures you are only scoring variants found in the callset.

📓 Example notebook:
Query VAT and filter PRS weights

👉 Skipping this step will still work, but may include variants not found in AoU.

3. Import Packages

import AoUPRS
import os
import pandas as pd
import numpy as np
from datetime import datetime
import gcsfs
import glob
import hail as hl

4. Initiate Hail

hl.init(tmp_dir='hail_temp/', default_reference='GRCh38')

5. Define Bucket

bucket = os.getenv("WORKSPACE_BUCKET")

6. Read Hail MT / VDS

Hail MT

mt_wgs_path = os.getenv("WGS_ACAF_THRESHOLD_MULTI_HAIL_PATH")
mt = hl.read_matrix_table(mt_wgs_path)

Hail VDS

vds_srwgs_path = os.getenv("WGS_VDS_PATH")
vds = hl.vds.read_vds(vds_srwgs_path)

7. Drop Flagged srWGS samples (optional)

AoU provides a table listing samples that are flagged as part of the sample outlier QC for the srWGS SNP and Indel joint callset.

# Read flagged samples

flagged_samples_path = "gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/relatedness/relatedness_flagged_samples.tsv"

# Save flagged samples locally

!gsutil -u $$GOOGLE_PROJECT cat $flagged_samples_path > flagged_samples.csv

# Import flagged samples into a hail table

flagged_samples = hl.import_table(flagged_samples_path, key='sample_id')

# Drop flagged sample from main Hail 

## If Hail MT
mt = mt.anti_join_cols(flagged_samples)

## If Hail VDS:
vds_no_flag = hl.vds.filter_samples(vds, flagged_samples, keep=False)

8. Filter to Your Samples

Hail MT

## Convert the subset_sample_ids to a Python set
subset_sample_ids_set = set(map(str, sample_ids['person_id'].tolist()))
## Filter samples
mt = mt.filter_cols(hl.literal(subset_sample_ids_set).contains(mt.s))

Hail VDS:

## Import the sample as a Hail table
sample_needed_ht = hl.import_table('sample_ids.csv', delimiter=',', key='person_id')
## Filter samples
vds_subset = hl.vds.filter_samples(vds_no_flag, sample_needed_ht, keep=True)

9. Prepare PRS Weight Table

The weight table must have these columns:

["chr", "bp", "effect_allele", "noneffect_allele", "weight"]

The table below shows an example of a PRS weight table

chr	bp	effect_allele	noneffect_allele	weight
2	202881162	C	T	1.57E-01
14	996676	C	T	6.77E-02
2	202881162	C	T	1.57E-01
14	99667605	C	T	6.77E-02
6	12903725	G	A	1.13E-01
13	110308365	G	A	6.77E-02

# Prepare PRS weight table using function 'prepare_prs_table'

AoUPRS.prepare_prs_table('PGS######_table.csv',
'PGS######_weight_table.csv', bucket=bucket)

# Read PRS weight table

with gcsfs.GCSFileSystem().open('PGS######_weight_table.csv', 'rb') as gcs_file:
    PGS######_weights_table = pd.read_csv(gcs_file)

10. Calculate PRS

prs_identifier = "PGS######"
pgs_weight_path = "PGS######_weight_table.csv"
output_path = "PGS######"

Hail MT

AoUPRS.calculate_prs_mt(mt, prs_identifier,
                        pgs_weight_path, output_path,
                        bucket=None, save_found_variants=False)

Hail VDS

AoUPRS.calculate_prs_vds(vds_subset, prs_identifier,
                         pgs_weight_path, output_path,
                         bucket=bucket, save_found_variants=True,
                         chunk_size=50000)  # ✅ recommended

📓 Example Notebooks

We provide ready-to-use Jupyter notebooks that demonstrate step-by-step how to run AoUPRS:

AoUPRS with Hail VDS
Example of filtering PRS weights using the Variant Annotation Table (VAT) and calculating scores.
Other AoUPRS notebooks
Full collection of usage examples, including MT and VDS approaches.

👉 You can also try AoUPRS directly on the All of Us Researcher Workbench:
🔗 Launch AoUPRS on the All of Us Researcher Workbench

You can explore, duplicate, and run the included notebooks.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📚 Citation

If you use AoUPRS in your research, please cite:

Khattab A, Chen S-F, Wineinger N, Torkamani A. AoUPRS: A Cost-Effective and Versatile PRS Calculator for the All of Us Program. BMC Genomics. 2025;26:412. https://doi.org/10.1186/s12864-025-11693-9

Author

Ahmed Khattab

Project details

Release history Release notifications | RSS feed

0.2.6

Sep 9, 2025

This version

0.2.5

Sep 6, 2025

0.2.4

Sep 2, 2025

0.2.3

Jun 18, 2025

0.2.2

Jun 18, 2025

0.2.1

Jun 18, 2025

0.2.0

May 23, 2025

0.1.9

May 23, 2025

0.1.7

May 23, 2025

0.1.5

May 23, 2025

0.1.2

Jul 12, 2024

0.1.1

Jul 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aouprs-0.2.5.tar.gz (13.6 kB view details)

Uploaded Sep 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aouprs-0.2.5-py3-none-any.whl (16.0 kB view details)

Uploaded Sep 6, 2025 Python 3

File details

Details for the file aouprs-0.2.5.tar.gz.

File metadata

Download URL: aouprs-0.2.5.tar.gz
Upload date: Sep 6, 2025
Size: 13.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for aouprs-0.2.5.tar.gz
Algorithm	Hash digest
SHA256	`b0b02a54f1a8926cb516df866ecf92c5bf8ab60f4e3b788dc1d074c71ebd9b30`
MD5	`a07277ce69561b5faff6c5616207db6f`
BLAKE2b-256	`8b6c0d79c96f1e091cbfd22fa86f104e688101e6e98ff98c277ec6507324e598`

See more details on using hashes here.

File details

Details for the file aouprs-0.2.5-py3-none-any.whl.

File metadata

Download URL: aouprs-0.2.5-py3-none-any.whl
Upload date: Sep 6, 2025
Size: 16.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for aouprs-0.2.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`96a4fe2990312f718454ab821f73e97542118c5b171c30260f61e9969e8778ec`
MD5	`947605c6ddfedd72256b2179a73ecf61`
BLAKE2b-256	`bb712dc74ac6379d03decb5c3e6a92d2164b00e7dd831ff32a57caef539d769d`

See more details on using hashes here.

AoUPRS 0.2.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

AoUPRS

🌟 Overview

⚠️ Dataset Compatibility (v7 & v8)

🚀 Performance Update (v8)

Why is PRS calculation slower in v8?

Chunking

Workers

Cost Example

🛠️ Resume & Checkpointing

🔧 Installation

Dependencies

📘 Usage Guide

1. Setup Environment

Dataproc cluster options:

2. Query VAT (optional)

3. Import Packages

4. Initiate Hail

5. Define Bucket

6. Read Hail MT / VDS

Hail MT

Hail VDS

7. Drop Flagged srWGS samples (optional)

8. Filter to Your Samples

Hail MT

Hail VDS:

9. Prepare PRS Weight Table

10. Calculate PRS

Hail MT

Hail VDS

📓 Example Notebooks

📄 License

📚 Citation

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes