Polars-bio based tool to compute polygenic risk scores from PGS Catalog
Project description
just-prs
A Polars-bio based tool to compute Polygenic Risk Scores (PRS) from the PGS Catalog.
Features
PRSCatalog— high-level class for searching scores, computing PRS, and estimating percentiles using cleaned bulk metadata (no REST API calls needed)- Cleanup pipeline — normalizes genome builds (hg19/hg38/NCBI36 → GRCh37/GRCh38/GRCh36), renames columns to snake_case, parses performance metric strings into structured numeric fields
- HuggingFace sync — cleaned metadata parquets are published to just-dna-seq/polygenic_risk_scores and auto-downloaded on first use
- Compute PRS for one or many scores against a VCF file
- Search and inspect PGS Catalog scores and traits via the REST API
- Bulk download the entire PGS Catalog metadata (all ~5,000+ scores) via EBI FTP — one HTTP request per sheet, not hundreds of API pages
- Stream harmonized scoring files directly from EBI FTP without storing intermediate
.gzfiles - All data saved as Parquet for fast, efficient downstream analysis with Polars
Validation against PLINK2
Our PRS computation is validated against PLINK2 --score on real genomic data. The integration test suite downloads a whole-genome VCF from Zenodo, computes PRS for multiple GRCh38 scores using both just-prs and PLINK2, and asserts agreement:
| PGS ID | just-prs | PLINK2 | Relative diff | Variants matched |
|---|---|---|---|---|
| PGS000001 | 0.030123 | 0.030123 | 6.5e-7 | 51 / 77 |
| PGS000002 | -0.137089 | -0.137089 | 1.1e-7 | 51 / 77 |
| PGS000003 | 0.588127 | 0.588127 | 8.1e-9 | 51 / 77 |
| PGS000004 | -0.7158 | -0.7158 | 3.1e-16 | 170 / 313 |
| PGS000005 | -0.8903 | -0.8903 | 5.0e-16 | 170 / 313 |
All differences are within floating-point precision. PLINK2 is auto-downloaded if not already installed, so the tests run on any Linux, macOS, or Windows machine:
uv run pytest tests/test_plink.py -v
Installation
Requires Python ≥ 3.14. Uses uv for dependency management.
From PyPI:
uv add just-prs
# or
pip install just-prs
From source:
git clone https://github.com/antonkulaga/just-prs
cd just-prs
uv sync
For the optional web UI: pip install just-prs[ui] or uv sync --all-packages when developing from source.
The CLI is available as both just-prs and prs.
CLI Reference
Top-level commands
prs --help
prs compute --help
prs catalog --help
prs compute — Compute PRS for a VCF
prs compute --vcf sample.vcf.gz --pgs-id PGS000001
prs compute --vcf sample.vcf.gz --pgs-id PGS000001,PGS000002,PGS000003
prs compute --vcf sample.vcf.gz --pgs-id PGS000001 --build GRCh37 --output results.json
Options:
| Flag | Default | Description |
|---|---|---|
--vcf / -v |
— | Path to VCF file (required) |
--pgs-id / -p |
— | Comma-separated PGS ID(s) (required) |
--build / -b |
GRCh38 |
Genome build |
--cache-dir |
~/.cache/just-prs/scores |
Cache directory for scoring files |
--output / -o |
— | Save results as JSON |
prs catalog scores — Search and inspect scores (REST API)
prs catalog scores list # first 100 scores
prs catalog scores list --all # every score in catalog
prs catalog scores search --term "breast cancer"
prs catalog scores info PGS000001
prs catalog traits — Search and inspect traits (REST API)
prs catalog traits search --term "diabetes"
prs catalog traits info EFO_0001645
prs catalog download — Download a single scoring file
Downloads the harmonized .txt.gz scoring file for one score and caches it locally.
prs catalog download PGS000001
prs catalog download PGS000001 --output-dir ./my_scores --build GRCh37
prs catalog bulk — Bulk FTP downloads (fast, parquet output)
These commands use the EBI FTP HTTPS mirror via fsspec to download pre-built catalog-wide files directly — far faster than paginating the REST API.
prs catalog bulk metadata — All catalog metadata as parquet
Downloads the PGS Catalog bulk metadata CSVs and converts each to a parquet file. The full catalog (~5,000+ scores) downloads in seconds as a single HTTP request per sheet.
# Download all 7 metadata sheets → ./output/pgs_metadata/*.parquet
prs catalog bulk metadata
# Download only the scores sheet
prs catalog bulk metadata --sheet scores
# Specify output directory; force re-download
prs catalog bulk metadata --output-dir /data/pgs --overwrite
Available sheets:
| Sheet | Contents |
|---|---|
scores |
All PGS scores and their metadata |
publications |
Publication sources for each PGS |
efo_traits |
Ontology-mapped trait information |
score_development_samples |
GWAS and training samples |
performance_metrics |
Evaluation performance metrics |
evaluation_sample_sets |
Evaluation sample set descriptions |
cohorts |
Cohort information |
Options:
| Flag | Default | Description |
|---|---|---|
--output-dir / -o |
./output/pgs_metadata |
Directory for parquet output |
--sheet / -s |
all sheets | Single sheet name to download |
--overwrite |
False |
Re-download existing files |
prs catalog bulk scores — All scoring files as parquet
Streams each harmonized scoring file from EBI FTP and saves it as a parquet file
(with an added pgs_id column). No intermediate .gz files are written to disk.
# Download ALL ~5,000+ scoring files (GRCh38) → ./output/pgs_scores/PGS######.parquet
prs catalog bulk scores
# Download a specific subset
prs catalog bulk scores --ids PGS000001,PGS000002,PGS000003
# GRCh37 build, custom output dir
prs catalog bulk scores --build GRCh37 --output-dir /data/scores
# Force re-download of existing files
prs catalog bulk scores --ids PGS000001 --overwrite
Options:
| Flag | Default | Description |
|---|---|---|
--output-dir / -o |
./output/pgs_scores |
Directory for parquet output |
--build / -b |
GRCh38 |
Genome build (GRCh37 or GRCh38) |
--ids |
all | Comma-separated PGS IDs to download |
--overwrite |
False |
Re-download existing parquet files |
prs catalog bulk clean-metadata — Build cleaned metadata parquets
Downloads raw metadata from EBI FTP, runs the cleanup pipeline (genome build normalization, column renaming, metric parsing, performance flattening), and saves three cleaned parquet files.
# Build cleaned parquets → ./output/pgs_metadata/
prs catalog bulk clean-metadata
# Custom output directory
prs catalog bulk clean-metadata --output-dir /data/cleaned
Output files:
| File | Contents |
|---|---|
scores.parquet |
All PGS scores with snake_case columns, normalized genome builds |
performance.parquet |
Performance metrics joined with evaluation samples, parsed numeric columns |
best_performance.parquet |
One best row per PGS ID (largest sample, European-preferred) |
Options:
| Flag | Default | Description |
|---|---|---|
--output-dir / -o |
./output/pgs_metadata |
Directory for cleaned parquet output |
prs catalog bulk push-hf — Push cleaned parquets to HuggingFace
Uploads cleaned metadata parquets to a HuggingFace dataset repository. Builds them first if not already present. Token is read from .env file or HF_TOKEN environment variable.
# Push to default repo (just-dna-seq/polygenic_risk_scores)
prs catalog bulk push-hf
# Push from a custom directory to a custom repo
prs catalog bulk push-hf --output-dir /data/cleaned --repo my-org/my-dataset
Options:
| Flag | Default | Description |
|---|---|---|
--output-dir / -o |
./output/pgs_metadata |
Directory containing cleaned parquets |
--repo / -r |
just-dna-seq/polygenic_risk_scores |
HuggingFace dataset repo ID |
prs catalog bulk pull-hf — Pull cleaned parquets from HuggingFace
Downloads cleaned metadata parquets from a HuggingFace dataset repository. Useful for bootstrapping a local cache without running the cleanup pipeline.
# Pull to default directory
prs catalog bulk pull-hf
# Pull to custom directory from custom repo
prs catalog bulk pull-hf --output-dir /data/cleaned --repo my-org/my-dataset
Options:
| Flag | Default | Description |
|---|---|---|
--output-dir / -o |
./output/pgs_metadata |
Directory to save pulled parquets |
--repo / -r |
just-dna-seq/polygenic_risk_scores |
HuggingFace dataset repo ID |
prs catalog bulk ids — List all PGS IDs
Fetches pgs_scores_list.txt from EBI FTP (one request) and prints every PGS ID.
prs catalog bulk ids
prs catalog bulk ids | wc -l # count total scores
Python API
Bulk FTP downloads (just_prs.ftp)
from just_prs.ftp import (
list_all_pgs_ids,
download_metadata_sheet,
download_all_metadata,
stream_scoring_file,
download_scoring_as_parquet,
bulk_download_scoring_parquets,
)
from pathlib import Path
# Full ID list in one request
ids = list_all_pgs_ids() # ['PGS000001', 'PGS000002', ...]
# All score metadata as a Polars DataFrame, saved to parquet
df = download_metadata_sheet("scores", Path("./output/pgs_metadata/scores.parquet"))
# All 7 sheets at once
sheets = download_all_metadata(Path("./output/pgs_metadata"))
# Stream a scoring file as a LazyFrame (no local .gz written)
lf = stream_scoring_file("PGS000001", genome_build="GRCh38")
# Download one scoring file as parquet (adds pgs_id column)
path = download_scoring_as_parquet("PGS000001", Path("./output/pgs_scores"))
# Bulk download a list (or all) scoring files as parquet
paths = bulk_download_scoring_parquets(Path("./output/pgs_scores"), pgs_ids=["PGS000001", "PGS000002"])
paths = bulk_download_scoring_parquets(Path("./output/pgs_scores")) # all ~5000+
REST API client (just_prs.catalog)
from just_prs.catalog import PGSCatalogClient
with PGSCatalogClient() as client:
score = client.get_score("PGS000001")
results = client.search_scores("breast cancer", limit=10)
trait = client.get_trait("EFO_0001645")
for score in client.iter_all_scores(page_size=100):
print(score.id, score.trait_reported)
PRSCatalog — search, compute, and percentile (just_prs.prs_catalog)
PRSCatalog is the recommended high-level interface. It persists 3 cleaned parquet files locally and loads them on access using a 3-tier fallback chain: local files -> HuggingFace pull -> raw FTP download + cleanup. All lookups, searches, and PRS computations use cleaned data with no per-score REST API calls.
from just_prs import PRSCatalog
catalog = PRSCatalog() # uses ~/.cache/just-prs by default
# Browse cleaned scores (genome builds normalized, snake_case columns)
scores_df = catalog.scores(genome_build="GRCh38").collect()
# Search across pgs_id, name, trait_reported, and trait_efo
results = catalog.search("breast cancer", genome_build="GRCh38").collect()
# Get cleaned metadata for a single score
info = catalog.score_info_row("PGS000001") # dict or None
# Best performance metric per score (largest sample, European-preferred)
best = catalog.best_performance(pgs_id="PGS000001").collect()
# Compute PRS (trait lookup from cached metadata, not REST API)
result = catalog.compute_prs(vcf_path="sample.vcf.gz", pgs_id="PGS000001")
print(result.score, result.match_rate)
# Batch computation
results = catalog.compute_prs_batch(
vcf_path="sample.vcf.gz",
pgs_ids=["PGS000001", "PGS000002"],
)
# Percentile estimation (AUROC-based or explicit mean/std)
pct = catalog.percentile(prs_score=1.5, pgs_id="PGS000014")
pct = catalog.percentile(prs_score=1.5, pgs_id="PGS000014", mean=0.0, std=1.0)
# Build cleaned parquets explicitly (download from FTP + cleanup)
paths = catalog.build_cleaned_parquets(output_dir=Path("./output/pgs_metadata"))
# {'scores': Path('output/pgs_metadata/scores.parquet'), 'performance': ..., 'best_performance': ...}
# Push cleaned parquets to HuggingFace
catalog.push_to_hf() # token from .env / HF_TOKEN
catalog.push_to_hf(token="hf_...", repo_id="my-org/my-dataset")
HuggingFace sync (just_prs.hf)
from just_prs.hf import push_cleaned_parquets, pull_cleaned_parquets
from pathlib import Path
# Push cleaned parquets to HF dataset repo
push_cleaned_parquets(Path("./output/pgs_metadata")) # default: just-dna-seq/polygenic_risk_scores
# Pull cleaned parquets from HF
downloaded = pull_cleaned_parquets(Path("./local_cache"))
# [Path('local_cache/scores.parquet'), Path('local_cache/performance.parquet'), ...]
Cleanup pipeline (just_prs.cleanup)
The cleanup functions can be used independently of PRSCatalog:
from just_prs.cleanup import clean_scores, clean_performance_metrics, parse_metric_string
from just_prs.ftp import download_metadata_sheet
from pathlib import Path
# Clean scores: rename columns, normalize genome builds
raw_df = download_metadata_sheet("scores", Path("./output/pgs_metadata/scores_raw.parquet"))
cleaned_lf = clean_scores(raw_df) # LazyFrame with snake_case columns
# Parse a metric string
parse_metric_string("1.55 [1.52,1.58]")
# {'estimate': 1.55, 'ci_lower': 1.52, 'ci_upper': 1.58, 'se': None}
# Clean performance metrics: parse strings, join with evaluation samples
perf_df = download_metadata_sheet("performance_metrics", Path("./output/pgs_metadata/perf.parquet"))
eval_df = download_metadata_sheet("evaluation_sample_sets", Path("./output/pgs_metadata/eval.parquet"))
cleaned_perf_lf = clean_performance_metrics(perf_df, eval_df)
Low-level PRS computation (just_prs.prs)
from pathlib import Path
from just_prs.prs import compute_prs, compute_prs_batch
result = compute_prs(
vcf_path=Path("sample.vcf.gz"),
scoring_file="PGS000001", # PGS ID, local path, or LazyFrame
genome_build="GRCh38",
)
print(result.score, result.match_rate)
results = compute_prs_batch(
vcf_path=Path("sample.vcf.gz"),
pgs_ids=["PGS000001", "PGS000002"],
)
Web UI (prs-ui)
An interactive Reflex web application for browsing PGS Catalog data and computing PRS scores.
cd prs-ui
uv run reflex run
The UI has three tabs:
Metadata Sheets
Browse all 7 PGS Catalog metadata sheets (Scores, Publications, EFO Traits, etc.) in a MUI DataGrid with server-side filtering and sorting. Select rows with checkboxes and download their scoring files to the local cache with a single Download Selected button.
Scoring File
Stream any harmonized scoring file by PGS ID directly from EBI FTP and view it in the grid. Select the genome build (GRCh37 / GRCh38) before loading.
Compute PRS
End-to-end PRS computation workflow:
- Upload a VCF — drag-and-drop or browse; genome build is auto-detected from
##referenceand##contigheaders - Load Scores — fetches the PGS Catalog scores metadata, pre-filtered by the detected (or manually selected) genome build. Scores are shown in a paginated, searchable table
- Select scores — use checkboxes to pick individual scores, or "Select All" to select everything matching the current search
- Compute — runs PRS for each selected score against the uploaded VCF and shows results with match rates, effect sizes, and classification metrics from PGS Catalog evaluation studies
Configuration:
| Environment variable | Default | Description |
|---|---|---|
PRS_CACHE_DIR |
~/.cache/just-prs |
Root directory for cached metadata and scoring files |
Data sources
- PGS Catalog REST API: https://www.pgscatalog.org/rest/
- EBI FTP bulk downloads: https://ftp.ebi.ac.uk/pub/databases/spot/pgs/
- PGS Catalog download documentation: https://www.pgscatalog.org/downloads/
- Cleaned metadata parquets on HuggingFace: https://huggingface.co/datasets/just-dna-seq/polygenic_risk_scores
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file just_prs-0.2.0.tar.gz.
File metadata
- Download URL: just_prs-0.2.0.tar.gz
- Upload date:
- Size: 28.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33a55cf8896bec00afe55945478bcb79a8ca784da9550185d50d9b6de4e945e6
|
|
| MD5 |
246fb42b5fbfb73348d10ad864f38883
|
|
| BLAKE2b-256 |
bad4bd8a5a3162522e55ac45295f003dc7a40746c8fce33bc7633eab450e9797
|
File details
Details for the file just_prs-0.2.0-py3-none-any.whl.
File metadata
- Download URL: just_prs-0.2.0-py3-none-any.whl
- Upload date:
- Size: 34.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc99520f1fef5c3e72cca34a52d30080d7488928d8968261dd7a82afc0a15d01
|
|
| MD5 |
e0e3ec42aa5272aa419ed0a3b7e6bbec
|
|
| BLAKE2b-256 |
49eba5cdeaed3f395a71ea2c1c6269ead23b303205c76dbff46b765c187e5db3
|