Skip to main content

Dagster pipeline for computing PRS reference distributions from the 1000G panel

Project description

prs-pipeline

Dagster pipeline for computing PRS reference distributions from the 1000 Genomes reference panel.

Overview

This pipeline downloads the PGS Catalog 1000G reference panel (~7 GB), computes polygenic risk scores for all 2,504 reference individuals across all PGS Catalog scores, aggregates per-superpopulation distribution statistics, and pushes reference_distributions.parquet to HuggingFace (just-dna-seq/prs-percentiles).

End users of just-prs automatically pull this tiny parquet via PRSCatalog.reference_distributions().

Running

cd prs-pipeline
uv run dagster dev -m prs_pipeline.definitions

Then open http://localhost:3000 in your browser.

Assets

Asset Group Description
ebi_reference_panel_fingerprint download HTTP fingerprint for freshness tracking of the remote reference panel
ebi_scoring_files_fingerprint download HTTP fingerprint for the remote scoring file manifest
scoring_files download Bulk-download all harmonized PGS scoring .txt.gz files from EBI FTP
scoring_files_parquet compute Convert all .txt.gz scoring files to spec-driven parquet caches (zstd-9, embedded headers). Deletes .txt.gz after verified conversion to save ~5.5 GB disk space. Tracks per-file failures in conversion_failures.parquet
reference_panel download Download + extract reference panel binary files (.pgen/.pvar/.psam)
reference_scores compute Score all PGS IDs against the reference panel via compute_reference_prs_batch()
hf_prs_percentiles upload Enrich distributions with metadata and push to HuggingFace

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prs_pipeline-0.2.3.tar.gz (27.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

prs_pipeline-0.2.3-py3-none-any.whl (30.4 kB view details)

Uploaded Python 3

File details

Details for the file prs_pipeline-0.2.3.tar.gz.

File metadata

  • Download URL: prs_pipeline-0.2.3.tar.gz
  • Upload date:
  • Size: 27.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for prs_pipeline-0.2.3.tar.gz
Algorithm Hash digest
SHA256 a5b5aa63d91120eca6e7185147b99e55855ba1c3706136ac4e566706b8563f8c
MD5 ca022370bf2c97888f1a36a45746e252
BLAKE2b-256 0865d22c3fddd2aa7f44ff4968a514e4baa820fbabb28cadbce812c3c35f1805

See more details on using hashes here.

File details

Details for the file prs_pipeline-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: prs_pipeline-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 30.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for prs_pipeline-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 65630d941d7be4b19e9b33de7558178582664f72d1de7e23bf3df54b9e945125
MD5 640489f42898f6dd9df94783866975b7
BLAKE2b-256 a356740158cc3c883397a529ccbcc17c08c5a108ded787913a5d01b1d32a560c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page