Skip to main content

Biobanking data processing, annotation, and association workflows

Project description

Biobanking

Systematic collection, processing, storage, and analysis of biological samples and associated health records for medical research.

Supported pipelines

Preprocess

Contains biobank-specific modules for EHR data collection, cleaning, and processing.

QC (Under construction)

Will contain biobank-specific modules for variant quality control and filtering.

Annotation (Under construction)

Will contain biobank-specific modules for variant annotation.

Association

Contains biobank-specific modules for genotype-phenotype association tests.

Supported biobanks

All of Us

The All of Us biobank consists of coupled whole genome sequencing and electronic health record data of more than 400k individuals, with continued expansion.

UK Biobank (Under construction)

The UK Biobank consists of coupled whole genome sequencing and electronic health record data of ~500k participants.

AoU REGENIE workflow

The All of Us association utilities currently support a packaged regenie workflow with three Step 2 modes:

  • Burden association testing
  • Mask-only runs for writing burden-mask PLINK datasets
  • Interaction testing using the same burden inputs and optional interaction flags

The workflow implementation lives in src/biobanking/workflows/regenie.wdl, and the Python utilities live in src/biobanking/association/aou.py.

The tracking model is phenotype-centered:

  • Step 1 is tracked once per phenotype prefix
  • Step 2 burden and interaction runs are tracked separately by mode under phenotype prefixes
  • mask runs are tracked separately under a top-level mask namespace
  • workflow metadata is written locally and synced to the workspace bucket

This keeps LOCO and prediction reuse aligned with the phenotype definition rather than with any specific burden or interaction run, while allowing masks to remain phenotype-independent artifacts.

Recommended usage pattern

  • Run or reuse Step 1 once per phenotype prefix.
  • Use burden runs for standard gene-based tests.
  • Use mask runs to materialize chromosome-wide burden-mask PLINK files from a universal dummy phenotype stored at data/associations/masks/<burden_type>/dummy.tsv.gz, without phenotype covariates.
  • Use interaction runs only after Step 1 exists for the phenotype prefix you are testing.

More detailed usage examples are in docs/workflows.md.

Validate WDL

Before submitting workflows through Cromwell, validate the WDL locally with womtool. A simple setup is:

java -jar .\data\tools\womtool.jar validate .\src\biobanking\workflows\regenie.wdl

If womtool.jar is not present yet, place it under data/tools/ in the repository and rerun the validation command before submitting updated workflow code.

Internal use

python -m pip install -U pip build
pip install twine
# linux
rm -rf dist build *.egg-info src/*.egg-info
# windows
Remove-Item -Recurse -Force dist, *.egg-info, src\*.egg-info
python -m build
pip install dist/biobanking-0.0.16-py3-none-any.whl
java -jar .\data\tools\womtool.jar validate .\src\biobanking\workflows\regenie.wdl
python -c "from biobanking.association.aou import REGENIE; regenie = REGENIE(); from biobanking.preprocess.aou.measurements import save_measurements_in_wide_format; print('import ok')"
twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biobanking-0.0.16.tar.gz (286.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biobanking-0.0.16-py3-none-any.whl (330.5 kB view details)

Uploaded Python 3

File details

Details for the file biobanking-0.0.16.tar.gz.

File metadata

  • Download URL: biobanking-0.0.16.tar.gz
  • Upload date:
  • Size: 286.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for biobanking-0.0.16.tar.gz
Algorithm Hash digest
SHA256 47007c29f60adebc3cade47da6d8805081f2e43d9ef3d95f6632e2aa2d14cedd
MD5 4e7c401ded0f3785eadbf36ff450c8e5
BLAKE2b-256 7a5556b08b16ff26464ff14a8235baed6e31d555e6c5ab4e6651eab171a2a2a5

See more details on using hashes here.

File details

Details for the file biobanking-0.0.16-py3-none-any.whl.

File metadata

  • Download URL: biobanking-0.0.16-py3-none-any.whl
  • Upload date:
  • Size: 330.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for biobanking-0.0.16-py3-none-any.whl
Algorithm Hash digest
SHA256 09912e5ea14f7e8db1508d11316d56dcef59ed992d13b9753cae778a2b87f7e0
MD5 5104f880025f0a8ad86bed2ac74242bf
BLAKE2b-256 3261790491f2114c5c46619972e773a65b9ab679413260c7e2475e24b7a2849a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page