Skip to main content

GENBoostGPU provides a scalable framework for running elastic net regression with boosting across thousands of CpG sites, leveraging GPU acceleration with RAPIDS cuML, CuPy, and cuDF.

Project description

GENBoostGPU

Read the Docs PyPI License: GPL v3 Tests DOI

Genomic Elastic Net Boosting on GPU (GENBoostGPU)

GENBoostGPU provides a scalable framework for running elastic net regression with boosting across thousands of CpG sites, leveraging GPU acceleration with RAPIDS cuML, CuPy, and cuDF.
It supports SNP preprocessing, cis-window filtering, LD clumping, missing data imputation, and phenotype integration — all optimized for large-scale epigenomics.


Features

  • Window-based orchestration:
    • run_windows_with_dask coordinates execution across one or more GPUs using Dask.
    • Handles batch scheduling of thousands of genomic windows.
  • Single-window analysis:
    • run_single_window executes boosting elastic net on one genomic region.
    • Accepts pre-loaded arrays (CuPy) or file paths (PLINK, phenotype tables).
  • GPU-accelerated boosting elastic net:
    • Iterative boosting with cuML ElasticNet and final Ridge refit.
    • Early stopping based on stability of variance explained.
  • Automated SNP preprocessing:
    • Zero-variance SNP filtering
    • Missing genotype imputation
    • LD clumping (PLINK-like) with CuPy
    • Cis-window SNP filtering
  • Hyperparameter optimization:
    • Optuna-based tuning of ElasticNet (alpha, l1_ratio)
    • Ridge regression tuning with delayed evaluation
    • Optional manual cross-validation for custom grids
  • Scalability:
    • Dask orchestration for multiple GPUs (LocalCUDACluster)
    • Single-GPU fallback for smaller jobs
  • Flexible outputs:
    • SNP betas, heritability estimates, variance explained
    • Window-level summary tables (.parquet)
    • Intermediate ridge/elastic net models for reproducibility

Installation

GENBoostGPU is available on PyPI.
It requires Python ≥3.10 and an NVIDIA GPU with CUDA 12.x.

pip install genboostgpu

For development (from source):

git clone https://github.com/heart-gen/GENBoostGPU.git
cd GENBoostGPU
poetry install

Usage

GENBoostGPU can be used either for large-scale orchestration (many genomic windows across one or more GPUs) or for single-window testing/debugging.


Example 1: Run a Single Window

The simplest entry point is run_single_window, which takes either:

  • File paths (PLINK genotypes + phenotype file + phenotype ID), or
  • Pre-loaded CuPy arrays for genotypes and phenotypes.
from genboostgpu.vmr_runner import run_single_window

result = run_single_window(
    chrom=21,
    start=10_000,
    end=510_000,
    geno_path="data/chr21_subset.bed",
    pheno_path="data/phenotypes.tsv",
    pheno_id="pheno_379",
    outdir="results",
    n_iter=50,
    n_trials=10
)

print(result)

Output is a Python dictionary, e.g.:

{
  "chrom": 21,
  "start": 10000,
  "end": 510000,
  "num_snps": 742,
  "final_r2": 0.34,
  "h2_unscaled": 0.29,
  "n_iter": 37
}

This produces:

  • Window-level summary (Python dict)
  • Saved results (.parquet, betas, heritability estimates) in results/

Example 2: Running on VMR Data

REGION=caudate python examples/vmr_test_caudate.py

Script outline (examples/vmr_test_caudate.py):

from genboostgpu.orchestration import run_windows_with_dask

df = run_windows_with_dask(
    windows, error_regions=error_regions,
    outdir="results", window_size=500_000,
    n_iter=100, n_trials=20, use_window=True,
    save=True, prefix="vmr"
)

This runs boosting elastic net across all VMR-defined windows for the chosen region.


Example 3: Running on Simulated Data

NUM_SAMPLES=100 python examples/simu_test_100n.py

Script outline (examples/simu_test_100n.py):

from genboostgpu.orchestration import run_windows_with_dask

df = run_windows_with_dask(
    windows, outdir="results", window_size=500_000,
    n_iter=100, n_trials=10, use_window=False,
    save=True, prefix="simu_100"
)

This runs boosting elastic net across synthetic SNP–phenotype pairs for benchmarking.


GPU Scaling

  • On a single GPU: runs without a Dask cluster.
  • On multiple GPUs: run_windows_with_dask automatically launches a LocalCUDACluster and distributes windows across devices.

Citation

If you use GENBoostGPU in your research, please cite:

Alexis Bennett and Kynon J.M. Benjamin GENBoostGPU: GPU-accelerated elastic net boosting for large-scale epigenomics DOI: 10.5281/zenodo.17238798


License

GENBoostGPU is licensed under the GPL-3.0 license. See the LICENSE file for details.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genboostgpu-0.2.0.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genboostgpu-0.2.0-py3-none-any.whl (20.3 kB view details)

Uploaded Python 3

File details

Details for the file genboostgpu-0.2.0.tar.gz.

File metadata

  • Download URL: genboostgpu-0.2.0.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.0 CPython/3.10.9 Linux/4.18.0-553.22.1.el8_10.x86_64

File hashes

Hashes for genboostgpu-0.2.0.tar.gz
Algorithm Hash digest
SHA256 6d4909b84e22b92f75b6ea161a39bd2439e6a6702c8c1761c90ac06e02f922e1
MD5 b116dd4f2c946c8227fc9e81e8c00d24
BLAKE2b-256 667b85365479e2bab9c3837ab4185485885450d245821ecf8e1cb06e4e8deb15

See more details on using hashes here.

File details

Details for the file genboostgpu-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: genboostgpu-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 20.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.0 CPython/3.10.9 Linux/4.18.0-553.22.1.el8_10.x86_64

File hashes

Hashes for genboostgpu-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b27b2a44654af12989c0f1a38af446e626e54f5736821980a93b5a6ef8d6d7c4
MD5 14714e6ce4bced5729ba6b47ed45a2da
BLAKE2b-256 b64c98f7e5472c8b707ee426921f22f7335de2e925afa474ec0778e31ec3db7c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page