Subgenome-aware scalable LMM + DL prior conditional lift for allopolyploid GWAS
Project description
HomoeoGWAS
Subgenome-aware mixed-model GWAS for allopolyploid crops, with an optional zero-shot deep-learning prior.
HomoeoGWAS runs GWAS on allopolyploid crops (wheat, cotton, rapeseed, oat,
strawberry, …) by modelling each subgenome explicitly. A new species is added
through a single YAML config — no framework code changes. The only requirement
is that the subgenomes are distinguishable (a homoeologous chromosome naming or a
chrom_map); the optional deep-learning prior additionally needs a reference
FASTA.
It combines:
- Subgenome-partitioned linear mixed model —
y = Xβ + u_A + u_B [+ u_D …] + ε, with a per-subgenome GRM fit by REML and an optional leave-one-chromosome-out (LOCO) correction. - Optional homoeolog interaction kernel
K_hom = K_A ⊙ K_B [⊙ K_D]for cross-subgenome epistasis. - Optional zero-shot deep-learning prior — PlantCaduceus + AgroNT log-likelihoods fused with the GWAS p-value to re-rank candidate loci.
- CPU and dual-GPU backends for the per-SNP scan, scaling to tens of millions of markers.
Quick start
# 1. Install (CPU)
pip install homoeogwas # or: pip install -e ".[dev]" from a checkout
# 2. Verify the install end-to-end (~2 s): synthesise a tiny dataset + run a fit
homoeogwas demo --keep # prints acceptance checks + lists the outputs
# 3. Run on your own data
homoeogwas validate -c my_run.yaml # check config + input paths first
homoeogwas fit -c my_run.yaml -o results/my_run
# (Optional) GPU extras for the per-SNP scan + DL prior
pip install "homoeogwas[gpu]"
See examples/minimal/ for the demo dataset + an annotated
config, and the I/O contract for input/output formats. CLI
subcommands: fit, validate, demo, split, interact.
Containers
# Docker — CPU (bundles plink2 + bcftools, so split/VCF -> fit all work)
docker build -t homoeogwas:cpu .
docker run --rm homoeogwas:cpu demo
docker run --rm -v "$PWD":/work -w /work homoeogwas:cpu fit -c run.yaml
# Docker — GPU (per-SNP scan + DL prior; CUDA 12.1)
docker build -f Dockerfile.gpu -t homoeogwas:gpu .
docker run --rm --gpus all -v "$PWD":/work -w /work homoeogwas:gpu fit -c run.yaml --backend gpu
# Apptainer / Singularity (HPC, no root) — convert the Docker image
apptainer build homoeogwas.sif docker-daemon://homoeogwas:cpu
apptainer run homoeogwas.sif demo
Pass --build-arg PIP_INDEX_URL=<mirror> to build through a faster pip mirror.
How it works
flowchart LR
G["VCF / PLINK genotypes"] -->|split| S["per-subgenome<br/>genotype sets"]
S --> K["per-subgenome GRM<br/>(+ optional K_hom)"]
K --> R["multi-kernel REML<br/>mixed model"]
R --> SC["per-SNP scan<br/>(CPU / GPU, LOCO)"]
SC --> O["sumstats + QQ +<br/>Manhattan + λ_GC"]
SC -. optional .-> D["DL-prior<br/>re-ranking"]
D --> RR["re-ranked candidates"]
Adding a new species
Any allopolyploid is supported through configuration alone:
- Copy an existing
configs/species/*.yamland editsubgenomes, the chromosome naming /chrom_map, the reference assembly path, andploidy. The schema insrc/homoeogwas/species_config.pyvalidates it. homoeogwas split --species <yaml> --vcf <in.vcf.gz> --out-dir ...splits the markers into per-subgenome genotype sets.K_homauto-selects its form for the subgenome count (full Hadamard for 2–3; pairwise-mean for 4+ to stay full-rank).homoeogwas fit --config <run.yaml>runs the mixed-model scan; the optional DL-prior step additionally needs the species reference FASTA.
No Python is edited at any step. Diploids can run the mixed model, but the
homoeolog kernel K_hom is not meaningful for them.
Tested species
The framework has been run end-to-end on five crops spanning ploidy 2n–8n through the same code path; this list is illustrative, not a limit on supported species.
| Species | Subgenomes | Reference assembly |
|---|---|---|
| Wheat (Triticum aestivum) | AABBDD (6n) | IWGSC RefSeq v1.0 |
| Cotton (Gossypium hirsutum) | AADD (4n) | HBAU NDM8 |
| Rapeseed (Brassica napus) | AACC (4n) | Darmor v4.1 |
| Oat (Avena sativa) | AACCDD (6n) | OT3098 v2 |
| Strawberry (Fragaria × ananassa) | octoploid (4 subgenomes) | NIHHS Seolhyang |
Package layout
src/homoeogwas/
├── species_config.py # config schema (pydantic)
├── species_split.py # VCF -> per-subgenome genotype splitter
├── grm.py # per-subgenome and LOCO GRMs
├── kernel.py # K_pool (additive) and K_hom (homoeolog) kernels
├── lmm.py # multi-kernel REML mixed model
├── gp.py # GBLUP prediction + cross-validation
├── scan.py # per-SNP scan (CPU + dual-GPU, LOCO)
├── diagnostics.py # lambda_GC, QQ, retained-fraction checks
├── calibration.py # null-simulation type-I error
├── sim.py # power-vs-FDR simulation
├── interact.py # homoeolog-pair interaction scan
├── cli.py # command-line interface
└── io.py # genotype I/O
Testing
pytest -m "not gpu and not slow" # CPU suite (~3-5 min): 287 passed + 1 skipped
pytest -m "not slow" # + GPU tests (needs torch)
pytest # full suite incl. simulation benchmarks
CI runs ruff + the CPU test suite on Python 3.10 / 3.11 / 3.12.
Reproducing the paper
The analysis code, configs, and figure pipeline for the manuscript live under
reproducibility/. Large inputs (data/) and intermediate
outputs (results/) are not tracked; see reproducibility/paper/ for how to
fetch the public datasets and regenerate the figures, and
reproducibility/paper/scripts/reproduce_baselines.sh to clone the external
benchmark tools.
Status
This is research software released alongside a manuscript in preparation (target Nature Communications). The package and its tests are stable; the biological associations in the paper are the subject of that manuscript and should be cited from it once published.
Citation
@unpublished{homoeogwas2026,
title = {HomoeoGWAS: subgenome-aware mixed-model GWAS for allopolyploid crops},
author = {Yang, Shipeng},
year = {2026},
note = {Manuscript in preparation},
url = {https://github.com/Shipeng-Yang/HomoeoGWAS},
}
See CITATION.cff for machine-readable metadata.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file homoeogwas-1.0.1.tar.gz.
File metadata
- Download URL: homoeogwas-1.0.1.tar.gz
- Upload date:
- Size: 194.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e216e5e3b251ac49e4465bab0f686888826e001c3f4f3189dcaa301e53bcf43b
|
|
| MD5 |
c5c8757345f776264005140a916a53f0
|
|
| BLAKE2b-256 |
6d8e56c472742f254ca4de7f9563f4e96eb250b027d7007f4691a84f859ed3a7
|
File details
Details for the file homoeogwas-1.0.1-py3-none-any.whl.
File metadata
- Download URL: homoeogwas-1.0.1-py3-none-any.whl
- Upload date:
- Size: 143.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93efff8980eff175f6614e37d26bb7bdf94c242c99e01af160dceff63e62eade
|
|
| MD5 |
8b5f18f2549c7c296be1d5ead567ed32
|
|
| BLAKE2b-256 |
30fe20d16849dbbf1dbaa7b31cc0bb2bdb7b3f4322c1ea6246d276f59d052ddb
|