Skip to main content

Subgenome-aware scalable LMM + DL prior conditional lift for allopolyploid GWAS

Project description

HomoeoGWAS

Subgenome-aware mixed-model GWAS for allopolyploid crops, with an optional zero-shot deep-learning prior.

CI License Python Version Tests

HomoeoGWAS runs GWAS on allopolyploid crops (wheat, cotton, rapeseed, oat, strawberry, …) by modelling each subgenome explicitly. A new species is added through a single YAML config — no framework code changes. The only requirement is that the subgenomes are distinguishable (a homoeologous chromosome naming or a chrom_map); the optional deep-learning prior additionally needs a reference FASTA.

It combines:

  1. Subgenome-partitioned linear mixed modely = Xβ + u_A + u_B [+ u_D …] + ε, with a per-subgenome GRM fit by REML and an optional leave-one-chromosome-out (LOCO) correction.
  2. Optional homoeolog interaction kernel K_hom = K_A ⊙ K_B [⊙ K_D] for cross-subgenome epistasis.
  3. Optional zero-shot deep-learning prior — PlantCaduceus + AgroNT log-likelihoods fused with the GWAS p-value to re-rank candidate loci.
  4. CPU and dual-GPU backends for the per-SNP scan, scaling to tens of millions of markers.

Quick start

# 1. Install (CPU)
pip install homoeogwas            # or: pip install -e ".[dev]" from a checkout

# 2. Verify the install end-to-end (~2 s): synthesise a tiny dataset + run a fit
homoeogwas demo --keep            # prints acceptance checks + lists the outputs

# 3. Run on your own data
homoeogwas validate -c my_run.yaml    # check config + input paths first
homoeogwas fit -c my_run.yaml -o results/my_run

# (Optional) GPU extras for the per-SNP scan + DL prior
pip install "homoeogwas[gpu]"

See examples/minimal/ for the demo dataset + an annotated config, and the I/O contract for input/output formats. CLI subcommands: fit, validate, demo, split, interact.

Containers

# Docker — CPU (bundles plink2 + bcftools, so split/VCF -> fit all work)
docker build -t homoeogwas:cpu .
docker run --rm homoeogwas:cpu demo
docker run --rm -v "$PWD":/work -w /work homoeogwas:cpu fit -c run.yaml

# Docker — GPU (per-SNP scan + DL prior; CUDA 12.1)
docker build -f Dockerfile.gpu -t homoeogwas:gpu .
docker run --rm --gpus all -v "$PWD":/work -w /work homoeogwas:gpu fit -c run.yaml --backend gpu

# Apptainer / Singularity (HPC, no root) — convert the Docker image
apptainer build homoeogwas.sif docker-daemon://homoeogwas:cpu
apptainer run homoeogwas.sif demo

Pass --build-arg PIP_INDEX_URL=<mirror> to build through a faster pip mirror.

How it works

flowchart LR
    G["VCF / PLINK genotypes"] -->|split| S["per-subgenome<br/>genotype sets"]
    S --> K["per-subgenome GRM<br/>(+ optional K_hom)"]
    K --> R["multi-kernel REML<br/>mixed model"]
    R --> SC["per-SNP scan<br/>(CPU / GPU, LOCO)"]
    SC --> O["sumstats + QQ +<br/>Manhattan + &lambda;_GC"]
    SC -. optional .-> D["DL-prior<br/>re-ranking"]
    D --> RR["re-ranked candidates"]

Adding a new species

Any allopolyploid is supported through configuration alone:

  1. Copy an existing configs/species/*.yaml and edit subgenomes, the chromosome naming / chrom_map, the reference assembly path, and ploidy. The schema in src/homoeogwas/species_config.py validates it.
  2. homoeogwas split --species <yaml> --vcf <in.vcf.gz> --out-dir ... splits the markers into per-subgenome genotype sets. K_hom auto-selects its form for the subgenome count (full Hadamard for 2–3; pairwise-mean for 4+ to stay full-rank).
  3. homoeogwas fit --config <run.yaml> runs the mixed-model scan; the optional DL-prior step additionally needs the species reference FASTA.

No Python is edited at any step. Diploids can run the mixed model, but the homoeolog kernel K_hom is not meaningful for them.

Tested species

The framework has been run end-to-end on five crops spanning ploidy 2n–8n through the same code path; this list is illustrative, not a limit on supported species.

Species Subgenomes Reference assembly
Wheat (Triticum aestivum) AABBDD (6n) IWGSC RefSeq v1.0
Cotton (Gossypium hirsutum) AADD (4n) HBAU NDM8
Rapeseed (Brassica napus) AACC (4n) Darmor v4.1
Oat (Avena sativa) AACCDD (6n) OT3098 v2
Strawberry (Fragaria × ananassa) octoploid (4 subgenomes) NIHHS Seolhyang

Package layout

src/homoeogwas/
├── species_config.py   # config schema (pydantic)
├── species_split.py    # VCF -> per-subgenome genotype splitter
├── grm.py              # per-subgenome and LOCO GRMs
├── kernel.py           # K_pool (additive) and K_hom (homoeolog) kernels
├── lmm.py              # multi-kernel REML mixed model
├── gp.py               # GBLUP prediction + cross-validation
├── scan.py             # per-SNP scan (CPU + dual-GPU, LOCO)
├── diagnostics.py      # lambda_GC, QQ, retained-fraction checks
├── calibration.py      # null-simulation type-I error
├── sim.py              # power-vs-FDR simulation
├── interact.py         # homoeolog-pair interaction scan
├── cli.py              # command-line interface
└── io.py               # genotype I/O

Testing

pytest -m "not gpu and not slow"   # CPU suite (~3-5 min): 287 passed + 1 skipped
pytest -m "not slow"               # + GPU tests (needs torch)
pytest                             # full suite incl. simulation benchmarks

CI runs ruff + the CPU test suite on Python 3.10 / 3.11 / 3.12.

Reproducing the paper

The analysis code, configs, and figure pipeline for the manuscript live under reproducibility/. Large inputs (data/) and intermediate outputs (results/) are not tracked; see reproducibility/paper/ for how to fetch the public datasets and regenerate the figures, and reproducibility/paper/scripts/reproduce_baselines.sh to clone the external benchmark tools.

Status

This is research software released alongside a manuscript in preparation (target Nature Communications). The package and its tests are stable; the biological associations in the paper are the subject of that manuscript and should be cited from it once published.

Citation

@unpublished{homoeogwas2026,
  title  = {HomoeoGWAS: subgenome-aware mixed-model GWAS for allopolyploid crops},
  author = {Yang, Shipeng},
  year   = {2026},
  note   = {Manuscript in preparation},
  url    = {https://github.com/Shipeng-Yang/HomoeoGWAS},
}

See CITATION.cff for machine-readable metadata.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

homoeogwas-1.0.1.tar.gz (194.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

homoeogwas-1.0.1-py3-none-any.whl (143.7 kB view details)

Uploaded Python 3

File details

Details for the file homoeogwas-1.0.1.tar.gz.

File metadata

  • Download URL: homoeogwas-1.0.1.tar.gz
  • Upload date:
  • Size: 194.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for homoeogwas-1.0.1.tar.gz
Algorithm Hash digest
SHA256 e216e5e3b251ac49e4465bab0f686888826e001c3f4f3189dcaa301e53bcf43b
MD5 c5c8757345f776264005140a916a53f0
BLAKE2b-256 6d8e56c472742f254ca4de7f9563f4e96eb250b027d7007f4691a84f859ed3a7

See more details on using hashes here.

File details

Details for the file homoeogwas-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: homoeogwas-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 143.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for homoeogwas-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 93efff8980eff175f6614e37d26bb7bdf94c242c99e01af160dceff63e62eade
MD5 8b5f18f2549c7c296be1d5ead567ed32
BLAKE2b-256 30fe20d16849dbbf1dbaa7b31cc0bb2bdb7b3f4322c1ea6246d276f59d052ddb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page