Skip to main content

Federated genome-wide association study pipeline built with Flower and PLINK

Project description

Federated GWAS Pipeline

This repository implements a federated pipeline for Genome-Wide Association Studies (GWAS) using Flower, PLINK, and custom privacy-preserving protocols. The pipeline supports multi-stage, multi-client GWAS with reproducible outputs and structured logging.

For release verification steps, see RELEASE.md. For implementation details and change history, see CURRENT_VERSION.md.


Environment Setup

Option 1: UV (recommended)

Install uv:

curl -LsSf https://astral.sh/uv/install.sh | sh

Sync dependencies (Python 3.11+):

uv sync --python 3.11

Optional dev dependencies:

uv sync --dev

Option 2: Conda

conda create -n fedgwas python=3.11 -y
conda activate fedgwas
pip install -e .
pip install -U "flwr[simulation]"

PLINK

  • Requires PLINK 1.9+.
  • Download the binary for your OS and ensure plink is on your PATH, or set the path in each client config.yaml (plink.path if configured).
  • Toy reference files are under plink/; production runs use experiment data under experiments/.

Quick Start (Recommended: tiny_even)

The default Flower config in pyproject.toml points to experiments/correctness/tiny_even/configs (2 clients, tiny synthetic data).

Repository layout (experiments)

experiments/correctness/tiny_even/
├── config.yaml
├── configs/
│   ├── server/config.yaml
│   ├── center_1/config.yaml
│   └── center_2/config.yaml
├── data/tiny/
│   ├── center_1/          # PLINK .bed/.bim/.fam per client
│   ├── center_2/
│   └── centralized_baseline/   # after generate_baseline
└── results_2/             # gitignored; current shipped config output

Config templates: configs/config_template.yaml.

1. Generate synthetic data (if not present)

python pipeline/simulation/simulated_data/generate_synthetic_data.py \
  --scale tiny \
  --partition-strategy even \
  --seed 42 \
  --output-dir experiments/correctness/tiny_even/data

2. Generate centralized baseline

python experiments/tools/generate_baseline.py \
  experiments/correctness/tiny_even/config.yaml

3. Run federated pipeline (simulation)

flwr run . local-simulation --stream

Override rounds or config path:

flwr run . local-simulation --stream --run-config \
  'simulation=true num-server-rounds=100 config_path="experiments/correctness/tiny_even/configs"'

Results are written under each client's logs/ and intermediate/ directories (paths set in per-center config.yaml). The shipped tiny configs currently write under experiments/correctness/tiny_even/results_2/; use the paths in the active center and server config files as the source of truth.

4. Retention (optional, automatic)

Experiment config.yaml may set retention.tier (minimal | standard | research). When auto_apply_on_complete: true, the server prunes non-essential artifacts after the run. Manual:

python experiments/tools/apply_run_retention.py \
  experiments/correctness/tiny_even/results \
  --config-path experiments/correctness/tiny_even/configs \
  --dry-run

See RELEASE.md for tier definitions.

5. Evaluate against baseline

python experiments/tools/evaluation/evaluate_all.py \
  experiments/correctness/tiny_even/results_2 \
  --baseline experiments/correctness/tiny_even/data/tiny/centralized_baseline \
  --king

See experiments/correctness/tiny_even/README.md for expected metrics and success criteria. If you changed the output paths in the active configs, pass that results directory instead.


Documentation Site

The Docusaurus site is isolated under website/ and reads Markdown from the repository-level docs/ directory.

cd website
npm install
npm run start
npm run build

Three-Node Cluster Deployment

For Matpool or any 3-node layout (1 SuperLink + 2 SuperNodes), use the bundled scripts and guide:

bash cluster_deployment/scripts/setup-cluster-node.sh   # each node
bash cluster_deployment/scripts/cluster-verify-data.sh --scale tiny --client-id 1  # each client
cluster_deployment/scripts/cluster-run-app.sh \
  --server-ip <SERVER_IP> --scale tiny --rounds 20

Performance scales (small/medium): experiments/performance/scales.yaml and per-scale READMEs under small_even/, medium_even/.


Local Deployment Mode

Requires SuperLink + two SuperNodes + flwr run:

flower-superlink --insecure
flower-supernode --insecure --superlink 127.0.0.1:9092 --clientappio-api-address 127.0.0.1:9094 \
  --node-config 'partition-id=0 num-partitions=2 config-file="experiments/correctness/tiny_even/configs/center_1/config.yaml"'
flower-supernode --insecure --superlink 127.0.0.1:9092 --clientappio-api-address 127.0.0.1:9095 \
  --node-config 'partition-id=1 num-partitions=2 config-file="experiments/correctness/tiny_even/configs/center_2/config.yaml"'
flwr run . local-deployment --stream

Advanced: Real-World Experiments

Larger studies (e.g. 1000 Genomes subset) live under experiments/real_world/1000genomes/. These require downloading/preparing data, longer runtime, and overriding config_path:

flwr run . local-simulation --stream --run-config \
  'config_path="experiments/real_world/1000genomes/configs"'

Manuscript figures and prior run outputs under experiments/real_world/1000genomes/manuscript/ are research artifacts and are not required for the default release path.


Output and Logs

  • Per-client intermediate_dir and log_dir are defined in each center config.yaml.
  • Directories are cleared at the start of each client run to avoid stale artifacts.
  • Stage progress and errors go to per-client log files under each configured output.log_dir.
  • Inspect PLINK outputs (.assoc.logistic, .imiss, .frq, KING kinship files) directly under each client's logs/.

Federated Protocol (Summary)

  1. Key exchange — ECC public keys via server relay
  2. Sync — Encrypted seed broadcast (server cannot decrypt)
  3. Local / global QC — Encrypted QC shares; exclusion list computed client-side
  4. Iterative KING — Chunked kinship with cross-client anonymized IDs
  5. Local LR + filtering — Tokenized insignificant SNPs
  6. Iterative LR — Chunked association on filtered data

Full stage contracts and privacy model: CURRENT_VERSION.md.


Troubleshooting

  • PLINK not found — Install PLINK 1.9+ and verify plink is on PATH or configured in config.yaml.
  • Wrong config — Check config_path in pyproject.toml or pass --run-config.
  • Empty results — Ensure data and baseline exist under experiments/correctness/tiny_even/data/.
  • Reproducibility — Use fixed seeds in data generation and consistent config_path across runs.

Contributing

Open issues or pull requests for bug fixes, improvements, or new features.

Acknowledgments

Built with Flower, PLINK, and open-source Python tools.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fedgwas-0.3.1.tar.gz (32.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fedgwas-0.3.1-py3-none-any.whl (95.9 kB view details)

Uploaded Python 3

File details

Details for the file fedgwas-0.3.1.tar.gz.

File metadata

  • Download URL: fedgwas-0.3.1.tar.gz
  • Upload date:
  • Size: 32.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fedgwas-0.3.1.tar.gz
Algorithm Hash digest
SHA256 a124160dabf861692024d0fad8437fcbd1b7b2b2912d586afce08bec1357bb21
MD5 2077320db46e68198613bd7cdabb7228
BLAKE2b-256 c3eb204d6480abe6c55e9d623484ccf3a7bbc3cbd105425f98ce6a88c8247444

See more details on using hashes here.

Provenance

The following attestation bundles were made for fedgwas-0.3.1.tar.gz:

Publisher: publish-pypi.yml on sitaomin1994/FedGWAS_pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fedgwas-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: fedgwas-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 95.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fedgwas-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f5c3da2b2967ba1b9423a1442fa3225e0bd29b069689577be39205861e59b4af
MD5 4efe33eee6ecdb6c1ddce30ea99cbef4
BLAKE2b-256 c411ea0b9010f912a433c85e500d6db00e48c0076df44b01097e318c8426be49

See more details on using hashes here.

Provenance

The following attestation bundles were made for fedgwas-0.3.1-py3-none-any.whl:

Publisher: publish-pypi.yml on sitaomin1994/FedGWAS_pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page