Genome-wide association analysis toolkit

These details have not been verified by PyPI

Project links

Project description

G2PInsight

What it is: G2PInsight is a command-line pipeline for genotype-to-phenotype (G2P) analysis. You provide genotype data (VCF or PLINK) and a phenotype file; the tool prepares a training matrix, fits machine-learning models, explains SNP importance with SHAP, and can score new samples — no coding required.

What it is for: breeding and genetics studies where you want to predict traits, compare ML methods, find important genomic regions, and generate publication-ready figures.

How to use it: run four stages in order (predict is optional):

preprocess  →  train or train-all  →  visualize  →  predict (optional)

Supports classification and regression, optional GWAS/LD SNP filtering in preprocess, and bundled PLINK / GEMMA binaries.

What you get
Quick Start
Pipeline
Installation
Commands
- preprocess
- train
- train-all
- predict
- visualize
Output layout
FAQ
Developer guide
License

What you get

Stage	What it does	Main outputs
preprocess	Match samples, QC SNPs, optional GWAS/LD, build training matrix	`_train_data.txt` (tab-delimited), `_metadata.json`, phenotype plot
train	Fit one model: tuning, CV, test metrics, SHAP	`{Model}_model.pkl`, metrics, SHAP files
train-all	Fit all applicable models, pick the best on the test set	Best model folder + comparison JSON
visualize	Genome-wide importance, performance curves, SHAP plots	PNG / HTML under `visualize/`
predict	Score new samples with a saved model	`*_predictions.tsv`

Models: LightGBM, RandomForest, XGBoost, SVM, CatBoost, Logistic (classification only). train-all skips Logistic for regression.

Important conventions

train / train-all need preprocess *_metadata.json (-j) — not a raw matrix path alone.
GWAS/LD runs in preprocess only; training does not re-filter SNPs.
Wait for preprocess to finish (Metadata file generated successfully) before training. In batch jobs, use set -e.
Large SNP counts: use preprocess -f 2–4; for train-all, prefer --parallel_models 1 unless job memory is very large (each parallel worker copies the full matrix).

Quick Start

# 1. Preprocess
G2PInsight preprocess --vcf data/genotype.vcf -p data/phenotype.txt -o results/

# 2. Train (check metadata exists first)
test -f results/preprocess/results_metadata.json && echo "OK"
G2PInsight train -j results/preprocess/results_metadata.json -m LightGBM -o results/

# 3. Visualize
G2PInsight visualize \
  -i results/train/LightGBM/LightGBM_shap_values.txt \
  -I results/train/LightGBM/LightGBM_plotting_data.json \
  -o results/plot/lightgbm

Whole-genome or millions of SNPs — preprocess with filtering, then train conservatively:

G2PInsight preprocess --bfile genotype -p pheno.txt -o results/ -f 4
G2PInsight train-all -j results/preprocess/results_metadata.json -o results/ \
  --parallel_models 1 --threads 4 --n_folds 3

Pipeline

  genotype + phenotype
         │
         ▼
   preprocess  ──►  *_train_data.txt + *_metadata.json
         │
         ▼
   train / train-all  ──►  model.pkl, metrics, SHAP, plotting JSON
         │
         ├──► visualize  ──►  PNG / HTML
         └──► predict    ──►  predictions.tsv

Use -o my_project/ as the project folder. Preprocess writes my_project/preprocess/my_project_metadata.json — that is the path for train -j.

Installation

Requirements: Python 3.8–3.12, Linux (recommended) or macOS. PLINK and GEMMA are bundled.

pip install G2PInsight
pip install "G2PInsight[bed]"   # strongly recommended for large preprocess jobs
# pip install "G2PInsight[all]"  # bed-reader + psutil

From source:

git clone https://github.com/chenrf0407/G2P_tool.git
cd G2P_tool && pip install -e ".[bed]"

G2PInsight --version
G2PInsight -h

Without bed-reader, preprocess falls back to slower PLINK --recodeA (--genotype-backend raw).

Commands

1. preprocess

What it is

Turns VCF / PLINK genotypes plus a two-column phenotype file into a sample × SNP training matrix and a metadata JSON that downstream stages require.

What you get

File	Purpose
`{prefix}_train_data.txt`	Tab-delimited matrix (`sample` index, SNP columns, `phenotype`); may be `.txt.gz` if large
`{prefix}_metadata.json`	Paths, sample list, SNP count, task type — required for train
`phenotype_distribution_pie.png`	Classification: class proportions
`phenotype_distribution_histogram.png`	Regression: trait distribution (after sample matching)

SNP columns are named chr_position (e.g. 1_123456).

How to run

G2PInsight preprocess (--bfile|--file|--vcf) <genotype> -p <phenotype> -o <output> [options]

Typical run (GWAS + LD feature selection):

G2PInsight preprocess --vcf data.vcf -p pheno.txt -o results/ -f 4

Large genome (fast bed read, lower memory):

G2PInsight preprocess --bfile genotype -p pheno.txt -o results/ -f 4 --no-cache

Key options

Option	Default	What it controls
`-f`	`1`	`1` none · `2` GWAS · `3` LD · `4` GWAS→LD
`--no-filter-snps`	off	Skip MAF / missingness QC (not for whole-genome)
`--genotype-backend`	`auto`	`auto`/`bed` = direct `.bed` read; `raw` = PLINK `--recodeA`
`--no-cache`	off	Less memory on huge jobs
`--parallel-chr-recode`	off	Parallel recode (raw backend only)

Inputs: exactly one of --bfile, --file, --vcf. Phenotype: two columns, no header (sample ID + value).

2. train

What it is

Trains one ML model from *_metadata.json: load matrix → split samples → optional hyperparameter search → K-fold CV on training split → final fit → test metrics → SHAP.

What you get

File	Purpose
`{Model}_model.pkl`	Saved model for predict
`{Model}_metrics.json`	Held-out test set metrics (main result)
`{Model}_cv_results.json`, `{Model}_cv_oof.tsv`	CV on training split only
`{Model}_shap_values.txt`	Genome-wide importance (tab-delimited, `.txt` extension)
`{Model}_shap_dependence.tsv`	Per-sample SHAP for dependence plots
`{Model}_plotting_data.json`	Index for visualize -I

How to run

G2PInsight train -j <metadata.json> -m <model> -o <output_dir> [options]

# Default (no hyperparameter search)
G2PInsight train -j results/preprocess/results_metadata.json -m LightGBM -o results/

# Large SNP set
G2PInsight train -j results/preprocess/results_metadata.json -m LightGBM -o results/ \
  --n_folds 3 --threads 4

# Enable hyperparameter search
G2PInsight train -j results/preprocess/results_metadata.json -m LightGBM -o results/ \
  --hyperparameter_search

Evaluation (default): 80% train / 20% test. Optional tuning and K-fold CV use the 80% only; {Model}_metrics.json reports the 20% test set.

Flow (default): K-fold CV on train split (model defaults) → final fit → SHAP. With --hyperparameter_search: hold-out tuning on the train split → CV → final fit → SHAP.

Key options

Option	Default	Notes
`-m`	—	`LightGBM`, `RandomForest`, `XGBoost`, `SVM`, `CatBoost`, `Logistic`
`--n_folds`	`5`	CV folds on training split
`--hyperparameter_search`	off	Opt-in RandomizedSearchCV; off by default for speed
`--threads`	`1`	Per-model CPU threads
`--shap_dependence_top`	`0`	Cap SNPs in dependence file (`0` = all)
`--ignore-warnings`	off	Hide `[WARNING]` logs and Python warnings (errors still shown)
`--group-file`	—	Per-group 70/15/15 train/val/test
`--train-ids-file` / `--test-ids-file`	—	Custom hold-out split

Before loading data, train checks metadata and logs memory guidance.

3. train-all

What it is

Trains every model for your task on the same data split. The parent reads the matrix once; each worker process runs the full pipeline for one model. After all finish, the best model (by held-out test performance) is kept; other model folders are removed.

What you get

File	Purpose
`{BestModel}/`	Winning model: `.pkl`, metrics, SHAP, plotting files
`best_model_info.json`	Which model won and the metric value
`model_comparison_report.json`	Success/fail, timings, settings
`all_models_cv_results.json`	All models’ CV results (saved before cleanup)

Best model rule: classification → AUC (else accuracy); regression → Pearson r on test set.

How to run

G2PInsight train-all -j <metadata.json> -o <output_dir> [options]

# Recommended default
G2PInsight train-all -j results/preprocess/results_metadata.json -o results/ \
  --parallel_models 1 --threads 4

# Millions of SNPs
G2PInsight train-all -j results/preprocess/results_metadata.json -o results/ \
  --parallel_models 1 --threads 4 --n_folds 3

# Opt-in hyperparameter search
G2PInsight train-all -j results/preprocess/results_metadata.json -o results/ \
  --hyperparameter_search --parallel_models 1 --threads 4

Shares train options except -m. Extra: --parallel_models (default 1), --ignore-warnings. On very wide matrices, high --parallel_models needs substantial RAM (each worker copies the full matrix).

4. predict

What it is

Applies a trained .pkl model to new samples. Aligns features to training SNPs (missing → zero-filled).

What you get

<output_dir>/predict/{Model}_predictions.tsv — columns sample, prediction (plus prob_class_* for classification).

How to run

G2PInsight predict -i <input> -m <model.pkl> -o <output_dir> [--task_type <type>]

Input (`-i`)	Notes
Tab-delimited matrix `.txt` / `.txt.gz`	Same format as preprocess output
PLINK binary prefix	`.bed/.bim/.fam` — training SNPs extracted first
VCF `.vcf` / `.vcf.gz`	Converted internally; training SNPs extracted first

G2PInsight predict -i new_genotype -m results/train/LightGBM/LightGBM_model.pkl -o results/ --task_type regression

Use the same genome build and SNP naming as training (*_training_features.json).

5. visualize

What it is

Builds figures from training outputs — genome-wide SHAP, top-SNPs bar chart, performance curves, CV diagnostics. Training writes data files; visualize renders PNG/HTML.

What you get

Under <parent_of_output_prefix>/visualize/:

With `-i` (SHAP)	With `-I` (plotting JSON)
Genome-wide importance (static + interactive)	Held-out test performance curves
Top-SNPs bar chart	Test metrics bar chart
SHAP summary & dependence (needs `*_shap_dependence.tsv`)	CV fold metrics, grouped hold-out bars

How to run

G2PInsight visualize [-i <shap_values.txt>] [-I <plotting_data.json>] -o <prefix> [--top_snps N]

At least one of -i or -I is required.

G2PInsight visualize -I results/train/LightGBM/LightGBM_plotting_data.json -o results/plot/perf
G2PInsight visualize -i results/train/LightGBM/LightGBM_shap_values.txt -o results/plot/shap --top_snps 20

*_shap_values.txt columns: feature, importance_abs, effect (1 / -1).

For huge SNP sets, set --shap_dependence_top N during training, then --top_snps ≤ N here.

Output layout

results/
├── preprocess/
│   ├── results_train_data.txt[.gz]
│   ├── results_metadata.json       ← train -j
│   └── phenotype_distribution_*.png
├── train/
│   ├── LightGBM/                   ← or only BestModel/ after train-all
│   │   ├── LightGBM_model.pkl
│   │   ├── LightGBM_metrics.json
│   │   ├── LightGBM_shap_values.txt
│   │   ├── LightGBM_plotting_data.json
│   │   └── ...
│   ├── best_model_info.json        ← train-all
│   ├── model_comparison_report.json
│   └── all_models_cv_results.json
├── predict/
│   └── LightGBM_predictions.tsv
└── plot/visualize/
    └── *.png / *.html

Temp files under {output}/tmp/ are removed after successful runs.

FAQ

train requires metadata — why?

train and train-all need *_metadata.json from preprocess. It records the matrix path, samples, SNP count, and task type.

Where is the metadata file?

<output>/preprocess/{prefix}_metadata.json. Example: -o results/ → results/preprocess/results_metadata.json.

train-all keeps only one model folder

By design: best model by test set metrics. See best_model_info.json and all_models_cv_results.json for the full comparison.

train-all fails or workers crash

Symptom: Worker process crashed, all models failed, or OOM under high --parallel_models.

What to do:

≥ 500k SNPs: prefer --parallel_models 1 unless memory is abundant.
Re-run preprocess with -f 2 or -f 4 if too many SNPs were kept.
Use --parallel_models 1 --n_folds 3 --threads 4 (omit --hyperparameter_search).
Increase job memory; or test one model: G2PInsight train -m LightGBM ....

Training or load is very slow

Fewer SNPs in preprocess (-f 2–4).
Uncompressed *_train_data.txt loads faster than .gz.
Keep default (no --hyperparameter_search), lower --n_folds.

SHAP dependence file is huge

Pass --shap_dependence_top N in train / train-all (e.g. 50). Genome-wide *_shap_values.txt is unchanged.

Preprocess slow or killed

Do not use --no-filter-snps on whole-genome data; use -f 2–4.
Install with pip install "G2PInsight[bed]" (or pip install bed-reader) for fast auto backend.
Add --no-cache; for raw backend try --parallel-chr-recode.

predict: constant values or no feature match

Same genome build and chr_pos naming as training.
Check sample IDs match.
For VCF/PLINK, training SNPs are extracted automatically — ensure those SNPs exist in the new data.

visualize: missing `effect` column

Use training-generated *_shap_values.txt with feature, importance_abs, effect.

Which preprocess backend was used?

Log lines: Genotype conversion backend: bed (fast) or raw (PLINK --recodeA).

Developer guide

assocG2P/G2PInsight/
├── main.py
└── bin/
    ├── preprocess.py
    ├── modeltraining.py
    ├── visualization.py
    ├── gemma_gwas.py
    └── plink_ld.py

pip install -e ".[bed]"
G2PInsight preprocess -h

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.4

Jul 21, 2026

1.0.3

Jul 21, 2026

1.0.2

Jul 21, 2026

1.0.1

Jul 20, 2026

1.0.0

May 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

g2pinsight-1.0.4.tar.gz (12.7 MB view details)

Uploaded Jul 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

g2pinsight-1.0.4-py3-none-any.whl (12.8 MB view details)

Uploaded Jul 21, 2026 Python 3

File details

Details for the file g2pinsight-1.0.4.tar.gz.

File metadata

Download URL: g2pinsight-1.0.4.tar.gz
Upload date: Jul 21, 2026
Size: 12.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.19

File hashes

Hashes for g2pinsight-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`a602a2fd105b826840f1467ec92a08cb9bae7b6c71c45a99a4b3e75061ee370a`
MD5	`ec041d0271a899b46a0ad0461bc97ec4`
BLAKE2b-256	`e99638bcab7be146d31bda7722ef1c6e9e0f06e9a4ce2dc540a21de842620057`

See more details on using hashes here.

File details

Details for the file g2pinsight-1.0.4-py3-none-any.whl.

File metadata

Download URL: g2pinsight-1.0.4-py3-none-any.whl
Upload date: Jul 21, 2026
Size: 12.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.19

File hashes

Hashes for g2pinsight-1.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f61b94265bb0ace76f1c02c7909801a1ec16bdda713f0cdc9320f6156847d975`
MD5	`4702c9ba4357978d4bc49c10a4a45fc5`
BLAKE2b-256	`70d405fe0432e53340759cf61f7d728b4768061a2bb964a5b0cc3ea2c3c477ed`

See more details on using hashes here.

G2PInsight 1.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

G2PInsight

Table of Contents

What you get

Quick Start

Pipeline

Installation

Commands

1. preprocess

What it is

What you get

How to run

Key options

2. train

What it is

What you get

How to run

Key options

3. train-all

What it is

What you get

How to run

4. predict

What it is

What you get

How to run

5. visualize

What it is

What you get

How to run

Output layout

FAQ

train requires metadata — why?

Where is the metadata file?

train-all keeps only one model folder

train-all fails or workers crash

Training or load is very slow

SHAP dependence file is huge

Preprocess slow or killed

predict: constant values or no feature match

visualize: missing effect column

Which preprocess backend was used?

Developer guide

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

visualize: missing `effect` column