Genome-wide association analysis toolkit

These details have not been verified by PyPI

Project description

G2PInsight Genomic Analysis Tool

G2PInsight is a command-line toolkit for genotype-to-phenotype association analysis. It provides an end-to-end workflow covering data preprocessing, model training, prediction, and visualization, with optional GWAS/LD-based feature selection during preprocessing.

Project Overview
Core Features
System Requirements
Installation
Quick Start
Command Reference
Output Files
FAQ
Developer Guide
License

Project Overview

G2PInsight standardizes the following workflow:

Align and clean genotype and phenotype data.
Optionally apply GWAS/LD-based feature selection during preprocessing.
Train classification or regression models.
Export model metrics, feature importance, and plotting assets.
Run prediction on new samples using trained models.

This tool is suitable for bioinformatics and agricultural genomics applications where reproducible GWAS-oriented ML workflows are required.

Core Features

1) Data Preprocessing

Supported genotype inputs: VCF (.vcf/.vcf.gz), PLINK binary (.bed/.bim/.fam), PLINK text (.ped/.map).
Automatic sample matching and phenotype cleaning (missing/abnormal/non-numeric handling).
Automatic task-type inference (classification / regression).
Optional SNP quality filtering (MAF, GENO).
Feature-selection modes: no selection / GWAS / LD / GWAS+LD.

2) Model Training

Supported models: LightGBM, RandomForest, XGBoost, SVM, CatBoost, Logistic.
Randomized hyperparameter search + cross-validation.
Saves model artifacts, metrics, feature importance, SHAP outputs, and plotting data.
train and train-all require preprocess-generated *_metadata.json as input.

3) Model Inference

Input support: training-matrix format (.txt/.txt.gz) or VCF (temporary conversion is handled automatically).
Prediction feature alignment is enforced against training features.

4) Visualization

Two visualization input modes:
- Feature-importance file (genome-wide scatter plot).
- plotting_data.npz (performance and CV training curves).
Outputs static PNG and interactive HTML (for feature-importance plotting).

System Requirements

Python: 3.8 - 3.12
OS: Linux / macOS (recommended)
Shell: Bash

Check Python version:

python --version
# or
python3 --version

Installation

Option 1: Install with project script (recommended on Linux/macOS)

chmod +x tools.sh
./tools.sh

Option 2: Install from wheel

pip install G2PInsight-1.0.0-py3-none-any.whl

Option 3: Install from source

pip install .

Verify installation

G2PInsight --version
G2PInsight -h

Quick Start

# 1) Preprocess
G2PInsight preprocess -g genotype.vcf -p phenotype.txt -o preprocessed/

# 2) Train (metadata-driven)
G2PInsight train -j preprocessed/preprocess/preprocessed_metadata.json -m LightGBM -o results/

# 3) Visualize feature importance
G2PInsight visualize -i results/train/LightGBM/LightGBM_feature_importance.txt -o result_plot

Note: train and train-all must use preprocess-generated *_metadata.json.

Command Reference

1. preprocess (data preprocessing)

Purpose: convert genotype + phenotype inputs into model-ready training matrix, with optional GWAS/LD feature selection.

Usage

G2PInsight preprocess \
  -g <genotype_input> \
  -p <phenotype.txt> \
  -o <output_path> \
  [-f <1|2|3|4>] \
  [--gwas_pvalue <float>] \
  [--ld-config "<window_kb>,<window_variants>,<r2_threshold>"] \
  [--no-filter-snps]

Parameters

Parameter	Required	Default	Description
`-g, --genotype`	Yes	-	Genotype input path (VCF/PLINK)
`-p, --phenotype`	Yes	-	Phenotype file path (at least two columns: sample, phenotype)
`-o, --output`	Yes	-	Output directory or output prefix
`-f, --feature_selection_mode`	No	`1`	1=no selection, 2=GWAS, 3=LD, 4=GWAS+LD
`--gwas_pvalue`	No	`0.01`	GWAS significance threshold (effective for mode 2/4)
`--ld-config`	No	`"50,5,0.2"`	LD config: window_kb, window_variants, r² threshold (effective for mode 3/4)
`--no-filter-snps`	No	`False`	Disable SNP quality filtering

Example

G2PInsight preprocess -g data.vcf -p pheno.txt -o out/ -f 1
G2PInsight preprocess -g data.vcf -p pheno.txt -o out/ -f 4 --gwas_pvalue 0.01 --ld-config "50,5,0.2"

2. train (single-model training)

Purpose: train one selected model and export model artifacts, metrics, and plotting data.

Usage

G2PInsight train \
  -j <preprocess_metadata.json> \
  -m <LightGBM|RandomForest|XGBoost|SVM|CatBoost|Logistic> \
  -o <output_dir> \
  [--task_type <classification|regression>] \
  [--n_folds <int>] \
  [--random_state <int>] \
  [--feature_importance]

Parameters

Parameter	Required	Default	Description
`-j, --json`	Yes	-	Preprocess-generated `*_metadata.json`
`-m, --model`	Yes	-	Model name
`-o, --output_dir`	Yes	-	Output directory
`--task_type`	No	Auto	Optional explicit task type
`--n_folds`	No	`5`	Number of CV folds
`--random_state`	No	`42`	Random seed
`--feature_importance`	No	`False`	Trigger feature-importance output flow

Example

G2PInsight train -j out/preprocess/out_metadata.json -m LightGBM -o results/

3. train-all (all-model training)

Purpose: train all supported models in parallel and produce comparison outputs.

Usage

G2PInsight train-all \
  -j <preprocess_metadata.json> \
  -o <output_dir> \
  [--task_type <classification|regression>] \
  [--n_folds <int>] \
  [--random_state <int>] \
  [--feature_importance]

Important behavior

Current implementation keeps only the best-performing model directory after all-model training and removes the others. It also exports best_model_info.json.

Example

G2PInsight train-all -j out/preprocess/out_metadata.json -o results/

4. predict (model inference)

Purpose: predict phenotypes using a trained .pkl model.

Usage

G2PInsight predict \
  -i <input_data.txt|input_data.vcf|input_data.vcf.gz> \
  -m <model.pkl> \
  -o <output_dir> \
  [--task_type <classification|regression>]

Parameters

Parameter	Required	Default	Description
`-i, --input`	Yes	-	Prediction input (training matrix or VCF)
`-m, --model`	Yes	-	Path to trained model file (`.pkl`)
`-o, --output_dir`	Yes	-	Output directory (used for temp conversion when input is VCF)
`--task_type`	No	-	Optional task type

Output location

Prediction results are written to the model directory:

{model_dir}/{model_type}_predictions.tsv

Example

G2PInsight predict -i new_data.txt -m results/train/LightGBM/LightGBM_model.pkl -o pred/

5. visualize (result visualization)

Purpose: generate feature-importance plots or model-performance plots.

Usage

G2PInsight visualize \
  [-i <feature_importance.txt>] \
  [-I <plotting_data.npz>] \
  -o <output_prefix>

Parameters

Parameter	Required	Description
`-i, --importance`	No	Feature-importance file
`-I, --indicator`	No	`plotting_data.npz` file from training outputs
`-o, --output`	Yes	Output prefix

Feature-importance format requirements

Recommended input: <Model>_feature_importance.txt generated by training.

Required columns:

feature (e.g., 1_12345 or chr1_12345)
importance_abs (or importance)
effect (1 or -1)

Example

G2PInsight visualize -i results/train/LightGBM/LightGBM_feature_importance.txt -o plot
G2PInsight visualize -I results/train/LightGBM/LightGBM_plotting_data.npz -o plot

Output Files

preprocess

Typical location: <output>/preprocess/

<prefix>_train_data.txt
<prefix>_metadata.json
phenotype distribution plot(s), depending on task type

train

Typical location: <output>/train/<Model>/

<Model>_model.pkl
<Model>_metrics.json
<Model>_cv_results.json
<Model>_training_features.json
<Model>_feature_importance.txt
<Model>_shap_values.txt
<Model>_plotting_data.npz

train-all

Typical location: <output>/train/

best model directory (other model directories may be removed by current implementation)
best_model_info.json
model_comparison_report.json

predict

Typical location: model directory

<Model>_predictions.tsv

visualize

Typical location: <output_parent>/visualize/

<prefix>_importance_static.png
<prefix>_importance_interactive.html
<prefix>_performance_curves.png
<prefix>_cv_training_curves.png

FAQ

1) `train` requires metadata input

train and train-all require preprocess-generated *_metadata.json via -j.

2) Cannot find preprocess outputs

Check <output>/preprocess/ for <prefix>_train_data.txt and <prefix>_metadata.json.

3) `visualize` complains about missing `effect`

The input file does not meet the required 3-column schema. Use training-generated <Model>_feature_importance.txt.

4) Why does `train-all` keep only one model directory?

This is the current behavior: it selects the best model and removes the rest.

5) Why are prediction results not under `-o`?

Prediction outputs are saved in the model directory by current implementation. -o is mainly used for temporary conversion workflow management.

Developer Guide

Project Structure

G2PInsight/
├── G2PInsight/
│   ├── main.py
│   └── bin/
│       ├── preprocess.py
│       ├── modeltraining.py
│       ├── gemma_gwas.py
│       ├── plink_ld.py
│       ├── visualization.py
│       └── font_utils.py
├── pyproject.toml
├── setup.py
└── README.md

Local Development Setup

git clone <your-repo-url>
cd G2PInsight
pip install -e .

Style Recommendations

Follow PEP 8.
Document parameters and return values for new public interfaces.
Keep README synchronized with CLI behavior whenever arguments or outputs change.

License

This project is distributed under the MIT License. See LICENSE for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

May 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

g2pinsight-1.0.0.tar.gz (12.7 MB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

g2pinsight-1.0.0-py3-none-any.whl (12.8 MB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file g2pinsight-1.0.0.tar.gz.

File metadata

Download URL: g2pinsight-1.0.0.tar.gz
Upload date: May 8, 2026
Size: 12.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.19

File hashes

Hashes for g2pinsight-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`6787b40fbfe2e9f0a26f27e3b21273b21d46be8ccd0838d6ec091ac9c253c7b9`
MD5	`6f78d9ed108695d8169ef345f0c43fb2`
BLAKE2b-256	`1abcbcc88941f8dee3e081b7df51984ee330147f9d55dec694121e697cea54ec`

See more details on using hashes here.

File details

Details for the file g2pinsight-1.0.0-py3-none-any.whl.

File metadata

Download URL: g2pinsight-1.0.0-py3-none-any.whl
Upload date: May 8, 2026
Size: 12.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.19

File hashes

Hashes for g2pinsight-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b6ec61295f3d5d604728a4fb14b7df600b8773e9e4ce5640387317a83a4fff3d`
MD5	`30ac391b241ce194eef0f9b42d5dbda0`
BLAKE2b-256	`26dcf86fb23af4799cfd6d62c2e664821d4eb61b275b47714c370ab30bb97252`

See more details on using hashes here.

G2PInsight 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

G2PInsight Genomic Analysis Tool

Table of Contents

Project Overview

Core Features

1) Data Preprocessing

2) Model Training

3) Model Inference

4) Visualization

System Requirements

Installation

Option 1: Install with project script (recommended on Linux/macOS)

Option 2: Install from wheel

Option 3: Install from source

Verify installation

Quick Start

Command Reference

1. preprocess (data preprocessing)

Usage

Parameters

Example

2. train (single-model training)

Usage

Parameters

Example

3. train-all (all-model training)

Usage

Important behavior

Example

4. predict (model inference)

Usage

Parameters

Output location

Example

5. visualize (result visualization)

Usage

Parameters

Feature-importance format requirements

Example

Output Files

preprocess

train

train-all

predict

visualize

FAQ

1) train requires metadata input

2) Cannot find preprocess outputs

3) visualize complains about missing effect

4) Why does train-all keep only one model directory?

5) Why are prediction results not under -o?

Developer Guide

Project Structure

Local Development Setup

Style Recommendations

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1) `train` requires metadata input

3) `visualize` complains about missing `effect`

4) Why does `train-all` keep only one model directory?

5) Why are prediction results not under `-o`?