Genome-wide association analysis toolkit
Project description
G2PInsight Genomic Analysis Tool
G2PInsight is a command-line toolkit for genotype-to-phenotype association analysis. It provides an end-to-end workflow covering data preprocessing, model training, prediction, and visualization, with optional GWAS/LD-based feature selection during preprocessing.
Table of Contents
- Project Overview
- Core Features
- System Requirements
- Installation
- Quick Start
- Command Reference
- Output Files
- FAQ
- Developer Guide
- License
Project Overview
G2PInsight standardizes the following workflow:
- Align and clean genotype and phenotype data.
- Optionally apply GWAS/LD-based feature selection during preprocessing.
- Train classification or regression models.
- Export model metrics, feature importance, and plotting assets.
- Run prediction on new samples using trained models.
This tool is suitable for bioinformatics and agricultural genomics applications where reproducible GWAS-oriented ML workflows are required.
Core Features
1) Data Preprocessing
- Supported genotype inputs:
VCF (.vcf/.vcf.gz),PLINK binary (.bed/.bim/.fam),PLINK text (.ped/.map). - Automatic sample matching and phenotype cleaning (missing/abnormal/non-numeric handling).
- Automatic task-type inference (
classification/regression). - Optional SNP quality filtering (
MAF,GENO). - Feature-selection modes: no selection / GWAS / LD / GWAS+LD.
2) Model Training
- Supported models:
LightGBM,RandomForest,XGBoost,SVM,CatBoost,Logistic. - Randomized hyperparameter search + cross-validation.
- Saves model artifacts, metrics, feature importance, SHAP outputs, and plotting data.
trainandtrain-allrequire preprocess-generated*_metadata.jsonas input.
3) Model Inference
- Input support: training-matrix format (
.txt/.txt.gz) or VCF (temporary conversion is handled automatically). - Prediction feature alignment is enforced against training features.
4) Visualization
- Two visualization input modes:
- Feature-importance file (genome-wide scatter plot).
plotting_data.npz(performance and CV training curves).
- Outputs static PNG and interactive HTML (for feature-importance plotting).
System Requirements
- Python:
3.8-3.12 - OS: Linux / macOS (recommended)
- Shell: Bash
Check Python version:
python --version
# or
python3 --version
Installation
Option 1: Install with project script (recommended on Linux/macOS)
chmod +x tools.sh
./tools.sh
Option 2: Install from wheel
pip install G2PInsight-1.0.0-py3-none-any.whl
Option 3: Install from source
pip install .
Verify installation
G2PInsight --version
G2PInsight -h
Quick Start
# 1) Preprocess
G2PInsight preprocess -g genotype.vcf -p phenotype.txt -o preprocessed/
# 2) Train (metadata-driven)
G2PInsight train -j preprocessed/preprocess/preprocessed_metadata.json -m LightGBM -o results/
# 3) Visualize feature importance
G2PInsight visualize -i results/train/LightGBM/LightGBM_feature_importance.txt -o result_plot
Note:
trainandtrain-allmust use preprocess-generated*_metadata.json.
Command Reference
1. preprocess (data preprocessing)
Purpose: convert genotype + phenotype inputs into model-ready training matrix, with optional GWAS/LD feature selection.
Usage
G2PInsight preprocess \
-g <genotype_input> \
-p <phenotype.txt> \
-o <output_path> \
[-f <1|2|3|4>] \
[--gwas_pvalue <float>] \
[--ld-config "<window_kb>,<window_variants>,<r2_threshold>"] \
[--no-filter-snps]
Parameters
| Parameter | Required | Default | Description |
|---|---|---|---|
-g, --genotype |
Yes | - | Genotype input path (VCF/PLINK) |
-p, --phenotype |
Yes | - | Phenotype file path (at least two columns: sample, phenotype) |
-o, --output |
Yes | - | Output directory or output prefix |
-f, --feature_selection_mode |
No | 1 |
1=no selection, 2=GWAS, 3=LD, 4=GWAS+LD |
--gwas_pvalue |
No | 0.01 |
GWAS significance threshold (effective for mode 2/4) |
--ld-config |
No | "50,5,0.2" |
LD config: window_kb, window_variants, r² threshold (effective for mode 3/4) |
--no-filter-snps |
No | False |
Disable SNP quality filtering |
Example
G2PInsight preprocess -g data.vcf -p pheno.txt -o out/ -f 1
G2PInsight preprocess -g data.vcf -p pheno.txt -o out/ -f 4 --gwas_pvalue 0.01 --ld-config "50,5,0.2"
2. train (single-model training)
Purpose: train one selected model and export model artifacts, metrics, and plotting data.
Usage
G2PInsight train \
-j <preprocess_metadata.json> \
-m <LightGBM|RandomForest|XGBoost|SVM|CatBoost|Logistic> \
-o <output_dir> \
[--task_type <classification|regression>] \
[--n_folds <int>] \
[--random_state <int>] \
[--feature_importance]
Parameters
| Parameter | Required | Default | Description |
|---|---|---|---|
-j, --json |
Yes | - | Preprocess-generated *_metadata.json |
-m, --model |
Yes | - | Model name |
-o, --output_dir |
Yes | - | Output directory |
--task_type |
No | Auto | Optional explicit task type |
--n_folds |
No | 5 |
Number of CV folds |
--random_state |
No | 42 |
Random seed |
--feature_importance |
No | False |
Trigger feature-importance output flow |
Example
G2PInsight train -j out/preprocess/out_metadata.json -m LightGBM -o results/
3. train-all (all-model training)
Purpose: train all supported models in parallel and produce comparison outputs.
Usage
G2PInsight train-all \
-j <preprocess_metadata.json> \
-o <output_dir> \
[--task_type <classification|regression>] \
[--n_folds <int>] \
[--random_state <int>] \
[--feature_importance]
Important behavior
Current implementation keeps only the best-performing model directory after all-model training and removes the others. It also exports best_model_info.json.
Example
G2PInsight train-all -j out/preprocess/out_metadata.json -o results/
4. predict (model inference)
Purpose: predict phenotypes using a trained .pkl model.
Usage
G2PInsight predict \
-i <input_data.txt|input_data.vcf|input_data.vcf.gz> \
-m <model.pkl> \
-o <output_dir> \
[--task_type <classification|regression>]
Parameters
| Parameter | Required | Default | Description |
|---|---|---|---|
-i, --input |
Yes | - | Prediction input (training matrix or VCF) |
-m, --model |
Yes | - | Path to trained model file (.pkl) |
-o, --output_dir |
Yes | - | Output directory (used for temp conversion when input is VCF) |
--task_type |
No | - | Optional task type |
Output location
Prediction results are written to the model directory:
{model_dir}/{model_type}_predictions.tsv
Example
G2PInsight predict -i new_data.txt -m results/train/LightGBM/LightGBM_model.pkl -o pred/
5. visualize (result visualization)
Purpose: generate feature-importance plots or model-performance plots.
Usage
G2PInsight visualize \
[-i <feature_importance.txt>] \
[-I <plotting_data.npz>] \
-o <output_prefix>
Parameters
| Parameter | Required | Description |
|---|---|---|
-i, --importance |
No | Feature-importance file |
-I, --indicator |
No | plotting_data.npz file from training outputs |
-o, --output |
Yes | Output prefix |
Feature-importance format requirements
Recommended input: <Model>_feature_importance.txt generated by training.
Required columns:
feature(e.g.,1_12345orchr1_12345)importance_abs(orimportance)effect(1or-1)
Example
G2PInsight visualize -i results/train/LightGBM/LightGBM_feature_importance.txt -o plot
G2PInsight visualize -I results/train/LightGBM/LightGBM_plotting_data.npz -o plot
Output Files
preprocess
Typical location: <output>/preprocess/
<prefix>_train_data.txt<prefix>_metadata.json- phenotype distribution plot(s), depending on task type
train
Typical location: <output>/train/<Model>/
<Model>_model.pkl<Model>_metrics.json<Model>_cv_results.json<Model>_training_features.json<Model>_feature_importance.txt<Model>_shap_values.txt<Model>_plotting_data.npz
train-all
Typical location: <output>/train/
- best model directory (other model directories may be removed by current implementation)
best_model_info.jsonmodel_comparison_report.json
predict
Typical location: model directory
<Model>_predictions.tsv
visualize
Typical location: <output_parent>/visualize/
<prefix>_importance_static.png<prefix>_importance_interactive.html<prefix>_performance_curves.png<prefix>_cv_training_curves.png
FAQ
1) train requires metadata input
train and train-all require preprocess-generated *_metadata.json via -j.
2) Cannot find preprocess outputs
Check <output>/preprocess/ for <prefix>_train_data.txt and <prefix>_metadata.json.
3) visualize complains about missing effect
The input file does not meet the required 3-column schema. Use training-generated <Model>_feature_importance.txt.
4) Why does train-all keep only one model directory?
This is the current behavior: it selects the best model and removes the rest.
5) Why are prediction results not under -o?
Prediction outputs are saved in the model directory by current implementation. -o is mainly used for temporary conversion workflow management.
Developer Guide
Project Structure
G2PInsight/
├── G2PInsight/
│ ├── main.py
│ └── bin/
│ ├── preprocess.py
│ ├── modeltraining.py
│ ├── gemma_gwas.py
│ ├── plink_ld.py
│ ├── visualization.py
│ └── font_utils.py
├── pyproject.toml
├── setup.py
└── README.md
Local Development Setup
git clone <your-repo-url>
cd G2PInsight
pip install -e .
Style Recommendations
- Follow PEP 8.
- Document parameters and return values for new public interfaces.
- Keep README synchronized with CLI behavior whenever arguments or outputs change.
License
This project is distributed under the MIT License. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file g2pinsight-1.0.0.tar.gz.
File metadata
- Download URL: g2pinsight-1.0.0.tar.gz
- Upload date:
- Size: 12.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6787b40fbfe2e9f0a26f27e3b21273b21d46be8ccd0838d6ec091ac9c253c7b9
|
|
| MD5 |
6f78d9ed108695d8169ef345f0c43fb2
|
|
| BLAKE2b-256 |
1abcbcc88941f8dee3e081b7df51984ee330147f9d55dec694121e697cea54ec
|
File details
Details for the file g2pinsight-1.0.0-py3-none-any.whl.
File metadata
- Download URL: g2pinsight-1.0.0-py3-none-any.whl
- Upload date:
- Size: 12.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6ec61295f3d5d604728a4fb14b7df600b8773e9e4ce5640387317a83a4fff3d
|
|
| MD5 |
30ac391b241ce194eef0f9b42d5dbda0
|
|
| BLAKE2b-256 |
26dcf86fb23af4799cfd6d62c2e664821d4eb61b275b47714c370ab30bb97252
|