A benchmarking toolkit for genomic prediction with multiple methods and LLM-powered analysis

These details have not been verified by PyPI

Project description

GPBench

GPBench is a benchmarking toolkit for genomic prediction. This repository reimplements and integrates many commonly used methods, including classic linear statistical approaches and machine learning / deep learning methods: rrBLUP, GBLUP, BayesA/B/C, SVR, Random Forest, XGBoost, LightGBM, DeepGS, DL_GWAS, G2PDeep, MVP, DNNGP, SoyDNGP, DeepCCR, EIR, Cropformer, GEFormer, CropARNet, etc.

Project Website: https://www.sdu-idea.cn/GPBench/

GPBench overview

Key Features

Implements multiple genomic prediction methods and reproducible experimental workflows
Supports GPU-accelerated deep learning methods (using PyTorch)
Unified data loading and 10-fold cross-validation pipeline
Outputs standardized evaluation metrics (PCC, MAE, MSE, R2) and per-fold predictions
LLM-powered analysis tool (gp_agent_tool): Analyzes dataset characteristics, finds similar datasets, and recommends suitable genomic prediction methods based on historical experimental experience

Important Structure

data/: Example/real dataset directory, each species/dataset is a subfolder (e.g., data/Cotton/), containing:
- genotype.npz: genotype matrix (typically saved as a NumPy array)
- phenotype.npz: phenotype data (contains phenotype matrix and phenotype names)
method_reg/: subdirectories with implementations for each method (each method usually contains a main runner script plus hyperparameter/utility scripts)
result/: default output directory for experimental results
gp_agent_tool/: LLM-powered dataset analysis and method recommendation tool (see Dataset Analysis Tool section)
environment.yml: dependency file for creating a conda environment (recommended)

Environment Setup (recommended: conda)

There is an environment.yml in the repository; it is recommended to create and activate a conda environment with it:

# On a machine with conda:
conda env create -f environment.yml
conda activate Benchmark

Notes:

environment.yml contains most dependencies (including CUDA / cuDNN related packages and pip list) and is suitable for GPU-enabled environments (the file references CUDA 11.8 and matching RAPIDS/torch/cupy versions).
Ensure the target machine has an NVIDIA driver compatible with CUDA 11.8/12.
If you cannot use the environment file directly, you can install main dependencies into an existing Python environment as needed:

pip install -U numpy pandas scikit-learn torch torchvision optuna psutil xgboost lightgbm

(Warning: the above is a simplified installation; some packages may need additional configuration on GPU systems or certain platforms.)

Data Format and Preparation

Each species folder should contain genotype.npz and phenotype.npz.
genotype.npz usually stores a 2D array (number of samples × number of SNPs).
phenotype.npz typically includes two arrays: the phenotype matrix (number of samples × number of phenotypes) and a list of phenotype names.

Quickly view phenotype names for a dataset (e.g., Cotton):

python - <<'PY'
import numpy as np
obj = np.load('data/Cotton/phenotype.npz')
print(obj['arr_1'])
PY

Quick Start (example with a method)

Most methods have a main script under method_reg/<Method>/. Scripts usually accept parameters like --methods, --species, --phe, --data_dir, --result_dir, etc. Example:

# 1) Activate the environment
conda activate Benchmark

# 2) Run a single phenotype with DeepCCR (note: include trailing slash after --species)
python method_reg/DeepCCR/DeepCCR.py \
	--methods DeepCCR/ \
	--species Cotton/ \
	--phe FibLen_17_18 \
	--data_dir data/ \
	--result_dir result/

Common optional arguments (may vary across scripts):

--epoch: number of training epochs (example scripts often default to 1000)
--batch_size: batch size
--lr: learning rate
--patience: early stopping patience

You can inspect the argparse help for the specific script in the method directory:

python method_reg/DeepCCR/DeepCCR.py -h

Dataset Analysis Tool (gp_agent_tool)

The gp_agent_tool is an LLM-powered analysis tool that performs comprehensive dataset analysis and automatically recommends suitable genomic prediction methods. It analyzes your dataset characteristics, computes statistical features, finds similar datasets from historical experiments, and provides evidence-based method recommendations.

Features

Dataset statistical analysis: Automatically computes and analyzes dataset statistics including sample size, marker count, phenotype distribution, missing rates, and statistical properties
Similar dataset discovery: Finds datasets with similar statistical distributions to your query dataset from historical experimental databases
Method recommendation: Recommends genomic prediction methods that have shown best performance on similar datasets based on historical experience
Bilingual support: Supports both Chinese and English queries and analysis
Experience-based insights: Leverages comprehensive historical experimental results to provide evidence-based analysis and recommendations

Prerequisites

LLM Configuration: Create a configuration file at gp_agent_tool/config/config.json with your LLM API settings:

{
  "llm": {
    "model": "gpt-4o-mini",
    "api_key": "YOUR_OPENAI_API_KEY",
    "base_url": "https://api.openai.com/v1",
    "timeout_seconds": 60,
    "max_retries": 3
  },
  "codegen_llm": {
    "model": "gpt-4o-mini",
    "api_key": "YOUR_OPENAI_API_KEY",
    "base_url": "https://api.openai.com/v1",
    "timeout_seconds": 60,
    "max_retries": 3
  },
  "multimodal_llm": {
    "model": "qwen-vl-max",
    "api_key": "YOUR_DASHSCOPE_API_KEY"
  }
}

Important: Please replace the api_key fields in the configuration file with your own API keys:

Replace YOUR_OPENAI_API_KEY in llm and codegen_llm with your OpenAI API key
Replace YOUR_DASHSCOPE_API_KEY in multimodal_llm with your Alibaba Cloud DashScope API key

You can obtain API keys from the following URLs:

OpenAI API key: https://platform.openai.com/api-keys
Alibaba Cloud DashScope API key: https://dashscope.console.aliyun.com/apiKey

Additional Dependencies: Install required packages for the tool:

pip install langchain langgraph openai

Usage

Basic Usage

Run the tool from the project root directory:

cd gp_agent_tool
python main.py \
  -q "Based on existing models, summarize the patterns in the mkg trait of cattle." \
  -o result.json

Or in English:

python main.py \
  -d ../data/Rapeseed \
  -q "Recommend the best methods for this dataset" \
  -o result.json

Command-line Arguments

-d / --dataset (optional): Path to the dataset directory containing genotype.npz and phenotype.npz. The tool will analyze this dataset to compute statistical features. If not provided, analysis and recommendations are based on the complete experience table only.
-q / --user-query (required): Your analysis requirement or question description (supports both Chinese and English). Examples: "分析这个数据集的特征" / "Analyze this dataset and recommend methods" / "What methods work best for binary phenotypes?"
-m / --mask (optional): Specify a species/phenotype (e.g., Rapeseed/FloweringTime) to mask in the reference experience database, preventing "answer leakage" when evaluating on known datasets.
-o / --output (optional): Path to save the analysis result as a JSON file. If not provided, results are printed to the terminal.

Dataset Analysis Features

When a dataset path is provided, the tool automatically computes the following statistical features:

Sample information: Total samples, valid samples, missing rate
Marker information: Number of markers, genotype statistics (mean, std, missing rate, MAF)
Phenotype statistics: Mean, std, min, max, median, skewness, kurtosis
Data type information: Genotype and phenotype data types, binary phenotype detection

Example Output

The tool returns a JSON object with two main sections:

{
  "similar_datasets": {
    "items": ["Chickpea/Days_to_0.5_flowering", "Cotton/FibLen_17_18"],
    "reason": "These datasets have similar statistical distributions..."
  },
  "methods": {
    "items": ["GBLUP", "XGBoost", "LightGBM"],
    "reason": "Based on historical experience, these methods showed best performance on similar datasets..."
  }
}

Analysis Workflow

When you provide a dataset path, the tool performs the following analysis steps:

Dataset feature extraction: Computes statistical features from your dataset (phenotype mean, std, skewness, kurtosis, sample size, marker count, etc.)
Similar dataset matching: Compares your dataset features with historical datasets to find the most similar ones
Experience table filtering: Filters the historical experience table to include only results from similar datasets
Method analysis and recommendation: Analyzes which methods performed best on similar datasets and recommends them with detailed reasoning

Use Cases

General method query: Query methods based on specific criteria without providing a dataset:

python main.py \
  -q "What methods work best for small sample sizes?" \
  -o result.json

Evaluation mode with masking: When evaluating on a known dataset, mask it to avoid bias in the analysis:

python main.py \
  -d ../data/Rapeseed \
  -q "Analyze this dataset and recommend appropriate algorithms." \
  -m Rapeseed/FloweringTime \
  -o result.json

Output Description

Each method run creates a directory under result/ named by method/species/phenotype, e.g., result/DeepCCR/Cotton/<PHENO>/.
Per-fold prediction results are typically saved as fold{n}.csv, containing Y_test and Y_pred columns.
The script prints or saves average evaluation metrics at the end: PCC (Pearson correlation coefficient), MAE, MSE, R2, along with runtime and memory/GPU usage.

Full Dataset Link

Species dataset: contains genotype and phenotype data for 16 species.

Running Tips & Troubleshooting

For GPU usage, ensure conda activate Benchmark and that CUDA drivers are available; torch.cuda.is_available() should return True.
If you encounter memory or GPU OOM issues, try reducing --batch_size or disabling some parallel settings in scripts.
If running on CPU-only systems, some GPU-specific methods (RAPIDS or GPU-only implementations) may be unavailable or require alternative implementations.

Contributing & Contact

Contributions via issues and PRs are welcome. Please describe changes and testing in PRs.
Contact: open an Issue in the repository or reach the repository owner (GitHub user: xwzhang2118).

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.2.2

Feb 6, 2026

This version

1.2.1 yanked

Feb 6, 2026

1.1.3 yanked

Feb 5, 2026

1.1.2 yanked

Feb 5, 2026

1.1.0 yanked

Feb 5, 2026

1.0.3 yanked

Feb 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpbench-1.2.1.tar.gz (168.9 kB view details)

Uploaded Feb 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gpbench-1.2.1-py3-none-any.whl (308.0 kB view details)

Uploaded Feb 6, 2026 Python 3

File details

Details for the file gpbench-1.2.1.tar.gz.

File metadata

Download URL: gpbench-1.2.1.tar.gz
Upload date: Feb 6, 2026
Size: 168.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for gpbench-1.2.1.tar.gz
Algorithm	Hash digest
SHA256	`1613f88c08e23d60d868845dd6e0f5eb1e92f46a00bfe467a7cc0d79220c0a60`
MD5	`16d6dd2cfb27f457670ca6dbcfc8790b`
BLAKE2b-256	`766f6c821568394b9b402b14f3f12cc8d853ae45e8b529d5a9f361839f41ebb1`

See more details on using hashes here.

File details

Details for the file gpbench-1.2.1-py3-none-any.whl.

File metadata

Download URL: gpbench-1.2.1-py3-none-any.whl
Upload date: Feb 6, 2026
Size: 308.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for gpbench-1.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`adf2f1d0e2d7d5ec9ac33e1cc5b1b6d8e9c2e49415d60aa3f054b0bfda95aef5`
MD5	`b0856af21e112a2af3a784b9f325c87f`
BLAKE2b-256	`e0939bf2d6ae2da1df26969aa47f5c4942a9c790e51db5d9611d4787f97d727c`

See more details on using hashes here.

gpbench 1.2.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

GPBench

Key Features

Important Structure

Environment Setup (recommended: conda)

Data Format and Preparation

Quick Start (example with a method)

Dataset Analysis Tool (gp_agent_tool)

Features

Prerequisites

Usage

Basic Usage

Command-line Arguments

Dataset Analysis Features

Example Output

Analysis Workflow

Use Cases

Output Description

Full Dataset Link

Running Tips & Troubleshooting

Contributing & Contact

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes