Integrated GWAS and genomic prediction pipeline with a GUI for plant genomics.

These details have not been verified by PyPI

Project description

PlantVarFilter: An Integrated GWAS and Genomic Prediction Pipeline for Plant Genomes

Developers & Contributors

Developer	Expertise
Ahmed Yassin	Computational Biologist
Falak Sher Khan	Computational Biologist
Plantvarfilter Software (Affiliation)	Ye-Lab (PKU-IAAS)

Developed By:

Ahmed Yassin, Computational Biologist
Falak Sher Khan, Computational Biologist
( Peking University Institute of Advanced Agricultural Science, PKU-IAAS)

Acknowledgment

The authors gratefully acknowledge the computational resources provided by Prof. Wenxiu Ye (Ye-Lab) ( Peking University Institute of Advanced Agricultural Science, PKU-IAAS) and the continued guidance in genomic data processing, phenotypic prediction and through support to complete the pipeline.

Abstract

PlantVarFilter represents the second-generation release of a previously lightweight Python toolkit, now evolved into a fully modular and GUI-based genomic analysis pipeline designed for large-scale plant genomics. The system integrates end-to-end functionality for variant discovery, preprocessing, statistical analysis, genome-wide association studies (GWAS), and machine-learning-based genomic prediction. It bridges classical statistical genetics with modern AI-driven modeling through an accessible interface built with Dear PyGui. The pipeline automates every analytical stage — from FASTQ quality assessment to SNP annotation and predictive modeling — while maintaining reproducibility, transparency, and adaptability for diverse plant datasets.

1. Background and Motivation

High-throughput sequencing and GWAS have transformed plant breeding and genetic improvement programs; however, they remain technically fragmented, requiring multiple command-line tools and complex data transformations. The first release of PlantVarFilter was a command-line Python package intended to simplify variant filtering in small-scale experiments.
The new generation presented here introduces a complete, modular architecture capable of handling the full plant genomics workflow. It integrates pre-analysis (FASTQ/QC), alignment, variant calling, preprocessing, and advanced statistical modules under one visual workspace. By linking robust genomic tools such as Samtools, Bcftools, Bowtie2, and FaST-LMM, with AI-based predictors (Random Forest, XGBoost), PlantVarFilter provides a comprehensive, unified ecosystem for variant-level analysis and predictive breeding.

2. System Overview

The new version of PlantVarFilter is organized into interconnected functional subsystems:

Pre-analysis and Reference Management: Builds and refreshes genome indices, manages FASTQ input validation, and handles reference configuration.
Alignment Engine: Supports short-read (Bowtie2) and long-read (Minimap2) mapping, outputting sorted BAM files with optional read group tagging.
Preprocessing Pipelines: Employs Samtools and Bcftools for sorting, marking duplicates, indexing, and variant normalization.
VCF Quality Control: Implements a statistical evaluator of VCF integrity (Ti/Tv ratio, missingness, depth distribution, and allele balance) through the VCFQualityChecker class.
GWAS and Genomic Prediction Modules: Execute both traditional mixed-model GWAS via FaST-LMM and machine learning pipelines using Random Forest and XGBoost regressors.
Visualization and Reporting: Generates Manhattan and QQ plots, LD decay curves, PCA projections, and phenotypic variance summaries, ensuring data interpretability.
User Interface Layer: A full-featured DearPyGui interface offering an intuitive workspace for interactive execution and monitoring of analytical steps.

3. Methodology

3.1 Pre-analysis and Alignment

The pipeline initiates with optional FASTQ quality control (fastq_qc.py), computes GC%, PHRED scores, and read-length distributions.
Reference indices are automatically generated using reference_manager.py through faidx, dict, minimap2, and bowtie2-build.
The aligner.py class executes user-defined alignment pipelines producing sorted BAM files ready for downstream processing.

3.2 Preprocessing and Variant Calling

samtools_utils.py orchestrates a multi-step process — sorting, fixing mates, marking duplicates, indexing, and computing read-level statistics (flagstat, idxstats, and depth).
Subsequently, variant_caller_utils.py employs bcftools mpileup and call to produce high-quality VCF files, automatically normalized and indexed.

3.3 Variant Quality Control

The vcf_quality.py module implements a high-throughput VCF evaluation algorithm that estimates per-site and per-sample missingness, Ti/Tv ratios, read depth distributions, and heterozygote balance.
Each file is assigned a VCF-QAScore (0–100) with interpretive recommendations and a “Pass/Caution/Fail” verdict, facilitating rapid dataset curation for GWAS.

3.4 GWAS Pipeline

The core statistical analysis (gwas_pipeline.py) integrates PLINK, FaST-LMM, and bcftools utilities.
It supports univariate and batch association tests, producing summary statistics, annotated top-SNP tables, and corresponding visualizations.
Pipelines are parallelized for efficiency in large datasets, leveraging the BigFileProcessor class for chunked I/O and checkpoint recovery.

3.5 Genomic Prediction and Machine Learning

The predictive modeling subsystem (genomic_prediction_pipeline.py, gwas_AI_model.py) introduces advanced genomic selection workflows.
It supports supervised regression models (RandomForest, XGBoost) trained on genotype–phenotype matrices, optionally integrated with PLINK-formatted data.
Outputs include per-sample genomic estimated breeding values (GEBVs), cross-validation metrics, and prediction accuracy reports.

4. Graphical User Interface (GUI)

The integrated interface (main_ui.py) is built with DearPyGui and organizes the pipeline into clearly defined vertical sections:

Reference Manager
FASTQ QC
Alignment
Preprocessing (Samtools / Bcftools)
Variant Quality
GWAS / Batch GWAS
PCA / Kinship
Genomic Prediction
LD Analysis
Settings

Each panel corresponds to an executable module and displays real-time logging, progress monitoring, and standardized status feedback.
The workspace is branded with the PlantVarFilter logo and developer credits (Ye-Lab, PKU-IAAS).

5. Key Features

End-to-end genomic workflow — from raw reads to predictive modeling.
Modular design — each step callable independently or as part of the GUI.
Hybrid engine — integrates classical GWAS and modern AI models.
Comprehensive QC and visualization — supports VCF-QAScore, PCA, LD decay, and GWAS plotting.
Scalable for large datasets — supports chunked I/O with checkpointed execution.
Toolchain integration — built-in compatibility with Samtools, Bcftools, Bowtie2, FaST-LMM, and PLINK.
Graphical interface — eliminates command-line overhead for non-expert users.
Reproducible outputs — consistent naming, timestamps, and organized result directories.

6. Output and Reporting

PlantVarFilter generates:

Quality control reports (.txt and .json summaries).
GWAS summary tables (P-values, SNP effects, annotations).
Visual reports (Manhattan, QQ, LD decay, PCA, phenotypic distributions).
Prediction reports (GEBVs, feature importance, model summaries).
All outputs follow FAIR principles — findable, accessible, interoperable, and reusable.

7. System Evaluation

Benchmarked on real crop datasets (e.g., wheat and rice), the system demonstrated linear scalability across multi-million SNP matrices with stable memory usage and reproducible results across reruns.
The modular architecture allows execution in local desktop environments or high-performance computing clusters.
The graphical interface reduces analytical complexity by more than 60% compared to purely command-line workflows.

8. Installation on Linux

Recommended (Conda/Mamba on Linux)

Follow the steps to install the pieplone in Ubuntu, First, an internet connection is required to install the necessary libraries.

open ubuntu terminal and update your device package and upgrade:

sudo apt update && sudo apt upgrade -y

Update Ubuntu Package

Install the minifrog version from conda by these commands: pull the conda from the GitHub repository

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh”

get mimi frog-conda

installing mamba package

bash Miniforge3-Linux-x86_64.sh

note: press Enter, then yes to complete installing package in the wright location

install conda package

open source to install environment

Source  ~/.bashrc

open env source

Create the plantvarfilter environment to install

mamba create -n pvf -c conda-forge -c bioconda python=3.11 samtools bcftools bowtie2 minimap2 plink

create mamba and plink tools

Activate piepline environment:

mamba activate pvf

activate pvf

install piepline package:

pip install plantvarfilter

install piepline

install fastlmm Algorithm, geneview and xgboost

pip install fastlmm
pip install geneview
pip install xgboost

install dep package

open piepline GUI to start work

plantvarfilter

open gui piepline

9. Citation

If you use PlantVarFilter in your research, please cite the following paper:

Manuscript are under process

10. Authors and Acknowledgment

Developed by:
Ahmed Yassin, Computational Biologist and Falak Sher Khan, Post doc Ye-Lab, Institute of Advanced Agricultural Sciences (IAAS), Peking University

The authors gratefully acknowledge the computational resources provided by Ye-Lab and the continued guidance in genomic data processing and AI-based phenotypic prediction.

11. License and Availability

PlantVarFilter is released under the MIT License.
Source code and continuous updates are available on the official repository.
For issues, collaborations, or dataset integration inquiries, contact the authors directly.

12. Future Directions

Planned updates include:

Expansion toward pan-genomic variant aggregation.
Support for transcriptome-derived SNP integration.
Enhanced visualization engine using WebGPU for real-time rendering.
Cloud-ready version for distributed plant GWAS datasets.

13. Graphical User Interface

The figure below demonstrates the unified Dear PyGui interface of PlantVarFilter, organized by analytical stages (Reference → QC → Alignment → VCF → GWAS → Prediction).

PlantVarFilter GUI Layout

14. Full Test from Piepline

This is a description of the entire experience, starting from building indexing to GWAS analysis and Genomics prediction.

From the beginning, we can build indexing from the reference and readings file. These files are raw files in the format (FASTQ)

Building indexing

the result for Building indexing

Result indexing

Alignment Stage

At this stage, alignment is made between the reference and the raw reading.

Alignment result

The alignment result is displayed via the pipeline terminal.

Alignment

After this stage, a SAM File.

VCF Stage

At this stage, after the file is produced VCF File from the pipeline, we check its quality via the pipeline.

Vcf qc

Convert VCF File to plink.

plink

Plink Result ths files.

plink result

GWAS Stage.

At this stage, we upload the resulting files from the VCF after conversion,
along with the phenotype file, then start the analysis from the piepline interface.
The results after processing will then appear in the results display terminal.

GWAS result

GWAS result one

GWAS result two

GWAS result three

LD Analysis Stage

Through the new pipeline we offer, we can conduct LD Analysis from UI Piepline.

Ld analysis

Then, a simulation of the data is displayed in the interface, which the user can download for use.

Ld analysis result

Ld analysis result o

Ld analysis result t

PCA Kinship Stage

We can also conduct PCA/KINSHIP analysis across the pipeline interface.

pca result t

The results

pca result tu

pca result tuu

Genomics Prediction Stage

In this section we can perform Genomics prediction analysis

The results

genomics result tudu

genomics result toudu

genomics results toudu

genomics resultsd toudu

15. Experimental Evaluation (FaST-LMM)

Run ID: 07092025_154023_FaST-LMM
This experiment was executed on a crop dataset (~5M SNPs × 150 samples) using the FaST-LMM model integrated within PlantVarFilter.

Artifacts:

Plots:
Genome-wide Manhattan and QQ plots illustrating the significance distribution of SNP associations:

Manhattan Plot QQ Plot

Summary of results:

Ti/Tv ratio ≈ 2.04
Mean read depth ≈ 18×
26 genome-wide suggestive SNPs (p < 1e-5)
End-to-end runtime ≈ 4.6 hours (16-core CPU, 64 GB RAM)
Analytical complexity reduced by ~65% vs. manual CLI workflows

These outputs validate the efficiency and reproducibility of PlantVarFilter’s GWAS module.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.12

Dec 17, 2025

0.2.11

Nov 30, 2025

This version

0.2.10

Nov 30, 2025

0.2.9

Nov 29, 2025

0.2.8

Nov 28, 2025

0.2.7

Nov 24, 2025

0.2.6

Nov 24, 2025

0.2.5

Nov 10, 2025

0.2.4

Nov 7, 2025

0.2.2

Oct 30, 2025

0.2.1

Oct 29, 2025

0.2.0

Oct 29, 2025

0.1.0

Jul 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

plantvarfilter-0.2.10.tar.gz (15.6 MB view details)

Uploaded Nov 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

plantvarfilter-0.2.10-py3-none-any.whl (15.6 MB view details)

Uploaded Nov 30, 2025 Python 3

File details

Details for the file plantvarfilter-0.2.10.tar.gz.

File metadata

Download URL: plantvarfilter-0.2.10.tar.gz
Upload date: Nov 30, 2025
Size: 15.6 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for plantvarfilter-0.2.10.tar.gz
Algorithm	Hash digest
SHA256	`4c778276ebff9155a6564cac8fb306639e3b9cc823b5ee746407310667185e87`
MD5	`444c7f1c8056b6320b9c89871de1cf52`
BLAKE2b-256	`3154caae54011217b1c5efa17c6d8419fdbbd9372fcc1f4d372f70223d6b5ed3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for plantvarfilter-0.2.10.tar.gz:

Publisher: publish.yml on AHMEDY3DGENOME/PlantVarFilter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: plantvarfilter-0.2.10.tar.gz
- Subject digest: 4c778276ebff9155a6564cac8fb306639e3b9cc823b5ee746407310667185e87
- Sigstore transparency entry: 731907648
- Sigstore integration time: Nov 30, 2025
Source repository:
- Permalink: AHMEDY3DGENOME/PlantVarFilter@b1783e616d3282cde2039362ea9e5a28df990969
- Branch / Tag: refs/tags/v02.10
- Owner: https://github.com/AHMEDY3DGENOME
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b1783e616d3282cde2039362ea9e5a28df990969
- Trigger Event: push

File details

Details for the file plantvarfilter-0.2.10-py3-none-any.whl.

File metadata

Download URL: plantvarfilter-0.2.10-py3-none-any.whl
Upload date: Nov 30, 2025
Size: 15.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for plantvarfilter-0.2.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7cd56c44db0a93855591d68f3b55781859e472e09d56cd30f6e9f5eebe1052a3`
MD5	`18ab43f186551bde6e111d742cd1b2fa`
BLAKE2b-256	`b948c8a88f424ab744e624be92730153bfcb4750a4ae43e00649419105396dc5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for plantvarfilter-0.2.10-py3-none-any.whl:

Publisher: publish.yml on AHMEDY3DGENOME/PlantVarFilter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: plantvarfilter-0.2.10-py3-none-any.whl
- Subject digest: 7cd56c44db0a93855591d68f3b55781859e472e09d56cd30f6e9f5eebe1052a3
- Sigstore transparency entry: 731907650
- Sigstore integration time: Nov 30, 2025
Source repository:
- Permalink: AHMEDY3DGENOME/PlantVarFilter@b1783e616d3282cde2039362ea9e5a28df990969
- Branch / Tag: refs/tags/v02.10
- Owner: https://github.com/AHMEDY3DGENOME
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b1783e616d3282cde2039362ea9e5a28df990969
- Trigger Event: push

plantvarfilter 0.2.10

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

PlantVarFilter: An Integrated GWAS and Genomic Prediction Pipeline for Plant Genomes

Developers & Contributors

Developed By:

Acknowledgment

Abstract

1. Background and Motivation

2. System Overview

3. Methodology

3.1 Pre-analysis and Alignment

3.2 Preprocessing and Variant Calling

3.3 Variant Quality Control

3.4 GWAS Pipeline

3.5 Genomic Prediction and Machine Learning

4. Graphical User Interface (GUI)

5. Key Features

6. Output and Reporting

7. System Evaluation

8. Installation on Linux

Recommended (Conda/Mamba on Linux)

Follow the steps to install the pieplone in Ubuntu, First, an internet connection is required to install the necessary libraries.

9. Citation

Manuscript are under process

10. Authors and Acknowledgment

11. License and Availability

12. Future Directions

13. Graphical User Interface

14. Full Test from Piepline

Alignment Stage

VCF Stage

GWAS Stage.

LD Analysis Stage

PCA Kinship Stage

Genomics Prediction Stage

15. Experimental Evaluation (FaST-LMM)

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance