PANICLE: Python Algorithms for Nucleotide-phenotype Inference and Chromosome-wide Locus Evaluation - A comprehensive GWAS pipeline with Numba JIT acceleration
Project description
PANICLE: Python Algorithms for Nucleotide-phenotype Inference and Chromosome-wide Locus Evaluation
PANICLE is a Python package for Genome Wide Association Studies (GWAS). It implements GLM, MLM, FarmCPU, and BLINK. PANICLE seeks to achieve speeds comparable or better to other implementations while supporting multiple input data formats, providing multiple quality of life features (native effect marker number testing, leave one chromosome out MLM, calculation of resampling model inclusion probabilities, etc), and allowing modern GWAS algorithms to be natively integrated into python-based data analysis pipelines and ecosystems.
Key Features
- Multiple Algorithms: GLM, MLM, FarmCPU, BLINK
- Supported Genotype Formats: VCF/BCF, PLINK, HapMap, CSV/TSV with optional caching of genotype data in binary during initial run (speeds future data loading dramatically).
- Robustness: Graceful handling of missing data.
Installation
Requires Python 3.9+.
pip install panicle
With optional dependencies for PLINK format support:
pip install panicle[plink]
Or install all optional dependencies:
pip install panicle[all]
Development Installation
To install from source for development:
git clone https://github.com/jschnable/PANICLE.git
cd PANICLE
pip install -e .[all]
Dependencies
Core dependencies (installed automatically):
numpy≥1.19.0scipy≥1.6.0pandas≥1.2.0h5py≥3.0.0 (HDF5 support)matplotlib≥3.3.0 (plotting)numba≥0.50.0 (JIT compilation for performance)cyvcf2≥0.30.0 (fast VCF/BCF parsing)
Optional dependencies:
bed-reader≥1.0.0 — PLINK .bed/.bim/.fam format support (pip install panicle[plink])joblib≥1.0.0 — Parallel processing for LOCO methods (pip install panicle[parallel])
Quick Start (Python API)
from panicle import PANICLE
# Run GWAS with a single function call
results = PANICLE(
phe="data/phenotype.csv",
geno="data/genotypes.vcf.gz",
map_data="data/map.csv", # optional for VCF (map is extracted automatically)
n_pcs=3, # compute 3 genotype PCs internally
method=["GLM", "MLM", "FarmCPU"]
)
# Results are also saved to CSV files automatically
For more control over data loading, use the loader functions directly:
from panicle import load_genotype_vcf, load_phenotype_file, match_individuals
from panicle import PANICLE_MLM, PANICLE_K_VanRaden, PANICLE_PCA
# Load data
genotype, sample_ids, marker_map = load_genotype_vcf("data/genotypes.vcf.gz")
phenotypes = load_phenotype_file("data/phenotype.csv")
# Align samples and loop over traits
for trait in phenotypes.columns[1:]:
phe_trait = phenotypes[["ID", trait]].dropna()
phe_aligned, _, geno_idx, _ = match_individuals(phe_trait, sample_ids)
geno_subset = genotype.subset_individuals(geno_idx)
phe_array = phe_aligned.values # (n, 2) array: [ID, value]
K = PANICLE_K_VanRaden(geno_subset)
results = PANICLE_MLM(phe=phe_array, geno=geno_subset, K=K)
df = results.to_dataframe()
print(f"{trait}: {(df['P'] < 5e-8).sum()} significant markers")
CLI Usage (Quick Start)
The run_GWAS.py script provides a command-line interface for batch processing.
python scripts/run_GWAS.py \
--phenotype data/phenotype.csv \
--genotype data/genotypes.vcf.gz \
--traits Trait1,Trait2 \
--methods GLM,MLM,FarmCPU,BLINK \
--n-pcs 5 \
--compute-effective-tests \
--outputs manhattan qq significant_marker_pvalues \
--outputdir ./results
For a small demo dataset included in the repo, see examples/EXAMPLE_DATA.md and try:
python scripts/run_GWAS.py \
--phenotype examples/example_phenotypes.csv \
--genotype examples/example_genotypes.vcf.gz \
--traits PlantHeight \
--methods GLM \
--outputdir ./results
Parameters
| Argument | Description | Default |
|---|---|---|
--phenotype |
Path to phenotype CSV/TSV (must contain ID column). | Required |
--phenotype-id-column |
ID column name in phenotype file. | ID |
--genotype |
Path to genotype VCF/BCF/CSV. | Required |
--map |
Optional map file (MARKER, CHROM, POS). Legacy SNP is also accepted. Recommended for numeric CSV/TSV and LOCO methods. |
None |
--format |
Genotype format override: vcf, plink, hapmap, csv, tsv, numeric. |
Auto |
--traits |
Comma-separated list of columns to analyze. | All numeric |
--methods |
GWAS methods: GLM, MLM, BAYESLOCO, FarmCPU, BLINK, FarmCPUResampling. |
GLM,MLM,FarmCPU |
--n-pcs |
Number of Principal Components for population structure. | 3 |
--compute-effective-tests |
Calculate Effective Marker Number (Me) and use it for Bonferroni correction. | False |
--alpha |
Significance level (e.g., 0.05). Threshold = alpha / Me (or M). |
0.05 |
--significance |
Fixed p-value threshold (overrides Bonferroni). | None |
--n-eff |
Effective number of markers (overrides Me). | None |
--covariates |
External covariate file. | None |
--covariate-columns |
Comma-separated covariate column names. | All except ID |
--covariate-id-column |
ID column name in covariate file. | ID |
--max-iterations |
Max iterations for FarmCPU/BLINK. | 10 |
--max-genotype-dosage |
Max dosage (e.g., 2 for diploid). | 2.0 |
--outputdir |
Output directory. | ./GWAS_results |
--outputs |
Outputs to generate: all_marker_pvalues, significant_marker_pvalues, manhattan, qq (see docs/output_files.md). |
All |
--include-standard-errors |
Include {METHOD}_SE columns in merged result CSV outputs. |
False |
Other useful filters:
--max-missing(default 1.0),--min-maf(default 0.0)--drop-monomorphic/--keep-monomorphic--snps-only,--no-split-multiallelic
Python API Usage
Integrate PANICLE into scripts or Jupyter Notebooks via the GWASPipeline class.
from panicle.pipelines.gwas import GWASPipeline
# 1. Initialize
pipeline = GWASPipeline(output_dir="./results")
# 2. Load Data (Auto-caches for speed)
pipeline.load_data(
phenotype_file="data/phenotype.csv",
genotype_file="data/genotype.vcf.gz",
map_file="data/genotype.map", # Optional unless format lacks positions
trait_columns=["Height", "Yield"],
loader_kwargs={'compute_effective_tests': True} # Enable Me calculation
)
# 3. Pre-process
pipeline.align_samples()
pipeline.compute_population_structure(n_pcs=5)
# 4. Run Analysis (runs in parallel by default)
pipeline.run_analysis(
methods=['GLM', 'MLM', 'FARMCPU', 'BLINK'],
alpha=0.05
)
Input Formats
Phenotype & Covariates
CSV or TSV files with an ID column and numeric columns for traits/covariates. PANICLE auto-detects ID columns named ID, id, IID, sample, Sample, Taxa, taxa, Genotype, genotype, Accession, accession (if multiple, it uses the leftmost). If none match, it uses the first column. Use --phenotype-id-column (or --covariate-id-column) to specify a custom ID column name.
Genotype
- VCF/BCF:
.vcf,.vcf.gz,.bcf(Preferred for performance). - CSV/TSV: Numeric matrix (rows=samples, cols=markers) + genetic map file with
MARKER,CHROM, andPOScolumns (legacySNPand aliases likeChr,Posare accepted). - PLINK:
.bed+.bim+.fam. - HapMap:
.hmp.txt.
Performance notes: VCF is typically the slowest format on the first run, but PANICLE caches parsed marker data so subsequent loads are competitive with other formats. BCF is roughly ~2x faster than VCF on the first run, and PLINK/bed is roughly ~4x faster than VCF on the first run (exact speedups depend on marker count, sample size, and hardware).
Tips
- Effective Tests: Use
--compute-effective-teststo calculate a less stringent, more accurate Bonferroni threshold based on marker linkage (Me). - Genotype Subsetting: If you align or filter samples manually, use
GenotypeMatrix.subset_individuals(...)to preserve pre-imputed fast paths.
Documentation & Examples
Documentation
Detailed documentation is available in the docs/ directory:
- Quick Start Guide - Get up and running in 5 minutes
- API Reference - Complete API documentation for all functions and classes
- Output Files - Understanding result file formats and columns
Interactive Tutorial
- Sorghum GWAS Tutorial - Jupyter notebook with complete GWAS workflow
Example Scripts
The examples/ directory contains runnable example scripts with included test data:
| Example | Description |
|---|---|
| 01_basic_gwas.py | Simplest GWAS with GLM |
| 02_mlm_with_structure.py | MLM with population structure correction |
| 04_with_covariates.py | Including external covariates |
| 05_reading_results.py | Analyzing and visualizing results |
| 06_farmcpu_resampling.py | FarmCPU resampling with RMIP output |
Run any example:
cd examples
python 01_basic_gwas.py
Algorithms
GLM
General Linear Model for fast single-marker association testing. Uses the Frisch-Waugh-Lovell (FWL) theorem combined with QR decomposition for computational efficiency. The algorithm residualizes the phenotype and genotypes against the covariate matrix (PCs + intercept), then computes per-marker regression statistics in vectorized batches. GLM is the fastest GWAS method but may generate overly optimistic significance values.
MLM
Mixed Linear Model accounting for population structure and cryptic relatedness via a kinship matrix.
Key design decisions:
- LOCO by default: Leave-One-Chromosome-Out kinship avoids proximal contamination (testing a marker against a kinship matrix that includes that marker), increasing power to detect true associations.
- Eigenspace transformation: Data is transformed via eigendecomposition of the kinship matrix, converting the correlated mixed model into an equivalent weighted least squares problem.
- REML variance components: Heritability (h²) is estimated using Brent's method optimization of the REML likelihood.
When map data is available, PANICLE's pipeline MLM path uses LOCO kinship and applies exact LRT refinement to top hits by default. LRT re-estimates variance components per marker, with a GEMMA-inspired derivative solver available for faster exact refinement versus the legacy bounded-Brent optimizer.
FarmCPU
Fixed and random model Circulating Probability Unification. FarmCPU iteratively alternates between a fixed-effect model (GLM) and random-effect model to identify associated markers while controlling for polygenic background. FarmCPU can often detect more independent loci linked to variation in the same trait since it controls for the impact of each significant signal when determining the significance of other signals.
This means FarmCPU will NOT give the "towers" most of us expect from classical manhattan plots which are the result of many different markers in LD with the same causal variant. Instead it will identify only one marker since once the effect of this marker is controlled for the significance of any markers in LD with that marker decline to baseline levels.
FarmCPU Citation: Liu, X., Huang, M., Fan, B., Buckler, E. S., & Zhang, Z. (2016). Iterative usage of fixed and random effect models for powerful and efficient genome-wide association studies. PLoS genetics, 12(2), e1005767.
BLINK
Bayesian-information and Linkage-disequilibrium Iteratively Nested Keyway. BLINK builds on FarmCPU's iterative framework but uses BIC-based model selection to optimize the pseudo-QTN set. Like FarmCPU, BLINK can often identify larger numbers of independent causal variants from the same phenotype/genotype set than GLM or MLM. Like FarmCPU, it will typically identify only one significant marker per causal variant and lacks the expected "towers" in manhattan plots caused by groups of markers that are all in LD.
Blink Citation: Huang, M., Liu, X., Zhou, Y., Summers, R. M., & Zhang, Z. (2019). BLINK: a package for the next level of genome-wide association studies with both individuals and markers in the millions. Gigascience, 8(2), giy154.
Effective Marker Number Estimates
PANICLE includes a python-based based implementation of the effective marker number estimation method implemented in GEC. Accounts for linkage disequilibrium between markers to provide a less conservative multiple testing correction than standard Bonferroni.
GEC citation: Li MX, Yeung JM, Cherny SS, Sham PC. Evaluating the effective numbers of independent tests and significant p-value thresholds in commercial genotyping arrays and public imputation reference datasets. Hum Genet. 2012 May;131(5):747-56.
Benchmarks
Benchmarks based on traits measured from 862 samples, each scored for 5,751,024 markers and run on an Apple M4 CPU (cached VCF).
Data Loading
| Step | Time |
|---|---|
| Genotype loading | 1.34s |
| Phenotype loading | 0.005s |
| Sample alignment | 11.12s |
| PCA (3 components) | 2.08s |
| Total | 14.55s |
Note: First run with a given genetic marker file requires substantial time for parsing (≈9 minutes for 5M markers scored for 1000 individuals); subsequent runs use binary cache and load in seconds.
Analysis Times (5.75M markers, 862 samples; excludes data loading/result writing)
| Method | Time | Notes |
|---|---|---|
| GLM | 8.94s | ~643K markers/second |
| MLM | 28.18s | LOCO kinship precompute +15.95s = 44.13s total |
| FarmCPU | 41.90s | 10 max iterations |
| BLINK | 60.81s | 10 max iterations |
Scaling by Marker Count (862 samples; includes cached load, alignment, PCA, kinship where relevant)
| Markers | GLM | MLM | FarmCPU | BLINK |
|---|---|---|---|---|
| 50,000 | 12.09s | 12.86s | 12.29s | 12.42s |
| 500,000 | 12.78s | 15.72s | 14.66s | 15.74s |
| 5,000,000 | 19.49s | 47.12s | 46.37s | 58.60s |
License
Distributed under the MIT license. See LICENSE.
Disclaimer: This is an independent Python implementation of algorithms developed by others. Any errors are mine alone. -James
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file panicle-0.3.2.tar.gz.
File metadata
- Download URL: panicle-0.3.2.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48102efc924fc4c91f205feb6d984b79bbdc7c715fb90bfd7e8ce8114c286cef
|
|
| MD5 |
b8bcd22f284e075faff6dcc119d4c938
|
|
| BLAKE2b-256 |
3eb1ec7fce6f819b847d597119be63ea99938cbce3fb76f720c44ef78af829e5
|
File details
Details for the file panicle-0.3.2-py3-none-any.whl.
File metadata
- Download URL: panicle-0.3.2-py3-none-any.whl
- Upload date:
- Size: 232.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d87538cd812a565b6dfd4ac6c7df4fef170ff0901d0f08ecf7e39f89c5abc379
|
|
| MD5 |
4a9fe4ce8c8c0f0461fd4137d4bf73a1
|
|
| BLAKE2b-256 |
986770578fe59a0fb80b854484a8c68fa05ac68b6204cb1c0ce588140d02ba0a
|