Skip to main content

PredictMix: integrated polygenic + clinical disease risk prediction pipeline

Project description

PredictMix

Integrated Polygenic + Clinical Disease Risk Prediction Pipeline

Developed by:


Overview

PredictMix is a modular and extensible machine-learning pipeline for integrated disease risk prediction, built to combine:

  • Polygenic Risk Scores (PRS)
  • Clinical variables
  • Environmental and lifestyle factors
  • Feature selection algorithms
  • Multiple ML models
  • Explainability (LIME-ready architecture)
  • Publication-grade visualizations

Originally motivated by genomic studies on sickle cell disease and population stratification in African cohorts, the tool is fully generalizable to any dataset requiring binary disease risk prediction.

PredictMix is designed for:

  • Researchers in statistical genetics, epidemiology, and AI-driven clinical modeling
  • Large-scale biobank analyses (e.g., UKB, CKB, H3Africa)
  • Rare disease prediction and stratification
  • Integrative genomic & clinical prediction studies

Key Features

🔬 End-to-End Prediction Pipeline

  • Automated train/test split
  • Cross-validation (configurable)
  • Multiple models (logistic regression, SVM, Random Forest, MLP, ensemble)

🧬 Multi-modal Feature Integration

  • PRS + clinical + environmental + biochemical data
  • Flexible column configuration
  • Optional genotype-derived features

🔍 Feature Selection Methods

  • none
  • lasso
  • elasticnet
  • tree (Random Forest importance)
  • chi2
  • pca

📊 Advanced Plotting Suite

Generate high-quality figures from prediction outputs:

  • ROC curve
  • Precision–Recall curve
  • Histograms (all + class-stratified)
  • Scatter risk vs class
  • Confusion matrix heatmap
  • Calibration curves
  • Volcano plot for GWAS summary statistics
  • Batch “generate all plots” mode

📦 PyPI Installation & CLI-first Design

PredictMix is simple to install and use:

pip install predictmix
predictmix --help

Requirements

  • Python 3.8+

Installed automatically when using pip:

  • numpy
  • pandas
  • scikit-learn
  • scipy
  • joblib
  • pyyaml
  • typer
  • matplotlib
  • lime
  • typing_extensions

Installation

Stable Release (PyPI)

pip install predictmix

From Source (Development)

git clone https://github.com/EtienneNtumba/predictmix.git
cd predictmix
pip install -e .

Command-Line Usage

Run:

predictmix --help

You will see something like:

Usage: predictmix [OPTIONS] COMMAND [ARGS]...

Commands:
  train        Train a PredictMix model on a dataset.
  predict      Apply a trained model to new data.
  plot         Generate visualization plots from predictions.
  plot-volcano Create volcano plots for GWAS summary statistics.

1. Train a Model

Basic Usage

predictmix train DATA.csv --model ensemble --feature-selection lasso --n-features 150

Training Options

Option Description Default
--config, -c Load YAML config instead of CLI options None
--model, -m Model: logreg, svm, rf, mlp, ensemble ensemble
--feature-selection, -f FS method: none, lasso, elasticnet, tree, chi2, pca lasso
--n-features, -k Number of features to keep 100
--target-column, -y Target (label) column name (0/1) y
--output-dir, -o Output directory predictmix_output
--export-predictions CSV path for y_true, risk_proba, split <output_dir>/predictions.csv
--plots/--no-plots Automatically generate ROC & PR plots --no-plots

Training Output

By default, training creates:

predictmix_output/
│
├── predictmix_model.joblib   # Trained model
├── config.json               # Configuration snapshot
├── metrics.json              # CV + test metrics
└── predictions.csv           # y_true, risk_proba, split

metrics.json

{
  "cv": {
    "accuracy": ...,
    "auc": ...,
    "precision_macro": ...,
    "recall_macro": ...,
    "f1_macro": ...
  },
  "test": {
    "accuracy": ...,
    "auc": ...,
    "precision_macro": ...,
    "recall_macro": ...,
    "f1_macro": ...
  }
}

predictions.csv

Column Description
y_true True binary label (0/1)
risk_proba Predicted probability for class 1
split "train_cv" for CV, "test" for test set

2. Predict on New Samples

Usage

predictmix predict MODEL_PATH DATA.csv --output predictions_new.csv

Arguments

Argument Description
MODEL_PATH Path to predictmix_model.joblib from training
DATA CSV/Parquet with new individuals (no label column required)

Options

Option Description Default
--output, -o CSV file to write predictions predictmix_predictions.csv

The output file will contain all original columns plus:

Column Description
risk_proba Predicted probability for the positive class

3. Generate Plots from Predictions

Usage

predictmix plot predictions.csv --kind all --output-dir predictmix_plots

Arguments

Argument Description
RESULTS CSV file with at least y_true and risk_proba columns

Options

Option Description Default
--kind, -k rocpr, hist, scatter, heatmap, calib, all all
--output-dir, -o Directory for plot PNGs predictmix_plots

Generated Plots (for --kind all)

  • roc_curve.png – ROC curve
  • pr_curve.png – Precision–Recall curve
  • hist_risk_all.png – Risk distribution (all samples)
  • hist_risk_by_class.png – Risk distribution by class
  • scatter_risk_vs_class.png – Scatter of risk vs. true class
  • confusion_heatmap.png – Confusion matrix heatmap
  • calibration_curve.png – Calibration (reliability) curve

4. Volcano Plot for GWAS Summary Statistics

Usage

predictmix plot-volcano gwas_summary.csv   --effect-col beta   --pval-col pval   --output volcano.png

Arguments

Argument Description
summary GWAS-like summary statistics CSV file

Options

Option Description Default
--effect-col Name of effect-size column (e.g. beta, logOR) beta
--pval-col Name of p-value column pval
--output, -o Output PNG for volcano plot predictmix_volcano.png

The input file must contain the specified effect_col and pval_col columns.


Input Data Format

Minimum Required Columns for Training

  • One binary label column (e.g. y, case_control)
  • One or more numeric feature columns (PRS, clinical variables, labs, etc.)

Example data.csv

y,prs,age,bmi,family_history,hbF,env_score
0,0.12,35,22.5,0,0.15,0.3
1,1.45,29,27.1,1,0.08,0.7
0,-0.34,41,24.8,0,0.20,0.2
1,1.10,33,26.3,1,0.05,0.8

If your label column has another name (e.g. case_control), set:

predictmix train data.csv --target-column case_control ...

Project Structure (Simplified)

src/predictmix/
├── __init__.py
├── cli.py                # Command-line interface (Typer)
├── config.py             # Config dataclass
├── data.py               # Input loading & preprocessing
├── feature_selection.py  # Feature selection methods
├── models.py             # Model factory (logreg, SVM, RF, MLP, ensemble)
├── pipeline.py           # High-level training & prediction pipeline
├── plots.py              # All plotting utilities (ROC, PR, hist, heatmap, volcano)
└── prs.py                # PRS-related utilities (optional/extensible)

Authors

Primary Developer

Etienne Ntumba Kabongo
McGill University, Montréal, Canada
Email: etienne.kabongo@mcgill.ca

Scientific Supervisor

Prof. Emile R. Chimusa
Northumbria University, United Kingdom
Email: emile.chimusa@northumbria.ac.uk


License

This project is distributed under the MIT License. See the LICENSE file for details.


How to Cite PredictMix

If you use PredictMix in research, please cite:

Ntumba Kabongo E., Chimusa E.R., PredictMix: an integrated polygenic–clinical machine learning pipeline for disease risk prediction, 2025.


Future Extensions

  • SHAP explainability and global/local feature importance
  • Multi-class classification support
  • Deep learning-based models
  • Integration with PRS-CS, LDpred and other PRS frameworks
  • Automated genotype ingestion and variant-annotation hooks
  • Nextflow and Snakemake wrappers for large-scale HPC deployments
  • Model cards and interactive interpretability dashboards

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

predictmix-0.1.1.tar.gz (17.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

predictmix-0.1.1-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file predictmix-0.1.1.tar.gz.

File metadata

  • Download URL: predictmix-0.1.1.tar.gz
  • Upload date:
  • Size: 17.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for predictmix-0.1.1.tar.gz
Algorithm Hash digest
SHA256 76fdf8ad0b1fe5b759c28ebd18e5db03aae8bcc8acf809181b5fc0b2fb64edae
MD5 0858b2cac376f27864fa80e0408be34b
BLAKE2b-256 1f1e003022b84bcb03c81608191510f77a6f557ae1856cc3c66dde3ca78f84db

See more details on using hashes here.

File details

Details for the file predictmix-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: predictmix-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 16.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for predictmix-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 14723cb79ce12d1df95366e6b4bd953624c93fad11bc23654d8fef0eba4afa1f
MD5 2f7f63f095567528befadd7d77c314b6
BLAKE2b-256 fd2eb7b4140672364960cac7f3f20a11e8d78f532e76f7160c8cd83961149f81

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page