PredictMix: integrated polygenic + clinical disease risk prediction pipeline
Project description
PredictMix
Integrated Polygenic + Clinical Disease Risk Prediction Pipeline
Developed by:
- Etienne Ntumba Kabongo, McGill University
- Email: etienne.kabongo@mcgill.ca
- Prof. Emile R. Chimusa, Northumbria University
Overview
PredictMix is a modular and extensible machine-learning pipeline for integrated disease risk prediction, built to combine:
- Polygenic Risk Scores (PRS)
- Clinical variables
- Environmental and lifestyle factors
- Feature selection algorithms
- Multiple ML models
- Explainability (LIME-ready architecture)
- Publication-grade visualizations
Originally motivated by genomic studies on sickle cell disease and population stratification in African cohorts, the tool is fully generalizable to any dataset requiring binary disease risk prediction.
PredictMix is designed for:
- Researchers in statistical genetics, epidemiology, and AI-driven clinical modeling
- Large-scale biobank analyses (e.g., UKB, CKB, H3Africa)
- Rare disease prediction and stratification
- Integrative genomic & clinical prediction studies
Key Features
🔬 End-to-End Prediction Pipeline
- Automated train/test split
- Cross-validation (configurable)
- Multiple models (logistic regression, SVM, Random Forest, MLP, ensemble)
🧬 Multi-modal Feature Integration
- PRS + clinical + environmental + biochemical data
- Flexible column configuration
- Optional genotype-derived features
🔍 Feature Selection Methods
nonelassoelasticnettree(Random Forest importance)chi2pca
📊 Advanced Plotting Suite
Generate high-quality figures from prediction outputs:
- ROC curve
- Precision–Recall curve
- Histograms (all + class-stratified)
- Scatter risk vs class
- Confusion matrix heatmap
- Calibration curves
- Volcano plot for GWAS summary statistics
- Batch “generate all plots” mode
📦 PyPI Installation & CLI-first Design
PredictMix is simple to install and use:
pip install predictmix
predictmix --help
Requirements
- Python 3.8+
Installed automatically when using pip:
- numpy
- pandas
- scikit-learn
- scipy
- joblib
- pyyaml
- typer
- matplotlib
- lime
- typing_extensions
Installation
Stable Release (PyPI)
pip install predictmix
From Source (Development)
git clone https://github.com/EtienneNtumba/predictmix.git
cd predictmix
pip install -e .
Command-Line Usage
Run:
predictmix --help
You will see something like:
Usage: predictmix [OPTIONS] COMMAND [ARGS]...
Commands:
train Train a PredictMix model on a dataset.
predict Apply a trained model to new data.
plot Generate visualization plots from predictions.
plot-volcano Create volcano plots for GWAS summary statistics.
1. Train a Model
Basic Usage
predictmix train DATA.csv --model ensemble --feature-selection lasso --n-features 150
Training Options
| Option | Description | Default |
|---|---|---|
--config, -c |
Load YAML config instead of CLI options | None |
--model, -m |
Model: logreg, svm, rf, mlp, ensemble |
ensemble |
--feature-selection, -f |
FS method: none, lasso, elasticnet, tree, chi2, pca |
lasso |
--n-features, -k |
Number of features to keep | 100 |
--target-column, -y |
Target (label) column name (0/1) | y |
--output-dir, -o |
Output directory | predictmix_output |
--export-predictions |
CSV path for y_true, risk_proba, split |
<output_dir>/predictions.csv |
--plots/--no-plots |
Automatically generate ROC & PR plots | --no-plots |
Training Output
By default, training creates:
predictmix_output/
│
├── predictmix_model.joblib # Trained model
├── config.json # Configuration snapshot
├── metrics.json # CV + test metrics
└── predictions.csv # y_true, risk_proba, split
metrics.json
{
"cv": {
"accuracy": ...,
"auc": ...,
"precision_macro": ...,
"recall_macro": ...,
"f1_macro": ...
},
"test": {
"accuracy": ...,
"auc": ...,
"precision_macro": ...,
"recall_macro": ...,
"f1_macro": ...
}
}
predictions.csv
| Column | Description |
|---|---|
y_true |
True binary label (0/1) |
risk_proba |
Predicted probability for class 1 |
split |
"train_cv" for CV, "test" for test set |
2. Predict on New Samples
Usage
predictmix predict MODEL_PATH DATA.csv --output predictions_new.csv
Arguments
| Argument | Description |
|---|---|
MODEL_PATH |
Path to predictmix_model.joblib from training |
DATA |
CSV/Parquet with new individuals (no label column required) |
Options
| Option | Description | Default |
|---|---|---|
--output, -o |
CSV file to write predictions | predictmix_predictions.csv |
The output file will contain all original columns plus:
| Column | Description |
|---|---|
risk_proba |
Predicted probability for the positive class |
3. Generate Plots from Predictions
Usage
predictmix plot predictions.csv --kind all --output-dir predictmix_plots
Arguments
| Argument | Description |
|---|---|
RESULTS |
CSV file with at least y_true and risk_proba columns |
Options
| Option | Description | Default |
|---|---|---|
--kind, -k |
rocpr, hist, scatter, heatmap, calib, all |
all |
--output-dir, -o |
Directory for plot PNGs | predictmix_plots |
Generated Plots (for --kind all)
roc_curve.png– ROC curvepr_curve.png– Precision–Recall curvehist_risk_all.png– Risk distribution (all samples)hist_risk_by_class.png– Risk distribution by classscatter_risk_vs_class.png– Scatter of risk vs. true classconfusion_heatmap.png– Confusion matrix heatmapcalibration_curve.png– Calibration (reliability) curve
4. Volcano Plot for GWAS Summary Statistics
Usage
predictmix plot-volcano gwas_summary.csv --effect-col beta --pval-col pval --output volcano.png
Arguments
| Argument | Description |
|---|---|
summary |
GWAS-like summary statistics CSV file |
Options
| Option | Description | Default |
|---|---|---|
--effect-col |
Name of effect-size column (e.g. beta, logOR) |
beta |
--pval-col |
Name of p-value column | pval |
--output, -o |
Output PNG for volcano plot | predictmix_volcano.png |
The input file must contain the specified effect_col and pval_col columns.
Input Data Format
Minimum Required Columns for Training
- One binary label column (e.g.
y,case_control) - One or more numeric feature columns (PRS, clinical variables, labs, etc.)
Example data.csv
y,prs,age,bmi,family_history,hbF,env_score
0,0.12,35,22.5,0,0.15,0.3
1,1.45,29,27.1,1,0.08,0.7
0,-0.34,41,24.8,0,0.20,0.2
1,1.10,33,26.3,1,0.05,0.8
If your label column has another name (e.g. case_control), set:
predictmix train data.csv --target-column case_control ...
Project Structure (Simplified)
src/predictmix/
├── __init__.py
├── cli.py # Command-line interface (Typer)
├── config.py # Config dataclass
├── data.py # Input loading & preprocessing
├── feature_selection.py # Feature selection methods
├── models.py # Model factory (logreg, SVM, RF, MLP, ensemble)
├── pipeline.py # High-level training & prediction pipeline
├── plots.py # All plotting utilities (ROC, PR, hist, heatmap, volcano)
└── prs.py # PRS-related utilities (optional/extensible)
Authors
Primary Developer
Etienne Ntumba Kabongo
McGill University, Montréal, Canada
Email: etienne.kabongo@mcgill.ca
Scientific Supervisor
Prof. Emile R. Chimusa
Northumbria University, United Kingdom
Email: emile.chimusa@northumbria.ac.uk
License
This project is distributed under the MIT License. See the LICENSE file for details.
How to Cite PredictMix
If you use PredictMix in research, please cite:
Ntumba Kabongo E., Chimusa E.R., PredictMix: an integrated polygenic–clinical machine learning pipeline for disease risk prediction, 2025.
Future Extensions
- SHAP explainability and global/local feature importance
- Multi-class classification support
- Deep learning-based models
- Integration with PRS-CS, LDpred and other PRS frameworks
- Automated genotype ingestion and variant-annotation hooks
- Nextflow and Snakemake wrappers for large-scale HPC deployments
- Model cards and interactive interpretability dashboards
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file predictmix-0.1.1.tar.gz.
File metadata
- Download URL: predictmix-0.1.1.tar.gz
- Upload date:
- Size: 17.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76fdf8ad0b1fe5b759c28ebd18e5db03aae8bcc8acf809181b5fc0b2fb64edae
|
|
| MD5 |
0858b2cac376f27864fa80e0408be34b
|
|
| BLAKE2b-256 |
1f1e003022b84bcb03c81608191510f77a6f557ae1856cc3c66dde3ca78f84db
|
File details
Details for the file predictmix-0.1.1-py3-none-any.whl.
File metadata
- Download URL: predictmix-0.1.1-py3-none-any.whl
- Upload date:
- Size: 16.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14723cb79ce12d1df95366e6b4bd953624c93fad11bc23654d8fef0eba4afa1f
|
|
| MD5 |
2f7f63f095567528befadd7d77c314b6
|
|
| BLAKE2b-256 |
fd2eb7b4140672364960cac7f3f20a11e8d78f532e76f7160c8cd83961149f81
|