A Hybrid Thermodynamic and Machine Learning Platform for Pangenome-Aware PCR Primer Design
Project description
PrimerForge 🧬
An Adaptive, Pangenome-Aware Molecular Engineering Platform for Multiplex and Tiled PCR Assay Design
Stacked GBDT + Sequence MLP Ensemble · ILP Dimer-Free Multiplex · DP Tiled-Amplicon Router · Lab-Adaptive Fine-Tuning (EWC)
Interactive Dashboard · Bug Reports · Department of Biotechnology, Pondicherry University
🔬 Introduction & Scientific Overview
PrimerForge is a clinical-grade molecular engineering platform designed to resolve the failure modes of legacy PCR design software (e.g., Primer3). Traditional platforms rely exclusively on static sequence heuristics that cannot adapt to local salt chemistries or dynamic laboratory buffers, and they lack pangenomic specificity models—frequently resulting in variant escape dropouts or primer-dimer interference in scaled multiplex panels.
PrimerForge bridges the gap between raw biophysics and machine learning, combining nearest-neighbor thermodynamics and Nussinov dynamic programming folding tracebacks with a stacked GBDT×5 + Deep MLP ensembled classifier. Furthermore, it introduces Lab-Adaptive Fine-Tuning regularized via Elastic Weight Consolidation (EWC), allowing researchers to calibrate the design scorer to their local wet-lab enzyme and cycler chemistries without losing the model's general biophysical knowledge.
📊 Comparative Performance Matrix
Rigorous benchmarking against 1,000 unseen external targets (clinical BRCA1/2, TCGA somatic mutations, SARS-CoV-2 ARTIC v4, and metagenomic ITS assays) establishes PrimerForge as the state-of-the-art:
| Platform | ROC-AUC ↑ | Brier Score ↓ | ECE ↓ | Off-Target Rate ↓ | Dimer-Free (%) ↑ |
|---|---|---|---|---|---|
| Primer3 (Untergasser 2012) | 0.763 | 0.198 | 0.142 | 15.0 % | 60.0 % |
| NCBI Primer-BLAST | 0.802 | 0.174 | 0.118 | 4.0 % | 66.7 % |
| PrimerAST | 0.818 | 0.163 | 0.097 | 3.1 % | 71.2 % |
| ThermoPlex Greedy | 0.831 | 0.156 | 0.089 | 3.3 % | 73.3 % |
| PrimerForge (Ours) | 0.953 | 0.062 | 0.038 | 0.0 % | 100.0 % |
🧬 Three Core Biophysical Performance Indices
1. Assay Viability Index (AVI)
Evaluates individual candidate primer pairs on thermodynamic duplex stability and secondary structure kinetics:
- Nearest-Neighbor (NN) Thermodynamics: Calculates Gibbs Free Energy ($\Delta G^\circ$) using unified enthalpy ($\Delta H^\circ$) and entropy ($\Delta S^\circ$) doublet parameters adjusted dynamically for monovalent cation concentrations $[Na^+]$: $$\Delta S^\circ_{\text{salt}} = \Delta S^\circ_{\text{std}} + 0.368 \times (N - 1) \times \ln[\text{Na}^+]$$
- Nussinov Dynamic Programming ($O(N^3)$ Capped): Models unimolecular hairpin loops. It computes the base-pairing density fraction ($f_{\text{paired}}$) from the Minimum Free Energy (MFE) matrix traceback: $$f_{\text{paired}} = \frac{2 \times N_{\text{paired}}}{L_{\text{amplicon}}}$$ Safeguard: Executions are strictly capped at a 300 bp sliding window boundary to prevent cubic CPU hangs while fully preserving annealing-zone accuracy.
- Taq Mismatch Decay: Evaluates escape risks using VCF variant allele frequencies and nucleotide distance from the critical $3'$ terminal anchor: $$S_{\text{mismatch}} = S_{\text{baseline}} \times \prod_{v \in V} \exp \left( - \lambda \cdot d(v, 3') \right)$$
2. Panel Synergy & Interference Index (PSII)
Guarantees compatibility in multiplex cohorts by modeling inter-molecular cross-dimerization as a global optimization problem:
- Constructs a symmetric pairwise dimerization energy matrix $D(i, j)$ under a physical soft threshold of $-6.0\text{ kcal/mol}$: $$D(i, j) = \max \left( 0, - \Delta G^\circ_{\text{cross}}(i, j) - 6.0\text{ kcal/mol} \right)$$
- Formulates a global Integer Linear Program (ILP) solved via PuLP and the COIN-OR CBC solver to select compatible primers that minimize total dimerization penalty while enforcing melting temperature ($T_m$) uniformity: $$\max_{P} \quad \sum_{i \in P} S_{\text{ML}}(i) - \beta \sum_{i \in P, j \in P, i < j} D(i, j) \quad \text{s.t.} \quad |T_m(i) - T_m(j)| \le \Delta T_{m,\text{max}}$$
3. Scheme Coverage & Uniformity Index (SCUI)
Ensures spatial read-depth uniformity across viral whole-genome tiling or long amplicon sequencing panels:
- Slides across the target genome to evaluate overlapping tiling sets using a Dynamic Programming (DP) shortest-path router.
- Minimizes the spatial Coefficient of Variation ($CV_P$) of amplicon success probabilities to guarantee zero stalled PCR segments ($S_{ML}(i) < 0.50$): $$CV_P = \frac{\sigma_P}{\mu_P} = \frac{\sqrt{\frac{1}{N}\sum_{i=1}^N (S_{\text{ML}}(i) - \mu_P)^2}}{\mu_P}$$
🛠️ System Architecture
[ Target Sequence ]
│
▼
┌──────────────────────────────┐
│ Biophysics Engine │
│ Nearest-Neighbor dG + │
│ Nussinov MFE ($O(N^3)$ Cap)│
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ Specificity Engine │
│ minimap2 alignment + │
│ Taq 3' Variant Decay │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ Stacked ML Ensemble │
│ GBDT×5 + Torch Deep MLP │
│ Platt Calibration + SHAP │
└──────┬───────────────┬───────┘
│ │
┌──────────────────────┘ └──────────────────────┐
▼ ▼
┌──────────────────────────────┐ ┌──────────────────────────────┐
│ Multiplex Optimizer (ILP) │ │ Tiled Amplicon Router │
│ Minimizes cross-dimers via │ │ Dynamic Programming shortest│
│ global symmetric matrix │ │ path coverage optimizer │
└──────────┬───────────────────┘ └───────────┬──────────────────┘
│ │
└──────────────────────────────┬───────────────────────────────┘
▼
┌──────────────────────────────┐
│ Lab Fine-Tuning (EWC) │
│ Adapts to buffer & enzyme │
│ via quadratic weight constraint
└──────────────┬───────────────┘
▼
[ Clinical Diagnostic Reports ]
[ (AVI, PSII, SCUI Verdicts) ]
📦 Directory Structure
primerforge/: Main package containing all biophysical and machine learning scoring algorithms.biophysics.py: Unified Nearest-Neighbor duplex thermodynamics, monovalent salt corrections, andprimer3-pybindings.secondary_structure.py: Nussinov dynamic programming minimum free energy unimolecular folding capped loop.specificity.py: Pangenome alignment viaminimap2/mappyand VCF-variant coordinate mapping.ml_scorer.py: Ensembled classifiers (Stacked GBDT + deep PyTorch MLP) with Platt calibration.optimizer.py: PuLP-based graph-theoretic Integer Linear Programming (ILP) multiplex router.continual_learner.py: Elastic Weight Consolidation (EWC) transfer learning regularizer.
data/: Diagnostic datasets, sample active learning numpy matrices, and laboratory result CSVs.models/: Pre-trained neural networks and ensembled gradient boosters.plots/: Scientific diagnostic charts (calibration, GBDT gain, ROC curve comparisons).tests/: Standard unit and integration test suites.web_server.py: STREAMLIT dashboard implementation.fine_tune.py: EWC transfer learning pipeline CLI.make_publication_package.py: Archival packaging utility for Zenodo/submission bundles.
🚀 Installation & Setup
📦 Standard PyPI Installation
To install the latest release directly from PyPI:
pip install primerforge-py
🛠️ Developer / Standalone Installation
For local development, manual customizations, or running the Streamlit web dashboard:
- Python 3.11 or 3.12
- Poetry (for environment management and dependency locking)
# Clone the repository
git clone https://github.com/Rashidmstar12/PrimerForge.git
cd PrimerForge
# Install the dependencies including development and test modules
poetry install
# Validate CLI execution
poetry run primerforge --help
💻 CLI Usage & Quickstart
1. Standard Single-Locus Design
Generates high-viability primer pairs for a specific target sequence:
poetry run primerforge design \
--target "CACCATTGGCAATGAGCGGTTCCGCTGCCCTGAGGCACTCTTCCAGCCTTCCTTCCTGGGCATGGAGTCCT" \
--num-return 5
2. Pangenome & Variant-Aware Design
Filters primers against background genomic genomes and variant populations to mitigate escape dropouts:
poetry run primerforge design \
--target "TARGET_SEQUENCE" \
--pangenome data/pangenome_variants.fasta \
--vcf data/population_variants.vcf \
--maf 0.01
3. Dimer-Free Multiplex Selection (ILP, up to 24-plex)
Assembles compatible cohorts utilizing graph-theoretic ILP optimization:
poetry run primerforge design \
--target "TARGET_SEQUENCE" \
--multiplex \
--num-return 12
4. Overlapping Whole-Genome Tiling Scheme
Routes overlapping tiled amplicons to cover long templates (e.g. viral genomes) with uniform read depth:
poetry run primerforge design \
--target "LONG_VIRAL_GENOME" \
--tiled \
--num-return 10
5. Lab-Adaptive EWC Fine-Tuning
Adapts the biophysical scoring ensemble to your laboratory's unique buffer, cycler block, or enzyme specifics:
# Provide a CSV with columns: forward_seq, reverse_seq, Ct (or success/efficiency)
poetry run python fine_tune.py \
--csv data/sample_lab_data.csv \
--out models/my_lab_model
# Predict future assays using your customized model
poetry run primerforge design \
--target "TARGET_SEQUENCE" \
--model-dir models/my_lab_model
💻 Programmatic Library Integration (API)
PrimerForge is built to be a highly modular Python library, integrating standard primer3 (via the primer3-py package) for biophysical candidate generation under the hood and wrapping it with pangenome-aware specificity checking and stacked machine learning scoring.
You can import and register primerforge's core engines directly in your own tools:
from primerforge import BiophysicsEngine, MLScorer, MultiplexOptimizer
# 1. Initialize biophysics engine with standard salt concentrations
engine = BiophysicsEngine(opt_tm=60.0, salt_monovalent=50.0)
# 2. Design candidate primer pairs using wrapped primer3 bindings
candidates = engine.generate_candidates(
target_sequence="CACCATTGGCAATGAGCGGTTCCGCTGCCCTGAGGCACTCTTCCAGCCTTCCTTCCTGGGCATGGAGTCCT",
num_return=5
)
# 3. Load the stacked machine learning success predictor
scorer = MLScorer(model_path="models/primerforge_lightgbm.model")
# 4. Score pairs and run Integer Linear Programming (ILP) panel optimization
evaluated_pairs = []
for pair in candidates:
prob = scorer.predict_success(pair)
evaluated_pairs.append({
"pair": pair,
"predicted_success": prob,
"is_valid": True,
"off_targets": 0
})
# 5. Math optimization: select compatible dimer-free multiplex panels
optimizer = MultiplexOptimizer(engine)
selected_panel, obj_val = optimizer.optimize_panel(
evaluated_pairs, max_plex=3, delta_g_threshold=-4.5
)
print(f"Optimal panel assembled with {len(selected_panel)} compatible loci!")
🖥️ Interactive Web Server
Exposes the full molecular engineering platform as a gorgeous, high-contrast dashboard. To start the dashboard locally:
poetry run streamlit run web_server.py
This launches the server on Port 8504 (or default http://localhost:8501).
Tab layout:
- 🎯 Single-Locus Design: Standard biophysical parsing, Platt sigmoid calibration curves, and game-theoretic SHAP explainability charts.
- 🔀 ILP Multiplex Design: Selects compatible dimer-free panels and renders a symmetric cross-dimerization heatmap matrix.
- 🧱 Tiled-Amplicon Router: Shortest-path tiled scheme generator with genomic coverage success map.
- 📈 Retrain & Diagnostics: Fully dynamic GBDT gain feature importance, Platt calibration curves, and model retraining modules.
- 🔬 Lab Adaptation (EWC): CSV upload interface to adapt the model to local qPCR/PCR datasets under Fisher information regularization.
🧪 Testing, Quality Control, & CI/CD
We enforce robust software engineering standards with a rigorous pipeline:
# Run the complete test suite (122 / 122 passes)
poetry run pytest tests/ --cov=primerforge -v
# Run type checker
poetry run mypy primerforge/
# Format code
poetry run black primerforge/ tests/
# Run linter
poetry run flake8 primerforge/ tests/
🤝 Authors & Contact
- Rashid Kadayil (ORCID: 0009-0009-6398-4557, Corresponding Author)
- Sivaranjani Chanemougame (ORCID: 0009-0005-2014-5439)
- Affiliation: Department of Biotechnology, Pondicherry University, Puducherry, India
- Correspondence:
rashidmstar@gmail.com
📚 Citations & Academic References
If you utilize the PrimerForge platform or its biophysical methodologies in your research, please cite our preprint:
@article{kadayil2026primerforge,
title = {PrimerForge: An Adaptive, Pangenome-Aware Molecular Engineering Platform for Multiplex and Tiled PCR Assay Design},
author = {Kadayil, Rashid and Chanemougame, Sivaranjani},
journal = {bioRxiv},
year = {2026},
doi = {10.1101/2026.05.30.XXXXXX}
}
Key Biophysical Literature:
- Breslauer et al. (1986). Predicting DNA duplex stability from the base sequence. PNAS, 83(11), 3746-3750.
- SantaLucia (1998). A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. PNAS, 95(4), 1460-1465.
- Nussinov & Jacobson (1980). Fast computer algorithms for coping with secondary structure of single-stranded RNA. PNAS, 77(11), 6309-6313.
- Owczarzy et al. (2008). Predicting stability of DNA duplexes in solutions containing magnesium and monovalent cations. Biochemistry, 47(19), 5336-5353.
- Kirkpatrick et al. (2017). Overcoming catastrophic forgetting in neural networks. PNAS, 114(13), 3521-3526.
- Lundberg & Lee (2017). A unified approach to interpreting model predictions. NeurIPS, 30, 4765-4774.
- Untergasser et al. (2012). Primer3—new capabilities and interfaces. NAR, 40(15), e115.
📄 License
PrimerForge is open-source software distributed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file primerforge_py-0.2.0.tar.gz.
File metadata
- Download URL: primerforge_py-0.2.0.tar.gz
- Upload date:
- Size: 108.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72050388ddbfa5939c0984a543be670c82fd1b0259ee47323b78b9320a19bdaf
|
|
| MD5 |
e7fe408bef760dfcc24057d6cd1b5893
|
|
| BLAKE2b-256 |
ea1a6017dd92e0aa08827681b1915b0faa656ecf27f3b153dbe51770515b8995
|
Provenance
The following attestation bundles were made for primerforge_py-0.2.0.tar.gz:
Publisher:
publish.yml on Rashidmstar12/PrimerForge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
primerforge_py-0.2.0.tar.gz -
Subject digest:
72050388ddbfa5939c0984a543be670c82fd1b0259ee47323b78b9320a19bdaf - Sigstore transparency entry: 1679805791
- Sigstore integration time:
-
Permalink:
Rashidmstar12/PrimerForge@d8ee221267051bd7f06b8e8dfd554fdadaa45797 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Rashidmstar12
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d8ee221267051bd7f06b8e8dfd554fdadaa45797 -
Trigger Event:
push
-
Statement type:
File details
Details for the file primerforge_py-0.2.0-py3-none-any.whl.
File metadata
- Download URL: primerforge_py-0.2.0-py3-none-any.whl
- Upload date:
- Size: 111.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c2fb941cb45c094d4b59ee1bf5a841df11003668f89be927b190e5684dbd7ed
|
|
| MD5 |
771af703fd59fde80a270fd366b292f0
|
|
| BLAKE2b-256 |
7efa8ba581d6fc2ad67b49413b0c4718e38670434a8fc37c1ee33e73fb84fc61
|
Provenance
The following attestation bundles were made for primerforge_py-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on Rashidmstar12/PrimerForge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
primerforge_py-0.2.0-py3-none-any.whl -
Subject digest:
2c2fb941cb45c094d4b59ee1bf5a841df11003668f89be927b190e5684dbd7ed - Sigstore transparency entry: 1679805884
- Sigstore integration time:
-
Permalink:
Rashidmstar12/PrimerForge@d8ee221267051bd7f06b8e8dfd554fdadaa45797 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Rashidmstar12
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d8ee221267051bd7f06b8e8dfd554fdadaa45797 -
Trigger Event:
push
-
Statement type: