PaRMMoSaHN - Pangenome Reference-based Metabolic Modelling by Saving Homolog Networks
Project description
PaRMMoSaHN
Pangenome Reference-based Metabolic Modelling by Saving Homolog Networks
Overview
PaRMMoSaHN is a Python-based pipeline that bridges partitioned pangenome graphs with high-fidelity metabolic reconstruction.
Instead of building genome-scale metabolic models (GEMs) for hundreds of strains from scratch -- which is computationally expensive and produces inconsistently gap-filled models -- PaRMMoSaHN builds a single high-quality pan-model from the pangenome and rapidly derives strain-specific models by mapping individual strains to this reference via homology networks.
Why this approach?
Existing tools either reconstruct each strain independently (gapseq, CarveMe) or require a manually curated species-specific reference (Bactabolize, KpSC pan). PaRMMoSaHN is the first species-agnostic, automated pipeline that leverages PPanGGOLiN pangenome partitioning to build a shared metabolic reference, then projects it onto strains using DIAMOND sequence homology. This produces models that are:
- Consistent -- all strains are derived from the same metabolic reference, eliminating artificial variation from independent gap-filling
- Fast -- strain projection takes minutes instead of hours per genome
- Scalable --
.donecheckpointing and isolated evaluation processes handle hundreds of strains
Pipeline
Genomes (.gbff)
|
v
[Step 1] PPanGGOLiN pangenome construction
| -> soft-core protein families FASTA
v
[Step 2] gapseq metabolic pathway prediction on pangenome
| -> pan-model reaction table
v
[Step 3] Per-strain model derivation
| DIAMOND blastp -> filter pan-model -> gapseq draft + gapfill
| -> one SBML model per strain
v
[Step 4] Automated curation (memote evaluation + duplicate/imbalance fixes)
| Optional: pause for manual curation via Excel spreadsheet
v
[Step 5b] ModelPolisher (optional side-step, --polish; Docker/Podman/Apptainer)
|
[Step 5] Annotation enrichment (NCBI protein, BRENDA, RHEA)
|
v
[Step 6] Gather & convert (SBML, JSON, MATLAB)
|
v
[Step 7] Final memote FAIR-compliance evaluation
Output structure
output/
├── 01-pangenome/ # PPanGGOLiN pangenome + soft-core FASTA
│ └── pangenome_meta.json # n_genomes, soft-core threshold, cluster params
├── 02-panmodel/ # gapseq reaction/pathway tables
├── 03-strain_models/ # Per-strain DIAMOND matches, proteomes, draft models
├── 04-curated_models/ # Models after automated curation
│ └── curation_application_report.tsv # Per-row hit counts across strains (v0.2.1)
├── 05-annotated_models/ # SBML enriched with NCBI/BRENDA/RHEA annotations (Step 5)
├── 05b-polished_models/ # ModelPolisher output (optional side-step, --polish)
├── 06-final_models/ # Final models in XML, JSON, and MATLAB formats
├── 07-memote_reports/ # draft/ and final/ memote HTML+JSON reports
├── curation_template.xlsx # Memote-derived spreadsheet for manual curation
├── pipeline_summary.json # Run metadata, parameters, provenance, model statistics
├── run.log # Full INFO-level run log
└── errors.log # WARNING+ messages (only if errors occurred)
The pipeline_summary.json file includes a provenance block (SHA-256 of
the medium CSV and curation database, soft-core threshold actually used,
external tool versions) and a run_environment block (host CPU/RAM/OS), so
a reviewer can verify a run is reproducible without re-running anything.
Installation
PaRMMoSaHN orchestrates several bioinformatics tools that cannot be installed via pip alone. Use Conda/Mamba to set up the environment.
1. Create the Conda environment
# mamba is recommended for faster dependency resolution
mamba env create -f environment.yml
conda activate parmmosahn_env
2. Install PaRMMoSaHN
From PyPI (recommended):
pip install parmmosahn
This installs the Python orchestrator and its Python dependencies only. The
external tools (PPanGGOLiN, gapseq, DIAMOND) come from the Conda environment in
step 1 — verify them with parmmosahn doctor.
For development, from a clone:
pip install -e ".[dev]"
Or directly from GitHub:
pip install "git+https://github.com/robbedewin/PaRMMoSaHN.git"
3. Verify the installation
parmmosahn doctor
This checks that all required external tools (PPanGGOLiN, gapseq, DIAMOND)
and optional container engines (Docker, Podman, Apptainer) are available,
and reports host CPU/RAM with a recommended memote-worker count for the
evaluate step (helpful on memory-constrained hosts such as default WSL2
installations, where the default worker count can trigger
BrokenProcessPool errors).
Quick Start
Workflow A: Full automated pipeline
parmmosahn run \
-g /path/to/genomes/ \
-o ./results/ \
-l clostridiales \
-m medium.csv \
-t 14 --parallel-strains 2
Required inputs:
| Option | Description |
|---|---|
-g, --genomes |
Directory containing annotated genomes in GenBank format (.gbff) |
-o, --output |
Output base directory |
-l, --label |
Label for the pangenome (used in filenames) |
-m, --medium |
Growth medium CSV for gap-filling (gapseq format) |
Optional parameters:
| Option | Default | Description |
|---|---|---|
-t, --threads |
75% of CPUs | Total CPU threads |
--parallel-strains |
auto (threads // 4) |
Strains processed in parallel in Step 3; each worker gets threads / N CPU threads (peaks ~1.5 GB RAM each) |
--diamond-bits |
150 | DIAMOND blastp bitscore threshold |
--gapseq-bits |
150 | gapseq pathway search bitscore threshold |
--biomass |
pos |
Biomass reaction type (pos or neg, matches gapseq's gram-stain templates) |
--add-unique |
off | Include unique/cloud genes (singletons) in the pan-model |
--soft-core |
2/N | Override the soft-core frequency threshold (fraction in 0–1) |
--polish |
off | Enable ModelPolisher (requires Docker/Podman/Apptainer; see ModelPolisher below) |
-e, --engine |
docker |
Container engine when --polish is enabled |
-c, --curation-db |
none | Path to an existing curation spreadsheet (skips auto-template generation) |
--pause-for-curation |
off | Pause after Step 3 for manual curation; resume with parmmosahn project --resume |
Workflow B: Human-in-the-loop curation
For maximum model quality, pause the pipeline after draft model evaluation, manually review the curation spreadsheet, then resume:
# Step 1: Run pipeline and pause for manual curation
parmmosahn run \
-g ./genomes/ -o ./results/ -l my_species -m medium.csv \
--pause-for-curation
# -> Edit results/curation_template.xlsx in Excel
# Fill the 'duplicate_reactions' and 'curated_imbalances' sheets
# Step 2: Resume pipeline with your curation decisions
parmmosahn project --resume -o ./results/
The curation template has three sheets:
- duplicate_reactions -- pairs of duplicate reactions with decision options (keep 1, keep 2, drop both, keep both)
- curated_imbalances -- mass/charge-imbalanced reactions with a column for corrected formulas
- ignored_imbalances -- reactions to leave intact despite imbalance (with justification)
Workflow C: Rapid projection of new isolates
If you already have a pan-model and sequenced new strains, bypass the pangenome construction:
parmmosahn project \
-g ./new_strains/ \
-f ./results/01-pangenome/my_species.faa \
-r ./results/02-panmodel/my_species-all-Reactions.tbl \
-m medium.csv \
-o ./projection/
Analysis
Once you have strain-specific models, PaRMMoSaHN provides built-in analysis commands under parmmosahn analyze to explore metabolic diversity, validate predictions, and compare strains. These analyses are what make the models scientifically useful -- raw SBML files only become insights when you interrogate them.
Pan-reactome characterization
The pan-reactome is the union of all metabolic reactions across your strains, analogous to the pangenome but at the metabolic level. Characterizing it reveals which metabolic capabilities are universally conserved (core), which are shared by subsets of strains (accessory), and which are strain-specific (unique). This is the central scientific output of a pangenome-scale metabolic study.
parmmosahn analyze panreactome \
-M ./results/06-final_models/ \
-o ./analysis/panreactome.tsv \
--plot
Outputs:
panreactome.tsv-- per-reaction classification (core/accessory/unique) with strain presencepanreactome_summary.tsv-- pan-reactome size, core/accessory/unique counts, model size statisticspanreactome_jaccard.tsv-- pairwise Jaccard similarity matrix between strainspanreactome_accumulation.tsv-- reaction accumulation curve (pan-reactome growth with added genomes)panreactome_dendrogram.png-- hierarchical clustering of strains by metabolic similarity (with--plot)
The accumulation curve shows whether the pan-reactome is "open" (still growing) or "closed" (saturated), which has implications for how representative your strain collection is.
By default a reaction is classified core if it occurs in more than 99% of strains and unique (cloud) if it occurs in fewer than 5%; adjust these cutoffs with --core-threshold and --cloud-threshold.
Phenotype validation
Model predictions are only as trustworthy as their agreement with experimental data. The validate command compares in silico growth predictions against experimental growth phenotypes (e.g., Biolog plates, carbon source utilization assays) and computes standard classification metrics.
parmmosahn analyze validate \
-M ./results/06-final_models/ \
-p phenotypes.csv \
-o ./analysis/validation.tsv
The phenotype file can be in matrix format (rows = carbon sources, columns = strains, values = 1/0) or long format (columns: strain, carbon_source, growth). Carbon sources should be specified as exchange reaction IDs (e.g., EX_glc__D_e0).
Output metrics: accuracy, sensitivity, specificity, and Matthews Correlation Coefficient (MCC). Per-strain per-substrate results are written to the TSV for detailed inspection.
FBA summary
Run Flux Balance Analysis on all models to compare baseline growth rates and active exchange reactions:
parmmosahn analyze fba \
-M ./results/06-final_models/ \
-o ./analysis/fba_summary.tsv \
-m medium.csv
Auxotrophy screening
Systematically knock out each medium component to identify predicted auxotrophies -- strains that cannot grow without a specific nutrient. This reveals metabolic dependencies that may reflect ecological niche adaptation or gene loss events.
parmmosahn analyze auxotrophy \
-M ./results/06-final_models/ \
-o ./analysis/auxotrophy.tsv \
-m medium.csv
Gene essentiality prediction
Perform single-gene deletion FBA to predict which genes are essential for growth. This can be validated against experimental transposon library (Tn-seq) data and highlights potential drug targets in pathogens.
parmmosahn analyze essentiality \
-M ./results/06-final_models/ \
-o ./analysis/essentiality.tsv
Reaction heatmap
Build a binary presence/absence matrix of all reactions across strains, optionally with a clustered heatmap visualization:
parmmosahn analyze heatmap \
-M ./results/06-final_models/ \
-o ./analysis/heatmap.tsv \
--plot
Prune dry-run (diagnostic)
Read-only diagnostic — modifies no models. This reports what could be pruned; the destructive
prunestep it scaffolds is still on the roadmap.
Report dead-end metabolites (those with zero producers or zero consumers) and reactions whose participants are all dead-end, per model. Dead-end metabolites are one structural source of the cohort-wide blocked-reaction baseline, so this gives a quick, non-destructive estimate of how many reactions are unambiguously prunable before committing to a clean-up pass.
parmmosahn analyze prune-report \
-M ./results/06-final_models/ \
-o ./analysis/prune_report.tsv
Output: one row per model with dead-end metabolite counts (dead_end_metabolites, dead_end_fraction), unambiguously-prunable reaction counts (reactions_all_dead_end, prunable_fraction), plus an FBA growth_rate_baseline and fba_status so you can confirm the models still grow.
Modular usage
Each pipeline step can be run independently:
parmmosahn pan -g ./genomes/ -l my_label -t 8 # Steps 1-2 only
parmmosahn strains -g ./genomes/ -f soft.faa \
-r rxns.tbl -m medium.csv # Step 3 only
parmmosahn evaluate -M ./models/ -o ./reports/ # Memote evaluation
parmmosahn annotate -M ./models/ -o ./annotated/ # FAIR annotation
parmmosahn gather -M ./models/ -o ./final/ # Format conversion
parmmosahn curate -s model.xml -o out/ -d curation.xlsx # Apply curations
parmmosahn polish -M ./models/ -o ./polished/ # ModelPolisher
Run parmmosahn --help for a full list of commands and options.
Configuration
PaRMMoSaHN supports YAML configuration files as an alternative to CLI flags
for the run, pan, and project commands. Generate a template:
parmmosahn init-config -o my_config.yml
The generated template documents every config-loadable key. Then use it:
# Full pipeline from a config file (all required args may come from config)
parmmosahn run --config my_config.yml
# Or mix CLI overrides with config defaults
parmmosahn run --config my_config.yml -o ./results/ -t 24
CLI arguments always override config file values. All options can also be set
via environment variables prefixed with PARMMOSAHN_ (e.g.,
PARMMOSAHN_THREADS=32).
ModelPolisher
ModelPolisher (v2.1-beta) enriches SBML models with BiGG database annotations and standardised identifiers. It runs inside a container, so it requires Docker, Podman, or Apptainer on the host.
ModelPolisher is OFF by default, both because of the container dependency
and because the bundled beta version of ModelPolisher uses a fragile network
fetch at startup. Enable it explicitly with --polish, optionally picking an
engine with -e:
# Default container engine (docker)
parmmosahn run ... --polish
# Podman (rootless, HPC-friendly)
parmmosahn run ... --polish -e podman
# Apptainer/Singularity (HPC clusters)
parmmosahn run ... --polish -e apptainer
If no container engine is detected at runtime, Step 5b is silently skipped and the pipeline continues with the pre-polish models.
Note: ModelPolisher v2.1 (stable release) has a known bug with a broken URL pattern regex in the DataONE namespace and crashes at startup. PaRMMoSaHN bundles the v2.1-beta which works correctly. SBML headers are temporarily downgraded from L3V2 to L3V1 for beta compatibility.
Known Limitations
- Reaction pre-filtering does not evaluate GPR rules. If a reaction requires multiple gene subunits (AND rule), it may be included even if only one subunit has a homolog. The downstream
gapseq draftstep partially mitigates this. - Single medium for all strains. Gap-filling uses one medium specification. Strains from different niches may need different media. Use
parmmosahn projectto re-derive models with alternative media. - Generic biomass composition. Biomass reactions use gapseq's default gram-positive or gram-negative templates rather than species-specific composition. This is consistent with other automated tools (CarveMe, Bactabolize).
- Soft-core threshold (2/N). The default threshold excludes genes present in only one genome. For very small collections (N < 5), consider using
--add-uniqueor adjusting--soft-core.
Citation
If you use PaRMMoSaHN in your research, please cite:
@software{dewin2026parmmosahn,
author = {De Win, Robbe and De Vrieze, Lucas},
title = {PaRMMoSaHN: Pangenome Reference-based Metabolic Modelling by Saving Homolog Networks},
version = {0.3.0},
year = {2026},
url = {https://github.com/robbedewin/PaRMMoSaHN}
}
License
This project is licensed under the MIT License -- see the LICENSE file for details.
Acknowledgments
This work was developed as part of a Master's thesis project at KU Leuven in the laboratory of Prof. Masschelein, in collaboration with the VIB-KU Leuven Center for Microbiology.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parmmosahn-0.3.0.tar.gz.
File metadata
- Download URL: parmmosahn-0.3.0.tar.gz
- Upload date:
- Size: 325.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
633c39b7b46995177875bae4a343ab8467f1f2470de70aea0dcdcf8ad485e29f
|
|
| MD5 |
a54fbcaa77a3e4aaa447726428a2383a
|
|
| BLAKE2b-256 |
543ae0043502c6da8c5fb9cbc55e7b4534dc78ea0042d4cdd7680e5943f85301
|
Provenance
The following attestation bundles were made for parmmosahn-0.3.0.tar.gz:
Publisher:
ci.yml on robbedewin/PaRMMoSaHN
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parmmosahn-0.3.0.tar.gz -
Subject digest:
633c39b7b46995177875bae4a343ab8467f1f2470de70aea0dcdcf8ad485e29f - Sigstore transparency entry: 1634863523
- Sigstore integration time:
-
Permalink:
robbedewin/PaRMMoSaHN@254977922a34f1925f52ae99fc99de6b851fbf5e -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/robbedewin
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@254977922a34f1925f52ae99fc99de6b851fbf5e -
Trigger Event:
push
-
Statement type:
File details
Details for the file parmmosahn-0.3.0-py3-none-any.whl.
File metadata
- Download URL: parmmosahn-0.3.0-py3-none-any.whl
- Upload date:
- Size: 285.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4970c9f8c21e48954e528da270f03fa4b9e0f5f62a61dc1046873fc0f1ec6bf1
|
|
| MD5 |
09bd305e2503791e2a776acaa8c9fe2b
|
|
| BLAKE2b-256 |
f5fe9ace9481407e85811e0ccc1bab8a6eb60cfb7d68ddf272635db9225a39af
|
Provenance
The following attestation bundles were made for parmmosahn-0.3.0-py3-none-any.whl:
Publisher:
ci.yml on robbedewin/PaRMMoSaHN
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parmmosahn-0.3.0-py3-none-any.whl -
Subject digest:
4970c9f8c21e48954e528da270f03fa4b9e0f5f62a61dc1046873fc0f1ec6bf1 - Sigstore transparency entry: 1634863611
- Sigstore integration time:
-
Permalink:
robbedewin/PaRMMoSaHN@254977922a34f1925f52ae99fc99de6b851fbf5e -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/robbedewin
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@254977922a34f1925f52ae99fc99de6b851fbf5e -
Trigger Event:
push
-
Statement type: