Snakemake scheduler plugin: cascade scheduler (MILP/HEFT/GNNRL) with self-calibrating runtimes, digital twin run history, and GNNRL workflow-specific training
Project description
Snakemake Scheduler Plugin: GrapheonRL
Cascade scheduler for heterogeneous HPC workflows. Reimplements and extends the design of milp_snakemake_scheduler using the Snakemake 9 plugin interface, with added HEFT ordering, GNNRL policy, and self-calibrating runtimes. Does not import from milp_snakemake_scheduler; shares its config file format for cross-plugin compatibility.
What this plugin provides
Scheduling algorithms (cascade order)
MILP with node placement (primary for small subgraphs, <=30 jobs) Assigns each job to a specific node using binary ILP variables. Formulation: x[j][n] binary (job j to node n), start/end continuous, makespan minimized. Feature compatibility enforced: GPU jobs go only to GPU nodes; core and memory capacity respected per node. Falls back to RCPSP ordering when nodes not configured. Uses PuLP/CBC solver. MILP is an exact solver: it is accepted unconditionally when feasible, with no quality gate comparison against HEFT. Benchmark verification: plugin MILP equals the time-indexed MILP certified optimal (14s) on the 10-job known-optimal test workflow.
GNNRL (experimental, <=300 jobs)
3-layer message-passing network with symmetric neighbourhood aggregation and
LayerNorm. Scores task priorities from 12 task features (cores, memory,
duration, indegree, outdegree, level, upward rank, GPU flag, etc.) and 6 node
features (cores, memory, storage, bandwidth, speed, utilisation). Global
one-shot inference on the full DAG at startup; result cached for all scheduling
rounds (O(n log n) per round). GPU-aware node assignment during inference.
Quality gate: generic pretrained model uses tight threshold (×1.01),
workflow-trained model uses relaxed threshold (×1.05). Training: BC warm-start
from HEFT teacher, then PPO fine-tuning with reward
-(beta * makespan + alpha * resource_waste * makespan) (default: alpha=0,
beta=1). Note: PPO training does not pass graph edges to the update step (each
training step operates on node features only); the graph structure influences
inference but not the gradient updates. Pre-trained model shipped;
workflow-specific fine-tuning available via --scheduler-grapheonrl-train.
HEFT (fallback, any size) HEFT-inspired critical-path ordering (Topcuoglu et al. 2002). Computes upward rank using calibrated per-rule durations (single-machine estimate) and schedules tasks in decreasing rank order. Node assignment uses greedy earliest-finish-time per compatible node. Used when MILP is not configured and GNNRL is not loaded or fails the quality gate.
Self-calibrating runtime estimation
On every execution the plugin measures actual per-rule wall-clock time from
Snakemake's benchmark: directive TSV files (most accurate) or from
SLURM sacct (for cluster runs), falling back to timing between scheduling
rounds. Results are stored in scheduler_config.yaml [rules] and loaded at
the next startup so HEFT's critical-path computation uses real measured
durations instead of the cores-as-proxy fallback.
Node placement and SLURM steering
When system_profile.json is configured (same format as
milp_snakemake_scheduler), the MILP assigns each job to a specific node.
The plugin then attempts to set job.resources.slurm_partition to steer
the SLURM executor to the correct partition. Controllable via
scheduler_config.yaml [grapheonrl.node_assignment].
Digital twin
On every execution, the plugin writes .snakemake/grapheonrl/dag_export.json
capturing the task graph, HEFT oracle schedule, per-rule calibrated duration
history, and a run log that accumulates across runs (run_history[],
rule_stats, best_run). Used for GNNRL warm-start training and offline
analysis. Use --scheduler-grapheonrl-export-dag PATH to write to a custom path
instead. Disable with --scheduler-grapheonrl-disable-auto-twin true.
Installation
# From GitHub (recommended)
pip install git+https://github.com/AasishKumarSharma/snakemake-scheduler-plugin-grapheonrl.git
# Or clone for development
git clone https://github.com/AasishKumarSharma/snakemake-scheduler-plugin-grapheonrl
cd snakemake-scheduler-plugin-grapheonrl
pip install -e .
Dependencies installed automatically: torch, numpy, pulp, pyyaml.
Quick start
# Default: cascade with auto-training. GNNRL trains on the workflow before
# the first scheduling round, then executes using the trained model.
# The digital twin is updated automatically after each run.
snakemake --cores 8 --scheduler grapheonrl
# Explicit training control: more iterations for stronger policy
snakemake --cores 8 --scheduler grapheonrl \
--scheduler-grapheonrl-train-iters 200
# Disable auto-training (use generic pretrained model + quality gate)
snakemake --cores 8 --scheduler grapheonrl \
--scheduler-grapheonrl-disable-auto-train true
# HEFT only (fastest, no GNNRL inference)
snakemake --cores 8 --scheduler grapheonrl --scheduler-grapheonrl-strategy heft
# Export digital twin to a custom path
snakemake --cores 8 --scheduler grapheonrl \
--scheduler-grapheonrl-export-dag dag_export/dag_export.json
# With heterogeneous cluster (system_profile.json)
snakemake --cores 8 --scheduler grapheonrl \
--scheduler-grapheonrl-node-config system_profile.json
Configuration files
Two files are searched in order: CWD > snakefile dir > ~/.snakemake/ > package default.
Both use the same format as milp_snakemake_scheduler for cross-plugin compatibility.
scheduler_config.yaml - solver and calibration settings.
Copy the template from the repo root and edit as needed.
system_profile.json - cluster node definitions (clusters -> nodes ->
resources/features/properties). Copy from repo root and add your nodes.
Key scheduler_config.yaml sections (see template for all options):
grapheonrl:
strategy: cascade # milp | heft | gnnrl | cascade
milp_threshold: 30 # max remaining jobs for MILP
gnnrl_threshold: 300 # max remaining jobs for GNNRL
quality_gate: 1.01 # GNNRL only: accept if makespan <= HEFT * quality_gate
node_assignment:
enabled: true # write slurm_partition to job resources
mode: best_fit
history:
max_days: 90
max_entries: 100
calibration:
use_sacct: true
use_benchmark_files: true
training:
alpha: 0.0 # resource utilization penalty weight (0 = makespan only)
beta: 1.0 # makespan weight
Settings reference
All CLI flags: --scheduler-grapheonrl-<name>
| Flag | Default | Description |
|---|---|---|
strategy |
cascade |
Algorithm: cascade, heft, gnnrl (experimental), milp, priority |
disable-auto-train |
false |
Disable auto-training; use generic pretrained model |
disable-auto-twin |
false |
Disable automatic digital twin updates |
train |
false |
Force train GNNRL before executing (explicit override) |
train-iters |
50 |
PPO iterations (50=seconds, 200=minutes) |
train-after |
None | Auto-train after N runs in digital twin |
export-dag |
None | Path to write/update digital twin JSON (overrides auto path) |
model-path |
None | Explicit model path (auto-discovered if omitted) |
gnnrl-threshold |
300 |
Skip GNNRL above this job count |
milp-threshold |
30 |
Skip MILP above this job count |
milp-timeout |
10.0 |
MILP solver timeout in seconds |
node-config |
None | Path to system_profile.json |
Tests
bash tests/run_all_tests.sh --quick # 17 checks (~3 min, includes integration + optimality)
bash tests/run_all_tests.sh # 24 checks (~10 min)
python tests/test_gnnrl.py # 112 GNNRL + infrastructure checks
python tests/test_doc_claims.py # 97 doc-claim verification checks
python tests/test_integration.py # 58 end-to-end integration checks (~3 min)
Assessment of GNNRL (Experimental)
Workflow-trained GNNRL (default behavior)
By default, when no trained model exists for the current workflow, the plugin automatically trains GNNRL before the first scheduling round. The trained model learns the workflow's critical-path structure, resource requirements, and task ordering from BC warm-start (HEFT teacher) followed by PPO fine-tuning.
Benchmark results on rnc workflows (gap from certified MILP optimal):
| Scale | HEFT gap | GNNRL (workflow-trained) gap |
|---|---|---|
| rnc50 | 1.2% | 0.3-0.5% |
| rnc100 | 1.0-1.5% | 0.3-0.6% |
| rnc300 | 0.9-2.1% | 0.4-0.8% |
| rnc5000 homo | baseline | +2.2% improvement over HEFT |
| rnc5000 hetero | 209,839 obj | Self-iter300: 208,969 obj (beats HEFT) |
Workflow-trained GNNRL is within 0.3-0.8% of certified optimal at small-to-medium scale and consistently outperforms HEFT at large scale. This is the intended operating mode.
Generic pretrained model (auto-training disabled)
When --scheduler-grapheonrl-disable-auto-train true is set, the shipped generic model
runs without workflow-specific training. The generic model was trained on 3000
synthetic DAGs and evaluated on 300 held-out DAGs:
| Metric | Value |
|---|---|
| Win rate vs HEFT | 44.3% |
| Average relative improvement | -3.6% |
| Cases >5% better than HEFT | 15% |
| Cases >5% worse than HEFT | 26.7% |
The generic model is unreliable on unseen workflows. The quality gate (×1.01 vs HEFT simulation) guards against bad generic-model decisions; HEFT runs as fallback when the gate rejects. Do not use the generic model as your primary scheduler.
Acceptance logic
MILP is always accepted when feasible. It is an exact solver - applying a quality gate would mean rejecting a provably optimal solution. Benchmark verification: on the 10-job known-optimal test workflow, the plugin MILP equals the time-indexed MILP certified optimal (14s).
GNNRL (workflow-trained) uses a relaxed quality gate (×1.05 vs HEFT simulation) because the simulation consistently underestimates the trained model's actual benefit.
GNNRL (generic pretrained) uses a tight quality gate (×1.01 vs HEFT simulation) because the generic model is unreliable on unseen workflows.
Related work
- milp_snakemake_scheduler: original MILP scheduler (old Snakemake API). GrapheonRL reimplements its design for the Snakemake 9 plugin interface.
Citation
If you use this plugin, please cite the papers it is based on:
@inproceedings{sharma2025grapheonrl,
title = {Grapheon {RL}: A Graph Neural Network and Reinforcement Learning
Framework for Constraint and Data-Aware Workflow Mapping and
Scheduling in Heterogeneous {HPC} Systems},
author = {Sharma, Aasish Kumar and Kunkel, Julian},
booktitle = {Proceedings of the 2025 IEEE 49th Annual Computers, Software,
and Applications Conference (COMPSAC)},
pages = {489--494},
year = {2025},
doi = {10.1109/COMPSAC65507.2025.00341}
}
@inproceedings{sharma2025workflow,
title = {Workflow-Driven Modeling for the Compute Continuum: An
Optimization Approach to Automated System and Workload Scheduling},
author = {Sharma, Aasish Kumar and Boehme, Christian and Gel{\ss}, Patrick
and Yahyapour, Ramin and Kunkel, Julian},
booktitle = {Proceedings of the 2025 IEEE 49th Annual Computers, Software,
and Applications Conference (COMPSAC)},
pages = {2170--2175},
year = {2025},
doi = {10.1109/COMPSAC65507.2025.00343}
}
Author
Aasish Kumar Sharma, Institute of Computer Science, GWDG, University of Gottingen
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file snakemake_scheduler_plugin_grapheonrl-0.6.0.tar.gz.
File metadata
- Download URL: snakemake_scheduler_plugin_grapheonrl-0.6.0.tar.gz
- Upload date:
- Size: 910.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52115a3192a29cb0d9d30bbeea0e9ce2b26529deb090e6d0e4623326874c0c1a
|
|
| MD5 |
8fb04664b17f646a9c41840726d5be7d
|
|
| BLAKE2b-256 |
d9c91799f287d8e198ec077159126c50ffd5d1807f53e187c190a1c387aebcec
|
File details
Details for the file snakemake_scheduler_plugin_grapheonrl-0.6.0-py3-none-any.whl.
File metadata
- Download URL: snakemake_scheduler_plugin_grapheonrl-0.6.0-py3-none-any.whl
- Upload date:
- Size: 627.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
977f0be7195abb81917cc7b15b638deacaa5884929a521bdeb6b2a3433ec5059
|
|
| MD5 |
399131d2f73dbd396deb1523ad154b98
|
|
| BLAKE2b-256 |
1aa2bcc93051e611ecec5e7a4fa79b6535462de86960299f23756ba10cc3edcf
|