Snakemake scheduler plugin: cascade scheduler (MILP/HEFT/GNNRL) with self-calibrating runtimes, digital twin run history, and GNNRL workflow-specific training

These details have not been verified by PyPI

Project links

Project description

Snakemake Scheduler Plugin: GrapheonRL

Cascade scheduler for heterogeneous HPC workflows. Reimplements and extends the design of milp_snakemake_scheduler using the Snakemake 9 plugin interface, with added HEFT ordering, GNNRL policy, and self-calibrating runtimes. Does not import from milp_snakemake_scheduler; shares its config file format for cross-plugin compatibility.

What this plugin provides

Scheduling algorithms (cascade order)

MILP with node placement (primary for small subgraphs, <=30 jobs) Assigns each job to a specific node using binary ILP variables. Formulation: x[j][n] binary (job j to node n), start/end continuous, makespan minimized. Feature compatibility enforced: GPU jobs go only to GPU nodes; core and memory capacity respected per node. Falls back to RCPSP ordering when nodes not configured. Uses PuLP/CBC solver. MILP is an exact solver: it is accepted unconditionally when feasible, with no quality gate comparison against HEFT. Benchmark verification: plugin MILP equals the time-indexed MILP certified optimal (14s) on the 10-job known-optimal test workflow.

GNNRL (experimental, <=300 jobs) 3-layer message-passing network with symmetric neighbourhood aggregation and LayerNorm. Scores task priorities from 12 task features (cores, memory, duration, indegree, outdegree, level, upward rank, GPU flag, etc.) and 6 node features (cores, memory, storage, bandwidth, speed, utilisation). Global one-shot inference on the full DAG at startup; result cached for all scheduling rounds (O(n log n) per round). GPU-aware node assignment during inference. Quality gate: generic pretrained model uses tight threshold (×1.01), workflow-trained model uses relaxed threshold (×1.05). Training: BC warm-start from HEFT teacher, then PPO fine-tuning with reward -(beta * makespan + alpha * resource_waste * makespan) (default: alpha=0, beta=1). Note: PPO training does not pass graph edges to the update step (each training step operates on node features only); the graph structure influences inference but not the gradient updates. Pre-trained model shipped; workflow-specific fine-tuning available via --scheduler-grapheonrl-train.

HEFT (fallback, any size) HEFT-inspired critical-path ordering (Topcuoglu et al. 2002). Computes upward rank using calibrated per-rule durations (single-machine estimate) and schedules tasks in decreasing rank order. Node assignment uses greedy earliest-finish-time per compatible node. Used when MILP is not configured and GNNRL is not loaded or fails the quality gate.

Self-calibrating runtime estimation

On every execution the plugin measures actual per-rule wall-clock time from Snakemake's benchmark: directive TSV files (most accurate) or from SLURM sacct (for cluster runs), falling back to timing between scheduling rounds. Results are stored in scheduler_config.yaml [rules] and loaded at the next startup so HEFT's critical-path computation uses real measured durations instead of the cores-as-proxy fallback.

Node placement and SLURM steering

When system_profile.json is configured (same format as milp_snakemake_scheduler), the MILP assigns each job to a specific node. The plugin then attempts to set job.resources.slurm_partition to steer the SLURM executor to the correct partition. Controllable via scheduler_config.yaml [grapheonrl.node_assignment].

Digital twin

On every execution, the plugin writes .snakemake/grapheonrl/dag_export.json capturing the task graph, HEFT oracle schedule, per-rule calibrated duration history, and a run log that accumulates across runs (run_history[], rule_stats, best_run). Used for GNNRL warm-start training and offline analysis. Use --scheduler-grapheonrl-export-dag PATH to write to a custom path instead. Disable with --scheduler-grapheonrl-disable-auto-twin true.

Installation

# From GitHub (recommended)
pip install git+https://github.com/AasishKumarSharma/snakemake-scheduler-plugin-grapheonrl.git

# Or clone for development
git clone https://github.com/AasishKumarSharma/snakemake-scheduler-plugin-grapheonrl
cd snakemake-scheduler-plugin-grapheonrl
pip install -e .

Dependencies installed automatically: torch, numpy, pulp, pyyaml.

Quick start

# Default: cascade with auto-training. GNNRL trains on the workflow before
# the first scheduling round, then executes using the trained model.
# The digital twin is updated automatically after each run.
snakemake --cores 8 --scheduler grapheonrl

# Explicit training control: more iterations for stronger policy
snakemake --cores 8 --scheduler grapheonrl \
    --scheduler-grapheonrl-train-iters 200

# Disable auto-training (use generic pretrained model + quality gate)
snakemake --cores 8 --scheduler grapheonrl \
    --scheduler-grapheonrl-disable-auto-train true

# HEFT only (fastest, no GNNRL inference)
snakemake --cores 8 --scheduler grapheonrl --scheduler-grapheonrl-strategy heft

# Export digital twin to a custom path
snakemake --cores 8 --scheduler grapheonrl \
    --scheduler-grapheonrl-export-dag dag_export/dag_export.json

# With heterogeneous cluster (system_profile.json)
snakemake --cores 8 --scheduler grapheonrl \
    --scheduler-grapheonrl-node-config system_profile.json

Configuration files

Two files are searched in order: CWD > snakefile dir > ~/.snakemake/ > package default. Both use the same format as milp_snakemake_scheduler for cross-plugin compatibility.

scheduler_config.yaml - solver and calibration settings. Copy the template from the repo root and edit as needed.

system_profile.json - cluster node definitions (clusters -> nodes -> resources/features/properties). Copy from repo root and add your nodes.

Key scheduler_config.yaml sections (see template for all options):

grapheonrl:
  strategy: cascade          # milp | heft | gnnrl | cascade
  milp_threshold: 30         # max remaining jobs for MILP
  gnnrl_threshold: 300       # max remaining jobs for GNNRL
  quality_gate: 1.01         # GNNRL only: accept if makespan <= HEFT * quality_gate
  node_assignment:
    enabled: true            # write slurm_partition to job resources
    mode: best_fit
  history:
    max_days: 90
    max_entries: 100
  calibration:
    use_sacct: true
    use_benchmark_files: true
  training:
    alpha: 0.0   # resource utilization penalty weight (0 = makespan only)
    beta:  1.0   # makespan weight

Settings reference

All CLI flags: --scheduler-grapheonrl-<name>

Flag	Default	Description
`strategy`	`cascade`	Algorithm: `cascade`, `heft`, `gnnrl` (experimental), `milp`, `priority`
`disable-auto-train`	`false`	Disable auto-training; use generic pretrained model
`disable-auto-twin`	`false`	Disable automatic digital twin updates
`train`	`false`	Force train GNNRL before executing (explicit override)
`train-iters`	`50`	PPO iterations (50=seconds, 200=minutes)
`train-after`	None	Auto-train after N runs in digital twin
`export-dag`	None	Path to write/update digital twin JSON (overrides auto path)
`model-path`	None	Explicit model path (auto-discovered if omitted)
`gnnrl-threshold`	`300`	Skip GNNRL above this job count
`milp-threshold`	`30`	Skip MILP above this job count
`milp-timeout`	`10.0`	MILP solver timeout in seconds
`node-config`	None	Path to system_profile.json

Tests

bash tests/run_all_tests.sh --quick   # 17 checks (~3 min, includes integration + optimality)
bash tests/run_all_tests.sh           # 24 checks (~10 min)
python tests/test_gnnrl.py            # 112 GNNRL + infrastructure checks
python tests/test_doc_claims.py       # 97 doc-claim verification checks
python tests/test_integration.py      # 58 end-to-end integration checks (~3 min)

Assessment of GNNRL (Experimental)

Workflow-trained GNNRL (default behavior)

By default, when no trained model exists for the current workflow, the plugin automatically trains GNNRL before the first scheduling round. The trained model learns the workflow's critical-path structure, resource requirements, and task ordering from BC warm-start (HEFT teacher) followed by PPO fine-tuning.

Benchmark results on rnc workflows (gap from certified MILP optimal):

Scale	HEFT gap	GNNRL (workflow-trained) gap
rnc50	1.2%	0.3-0.5%
rnc100	1.0-1.5%	0.3-0.6%
rnc300	0.9-2.1%	0.4-0.8%
rnc5000 homo	baseline	+2.2% improvement over HEFT
rnc5000 hetero	209,839 obj	Self-iter300: 208,969 obj (beats HEFT)

Workflow-trained GNNRL is within 0.3-0.8% of certified optimal at small-to-medium scale and consistently outperforms HEFT at large scale. This is the intended operating mode.

Generic pretrained model (auto-training disabled)

When --scheduler-grapheonrl-disable-auto-train true is set, the shipped generic model runs without workflow-specific training. The generic model was trained on 3000 synthetic DAGs and evaluated on 300 held-out DAGs:

Metric	Value
Win rate vs HEFT	44.3%
Average relative improvement	-3.6%
Cases >5% better than HEFT	15%
Cases >5% worse than HEFT	26.7%

The generic model is unreliable on unseen workflows. The quality gate (×1.01 vs HEFT simulation) guards against bad generic-model decisions; HEFT runs as fallback when the gate rejects. Do not use the generic model as your primary scheduler.

Acceptance logic

MILP is always accepted when feasible. It is an exact solver - applying a quality gate would mean rejecting a provably optimal solution. Benchmark verification: on the 10-job known-optimal test workflow, the plugin MILP equals the time-indexed MILP certified optimal (14s).

GNNRL (workflow-trained) uses a relaxed quality gate (×1.05 vs HEFT simulation) because the simulation consistently underestimates the trained model's actual benefit.

GNNRL (generic pretrained) uses a tight quality gate (×1.01 vs HEFT simulation) because the generic model is unreliable on unseen workflows.

Related work

milp_snakemake_scheduler: original MILP scheduler (old Snakemake API). GrapheonRL reimplements its design for the Snakemake 9 plugin interface.

Citation

If you use this plugin, please cite the papers it is based on:

@inproceedings{sharma2025grapheonrl,
  title     = {Grapheon {RL}: A Graph Neural Network and Reinforcement Learning
               Framework for Constraint and Data-Aware Workflow Mapping and
               Scheduling in Heterogeneous {HPC} Systems},
  author    = {Sharma, Aasish Kumar and Kunkel, Julian},
  booktitle = {Proceedings of the 2025 IEEE 49th Annual Computers, Software,
               and Applications Conference (COMPSAC)},
  pages     = {489--494},
  year      = {2025},
  doi       = {10.1109/COMPSAC65507.2025.00341}
}

@inproceedings{sharma2025workflow,
  title     = {Workflow-Driven Modeling for the Compute Continuum: An
               Optimization Approach to Automated System and Workload Scheduling},
  author    = {Sharma, Aasish Kumar and Boehme, Christian and Gel{\ss}, Patrick
               and Yahyapour, Ramin and Kunkel, Julian},
  booktitle = {Proceedings of the 2025 IEEE 49th Annual Computers, Software,
               and Applications Conference (COMPSAC)},
  pages     = {2170--2175},
  year      = {2025},
  doi       = {10.1109/COMPSAC65507.2025.00343}
}

Author

Aasish Kumar Sharma, Institute of Computer Science, GWDG, University of Gottingen

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.6.0

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snakemake_scheduler_plugin_grapheonrl-0.6.0.tar.gz (910.5 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

snakemake_scheduler_plugin_grapheonrl-0.6.0-py3-none-any.whl (627.7 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file snakemake_scheduler_plugin_grapheonrl-0.6.0.tar.gz.

File metadata

Download URL: snakemake_scheduler_plugin_grapheonrl-0.6.0.tar.gz
Upload date: May 20, 2026
Size: 910.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for snakemake_scheduler_plugin_grapheonrl-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`52115a3192a29cb0d9d30bbeea0e9ce2b26529deb090e6d0e4623326874c0c1a`
MD5	`8fb04664b17f646a9c41840726d5be7d`
BLAKE2b-256	`d9c91799f287d8e198ec077159126c50ffd5d1807f53e187c190a1c387aebcec`

See more details on using hashes here.

File details

Details for the file snakemake_scheduler_plugin_grapheonrl-0.6.0-py3-none-any.whl.

File metadata

Download URL: snakemake_scheduler_plugin_grapheonrl-0.6.0-py3-none-any.whl
Upload date: May 20, 2026
Size: 627.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for snakemake_scheduler_plugin_grapheonrl-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`977f0be7195abb81917cc7b15b638deacaa5884929a521bdeb6b2a3433ec5059`
MD5	`399131d2f73dbd396deb1523ad154b98`
BLAKE2b-256	`1aa2bcc93051e611ecec5e7a4fa79b6535462de86960299f23756ba10cc3edcf`

See more details on using hashes here.

snakemake-scheduler-plugin-grapheonrl 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Snakemake Scheduler Plugin: GrapheonRL

What this plugin provides

Scheduling algorithms (cascade order)

Self-calibrating runtime estimation

Node placement and SLURM steering

Digital twin

Installation

Quick start

Configuration files

Settings reference

Tests

Assessment of GNNRL (Experimental)

Workflow-trained GNNRL (default behavior)

Generic pretrained model (auto-training disabled)

Acceptance logic

Related work

Citation

Author

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes