Skip to main content

Snakemake scheduler plugin: cascade scheduler (MILP/HEFT/GNNRL) with self-calibrating runtimes, digital twin run history, and GNNRL workflow-specific training

Project description

Snakemake Scheduler Plugin: GrapheonRL

Python License: MIT Version Status

Cascade scheduler for heterogeneous HPC workflows. Reimplements and extends the design of milp_snakemake_scheduler using the Snakemake 9 plugin interface, with added HEFT ordering, GNNRL policy, and self-calibrating runtimes. Does not import from milp_snakemake_scheduler; shares its config file format for cross-plugin compatibility.

What this plugin provides

Scheduling algorithms (cascade order)

MILP with node placement (primary for small subgraphs, <=30 jobs) Assigns each job to a specific node using binary ILP variables. Formulation: x[j][n] binary (job j to node n), start/end continuous, makespan minimized. Feature compatibility enforced: GPU jobs go only to GPU nodes; core and memory capacity respected per node. Falls back to RCPSP ordering when nodes not configured. Uses PuLP/CBC solver. MILP is an exact solver: it is accepted unconditionally when feasible, with no quality gate comparison against HEFT. Benchmark verification: plugin MILP equals the time-indexed MILP certified optimal (14s) on the 10-job known-optimal test workflow.

GNNRL (experimental, <=300 jobs) 3-layer message-passing network with symmetric neighbourhood aggregation and LayerNorm. Scores task priorities from 12 task features (cores, memory, duration, indegree, outdegree, level, upward rank, GPU flag, etc.) and 6 node features (cores, memory, storage, bandwidth, speed, utilisation). Global one-shot inference on the full DAG at startup; result cached for all scheduling rounds (O(n log n) per round). GPU-aware node assignment during inference. Quality gate: generic pretrained model uses tight threshold (×1.01), workflow-trained model uses relaxed threshold (×1.05). Training: BC warm-start from HEFT teacher, then PPO fine-tuning with reward -(beta * makespan + alpha * resource_waste * makespan) (default: alpha=0, beta=1). Note: PPO training does not pass graph edges to the update step (each training step operates on node features only); the graph structure influences inference but not the gradient updates. Pre-trained model shipped; workflow-specific fine-tuning available via --scheduler-grapheonrl-train.

HEFT (fallback, any size) HEFT-inspired critical-path ordering (Topcuoglu et al. 2002). Computes upward rank using calibrated per-rule durations (single-machine estimate) and schedules tasks in decreasing rank order. Node assignment uses greedy earliest-finish-time per compatible node. Used when MILP is not configured and GNNRL is not loaded or fails the quality gate.

Self-calibrating runtime estimation

On every execution the plugin measures actual per-rule wall-clock time from Snakemake's benchmark: directive TSV files (most accurate) or from SLURM sacct (for cluster runs), falling back to timing between scheduling rounds. Results are stored in scheduler_config.yaml [rules] and loaded at the next startup so HEFT's critical-path computation uses real measured durations instead of the cores-as-proxy fallback.

Node placement and SLURM steering

When system_profile.json is configured (same format as milp_snakemake_scheduler), the MILP assigns each job to a specific node. The plugin then attempts to set job.resources.slurm_partition to steer the SLURM executor to the correct partition. Controllable via scheduler_config.yaml [grapheonrl.node_assignment].

Digital twin

On every execution, the plugin writes .snakemake/grapheonrl/dag_export.json capturing the task graph, HEFT oracle schedule, per-rule calibrated duration history, and a run log that accumulates across runs (run_history[], rule_stats, best_run). Used for GNNRL warm-start training and offline analysis. Use --scheduler-grapheonrl-export-dag PATH to write to a custom path instead. Disable with --scheduler-grapheonrl-disable-auto-twin true.

Installation

# From GitHub (recommended)
pip install git+https://github.com/AasishKumarSharma/snakemake-scheduler-plugin-grapheonrl.git

# Or clone for development
git clone https://github.com/AasishKumarSharma/snakemake-scheduler-plugin-grapheonrl
cd snakemake-scheduler-plugin-grapheonrl
pip install -e .

Dependencies installed automatically: torch, numpy, pulp, pyyaml.

Quick start

# Default: cascade with auto-training. GNNRL trains on the workflow before
# the first scheduling round, then executes using the trained model.
# The digital twin is updated automatically after each run.
snakemake --cores 8 --scheduler grapheonrl

# Explicit training control: more iterations for stronger policy
snakemake --cores 8 --scheduler grapheonrl \
    --scheduler-grapheonrl-train-iters 200

# Disable auto-training (use generic pretrained model + quality gate)
snakemake --cores 8 --scheduler grapheonrl \
    --scheduler-grapheonrl-disable-auto-train true

# HEFT only (fastest, no GNNRL inference)
snakemake --cores 8 --scheduler grapheonrl --scheduler-grapheonrl-strategy heft

# Export digital twin to a custom path
snakemake --cores 8 --scheduler grapheonrl \
    --scheduler-grapheonrl-export-dag dag_export/dag_export.json

# With heterogeneous cluster (system_profile.json)
snakemake --cores 8 --scheduler grapheonrl \
    --scheduler-grapheonrl-node-config system_profile.json

Configuration files

Two files are searched in order: CWD > snakefile dir > ~/.snakemake/ > package default. Both use the same format as milp_snakemake_scheduler for cross-plugin compatibility.

scheduler_config.yaml - solver and calibration settings. Copy the template from the repo root and edit as needed.

system_profile.json - cluster node definitions (clusters -> nodes -> resources/features/properties). Copy from repo root and add your nodes.

Key scheduler_config.yaml sections (see template for all options):

grapheonrl:
  strategy: cascade          # milp | heft | gnnrl | cascade
  milp_threshold: 30         # max remaining jobs for MILP
  gnnrl_threshold: 300       # max remaining jobs for GNNRL
  quality_gate: 1.01         # GNNRL only: accept if makespan <= HEFT * quality_gate
  node_assignment:
    enabled: true            # write slurm_partition to job resources
    mode: best_fit
  history:
    max_days: 90
    max_entries: 100
  calibration:
    use_sacct: true
    use_benchmark_files: true
  training:
    alpha: 0.0   # resource utilization penalty weight (0 = makespan only)
    beta:  1.0   # makespan weight

Settings reference

All CLI flags: --scheduler-grapheonrl-<name>

Flag Default Description
strategy cascade Algorithm: cascade, heft, gnnrl (experimental), milp, priority
disable-auto-train false Disable auto-training; use generic pretrained model
disable-auto-twin false Disable automatic digital twin updates
train false Force train GNNRL before executing (explicit override)
train-iters 50 PPO iterations (50=seconds, 200=minutes)
train-after None Auto-train after N runs in digital twin
export-dag None Path to write/update digital twin JSON (overrides auto path)
model-path None Explicit model path (auto-discovered if omitted)
gnnrl-threshold 300 Skip GNNRL above this job count
milp-threshold 30 Skip MILP above this job count
milp-timeout 10.0 MILP solver timeout in seconds
node-config None Path to system_profile.json

Tests

bash tests/run_all_tests.sh --quick   # 17 checks (~3 min, includes integration + optimality)
bash tests/run_all_tests.sh           # 24 checks (~10 min)
python tests/test_gnnrl.py            # 112 GNNRL + infrastructure checks
python tests/test_doc_claims.py       # 97 doc-claim verification checks
python tests/test_integration.py      # 58 end-to-end integration checks (~3 min)

Assessment of GNNRL (Experimental)

Workflow-trained GNNRL (default behavior)

By default, when no trained model exists for the current workflow, the plugin automatically trains GNNRL before the first scheduling round. The trained model learns the workflow's critical-path structure, resource requirements, and task ordering from BC warm-start (HEFT teacher) followed by PPO fine-tuning.

Benchmark results on rnc workflows (gap from certified MILP optimal):

Scale HEFT gap GNNRL (workflow-trained) gap
rnc50 1.2% 0.3-0.5%
rnc100 1.0-1.5% 0.3-0.6%
rnc300 0.9-2.1% 0.4-0.8%
rnc5000 homo baseline +2.2% improvement over HEFT
rnc5000 hetero 209,839 obj Self-iter300: 208,969 obj (beats HEFT)

Workflow-trained GNNRL is within 0.3-0.8% of certified optimal at small-to-medium scale and consistently outperforms HEFT at large scale. This is the intended operating mode.

Generic pretrained model (auto-training disabled)

When --scheduler-grapheonrl-disable-auto-train true is set, the shipped generic model runs without workflow-specific training. The generic model was trained on 3000 synthetic DAGs and evaluated on 300 held-out DAGs:

Metric Value
Win rate vs HEFT 44.3%
Average relative improvement -3.6%
Cases >5% better than HEFT 15%
Cases >5% worse than HEFT 26.7%

The generic model is unreliable on unseen workflows. The quality gate (×1.01 vs HEFT simulation) guards against bad generic-model decisions; HEFT runs as fallback when the gate rejects. Do not use the generic model as your primary scheduler.

Acceptance logic

MILP is always accepted when feasible. It is an exact solver - applying a quality gate would mean rejecting a provably optimal solution. Benchmark verification: on the 10-job known-optimal test workflow, the plugin MILP equals the time-indexed MILP certified optimal (14s).

GNNRL (workflow-trained) uses a relaxed quality gate (×1.05 vs HEFT simulation) because the simulation consistently underestimates the trained model's actual benefit.

GNNRL (generic pretrained) uses a tight quality gate (×1.01 vs HEFT simulation) because the generic model is unreliable on unseen workflows.

Related work

  • milp_snakemake_scheduler: original MILP scheduler (old Snakemake API). GrapheonRL reimplements its design for the Snakemake 9 plugin interface.

Citation

If you use this plugin, please cite the papers it is based on:

@inproceedings{sharma2025grapheonrl,
  title     = {Grapheon {RL}: A Graph Neural Network and Reinforcement Learning
               Framework for Constraint and Data-Aware Workflow Mapping and
               Scheduling in Heterogeneous {HPC} Systems},
  author    = {Sharma, Aasish Kumar and Kunkel, Julian},
  booktitle = {Proceedings of the 2025 IEEE 49th Annual Computers, Software,
               and Applications Conference (COMPSAC)},
  pages     = {489--494},
  year      = {2025},
  doi       = {10.1109/COMPSAC65507.2025.00341}
}

@inproceedings{sharma2025workflow,
  title     = {Workflow-Driven Modeling for the Compute Continuum: An
               Optimization Approach to Automated System and Workload Scheduling},
  author    = {Sharma, Aasish Kumar and Boehme, Christian and Gel{\ss}, Patrick
               and Yahyapour, Ramin and Kunkel, Julian},
  booktitle = {Proceedings of the 2025 IEEE 49th Annual Computers, Software,
               and Applications Conference (COMPSAC)},
  pages     = {2170--2175},
  year      = {2025},
  doi       = {10.1109/COMPSAC65507.2025.00343}
}

Author

Aasish Kumar Sharma, Institute of Computer Science, GWDG, University of Gottingen

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snakemake_scheduler_plugin_grapheonrl-0.6.0.tar.gz (910.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file snakemake_scheduler_plugin_grapheonrl-0.6.0.tar.gz.

File metadata

File hashes

Hashes for snakemake_scheduler_plugin_grapheonrl-0.6.0.tar.gz
Algorithm Hash digest
SHA256 52115a3192a29cb0d9d30bbeea0e9ce2b26529deb090e6d0e4623326874c0c1a
MD5 8fb04664b17f646a9c41840726d5be7d
BLAKE2b-256 d9c91799f287d8e198ec077159126c50ffd5d1807f53e187c190a1c387aebcec

See more details on using hashes here.

File details

Details for the file snakemake_scheduler_plugin_grapheonrl-0.6.0-py3-none-any.whl.

File metadata

File hashes

Hashes for snakemake_scheduler_plugin_grapheonrl-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 977f0be7195abb81917cc7b15b638deacaa5884929a521bdeb6b2a3433ec5059
MD5 399131d2f73dbd396deb1523ad154b98
BLAKE2b-256 1aa2bcc93051e611ecec5e7a4fa79b6535462de86960299f23756ba10cc3edcf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page