Experiment orchestration toolkit for Slurm-based training and evaluation workflows.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

XinLi-Sean

These details have not been verified by PyPI

Project description

slurmforge

TL;DR

Define experiments in YAML → generate reproducible Slurm jobs.

sforge init
sforge validate
sforge generate
sbatch runs/.../sbatch/*.sh

slurmforge is a Slurm-native experiment orchestration toolkit designed for large-scale training workflows.

It helps you:

expand experiment sweeps from a single config
generate reproducible Slurm batch jobs
manage training + evaluation pipelines with minimal boilerplate

It takes one experiment config, expands a sweep, resolves train and eval commands, groups runs by final Slurm resource shape, and materializes the batch records and sbatch files needed for execution.

Why slurmforge?

Compared to ad-hoc bash scripts or manual sbatch workflows:

structured experiment definition (YAML instead of shell glue)
deterministic sweep expansion
built-in retry and replay support
explicit separation of planning vs execution

Unlike general-purpose orchestration tools, slurmforge is designed specifically for Slurm environments.

Architecture (High-Level)

flowchart TD
    A[Config YAML] --> B[Planning - Build Run Graph]
    B --> C[Materialization - Generate sbatch]
    C --> D[Execution - Runtime helpers]
    D --> E[Slurm Cluster]
    E --> F[Results / Logs]

This separation ensures reproducibility and easier debugging.

Who This Is For

The following section is intended for users who are new to slurmforge.

You do not need to understand the internal planner or executor model to start. The intended workflow is:

keep your real training code in your own project directory
generate a starter project with sforge init
either edit the generated starter scripts or point the config at your existing train and eval entrypoints
run sforge validate
run sforge generate
submit the generated sbatch files

Install

The primary distribution path is GitHub source checkout install.

Source installation is recommended to ensure compatibility with your local environment and Slurm setup.

Use any Python 3.10 or newer interpreter.

Create the virtual environment outside the source checkout so the package directory stays clean. Source-checkout installs intentionally use the active environment's local build toolchain instead of an isolated build environment.

git clone <repo-url>
cd slurmforge
python -m venv ../slurmforge_venv
source ../slurmforge_venv/bin/activate
python -m pip install --no-build-isolation .

Main CLI:

sforge --help

Most users only need sforge. The low-level runtime helpers are invoked automatically by generated batch scripts.

Quick Start

The recommended newcomer path is init.

Create a starter project scaffold (interactive wizard):

sforge init

Or specify type and profile directly:

sforge init script --out ./demo_project
cd ./demo_project

Validate the config first:

sforge validate --config ./experiment.yaml

Preview the generated batch:

sforge generate --config ./experiment.yaml --dry_run

Generate the batch files:

sforge generate --config ./experiment.yaml

Generated batches persist the slurmforge version that planned them. After upgrading slurmforge, older batches may still execute or replay with compatibility warnings instead of a hard stop. For new submissions after an upgrade, regenerate the batch so planning and execution use the same installed version.

Then submit the generated Slurm scripts under:

runs/<project>/<experiment>/batch_<name>/sbatch/

Connect A Starter To Your Code

There are two normal ways to adapt a starter project:

Replace the generated train.py, eval.py, or train_adapter.py bodies with your real logic.
Keep the generated experiment.yaml, but change model.script and eval.script to point at entrypoint scripts that already exist in your project.

model.script should point to the script that launches training, not to a module that only defines layers or model classes.

Typical direct-entrypoint edit:

model:
  name: "my_model"
  script: "train.py"

eval:
  enabled: true
  script: "eval.py"

Typical existing-project edit:

model:
  name: "my_model"
  script: "src/train_my_model.py"

eval:
  enabled: true
  script: "tools/run_eval.py"

If you use model_cli, make sure the script named by model.script accepts the arguments declared under run.args.

Starter Modes

Use init when you want a starter project scaffold.

init takes two orthogonal choices: type (how your training code is invoked) and profile (cluster complexity).

sforge init                          # interactive wizard
sforge init script                   # script type, starter profile (default)
sforge init script   --profile hpc   # script type, hpc profile
sforge init command
sforge init command  --profile hpc
sforge init registry
sforge init registry --profile hpc
sforge init adapter
sforge init adapter  --profile hpc

Types:

script — train.py-style script; slurmforge manages args and submission
command — wraps a complete shell command in Slurm
registry — uses a shared team model registry
adapter — interface bridge script (advanced)

Profiles:

starter — single GPU, minimal config; runnable immediately after filling in 4 fields
hpc — multi-GPU, sweep, eval, artifact sync; includes placeholders for cluster account, environment activation, and data paths that you replace before execution

Typical generated files:

experiment.yaml
README.md
runs/
type-specific files such as train.py, eval.py, train_adapter.py, or models.yaml

Run sforge init --help to see the full usage.

Raw YAML References

Use examples when you want to inspect or export the raw YAML reference files.

List available examples:

sforge examples list

Show one example:

sforge examples show script_hpc

Export one example:

sforge examples export script_hpc --out ./experiment.yaml

examples is the raw YAML layer. init is the recommended starter-project layer built around those YAML definitions.

Runtime Internals

sforge-run-plan-executor, sforge-artifact-sync, sforge-write-train-outputs, and sforge-write-attempt-result are low-level runtime helpers.

Most users do not call them directly. Generated batch scripts and debugging workflows use them to execute one run record, resolve train outputs for eval handoff, collect artifacts into the result directory, and persist structured attempt_result.json metadata after train/eval finishes.

Core Commands

Validate a config without generating a batch:

sforge validate --config /path/to/experiment.yaml

Generate a batch:

sforge generate --config /path/to/experiment.yaml

Preview without writing files:

sforge generate --config /path/to/experiment.yaml --dry_run

Override config values from the CLI:

sforge generate \
  --config /path/to/experiment.yaml \
  --set run.args.lr=0.003 \
  --set cluster.mem=80G

Retry failed runs from an existing batch:

sforge rerun --from /path/to/batch_root

Replay a specific persisted run:

sforge replay --from-run /path/to/batch_root/runs/run_001_abcd1234

Replay directly from a snapshot file:

sforge replay --from-snapshot /path/to/run_snapshot.json

Replay selected runs from a batch:

sforge replay --from-batch /path/to/batch_root --run_id r1 --run_id r2

Replay selected runs by both id and index:

sforge replay --from-batch /path/to/batch_root --run_id r1 --run_index 1

replay --from-batch replays every run by default. Repeat --run_id or --run_index to narrow the selection. If you pass both flags, slurmforge uses intersection semantics: a run must match the selected ids and the selected indices.

When retries find a checkpoint under the previous run's result directory, the rebuilt run will:

export AI_INFRA_RESUME_FROM_CHECKPOINT
pass --resume_from_checkpoint ... for structured modes that slurmforge controls

Checkpoint resume selection is deterministic, not heuristic:

if job-*/meta/checkpoint_state.json exists, rerun uses it as the authoritative latest checkpoint pointer
otherwise slurmforge scans discovered checkpoint files and selects the highest parseable step number from the filename
if multiple checkpoint candidates exist and none expose a parseable step number, rerun fails instead of guessing from file modification time

In practice, that means your training outputs should do one of these:

update job-*/meta/checkpoint_state.json whenever a new latest checkpoint is committed
or name checkpoint files with a stable step number such as global_step_1200, step1200, checkpoint-1200, or ckpt_1200

Use replay when you want an exact user-directed replay source. Use rerun when you want status-based retry selection plus automatic checkpoint resume injection.

Inspect run status:

sforge status --from /path/to/batch_root

If squeue / sacct are available on the machine where you run status, slurmforge will use Slurm-native job states to distinguish pending, running, and terminal scheduler states before falling back to local logs.

Path Rules

--config is required for validate and generate
relative paths inside the config resolve against project_root
by default, project_root is the directory that contains the config file
--project_root lets you override that explicitly
validate and generate use the same --set and --project_root semantics
replay restores the original planning root from persisted metadata; if the project moved, pass --project_root
rerun restores the original planning root from persisted run metadata; if the project moved, pass --project_root

Fields typically resolved relative to project_root:

model_registry.registry_file
model.script
model.yaml
launcher.workdir
run.workdir
run.adapter.script
eval.script
eval.workdir
output.base_output_dir

Choosing A Train Mode

The package supports three internal train modes, each corresponding to an init type:

command (sforge init command): run an existing command exactly as provided; slurmforge does not rewrite it into torchrun or infer a distributed launcher topology from it
model_cli (sforge init script or sforge init registry): build the train command from model and run.args
adapter (sforge init adapter): call a bridge script that translates slurmforge inputs to some external system

Recommended order for new users:

sforge init command if you only want to wrap an existing command quickly
sforge init script as the default structured path
sforge init registry when a team wants a shared model catalog
sforge init adapter only for advanced or non-standard integrations

If you use model.script directly, the default assumption is ddp_supported: true. Set model.ddp_supported: false explicitly for single-process-only scripts.

Use command mode only when your command text already expresses the launcher you want. If you need slurmforge to manage torchrun, GPU process counts, or multi-node Slurm launch details, use script or adapter init types.

Advanced Configuration

Hyperparameter Sweep

sweep generates the matrix product of all declared axes. Each combination becomes one independent Slurm task.

Flat grid (shared_axes only):

sweep:
  enabled: true
  max_runs: 20            # optional cap on total runs
  shared_axes:
    run.args.lr:          [1e-4, 5e-5, 1e-5]
    run.args.batch_size:  [64, 128]

Named cases — each case can have its own fixed values (set) and additional axes:

sweep:
  enabled: true
  shared_axes:
    run.args.lr: [1e-4, 5e-5]
  cases:
    - name: "case_1"
      set:
        run.args.optimizer: "adam"
    - name: "case_2"
      set:
        run.args.optimizer: "sgd"
      axes:
        run.args.epochsize: [10, 20, 40]

Each case is multiplied with shared_axes independently, so the total runs equal len(shared_axes_product) × sum(len(case_product) for each case).

max_runs truncates the final expansion deterministically if set.

Dot-path keys in shared_axes, set, and axes must not overlap within or across a case.

Inline Evaluation

eval runs inside the same Slurm job immediately after training completes.

eval:
  enabled: true
  script: "eval.py"
  workdir: "."
  launch_mode: "inherit"   # auto / ddp / single / inherit (inherit = use same launcher as train)
  pass_run_args: true       # pass run.args to eval script as --run_args_json
  run_args_flag: "run_args_json"
  pass_model_overrides: false
  model_overrides_flag: "model_overrides_json"
  args:                     # extra eval-only args
    test_split: 0.02
  launcher:
    distributed:
      master_port: 29900    # separate port to avoid conflict with train launcher
      extra_torchrun_args: []
  train_outputs:
    checkpoint_policy: "latest"   # latest / best / explicit
    # explicit_checkpoint: "checkpoints/step_5000.pt"  # only when policy=explicit

eval.command can be used instead of eval.script for an arbitrary shell command. When using eval.command, eval.external_runtime is required and eval.args/pass_run_args/pass_model_overrides are not available.

Email Notifications

notify:
  enabled: true
  email: "you@example.com"
  when: "afterany"    # after / afterany / afterok / afternotok

when uses Slurm dependency vocabulary: afterany sends on any completion, afterok only on success, afternotok only on failure.

Automatic GPU Allocation

When resources.auto_gpu: true, slurmforge estimates the GPU count per job from model memory heuristics and sets cluster.gpus_per_node automatically.

resources:
  auto_gpu: true
  gpu_estimator: "heuristic"
  target_mem_per_gpu_gb: 80    # target memory per GPU in GB
  safety_factor: 1.15          # multiply estimated memory by this factor (>= 1.0)
  min_gpus_per_job: 1
  max_gpus_per_job: 8
  max_available_gpus: 8

cluster:
  gpus_per_node: "auto"        # set to "auto" to let resources block drive this

Distributed Launcher

Full torchrun-based distributed config:

launcher:
  mode: "auto"          # auto selects ddp when ddp_supported=true and gpus_per_node > 1
  python_bin: "python3"
  workdir: "."
  distributed:
    nnodes: 1
    nproc_per_node: "auto"      # int or "auto" (matches gpus_per_node)
    master_port: 29500
    port_offset: "auto"         # int or "auto" (avoids port collisions across array tasks)
    extra_torchrun_args:
      - "--rdzv_backend=c10d"
      - "--max_restarts=2"

Set model.ddp_supported: false to force single mode regardless of GPU count. Set model.ddp_required: true to fail fast if DDP cannot be selected.

Cluster Configuration

cluster:
  partition: "your_partition"
  account: "my_account"
  qos: "high_priority"         # optional QoS override
  time_limit: "04:00:00"       # or "2-00:00:00" for 2 days
  nodes: 1
  gpus_per_node: 4
  cpus_per_task: 8
  mem: "64G"                   # "0" = unlimited
  constraint: "a100|h100"      # optional node constraint
  extra_sbatch_args:            # raw #SBATCH directives
    - "--exclude=node001,node002"
    - "--reservation=my_reservation"

Cross-Batch Slurm Dependencies

output.dependencies injects --dependency flags into every generated array job, allowing you to chain batches without manual sbatch calls.

output:
  base_output_dir: "./runs"
  batch_name: "finetune_v2"
  dependencies:
    afterok:
      - "4512345"    # Slurm job IDs from a previous batch
      - "4512346"
    afterany:
      - "4512347"

Supported dependency types: after, afterany, afterok, afternotok.

Artifact Collection

slurmforge collects artifacts from the run working directory into the result directory after each job.

artifacts:
  checkpoint_globs:
    - "checkpoints/**/*.pt"
    - "checkpoints/**/*.ckpt"
  eval_csv_globs:
    - "eval_csv/**/*.csv"
  eval_image_globs:
    - "eval_images/**/*.png"
    - "eval_images/**/*.pdf"
  extra_globs:
    - "logs/**/*.log"

Validation Policies

Control how slurmforge handles various warnings and errors:

validation:
  cli_args: "warn"          # warn / error / ignore — unknown CLI args in run.args
  topology_errors: "error"  # error / warn / off    — DDP topology mismatches
  resource_warnings: "warn" # warn / error / off    — GPU/memory estimation warnings
  runtime_preflight: "error"# error / warn / off    — script existence checks

Command Mode with External Runtime

Use command mode to wrap an arbitrary shell command. external_runtime declares the topology slurmforge uses when injecting the command into a Slurm array.

run:
  command: "bash scripts/train.sh --config cfg.yaml"
  command_mode: "argv"      # argv (shell-escaped) / raw (shell expansion enabled)
  external_runtime:
    nnodes: 1
    nproc_per_node: 4

command_mode: raw passes the command string to bash without escaping — useful for pipes and redirects, but disables slurmforge's argument safety checks.

Adapter Mode

adapter mode calls a bridge script that translates slurmforge's structured inputs to an external training system.

run:
  adapter:
    script: "train_adapter.py"
    pass_run_args: true
    run_args_flag: "run_args_json"
    pass_model_overrides: true
    model_overrides_flag: "model_overrides_json"
    ddp_supported: false
    ddp_required: false
  args:
    lr: 0.004

launcher:
  mode: "auto"

The adapter script receives run.args as a JSON blob via --run_args_json and run.model_overrides via --model_overrides_json.

Notes

batch materialization is always array-based in the current contract; output.dispatch_mode has been removed
output.dependencies can add external Slurm dependencies such as afterok or afterany to every generated array job when you need cross-batch sequencing
notify.when uses the same Slurm dependency vocabulary as batch submission dependencies
eval currently runs inline inside the same generated job as train; output.dependencies is a batch-level Slurm dependency feature, not a per-run train→eval stage DAG
eval.train_outputs controls how slurmforge selects the checkpoint handed off from train to eval; it must be a mapping, e.g. {checkpoint_policy: latest}; supported policies are latest, best, and explicit
sweep is always matrix expansion; valid top-level keys are enabled, max_runs, shared_axes, and cases; there is no sweep.method or sweep.params key
your train and eval scripts must exist on a Slurm-visible filesystem
generated array jobs bootstrap env.modules, env.conda_activate, and env.venv_activate before invoking sforge-run-plan-executor; that activated runtime environment must expose sforge-run-plan-executor, sforge-artifact-sync, sforge-write-train-outputs, and sforge-write-attempt-result on compute nodes
generate persists run metadata so rerun can replay without package-local path guesses
eval artifact fallback scans both train and eval workdirs

Maintenance Policy

This project is currently maintained on a best-effort basis.
Responses to issues and pull requests may be delayed.

Pull requests are welcome for:

bug fixes
documentation improvements

New features may not be accepted unless aligned with the project scope.

Development

python -m pip install --no-build-isolation '.[dev]'
pytest -q

Author and Maintainer

Created and maintained by Xin Li.

Email: seanxinlee@gmail.com
GitHub: https://github.com/Sean-XinLi

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

XinLi-Sean

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.2.0

Apr 30, 2026

1.1.0

Apr 30, 2026

1.0.1

Apr 29, 2026

1.0.0

Apr 28, 2026

0.2.2

Apr 22, 2026

0.2.1

Apr 15, 2026

0.2.0

Apr 14, 2026

0.1.3

Apr 8, 2026

0.1.2

Apr 7, 2026

0.1.1

Apr 7, 2026

This version

0.1.0

Apr 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurmforge-0.1.0.tar.gz (192.8 kB view details)

Uploaded Apr 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

slurmforge-0.1.0-py3-none-any.whl (285.1 kB view details)

Uploaded Apr 7, 2026 Python 3

File details

Details for the file slurmforge-0.1.0.tar.gz.

File metadata

Download URL: slurmforge-0.1.0.tar.gz
Upload date: Apr 7, 2026
Size: 192.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurmforge-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`365350342d0d8076030ac3f9b7a8c2a9f02e850b853f569143c41c657d6a2812`
MD5	`94364daa5ae23d47c5c6fa9c7bd5c013`
BLAKE2b-256	`6fa79e1582ceb976f9bc4da34021c147670aa3cd34ab49427ab4b727a782dd7a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmforge-0.1.0.tar.gz:

Publisher: publish.yml on Sean-XinLi/slurmforge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slurmforge-0.1.0.tar.gz
- Subject digest: 365350342d0d8076030ac3f9b7a8c2a9f02e850b853f569143c41c657d6a2812
- Sigstore transparency entry: 1245626479
- Sigstore integration time: Apr 7, 2026
Source repository:
- Permalink: Sean-XinLi/slurmforge@0b3888de101880183d0788846fb066fc4a30633a
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Sean-XinLi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0b3888de101880183d0788846fb066fc4a30633a
- Trigger Event: release

File details

Details for the file slurmforge-0.1.0-py3-none-any.whl.

File metadata

Download URL: slurmforge-0.1.0-py3-none-any.whl
Upload date: Apr 7, 2026
Size: 285.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurmforge-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d07b570a4dbbad083323f672622e36d9a4530578efe70cce2635ad03558aa81f`
MD5	`5f6c635711316e82c13a730641cb5f66`
BLAKE2b-256	`9cfaf2db864ab4f4205850555e0732065f00df10f7e998f29012127f8084ea0b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmforge-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Sean-XinLi/slurmforge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slurmforge-0.1.0-py3-none-any.whl
- Subject digest: d07b570a4dbbad083323f672622e36d9a4530578efe70cce2635ad03558aa81f
- Sigstore transparency entry: 1245626481
- Sigstore integration time: Apr 7, 2026
Source repository:
- Permalink: Sean-XinLi/slurmforge@0b3888de101880183d0788846fb066fc4a30633a
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Sean-XinLi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0b3888de101880183d0788846fb066fc4a30633a
- Trigger Event: release

slurmforge 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

slurmforge

TL;DR

Why slurmforge?

Architecture (High-Level)

Who This Is For

Install

Quick Start

Connect A Starter To Your Code

Starter Modes

Raw YAML References

Runtime Internals

Core Commands

Path Rules

Choosing A Train Mode

Advanced Configuration

Hyperparameter Sweep

Inline Evaluation

Email Notifications

Automatic GPU Allocation

Distributed Launcher

Cluster Configuration

Cross-Batch Slurm Dependencies

Artifact Collection

Validation Policies

Command Mode with External Runtime

Adapter Mode

Notes

Maintenance Policy

Development

Author and Maintainer

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance