Skip to main content

Experiment orchestration toolkit for Slurm-based training and evaluation workflows.

Project description

slurmforge

slurmforge is a Slurm-native stage-batch system for AI training and evaluation workflows.

It focuses on a small CLI surface:

sforge init
sforge init --list-templates
sforge init --template train-eval --output ./demo --force
cd demo
sforge validate --config experiment.yaml
sforge estimate --config experiment.yaml
sforge plan train --config experiment.yaml --dry-run=full --output plan.audit.json
sforge plan eval --config experiment.yaml --checkpoint /path/to/model.pt --input-name model_input
sforge plan run --config experiment.yaml
sforge train --config experiment.yaml --dry-run=full
sforge eval --config experiment.yaml --checkpoint /path/to/model.pt
sforge run --config experiment.yaml
sforge pipeline resume /path/to/pipeline-root
sforge status --from /path/to/root --reconcile
sforge status --from /path/to/root --reconcile --sacct-profile portable
sforge resubmit --from /path/to/root --stage eval --query state=failed

sforge run pipelines auto-advance to completion via a dependency-free watchdog (configurable under orchestration.control.auto_advance). sforge pipeline resume advances a pipeline once on demand — useful when auto-advance is disabled or driven externally by cron/scrontab.

Install

python -m venv .venv
source .venv/bin/activate
python -m pip install -e '.[dev]'

Start

Create a starter project instead of writing YAML from scratch:

sforge init

For scripts or CI, choose a template explicitly:

sforge init --template train-eval --output ./demo --force

This writes ./demo/experiment.yaml, ./demo/CONFIG.sforge.md, ./demo/README.sforge.md, and the template's stage scripts.

Available starter templates:

  • train-eval: train produces a checkpoint; eval consumes the upstream output.
  • train-only: one train stage with a checkpoint output.
  • eval-checkpoint: one eval stage that consumes an explicit checkpoint path.

The generated train.py and eval.py are structured as integration scaffolds:

  • SECTION A - SlurmForge contract: injected CLI args and environment contract.
  • SECTION B - Your model code: model construction, data loading, training, and eval logic to replace.
  • SECTION C - Output contract: checkpoint and metrics files declared by the YAML.

Minimal Workflow

sforge validate --config experiment.yaml
sforge run --config experiment.yaml --dry-run=full
sforge run --config experiment.yaml --emit-only
sforge run --config experiment.yaml
sforge status --from ./runs/<project>/<experiment>/<pipeline-root> --reconcile

status --reconcile uses sacct profile fallback by default and can be pinned with --sacct-profile or --sacct-fields for clusters with different sacct field support.

Use sforge train for train-only configs and sforge eval --checkpoint /path/to/model.pt for eval-only configs.

Docs

Development

ruff check src tests
pytest -q

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurmforge-1.2.3.tar.gz (183.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slurmforge-1.2.3-py3-none-any.whl (316.9 kB view details)

Uploaded Python 3

File details

Details for the file slurmforge-1.2.3.tar.gz.

File metadata

  • Download URL: slurmforge-1.2.3.tar.gz
  • Upload date:
  • Size: 183.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for slurmforge-1.2.3.tar.gz
Algorithm Hash digest
SHA256 ab105cb66f35be5010ae42f5037a2a707a7ea85956ccd545f66c2b9bf471e32b
MD5 1776b955b3c807d38c8f437b8efdb977
BLAKE2b-256 2c453a3fbf204f58d99e7013492679fca1378f834e3bb1be0706bb3196ac6ff5

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmforge-1.2.3.tar.gz:

Publisher: publish.yml on Sean-XinLi/slurmforge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file slurmforge-1.2.3-py3-none-any.whl.

File metadata

  • Download URL: slurmforge-1.2.3-py3-none-any.whl
  • Upload date:
  • Size: 316.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for slurmforge-1.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b7b96615fb1ca957fcc3e8ecb5842462e89f3a597970ce7884ce2ef70d45bc08
MD5 7ba6ffa6fa790fd4f13d38ec0466f607
BLAKE2b-256 8346742c98d898872b99a1ecb82d6e6dde01df9c4da056491af131375e19affd

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmforge-1.2.3-py3-none-any.whl:

Publisher: publish.yml on Sean-XinLi/slurmforge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page