Skip to main content

ML experiment launcher for local, SLURM, and SSH environments

Project description

Chester

Chester (chester-ml on PyPI) is a Python experiment launcher for ML workflows. Define your training function and parameter sweep — Chester handles dispatching jobs to local subprocesses, SSH servers, or SLURM clusters, with Singularity container support, code syncing, and reproducibility snapshots baked in.

Installation

pip install chester-ml
# or
uv add chester-ml

Quick Start

1. Create .chester/config.yaml in your project root:

log_dir: data
package_manager: uv

backends:
  local:
    type: local
    prepare: .chester/backends/local/prepare.sh

  myserver:
    type: ssh
    host: myserver                       # SSH alias from ~/.ssh/config
    remote_dir: /home/user/myproject
    prepare: .chester/backends/myserver/prepare.sh

  mycluster:
    type: slurm
    host: mycluster
    remote_dir: /home/user/myproject
    prepare: .chester/backends/mycluster/prepare.sh
    slurm:
      partition: gpu
      time: "24:00:00"
      gpus: 1
      cpus_per_gpu: 8
      mem_per_gpu: 32G

2. Write a launcher:

from chester.run_exp import run_experiment_lite, VariantGenerator, detect_local_gpus, flush_backend

def run_task(variant, log_dir, exp_name):
    print(f"lr={variant['lr']}, batch={variant['batch_size']}")
    # ... your training code ...

vg = VariantGenerator()
vg.add('lr', [1e-3, 1e-4])
vg.add('batch_size', [32, 64])

for v in vg.variants():
    run_experiment_lite(
        stub_method_call=run_task,
        variant=v,
        mode='local',        # or 'myserver', 'mycluster'
        exp_prefix='sweep',
        max_num_processes=max(1, len(detect_local_gpus())),
    )

flush_backend('local')       # no-op for local; required after loop for batch SSH mode

3. Run:

python launcher.py           # local
python launcher.py myserver  # SSH
python launcher.py mycluster # SLURM

Features

  • Three backend types: local subprocess, SSH (nohup), SLURM (sbatch)
  • Singularity on all backends: GPU passthrough, persistent overlays, per-container prepare.sh
  • VariantGenerator: cartesian product sweeps, dependent parameters, order="serial" (multi-step single job) and order="dependent" (chained SLURM jobs)
  • Hydra integration: pass parameters as key=value overrides with OmegaConf interpolation support
  • Git snapshot: saves git_info.json + git_diff.patch per run for full reproducibility
  • Submodule commit pinning: pin specific submodule commits per job via remote git worktrees
  • SSH batch-GPU mode: accumulate jobs across variants, fire one per GPU on flush_backend()
  • Extra sync dirs: rsync additional paths (datasets, checkpoints) to remote before submission
  • Per-experiment SLURM overrides: tune time, gpus, mem_per_gpu, etc. per run_experiment_lite() call
  • Graceful Ctrl+C: local kills subprocesses and stops the queue; remote detaches and lets jobs keep running

Documentation

Full reference in docs/:

Doc What it covers
Configuration .chester/config.yaml — all fields, global singularity block, YAML anchors
Backends Local, SSH, SLURM — all options, batch-GPU, extra sync dirs
Singularity Mounts, overlays, PID namespace, fakeroot, runtime override
Parameter Sweeps VariantGenerator, serial/dependent ordering, derive, flush_backend
Hydra hydra_enabled, flags, OmegaConf interpolations
Git Snapshot git_info.json, git_diff.patch, submodule tracking, recovery
Submodule Pinning Per-job submodule commit pinning via worktrees
Examples Annotated real-world config patterns

Example Configs

See docs/examples/ for annotated configs:

Project Layout

myproject/
├── .chester/
│   ├── config.yaml                    # Main config
│   └── backends/
│       ├── local/
│       │   └── prepare.sh             # Local env setup
│       ├── mycluster/
│       │   └── prepare.sh             # Cluster setup (modules, paths)
│       └── myserver/
│           └── prepare.sh             # SSH server setup
├── launchers/
│   └── launch_sweep.py
└── src/

Chester searches for .chester/config.yaml upward from the current directory, stopping at the .git root. Override with $CHESTER_CONFIG_PATH.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chester_ml-2.0.0.tar.gz (401.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chester_ml-2.0.0-py3-none-any.whl (52.7 kB view details)

Uploaded Python 3

File details

Details for the file chester_ml-2.0.0.tar.gz.

File metadata

  • Download URL: chester_ml-2.0.0.tar.gz
  • Upload date:
  • Size: 401.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for chester_ml-2.0.0.tar.gz
Algorithm Hash digest
SHA256 07b6a90e86a7b569cf1d0fd515c3232f3beacc95c66734ae56146e74c0ca9b43
MD5 5616b90bda1a3862c4773f759cd570c8
BLAKE2b-256 6dd03ac5e2c134dfdb56139abda25dfac86fd20a5f95e5f68387a50bf1faeff3

See more details on using hashes here.

File details

Details for the file chester_ml-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: chester_ml-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 52.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for chester_ml-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 80c7a0cc765cb68076da0605ecbada7b4b1253d95fc854616f3861c06b9ba615
MD5 e875d0d2b9a277c2ef77dc8fbff737fe
BLAKE2b-256 cb6100805333fefde096098f8d6779dcb87094fb97df89f50428f05960dae88b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page