Skip to main content

Job management tool for running and monitoring jobs with dependencies

Project description

🌳 JRun

Submit & track job-trees on SLURM with one command.

MIT License  Python 3.9+  SLURM


stateDiagram-v2
    state "✅ 6867196<br/><code>python train.py --lr 0.001 --mo…</code>" as S6867196
    state "✅ 6867197<br/><code>python train.py --lr 0.001 --mo…</code>" as S6867197
    state "✅ 6867198<br/><code>python train.py --lr 0.01 --mod…</code>" as S6867198
    state "✅ 6867199<br/><code>python train.py --lr 0.01 --mod…</code>" as S6867199
    state "✅ 6867200<br/><code>python train.py --lr 0.1 --mode…</code>" as S6867200
    state "✅ 6867201<br/><code>python train.py --lr 0.1 --mode…</code>" as S6867201
    state "✅ 6867202<br/><code>python find_best.py --metric ev…</code>" as S6867202
    state "⏸️ 6867203<br/><code>python test.py --model best_mod…</code>" as S6867203
    state "⏸️ 6867204<br/><code>python create_report.py --resul…</code>" as S6867204
    S6867196 --> S6867202
    S6867197 --> S6867202
    S6867198 --> S6867202
    S6867199 --> S6867202
    S6867200 --> S6867202
    S6867201 --> S6867202
    S6867196 --> S6867203
    S6867197 --> S6867203
    S6867198 --> S6867203
    S6867199 --> S6867203
    S6867200 --> S6867203
    S6867201 --> S6867203
    S6867202 --> S6867203
    S6867196 --> S6867204
    S6867197 --> S6867204
    S6867198 --> S6867204
    S6867199 --> S6867204
    S6867200 --> S6867204
    S6867201 --> S6867204
    S6867202 --> S6867204
    S6867203 --> S6867204

Installation

pip install -e . # editable install

Usage

# Submit a workflow from YAML file
jrun submit --file workflow.yaml

# Check job statuses
jrun status

# Submit a single job
jrun sbatch --cpus-per-task=4 --mem=16G --wrap="python train.py"

Quick start

Define a tree of jobs

# Define tree
group:
  name: "test"
  type: sequential
  jobs:
    - group:
        type: parallel
        jobs:
          - job:
              preamble: cpu
              command: "echo 'python train.py'"

          - job:
              preamble: cpu
              command: "echo 'python eval.py'"
    - job:
        preamble: cpu
        command: "echo 'python make_report.py'"

# Define preambles
preambles:
  cpu:
    - "#!/bin/bash"
    - "#SBATCH --cpus-per-task=4"
    - "#SBATCH --mem=8G"
    - "#SBATCH --output=slurm/slurm-%j.out"
    - "#SBATCH --error=slurm/slurm-%j.err"

Submit tree and visuzlize

$ jrun submit --file path/to/job/tree.yaml
$ jrun viz # add `--mode mermaid` for mermaid diagram
Job Dependencies:
========================================
6866829 []: (COMPLETED): echo 'python train.py'
6866830 []: (COMPLETED): echo 'python eval.py'
6866831 []: (PENDING): echo 'python make_report.py' <- 6866829, 6866830

Workflow Types

Parameter Sweeps

group:
  name: "sweep-example"
  type: sweep
  preamble: base
  sweep:
    lr: [0.001, 0.01, 0.1]
    model: ["resnet", "vgg"]
  sweep_template: "python train.py --lr {lr} --model {model}"

This creates 6 jobs (3 × 2 combinations) automatically.

Parallel Jobs

group:
  name: "parallel-example"
  type: parallel
  jobs:
    - job:
        preamble: base
        command: "python train_model_a.py"
    - job:
        preamble: base
        command: "python train_model_b.py"

Link jobs with group ids

# Use `{group_id}` in commands to link jobs
group:
  name: "main"
  type: parallel
  jobs:
    - group:
        type: sweep
        preamble: gpu
        sweep:
          lr: [5e-4, 1e-4, 5e-5]
        sweep_template:  "python train.py lr {lr} --group_id {group_id}"  # (e.g., aaa-bbb-ccc)

    - job:
        preamble: cpu
        command: "python eval.py --group_id {group_id}" # (e.g., aaa-bbb)

🌳 JRun Features & Status

Current Features

  • Submit job trees from YAML files (jrun submit --file workflow.yaml)
  • Monitor job status with visualization (jrun status, jrun viz)
  • Parameter sweeps and parallel job execution
  • Job graph vizualization
  • CLI filtering (jrun status --filter status=COMPLETED)
  • Job retry (jrun retry JOB_ID)
  • Job delete subgraph
  • Web app
  • Improve visual for loop group
  • Improve dependency (make markovian)
  • View job logs in browser
  • Delete by node
  • Add sweep_idx

Planned Features

  • Bugfix: retry not auto updating old job id to new id in deps table
  • Update afterany (allow some parent group failures)
  • Update node color code (mixed with pending should be blue/active)

Requirements

  • Python 3.6+
  • SLURM environment
  • PyYAML >= 6.0
  • tabulate >= 0.9.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agora-1.0.1.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agora-1.0.1-py3-none-any.whl (19.3 kB view details)

Uploaded Python 3

File details

Details for the file agora-1.0.1.tar.gz.

File metadata

  • Download URL: agora-1.0.1.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for agora-1.0.1.tar.gz
Algorithm Hash digest
SHA256 fc186953a1a49703bf15e99424c3596417d17822eabe1f56c5c5e3f1643831f9
MD5 2fe052af5d5ac163b586f47abeb83c98
BLAKE2b-256 4681433a1cd6b833d62b4458e493e32f8f5894b868dbaa0e0791f958b8fa3b32

See more details on using hashes here.

File details

Details for the file agora-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: agora-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 19.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for agora-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9e0d1e786e7f260f57ac1e2c9f87b58583e0ccbbcd8bfdeeb16d4a0f98871405
MD5 174b81f841da5464f678e826b4adacbe
BLAKE2b-256 93d536c059358ec0310c66f65f4bd4707150926cdced1828aec0b650f23ae83c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page