Job management tool for running and monitoring jobs with dependencies
Project description
🌳 JRun
Submit & track job-trees on SLURM with one command.
stateDiagram-v2
state "✅ 6867196<br/><code>python train.py --lr 0.001 --mo…</code>" as S6867196
state "✅ 6867197<br/><code>python train.py --lr 0.001 --mo…</code>" as S6867197
state "✅ 6867198<br/><code>python train.py --lr 0.01 --mod…</code>" as S6867198
state "✅ 6867199<br/><code>python train.py --lr 0.01 --mod…</code>" as S6867199
state "✅ 6867200<br/><code>python train.py --lr 0.1 --mode…</code>" as S6867200
state "✅ 6867201<br/><code>python train.py --lr 0.1 --mode…</code>" as S6867201
state "✅ 6867202<br/><code>python find_best.py --metric ev…</code>" as S6867202
state "⏸️ 6867203<br/><code>python test.py --model best_mod…</code>" as S6867203
state "⏸️ 6867204<br/><code>python create_report.py --resul…</code>" as S6867204
S6867196 --> S6867202
S6867197 --> S6867202
S6867198 --> S6867202
S6867199 --> S6867202
S6867200 --> S6867202
S6867201 --> S6867202
S6867196 --> S6867203
S6867197 --> S6867203
S6867198 --> S6867203
S6867199 --> S6867203
S6867200 --> S6867203
S6867201 --> S6867203
S6867202 --> S6867203
S6867196 --> S6867204
S6867197 --> S6867204
S6867198 --> S6867204
S6867199 --> S6867204
S6867200 --> S6867204
S6867201 --> S6867204
S6867202 --> S6867204
S6867203 --> S6867204
Installation
pip install -e . # editable install
Usage
# Submit a workflow from YAML file
jrun submit --file workflow.yaml
# Check job statuses
jrun status
# Submit a single job
jrun sbatch --cpus-per-task=4 --mem=16G --wrap="python train.py"
Quick start
Define a tree of jobs
# Define tree
group:
name: "test"
type: sequential
jobs:
- group:
type: parallel
jobs:
- job:
preamble: cpu
command: "echo 'python train.py'"
- job:
preamble: cpu
command: "echo 'python eval.py'"
- job:
preamble: cpu
command: "echo 'python make_report.py'"
# Define preambles
preambles:
cpu:
- "#!/bin/bash"
- "#SBATCH --cpus-per-task=4"
- "#SBATCH --mem=8G"
- "#SBATCH --output=slurm/slurm-%j.out"
- "#SBATCH --error=slurm/slurm-%j.err"
Submit tree and visuzlize
$ jrun submit --file path/to/job/tree.yaml
$ jrun viz # add `--mode mermaid` for mermaid diagram
Job Dependencies:
========================================
6866829 []: (COMPLETED): echo 'python train.py'
6866830 []: (COMPLETED): echo 'python eval.py'
6866831 []: (PENDING): echo 'python make_report.py' <- 6866829, 6866830
Workflow Types
Parameter Sweeps
group:
name: "sweep-example"
type: sweep
preamble: base
sweep:
lr: [0.001, 0.01, 0.1]
model: ["resnet", "vgg"]
sweep_template: "python train.py --lr {lr} --model {model}"
This creates 6 jobs (3 × 2 combinations) automatically.
Parallel Jobs
group:
name: "parallel-example"
type: parallel
jobs:
- job:
preamble: base
command: "python train_model_a.py"
- job:
preamble: base
command: "python train_model_b.py"
Link jobs with group ids
# Use `{group_id}` in commands to link jobs
group:
name: "main"
type: parallel
jobs:
- group:
type: sweep
preamble: gpu
sweep:
lr: [5e-4, 1e-4, 5e-5]
sweep_template: "python train.py lr {lr} --group_id {group_id}" # (e.g., aaa-bbb-ccc)
- job:
preamble: cpu
command: "python eval.py --group_id {group_id}" # (e.g., aaa-bbb)
🌳 JRun Features & Status
Current Features
- Submit job trees from YAML files (
jrun submit --file workflow.yaml) - Monitor job status with visualization (
jrun status,jrun viz) - Parameter sweeps and parallel job execution
- Job graph vizualization
- CLI filtering (
jrun status --filter status=COMPLETED) - Job retry (
jrun retry JOB_ID) - Job delete subgraph
- Web app
- Improve visual for loop group
- Improve dependency (make markovian)
- View job logs in browser
- Delete by node
- Add sweep_idx
Planned Features
- Bugfix: retry not auto updating old job id to new id in deps table
- Update afterany (allow some parent group failures)
- Update node color code (mixed with pending should be blue/active)
Requirements
- Python 3.6+
- SLURM environment
- PyYAML >= 6.0
- tabulate >= 0.9.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
agora-1.0.1.tar.gz
(17.1 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
agora-1.0.1-py3-none-any.whl
(19.3 kB
view details)
File details
Details for the file agora-1.0.1.tar.gz.
File metadata
- Download URL: agora-1.0.1.tar.gz
- Upload date:
- Size: 17.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc186953a1a49703bf15e99424c3596417d17822eabe1f56c5c5e3f1643831f9
|
|
| MD5 |
2fe052af5d5ac163b586f47abeb83c98
|
|
| BLAKE2b-256 |
4681433a1cd6b833d62b4458e493e32f8f5894b868dbaa0e0791f958b8fa3b32
|
File details
Details for the file agora-1.0.1-py3-none-any.whl.
File metadata
- Download URL: agora-1.0.1-py3-none-any.whl
- Upload date:
- Size: 19.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e0d1e786e7f260f57ac1e2c9f87b58583e0ccbbcd8bfdeeb16d4a0f98871405
|
|
| MD5 |
174b81f841da5464f678e826b4adacbe
|
|
| BLAKE2b-256 |
93d536c059358ec0310c66f65f4bd4707150926cdced1828aec0b650f23ae83c
|