`snapbatch` is a replacement of `sbatch` to create a snapshot of current working directory, and submit the command to `sbatch`.
Project description
SnapBATCH
Motivation
On slurm, if your task is queuing and you change the codes, the final launched code will be the modified version. Usually this behavior is not what we want.
snapbatch
replaces sbatch
to solve this problem.
Install
pip install snapbatch
Usage
snapbatch [-J your_job_name] [OPTIONS(1)...] [ : [OPTIONS(N)...]] script(0) [args(0)...]
snapbatch
is a replacement of sbatch
to create a snapshot of current working directory, and submit the command to sbatch
.
This command simply:
- commits the dirty changes of files monitored by git AND all untracked .py/.sh to a new branch.
- mirros this branch to the path of environment
SNAPBATCH_PATH
, default to~/snapbatches
. (withgit worktree
, friendly to merge/commit/find/diff on these new workplaces than directly copying.) - runs
sbatch --chdir /copied_path/relative/path {--arg xxx ...} (the following args to snapbatch)
Purge branches
Please first manually move or delete the ~/snapbatches dir. (too dangerous to automate), then run the following command under the git working directory,
snapbatch_purge [n]
It keeps the last n snapbatch branches, default 0.
Author: mingding.thu dot gmail.com
Other tools
snapbatch-dryrun [-J your_job_name] [OPTIONS(1)...] [ : [OPTIONS(N)...]] script(0) [args(0)...]
Only mirror the codes and print the sbatch
command.
snapbatch-rsc [-J your_job_name] [OPTIONS(1)...] [ : [OPTIONS(N)...]] script(0) [args(0)...]
submit to the FAIR RSC cluster on dev server.
snapbatch-launch
Motivation
Sometimes, we develop codes on a SLURM cluster and want to run it on another cluster without management systems.
snapbatch-launch
first mirrors the codes and launches a python or shell file on multiple machines with SLURM / torchrun environment variables, pretending that they are launched by srun
/ torchrun
.
Usage
First speficify environment variable SNAPBATCH_PATH
as a path on a shared filesystem.
snapbatch-launch [-h] [-H HOSTFILE] [-J JOB_NAME][--job-id JOB_ID] [--chdir CHDIR] [--env_style {slurm,torchrun}] [-i INCLUDE] [-e EXCLUDE] [--num_nodes NUM_NODES] [--num_gpus NUM_GPUS] [--master_port MASTER_PORT] [--master_addr MASTER_ADDR] [--launcher LAUNCHER] [--launcher_args LAUNCHER_ARGS] [--force_multi]
user_script ...(user_args)
Logs
snapbatch-launch
will create a subfolder snapbatch_backup_logs
under the mirrored working directory. It will capture and save outputs from different ranks (rank_i.log
).
run tail -f .../snapbatch_backup_logs/rank_0.log
to see the realtime output of the rank 0.
Stop
The codes are modified from deepspeed, based on pdsh. You need to manually kill the processes on different nodes due to the disadvantage of pdsh. An example is
pdsh -w ssh:node[0-1] "ps -ef | grep jobname | awk '{print \$2}' | xargs kill -9"
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file snapbatch-0.4.0.tar.gz
.
File metadata
- Download URL: snapbatch-0.4.0.tar.gz
- Upload date:
- Size: 14.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 74b1d1f4613f00516c9ccba1e67ee66720680f6f0185cff8917c5a8a35d78c2d |
|
MD5 | 2630f07743529c23f8ea82fda67b15be |
|
BLAKE2b-256 | 204bb459356793b8ceb87185a2e585e4f49139a080b6020b39e98791b0322894 |
File details
Details for the file snapbatch-0.4.0-py3-none-any.whl
.
File metadata
- Download URL: snapbatch-0.4.0-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2cdf67cb0d14d57148f3a1f297ba9e548ecfdfe89690c3fc4ea6f044df3cb8a5 |
|
MD5 | 050d5279e0b044493d0f56f322286736 |
|
BLAKE2b-256 | 9411aff52a4fe034db2a7b8f9239e9eee157dc1f92c2b1152e9ab9a801051140 |