No project description provided
Project description
HyperPod elastic agent
The HyperPod elastic agent is an extension of PyTorch’s ElasticAgent. It orchestrates lifecycles of training workers on each container and communicates with the HyperPod training operator. To use the HyperPod training operator, you must first install the HyperPod elastic agent into your training image before you can submit and run jobs using the operator. The following is a docker file that installs elastic agent and uses hyperpodrun to create the job launcher. For more information about the HyperPod training operator, installation of the operator, and how to use it, see the official AWS documentation.
RUN pip install hyperpod-elastic-agent
ENTRYPOINT ["entrypoint.sh"]
# entrypoint.sh
...
hyperpodrun --nnodes=node_count --nproc-per-node=proc_count \
--rdzv-backend hyperpod \ # Optional
... # Other torchrun args
# pre-traing arg_group
--pre-train-script pre.sh --pre-train-args "pre_1 pre_2 pre_3" \
# post-train arg_group
--post-train-script post.sh --post-train-args "post_1 post_2 post_3" \
training.py --script-args
You can now submit jobs with kubectl.
HyperPod elastic agent arguments
The HyperPod elastic agent supports all of the original arguments from the PyTorch Elastic Agent official documentation. The following is a list of additional arguments available in the HyperPod elastic agent:
| Argument | Description | Default Value |
|---|---|---|
| --shutdown-signal | Signal to send to workers for shutdown (SIGTERM or SIGKILL) | "SIGKILL" |
| --shutdown-timeout | Timeout in seconds between SIGTERM and SIGKILL signals | 30 |
| --server-host | Agent server address | "0.0.0.0" |
| --server-port | Agent server port | 8080 |
| --server-log-level | Agent server log level | "info" |
| --server-shutdown-timeout | Server shutdown timeout in seconds | 300 |
| --pre-train-script | Path to pre-training script | None |
| --pre-train-args | Arguments for pre-training script | None |
| --post-train-script | Path to post-training script | None |
| --post-train-args | Arguments for post-training script | None |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file hyperpod_elastic_agent-1.0.0.tar.gz.
File metadata
- Download URL: hyperpod_elastic_agent-1.0.0.tar.gz
- Upload date:
- Size: 43.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d05bdcac1485c1d1dbce8304e4579020f8a83704aaa94ccc2e81e34f16b2f1e2
|
|
| MD5 |
6d82861e900b98d16c55ef360a5b8288
|
|
| BLAKE2b-256 |
c6f241608314bbd6cd16db0131d9b5b5e66fb9ff35e39581d7e199c11410e0db
|