Skip to main content

No project description provided

Project description

HyperPod elastic agent

The HyperPod elastic agent is an extension of PyTorch’s ElasticAgent. It orchestrates lifecycles of training workers on each container and communicates with the HyperPod training operator. To use the HyperPod training operator, you must first install the HyperPod elastic agent into your training image before you can submit and run jobs using the operator. The following is a docker file that installs elastic agent and uses hyperpodrun to create the job launcher. For more information about the HyperPod training operator, installation of the operator, and how to use it, see the official AWS documentation.

RUN pip install hyperpod-elastic-agent

ENTRYPOINT ["entrypoint.sh"]
# entrypoint.sh
...
hyperpodrun --nnodes=node_count --nproc-per-node=proc_count \
            --rdzv-backend hyperpod \ # Optional
            ... # Other torchrun args
            # pre-traing arg_group
            --pre-train-script pre.sh --pre-train-args "pre_1 pre_2 pre_3" \
            # post-train arg_group
            --post-train-script post.sh --post-train-args "post_1 post_2 post_3" \
            training.py --script-args

You can now submit jobs with kubectl.

HyperPod elastic agent arguments

The HyperPod elastic agent supports all of the original arguments from the PyTorch Elastic Agent official documentation. The following is a list of additional arguments available in the HyperPod elastic agent:

Argument Description Default Value
--shutdown-signal Signal to send to workers for shutdown (SIGTERM or SIGKILL) "SIGKILL"
--shutdown-timeout Timeout in seconds between SIGTERM and SIGKILL signals 30
--server-host Agent server address "0.0.0.0"
--server-port Agent server port 8080
--server-log-level Agent server log level "info"
--server-shutdown-timeout Server shutdown timeout in seconds 300
--pre-train-script Path to pre-training script None
--pre-train-args Arguments for pre-training script None
--post-train-script Path to post-training script None
--post-train-args Arguments for post-training script None

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hyperpod_elastic_agent-1.0.0.tar.gz (43.2 kB view details)

Uploaded Source

File details

Details for the file hyperpod_elastic_agent-1.0.0.tar.gz.

File metadata

  • Download URL: hyperpod_elastic_agent-1.0.0.tar.gz
  • Upload date:
  • Size: 43.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for hyperpod_elastic_agent-1.0.0.tar.gz
Algorithm Hash digest
SHA256 d05bdcac1485c1d1dbce8304e4579020f8a83704aaa94ccc2e81e34f16b2f1e2
MD5 6d82861e900b98d16c55ef360a5b8288
BLAKE2b-256 c6f241608314bbd6cd16db0131d9b5b5e66fb9ff35e39581d7e199c11410e0db

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page