No project description provided

Project description

HyperPod elastic agent

The HyperPod elastic agent is an extension of PyTorch’s ElasticAgent. It orchestrates lifecycles of training workers on each container and communicates with the HyperPod training operator. To use the HyperPod training operator, you must first install the HyperPod elastic agent into your training image before you can submit and run jobs using the operator. The following is a docker file that installs elastic agent and uses hyperpodrun to create the job launcher. For more information about the HyperPod training operator, installation of the operator, and how to use it, see the official AWS documentation.

RUN pip install hyperpod-elastic-agent

ENTRYPOINT ["entrypoint.sh"]
# entrypoint.sh
...
hyperpodrun --nnodes=node_count --nproc-per-node=proc_count \
            --rdzv-backend hyperpod \ # Optional
            ... # Other torchrun args
            # pre-traing arg_group
            --pre-train-script pre.sh --pre-train-args "pre_1 pre_2 pre_3" \
            # post-train arg_group
            --post-train-script post.sh --post-train-args "post_1 post_2 post_3" \
            training.py --script-args

You can now submit jobs with kubectl.

Note: When invoking hyperpodrun in a container entrypoint, ensure that hyperpodrun is PID 1 or alternatively is able to receive signals(specifically SIGTERM) forwared from PID 1 in the container to allow graceful termination during container/pod teardown. This can be done in the following ways:

Use exec when specifying ENTRYPOINT in shell mode, for ex: ENTRYPOINT exec hyperpodrun .....
Use exec format to speicify the container entrypoint, for ex: ENTRYPOINT ["hyperpodrun", .....]
Use an init process that can forward signals to its child processes, for ex: tini

HyperPod elastic agent arguments

The HyperPod elastic agent supports all of the original arguments from the PyTorch Elastic Agent official documentation. The following is a list of additional arguments available in the HyperPod elastic agent:

Argument	Description	Default Value
--shutdown-signal	Signal to send to workers for shutdown (SIGTERM or SIGKILL)	"SIGKILL"
--shutdown-timeout	Timeout in seconds between SIGTERM and SIGKILL signals	15
--server-host	Agent server address	"0.0.0.0"
--server-port	Agent server port	8080
--server-log-level	Agent server log level	"info"
--server-shutdown-timeout	Server shutdown timeout in seconds	300
--pre-train-script	Path to pre-training script	None
--pre-train-args	Arguments for pre-training script	None
--post-train-script	Path to post-training script	None
--post-train-args	Arguments for post-training script	None
--inprocess-restart	Flag specifying whether to use the inprocess_restart feature	false
--inprocess-timeout	Time in seconds the Agent waits for workers to come to barrier before triggering a process level restart	None

Project details

Release history Release notifications | RSS feed

1.1.2

Mar 12, 2026

This version

1.1.0

Dec 3, 2025

1.0.0

Jun 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hyperpod_elastic_agent-1.1.0.tar.gz (102.7 kB view details)

Uploaded Dec 3, 2025 Source

File details

Details for the file hyperpod_elastic_agent-1.1.0.tar.gz.

File metadata

Download URL: hyperpod_elastic_agent-1.1.0.tar.gz
Upload date: Dec 3, 2025
Size: 102.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.24

File hashes

Hashes for hyperpod_elastic_agent-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c632de0c180d54504533b7cd4e22bad9c3030c651f075196d8c7ead15219d49b`
MD5	`1f4057b6089eea9db8ec97739653083b`
BLAKE2b-256	`0c8363ed4acf176b08c222b9c22b1de5a39af522ae77aa9f99f07dd71f1e36b7`

See more details on using hashes here.

hyperpod-elastic-agent 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta