Skip to main content

xpk helps Cloud developers to orchestrate training jobs on accelerators on GKE.

Project description

Overview

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE. xpk handles the "multihost pods" of TPUs and GPUs (HGX H100) as first class citizens.

xpk decouples provisioning capacity from running jobs. There are two structures: clusters (provisioned VMs) and workloads (training jobs). Clusters represent the physical resources you have available. Workloads represent training jobs -- at any time some of these will be completed, others will be running and some will be queued, waiting for cluster resources to become available.

The ideal workflow starts by provisioning the clusters for all of the ML hardware you have reserved. Then, without re-provisioning, submit jobs as needed. By eliminating the need for re-provisioning between jobs, using Docker containers with pre-installed dependencies and cross-ahead of time compilation, these queued jobs run with minimal start times. Further, because workloads return the hardware back to the shared pool when they complete, developers can achieve better use of finite hardware resources. And automated tests can run overnight while resources tend to be underutilized.

Installation

To install xpk, run the following command:

pip install xpk

XPK for Large Scale (>1k VMs)

Follow user instructions in xpk-large-scale-guide.sh to use xpk for a GKE cluster greater than 1000 VMs. Run these steps to set up a GKE cluster with large scale training and high throughput support with XPK, and run jobs with XPK. We recommend you manually copy commands per step and verify the outputs of each step.

Example usages:

To get started, be sure to set your GCP Project and Zone as usual via gcloud config set.

Below are reference commands. A typical journey starts with a Cluster Create followed by many Workload Creates. To understand the state of the system you might want to use Cluster List or Workload List commands. Finally, you can cleanup with a Cluster Delete.

Cluster Create

  • Cluster Create (provision on-demand capacity):

    python3 xpk.py cluster create \
    --cluster xpk-test --tpu-type=v5litepod-16 \
    --num-slices=4
    
  • Cluster Create (provision reserved capacity):

    python3 xpk.py cluster create \
    --cluster xpk-test --tpu-type=v5litepod-256 \
    --num-slices=2 \
    --custom-tpu-nodepool-arguments="--reservation-affinity=specific --reservation=RESERVATION_ID"
    
  • Cluster Create can be called again with the same --cluster name to modify the number of slices or retry failed steps.

    For example, if a user creates a cluster with 4 slices:

    python3 xpk.py cluster create \
    --cluster xpk-test --tpu-type=v5litepod-16 \
    --num-slices=4
    

    and recreates the cluster with 8 slices. The command will rerun to create 4 new slices:

    python3 xpk.py cluster create \
    --cluster xpk-test --tpu-type=v5litepod-16 \
    --num-slices=8
    

    and recreates the cluster with 6 slices. The command will rerun to delete 2 slices. The command will warn the user when deleting slices. Use --force to skip prompts.

    python3 xpk.py cluster create \
    --cluster xpk-test --tpu-type=v5litepod-16 \
    --num-slices=6
    
    # Skip delete prompts using --force.
    
    python3 xpk.py cluster create --force \
    --cluster xpk-test --tpu-type=v5litepod-16 \
    --num-slices=6
    

Cluster Delete

  • Cluster Delete (deprovision capacity):

    python3 xpk.py cluster delete \
    --cluster xpk-test
    

Cluster List

  • Cluster List (see provisioned capacity):

    python3 xpk.py cluster list
    

Cluster Describe

  • Cluster Describe (see capacity):

    python3 xpk.py cluster describe \
    --cluster xpk-test
    

Cluster Cacheimage

  • Cluster Cacheimage (enables faster start times):

    python3 xpk.py cluster cacheimage \
    --cluster xpk-test --docker-image gcr.io/your_docker_image
    

Workload Create

  • Workload Create (submit training job):

    python3 xpk.py workload create \
    --workload xpk-test-workload --command "echo goodbye" --cluster \
    xpk-test --tpu-type=v5litepod-16
    

Workload Priority and Preemption

  • Set the priority level of your workload with --priority=LEVEL

    We have five priorities defined: [very-low, low, medium, high, very-high]. The default priority is medium.

    Priority determines:

    1. Order of queued jobs.

      Queued jobs are ordered by very-low < low < medium < high < very-high

    2. Preemption of lower priority workloads.

      A higher priority job will evict lower priority jobs. Evicted jobs are brought back to the queue and will re-hydrate appropriately.

    General Example:

    python3 xpk.py workload create \
    --workload xpk-test-medium-workload --command "echo goodbye" --cluster \
    xpk-test --tpu-type=v5litepod-16 --priority=medium
    

Workload Delete

  • Workload Delete (delete training job):

    python3 xpk.py workload delete \
    --workload xpk-test-workload --cluster xpk-test
    

Workload List

  • Workload List (see training jobs):

    python3 xpk.py workload list \
    --cluster xpk-test
    
  • Example Workload List Output:

    The below example shows four jobs of different statuses:

    • user-first-job-failed: filter-status is FINISHED and FAILED.
    • user-second-job-success: filter-status is FINISHED and SUCCESSFUL.
    • user-third-job-running: filter-status is RUNNING.
    • user-forth-job-in-queue: filter-status is QUEUED.
    • user-fifth-job-in-queue-preempted: filter-status is QUEUED.
    Jobset Name                     Created Time           Priority   TPU VMs Needed   TPU VMs Running/Ran   TPU VMs Done      Status     Status Message                                                  Status Time
    user-first-job-failed           2023-1-1T1:00:00Z      medium     4                4                     <none>            Finished   JobSet failed                                                   2023-1-1T1:05:00Z
    user-second-job-success         2023-1-1T1:10:00Z      medium     4                4                     4                 Finished   JobSet finished successfully                                    2023-1-1T1:14:00Z
    user-third-job-running          2023-1-1T1:15:00Z      medium     4                4                     <none>            Admitted   Admitted by ClusterQueue cluster-queue                          2023-1-1T1:16:00Z
    user-forth-job-in-queue         2023-1-1T1:16:05Z      medium     4                <none>                <none>            Admitted   couldn't assign flavors to pod set slice-job: insufficient unused quota for google.com/tpu in flavor 2xv4-8, 4 more need   2023-1-1T1:16:10Z
    user-fifth-job-preempted        2023-1-1T1:10:05Z      low        4                <none>                <none>            Evicted    Preempted to accommodate a higher priority Workload             2023-1-1T1:10:00Z
    
  • Workload List supports filtering. Observe a portion of jobs that match user criteria.

    • Filter by Status: filter-by-status

    Filter the workload list by the status of respective jobs. Status can be: EVERYTHING,FINISHED, RUNNING, QUEUED, FAILED, SUCCESSFUL

    • Filter by Job: filter-by-job

    Filter the workload list by the name of a job.

    python3 xpk.py workload list \
    --cluster xpk-test --filter-by-job=$USER
    

How to add docker images to a xpk workload

The default behavior is xpk workload create will layer the local directory (--script-dir) into the base docker image (--base-docker-image) and run the workload command. If you don't want this layering behavior, you can directly use --docker-image. Do not mix arguments from the two flows in the same command.

Recommended / Default Docker Flow: --base-docker-image and --script-dir

This flow pulls the --script-dir into the --base-docker-image and runs the new docker image.

  • The below arguments are optional by default. xpk will pull the local directory with a generic base docker image.

    • --base-docker-image sets the base image that xpk will start with.

    • --script-dir sets which directory to pull into the image. This defaults to the current working directory.

    See python3 xpk.py workload create --help for more info.

  • Example with defaults which pulls the local directory into the base image:

    echo -e '#!/bin/bash \n echo "Hello world from a test script!"' > test.sh
    python3 xpk.py workload create --cluster xpk-test \
    --workload xpk-test-workload-base-image --command "bash test.sh" \
    --tpu-type=v5litepod-16 --num-slices=1
    
  • Recommended Flow For Normal Sized Jobs (fewer than 10k accelerators):

    python3 xpk.py workload create --cluster xpk-test \
    --workload xpk-test-workload-base-image --command "bash custom_script.sh" \
    --base-docker-image=gcr.io/your_dependencies_docker_image \
    --tpu-type=v5litepod-16 --num-slices=1
    

Optional Direct Docker Image Configuration: --docker-image

If a user wants to directly set the docker image used and not layer in the current working directory, set --docker-image to the image to be use in the workload.

  • Running with --docker-image:

    python3 xpk.py workload create --cluster xpk-test \
    --workload xpk-test-workload-base-image --command "bash test.sh" \
    --tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image
    
  • Recommended Flow For Large Sized Jobs (more than 10k accelerators):

    python3 xpk.py cluster cacheimage \
    --cluster xpk-test --docker-image gcr.io/your_docker_image
    # Run workload create with the same image.
    python3 xpk.py workload create --cluster xpk-test \
    --workload xpk-test-workload-base-image --command "bash test.sh" \
    --tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image
    

More advanced facts:

  • Workload create accepts a --docker-name and --docker-image. By using custom images you can achieve very fast boots and hence very fast feedback.

  • Workload create accepts a --env-file flag to allow specifying the container's environment from a file. Usage is the same as Docker's --env-file flag

  • Workload create accepts a --debug-dump-gcs flag which is a path to GCS bucket. Passing this flag sets the XLA_FLAGS='--xla_dump_to=/tmp/xla_dump/' and uploads hlo dumps to the specified GCS bucket for each worker.

Troubleshooting

Invalid machine type for CPUs.

XPK will create a regional GKE cluster. If you see issues like

Invalid machine type e2-standard-32 in zone $ZONE_NAME

Please select a CPU type that exists in all zones in the region.

# Find CPU Types supported in zones.
gcloud compute machine-types list --zones=$ZONE_LIST
# Adjust default cpu machine type.
python3 xpk.py cluster create --cluster-cpu-machine-type=CPU_TYPE ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xpk-0.1.0.tar.gz (24.2 kB view details)

Uploaded Source

Built Distribution

xpk-0.1.0-py3-none-any.whl (25.0 kB view details)

Uploaded Python 3

File details

Details for the file xpk-0.1.0.tar.gz.

File metadata

  • Download URL: xpk-0.1.0.tar.gz
  • Upload date:
  • Size: 24.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for xpk-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4e404672abe13e0e423e67c26585a80fda604693a7c8383a3611c82459701af7
MD5 2d45ff4298908ad7835e1103d26c745d
BLAKE2b-256 2c23405921b827c006f30951f72985a4e475a8688e370884138c1bdf730b2648

See more details on using hashes here.

File details

Details for the file xpk-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: xpk-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 25.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for xpk-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7e0d183d250f50da3624001001e8c0caa976c6ebcbb9ed100071fd0e2e8f2ad2
MD5 6a80eba2952444870b5442b14a385a6a
BLAKE2b-256 b384fd9ca449d9b9f8c4aad4f52ad1056f80c9185f9bc6ef92f0b61369439a23

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page