Skip to main content

A python template

Project description

gpu_tester

pypi

Gpu tester finds all your bad gpus.

Works on slurm.

Features:

  • does a forward on each gpu
  • check for gpu returning incorrect results
  • check for gpu failing due to ECC errors

Roadmap:

  • sanity check forward speed
  • sanity check broadcast speed

Install

Create a venv:

python3 -m venv .env
source .env/bin/activate
pip install -U pip

Then:

pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu116
pip install gpu_tester

Python examples

Checkout these examples to call this as a lib:

Output

Output looks like this:

job succeeded
0 have incorrect results, 1 have gpu errors and 319 succeeded
incorrect results:
[]
gpu errors:
[['gpu_error', 'compute-od-gpu-st-p4d-24xlarge-156', '3']]

Recommended testing strategy

Pair based strategy

The easiest way to quickly spot broken node is to do the pair-based strategy. It will run many jobs in parallel and find which node can talk together Here is one example

gpu_tester --nodes 2 --parallel-tests 50 --job_comment laion --partition "gpu" --test_kind "ddp" --job_timeout 45 --exclude 'gpu-st-p4d-24xlarge-[66]'

All at once strategy

Once you validated this works, you may want to try the DDP strategy over all nodes, eg:

gpu_tester --nodes 100 --parallel-tests 1 --job_comment laion --partition "gpu" --test_kind "ddp" --job_timeout 300 --exclude 'gpu-st-p4d-24xlarge-[66]'

Simple forward

If you want to only validate the forward functionality of gpus and not the communication, you may use:

gpu_tester --nodes 100 --parallel-tests 1 --job_comment laion --partition "gpu" --test_kind "simple_forward" --job_timeout 50 --exclude 'gpu-st-p4d-24xlarge-[66]'

API

This module exposes a single function gpu_tester which takes the same arguments as the command line tool:

  • cluster the cluster. (default slurm)
  • job_name slurm job name. (default gpu_tester)
  • partition slurm partition. (default compute-od-gpu)
  • gpu_per_node numbe of gpu per node. (default 8)
  • nodes number of gpu nodes. (default 1)
  • output_folder the output folder. (default None which means current folder / results)
  • job_timeout job timeout (default 150 seconds)
  • job_comment optional comment arg given to slurm (default None)
  • job_account optional account arg given to slurm (default None)
  • test_kind simple_forward or ddp. simple_forward is quick forward test. DDP uses pytorch ddp to check gpu interconnect (default simple_forward)
  • parallel_tests number of tests to run in parallel. Recommended to use that with nodes == 2 to test pair by pair (default 1)
  • nodelist node whitelist, example 'gpu-st-p4d-24xlarge-[66-67]' (default None)
  • exclude node blacklist, example 'gpu-st-p4d-24xlarge-[66-67]' (default None)

For development

Either locally, or in gitpod (do export PIP_USER=false there)

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

make lint
make test

You can use make black to reformat the code

python -m pytest -x -s -v tests -k "dummy" to run a specific test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpu_tester-1.2.0.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpu_tester-1.2.0-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file gpu_tester-1.2.0.tar.gz.

File metadata

  • Download URL: gpu_tester-1.2.0.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for gpu_tester-1.2.0.tar.gz
Algorithm Hash digest
SHA256 bb2cc9a33004df74261f603fe83a9ecfb38e7b65b9c490af977c059b52f41a41
MD5 5cab088a1f3baef6341831fd7b37d1a4
BLAKE2b-256 128e561754ee7ca2595120d6b2b8108c55fd99af89913aa48128a7c075fa323a

See more details on using hashes here.

File details

Details for the file gpu_tester-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: gpu_tester-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for gpu_tester-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 424b6186e22c745ca0894cf5fdeb9910e776a1a9a7810e2f6c428d746b3627f2
MD5 40fd9743b801583a58e091750f9e22cb
BLAKE2b-256 4cf906d6281f49a9dc821275ee9e212241aa2e1deafb27c2df57ab74bde6c254

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page