A python template
Project description
gpu_tester
Gpu tester finds all your bad gpus.
Works on slurm.
Features:
- does a forward on each gpu
- check for gpu returning incorrect results
- check for gpu failing due to ECC errors
Roadmap:
- sanity check forward speed
- sanity check broadcast speed
Install
pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu116
then
pip install gpu_tester
Python examples
Checkout these examples to call this as a lib:
Output
Output looks like this:
job succeeded
0 have incorrect results, 1 have gpu errors and 319 succeeded
incorrect results:
[]
gpu errors:
[['gpu_error', 'compute-od-gpu-st-p4d-24xlarge-156', '3']]
API
This module exposes a single function gpu_tester
which takes the same arguments as the command line tool:
- cluster the cluster. (default slurm)
- job_name slurm job name. (default gpu_tester)
- partition slurm partition. (default compute-od-gpu)
- gpu_per_node numbe of gpu per node. (default 8)
- nodes number of gpu nodes. (default 1)
- output_folder the output folder. (default None which means current folder / results)
- job_timeout job timeout (default 300 seconds)
- sbatch_args optional additional sbatch and srun args, example "--comment Laion" (default None)
For development
Either locally, or in gitpod (do export PIP_USER=false
there)
Setup a virtualenv:
python3 -m venv .env
source .env/bin/activate
pip install -e .
to run tests:
pip install -r requirements-test.txt
then
make lint
make test
You can use make black
to reformat the code
python -m pytest -x -s -v tests -k "dummy"
to run a specific test
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
gpu_tester-1.0.1.tar.gz
(5.6 kB
view hashes)
Built Distribution
Close
Hashes for gpu_tester-1.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 98af9e8f67c7ca4bdf00ed52d5a211329cdbf8aa2d86c6cf8ab25f392692446d |
|
MD5 | dfc16856fc28e790446b731d02df9bc1 |
|
BLAKE2b-256 | 0c2cd276afb35b2a182f53d637094395013c3a70603f31397b8397d209877cc5 |