GPU contorl and failure notification/recovery

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

GPU Monitoring and Failure Notification/Recovery

While in minging crytocurrency with GPU, although the GPU has built-in thermal control, but it might not be fitted in some environment and cause hash rate decreases or even GPU faults. This project provides a tool to:

According the detected temperature to change fan speed
Execute custom action scripts while current temperature of GPU(s) is over a defined threshold
Execute custom action scripts while current hash rate of GPU(s) is under a defined threshold

Currently, the example of action script are based on the miner's remote managemen API, for different miner, it might be needed to customize locally.

Environment and Installation

Intel CPU/16G RAM Ubuntu 18.4/Python3

Environment setup

For configuring Nvidia's GPU (ie. fan), it is needed to set the following environment.

In ~/.profile and/or ~/.bashrc, add:

export DISPLAY=:0
export NO_AT_BRIDGE=1
export XAUTHORITY=/var/run/lightdm/root/:0

In some shell script examples, it is needed to invoke the miner executive:

export PHOENIX_PATH=<path-to-phoenixminer-root>
export NSF_PATH=<path-to-nsfxminer-root>

and also the wallet and pool address:

export ETH_WALLET=<your-ethereum-wallet-address> # 0x
export ETH_POOL=<your-favorite-pool> # ex. asia1.ethermine.org:4444

Installation from PyPI

Setup Python vitual environment:

cd gpuctl
python3 -m venv venv
source ./venv/bin/activate

pip insall gpuctl

Installation from source

git clone https://github.com/Ed-Yang/gpuctl

Setup Python vitual environment:

cd gpuctl
python3 -m venv venv
source ./venv/bin/activate
pip install -r requirements.txt

Install gpuctl:

pip install .

After completed the above procedure, before you run the gpuctl, you only need to to run:

source ./venv/bin/activate

Interaction with miners

For the the under rate detection feature, it need to retreive the current hashrate of miner through the network management API of miner.

The following exapmple setup a netowrk management port on TCP port 3333.

ethminer/nsfminer, additional parameters:

--HWMON 2 --api-port -3333
or
--HWMON 2 --api-bind 127.0.0.1:3333
Phoenixminer, additional parameters (check section 3):

-cdm 2 -cdmport 3333

To test if the miner's network management function is correctly enabled:

```shell
./scripts/get-rate.sh 3333
```

Example of script for running PhonenixMiner and nsfminer

For utilizing the miner's script, it have to define:

```
export PHOENIX_PATH=<path-to-phoenixminer-root>
export NSF_PATH=<path-to-nsfxminer-root>

export ETH_WALLET=<your-ethereum-wallet-address> # 0x
export ETH_POOL=<your-favorite-pool> # ex. asia1.ethermine.org:4444
```

pm-dev.sh <n>: start PhoenixMiner on device <n>, n is starting from zero
pm-all.sh : start PhoenixMiner all GPU devices
nsf-dev.sh <n>: start nsfminer on device <n>, n is starting from zero
nsf-all.sh : start PhoenixMiner all GPU devices

GpuCtl

Some parameters are applying to every GPU on whole system (like interval, curve, etc.), if it is necessary to provide specific setting for a GPU, it is able to run seperate gpuctl instance with expected parametets.

Usage

gpuctl --help

usage: gpuctl [-h] [-l] [-s SLOTS] [-a] [-n] [--interval INTERVAL]
            [--wait WAIT] [--set-speed [0-100]] [-f FAN] [-d DELTA]
            [--curve CURVE] [--temp TEMP] [--tas TAS] [--scan] [-v] [-V]

optional arguments:
-h, --help            show this help message and exit
-l, --list            list all GPU cards
-s SLOTS, --slots SLOTS
                        use PCI slot name to locate GPU (ie.
                        0000:01:00.0/0000:01:00.1)
-a, --amd             only use AMD GPU
-n, --nvidia          only use Nvidia GPU
--interval INTERVAL   monitoring interval
--wait WAIT           seconds before report failure
--set-speed [0-100]   set the fan speed (0~100)
-f FAN, --fan FAN     if temperature is exceed than FAN once, activate fan
                        control (default:70)
-d DELTA, --delta DELTA
                        set fan speed if temperature diff % is over DELTA
                        (defaut:2)
--curve CURVE         set temp/fan-speed curve (ie. 0:0/10:10/80:100)
--temp TEMP           over temperature action threshold
--tas TAS             over temperature action script
--scan                list miner through network inquiry
-v, --verbose         show debug message
-V, --version         show version info

Slot Name

The slot name of each GPU card could be found by using "lspci -D" command. In the following output, the slot name of AMD GPU card is "0000:01:00.0".

```shell
lspci -D
```

```shell
0000:01:00.0 VGA compatible controller....
```

Action scripts

A few examples of action script are provided for reference, besides it is feasible to write a script to send syslog, email or telegram message, etc.

scripts/gpu-failure.sh: shutdown system and schedule reboot
scripts/restart.sh: restart miner
scripts/reboot.sh: reboot rig

If a failure is detected, the gpuctl will invoke the given script with slot name as argument.

Examples

Example 1) List status of GPU cards

gpuctl --list

ID Slot Name    Vendor   PCI-ID      Temp. Fan  PWR    Working
-- ------------ -------- ----------- ----- ---- ------ -------
1 0000:01:00.0 NVIDIA   [10DE:1C03]  54c  10%  73.63w True
2 0000:0b:00.0 AMD      [1002:67DF]  61c  60%  80.00w True
3 0000:0d:00.0 NVIDIA   [10DE:1C03]  47c   0%  76.18w True

Retreive GPU's status through network management API:

Note, if the miner is running as root priviledge, it must invoke with "sudo".

sudo gpuctl --scan

Miner : nsfminer-1.3.12+, tcp port: 3333, pid: 10208
Uptime: 960s
Rate(kh) Temp Fan  
======== ==== ==== 
18800  58c  11%
26939  60c  60%
19933  48c   0%

Example 2) Set the fan speed of all GPUs to 50%

sudo gpuctl --set-speed 50

ID Slot Name    Vendor   PCI-ID      Temp. Fan  PWR     Working
-- ------------ -------- ----------- ----- ---- ------- -------
1 0000:01:00.0  NVIDIA   [10DE:1C03]   61c  50%  74.76w True
2 0000:0b:00.0  AMD      [1002:67DF]   78c  47%  81.00w True
3 0000:0d:00.0  NVIDIA   [10DE:1C03]   49c  50%  72.60w True

Example 3) If the temperature of a GPU is over 65c, activate the fan speed control for the specific GPU.

sudo gpuctl --fan 65

ID Slot Name    Vendor   PCI-ID      Temp. Fan  PWR     Working
-- ------------ -------- ----------- ----- ---- ------- -------
1 0000:01:00.0 NVIDIA   [10DE:1C03]   60c  50%  76.08w True
2 0000:0b:00.0 AMD      [1002:67DF]   77c  80%  81.00w True
3 0000:0d:00.0 NVIDIA   [10DE:1C03]   49c  50%  74.41w True


gpuctl: started

12:22:25 INFO     query intervel: 5 wait-time: 120
12:22:25 INFO     temperature threshold: 85
12:22:25 INFO     fan control threshold: 65
12:22:25 INFO     script: None


12:22:30 INFO     [0000:0b:00.0/AMD] current temp. 75c set speed 84%
12:22:35 INFO     [0000:0b:00.0/AMD] current temp. 72c set speed 72%
12:22:50 INFO     [0000:0b:00.0/AMD] current temp. 76c set speed 84%
12:23:00 INFO     [0000:0b:00.0/AMD] current temp. 73c set speed 72%
12:23:10 INFO     [0000:0b:00.0/AMD] current temp. 70c set speed 72%
12:23:36 INFO     [0000:0b:00.0/AMD] current temp. 67c set speed 61%

Example 4) nsfminer If the temeprature of a GPU is over 75c for 10s, call action script "restart.sh" to restart miner

Note, the Phoenix might not be corectly restared through network command, so this mechanism only suitable for nsfminer.

sudo gpuctl --temp 75 --wait 10 --tas ./scripts/restart.sh

ID Slot Name    Vendor   PCI-ID      Temp. Fan  PWR     Working
-- ------------ -------- ----------- ----- ---- ------- -------
1 0000:01:00.0 NVIDIA   [10DE:1C03]   52c  11%  73.90w True
2 0000:0b:00.0 AMD      [1002:67DF]   56c  60%  80.00w True
3 0000:0d:00.0 NVIDIA   [10DE:1C03]   46c   0%  75.48w True


gpuctl: started

12:49:00 INFO     query intervel: 5 wait-time: 10
12:49:00 INFO     temperature threshold: 75
12:49:00 INFO     fan control threshold: None
12:49:00 INFO     script: ./scripts/restart.sh


12:51:07 WARNING  [0000:0b:00.0/AMD] over temperature 75c, exec ./scripts/restart.sh
12:51:07 INFO     exec script ./scripts/restart.sh 0000:0b:00.0 no_wait False

ID Slot Name    Vendor   PCI-ID      Temp. Fan  PWR     Working
-- ------------ -------- ----------- ----- ---- ------- -------
1 0000:0b:00.0 AMD      [1002:67DF]   78c  60%  82.00w True

In the nsfminer's terminal:

12:51:07 miner API : Method miner_restart requested
12:51:07 miner Restart miners...
12:51:07 miner Shutting down miners...
12:51:07 miner Spinning up miners...

Example 5) For every GPU, if its temeprature is over 75c for 10s, call failure action script

Note, invoking gpu-failure.sh will shutdown the system and schedulee to restart at 5 minutes later.

sudo gpuctl --temp 75 --wait 10 --tas ./scripts/gpu-failure.sh

ID Slot Name    Vendor   PCI-ID      Temp. Fan  PWR     Working
-- ------------ -------- ----------- ----- ---- ------- -------
1 0000:01:00.0 NVIDIA   [10DE:1C03]   56c  11%  75.99w True
2 0000:0b:00.0 AMD      [1002:67DF]   61c  60%  81.00w True
3 0000:0d:00.0 NVIDIA   [10DE:1C03]   48c   0%  75.49w True


gpuctl: started

13:11:17 INFO     query intervel: 5 wait-time: 10
13:11:17 INFO     temperature threshold: 75
13:11:17 INFO     fan control threshold: None
13:11:17 INFO     script: ./scripts/gpu-failure.sh


13:12:23 WARNING  [0000:0b:00.0/AMD] over temperature 75c, exec ./scripts/gpu-failure.sh
13:12:23 INFO     exec script ./scripts/gpu-failure.sh 0000:0b:00.0 no_wait False

EthCtl

Usage

ethctl --help

usage: ethctl [-h] [-l] [-b BASE] [--interval INTERVAL] [-w WAIT] [-t TEMP]
            [-r RATE] [--rmode RMODE] [-d DELAY] [-s SCRIPT] [-v] [-V]

optional arguments:
-h, --help            show this help message and exit
-l, --list            list all miners
-b BASE, --base BASE  tcp port offset, ie. device N is listened on base + N
--interval INTERVAL   monitoring interval
-w WAIT, --wait WAIT  count down interval
-t TEMP, --temp TEMP  over temperature action threshold
-r RATE, --rate RATE  under rate threshold (default: 2000 kh)
--rmode RMODE         failure restart, 0: none, 1: net restart, 2: kill
-d DELAY, --delay DELAY
                        delay before calling the action script
-s SCRIPT, --script SCRIPT
                        calling to script on failure
-v, --verbose         show debug message
-V, --version         show version info

Examples

Example 1) List all miner with network management function enabled

ethctl --list

Miner : nsfminer-1.3.12+, tcp port: 3333, pid: 19526
Uptime: 92100s
Rate(kh) Temp Fan  
======== ==== ==== 
   18794  58c  50%
   26624  62c  47%
   19898  47c  50%

Example 2) PhoenixMiner if the highest of temeprature of miner's GPU(s) is over 75c, or the lowest hashrate under 2 Mh/s call for 120s, kill the miner's process and call action script to restart miner

Use PhoenixMiner 5.3b as example:

Start miner:

```shell
./scripts/pm-all.sh
```

Open a new terminal and run:

```shell
ethctl -t 75 -r 2000 --rmode 2 -d 120 -s ./scripts/pm-all.sh
```

```shell
Miner : 'PM 5.3b'   
Uptime: 0s
Rate(kh) Temp Fan  
======== ==== ==== 
       0  48c  48%


ethctl: started

23:29:52 INFO     query intervel: 5 wait-time: 120
23:29:52 INFO     temperature: 75c hashrate: 2000 kh/s
23:29:52 INFO     restart mode: 2 script: ./scripts/pm-all.sh


23:30:43 INFO     add miner 'PM 5.3b - ETH':3333 pid 27381
23:33:20 WARNING  'PM 5.3b - ETH':3333 dev 0 rate 0 under threshold
23:33:20 WARNING  'PM 5.3b - ETH':3333 dev 0 temp [72] rate [0]
23:33:20 INFO     'PM 5.3b - ETH':3333 dev 0 restarting pid 27381 mode 2 
23:33:20 INFO     'PM 5.3b - ETH':3333 miner removed
23:33:20 INFO     'PM 5.3b - ETH':3333 delay 120 exec ./scripts/pm-all.sh 0 3333 72 0
```

Example 3) nsfminer if the highest of temeprature of miner's GPU(s) is over 75c, or the lowest hashrate under 2 Mh/s call for 60s, restart the miner's process

Use PhoenixMiner 5.3b as example:

Start miner:

```shell
./scripts/nsf-dev.sh 1
```

Open a new terminal and run:

```shell
ethctl -t 75 -r 2000 --rmode 2 -w 30 --delay 120 -s ./scripts/nsf-all.sh
```

```shell
Miner : 'PM 5.3b'   
Uptime: 0s
Rate(kh) Temp Fan  
======== ==== ==== 
    0  48c  48%


ethctl: started

23:29:52 INFO     query intervel: 5 wait-time: 120
23:29:52 INFO     temperature: 75c hashrate: 2000 kh/s
23:29:52 INFO     restart mode: 2 script: ./scripts/pm-all.sh


23:30:43 INFO     add miner 'PM 5.3b - ETH':3333 pid 27381
23:33:20 WARNING  'PM 5.3b - ETH':3333 dev 0 rate 0 under threshold
23:33:20 WARNING  'PM 5.3b - ETH':3333 dev 0 temp [72] rate [0]
23:33:20 INFO     'PM 5.3b - ETH':3333 dev 0 restarting pid 27381 mode 2 
23:33:20 INFO     'PM 5.3b - ETH':3333 miner removed
23:33:20 INFO     'PM 5.3b - ETH':3333 delay 120 exec ./scripts/pm-all.sh 0 3333 72 0
```

Run Test Cause
```
python3 -m unittest discover tests
```

Diagnostics

Monitor AMD GPU card
```
# HiveOS
sudo watch -c -n 2 amd-info
```

Monitor Nvidia GPU card

# HiveOS
sudo watch -c -n 2 nvidia-info

sudo watch -c -n 2 nvidia-smi

Q/A

nvidia Unable to init server: Could not connect: Connection refused

In ~/.profile and/or ~/.bashrc, add:
```
export DISPLAY=:0
export XAUTHORITY=/var/run/lightdm/root/:0
```
(nvidia-settings:15781): dbind-WARNING **: 04:46:56.622....

In ~/.profile and/or ~/.bashrc, add:
```
export NO_AT_BRIDGE=1
```

Reference

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.3.6

Apr 17, 2021

0.3.5

Apr 14, 2021

0.3.4

Apr 11, 2021

0.3.3

Apr 11, 2021

0.3.2

Apr 11, 2021

0.3.1

Apr 11, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpuctl-0.3.6.tar.gz (22.3 kB view hashes)

Uploaded Apr 17, 2021 Source

Built Distribution

gpuctl-0.3.6-py3-none-any.whl (26.7 kB view hashes)

Uploaded Apr 17, 2021 Python 3

Hashes for gpuctl-0.3.6.tar.gz

Hashes for gpuctl-0.3.6.tar.gz
Algorithm	Hash digest
SHA256	`d15b6397fc3923e20f1aad5c1db32a9cfc8d8c2c4279ab4adbbafa622c2960ea`
MD5	`15976d48e785e5e19dca8737bd830f22`
BLAKE2b-256	`baf39309d284dfee67fc4cf5972b650fea80f9f5285767abccaa9723435a0c9f`

Hashes for gpuctl-0.3.6-py3-none-any.whl

Hashes for gpuctl-0.3.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e841703f3ddb8bccf755280441f6f787b998d2f70068f24f4b0243ba6074baad`
MD5	`d57dee2655aadb7bad1dc96606eeea91`
BLAKE2b-256	`e042fc818a8d5cd60d584307770dc5f514d355751092baeb554aa06f199bae2b`