Export rocm-smi metrics as prometheus metrics
Project description
rocm-smi-exporter
Export rocm-smi metrics as prometheus metrics
Docker
docker build . -t powerml/rocm-smi-exporter
# Login with powerml repo in 1password and push to it
docker login
docker push powerml/rocm-smi-exporter
# -v /opt:/opt is needed to make rocm runtime library available to container
docker run -d --rm --name=smi-exporter -p 9001:9001 \
--device=/dev/kfd --device=/dev/dri --group-add video \
-v /opt:/opt powerml/rocm-smi-exporter
# You should see the ROCM_* prefixed metrics
curl localhost:9001/metrics
Integration with kube-prometheus-stack
Assume there is already a kube-prometheus-stack deployment on your kubernetes cluster. Verify that with:
helm list -n <namespace-of-kube-prometheus-stack>
Deploy exporter sidecar container
NOTE: You'll need to have the imagePullSecret to be able to pull powerml/rocm-smi-exporter
.
First, you'll need to deploy the exporter as sidecar
container to your pods that request AMD GPUs.
Use the template below. You'll need to add:
rocm-smi-exporter-sidecar
to your pod'scontainers
sectionhost-opt
to your pod'svolumes
section
apiVersion: v1
kind: Pod
metadata:
name: {{ pod-name }}
labels:
exporter: rocm-smi
spec:
containers:
- name: {{ main-container-name }}
image: {{ main-image }}
resources:
limits:
amd.com/gpu: {{ count }}
# Add the following sidecar container to your pod
- name: rocm-smi-exporter-sidecar
image: powerml/rocm-smi-exporter
# This port must be same as the promPort
args: ["lamini-rocm-smi-exporter", "--port=9001"]
imagePullPolicy: Always
ports:
- containerPort: 9001
name: prometheus
volumeMounts:
- mountPath: /opt
name: host-opt
volumes:
# Add additional volumen for sidecar container
- name: host-opt
hostPath:
# This is ROCm installation directory
path: /opt
type: Directory
Upgrade kube-prometheus-stack helm values to pick up metrics from the sidecar
In your kube-prometheus-stack
's values.yaml file, add the following rocm-smi-exporter
to the
prometheus.prometheusSpec.additionalScrapeConfigs
section. You can the location in the example of official repo.
An example is provided below.
# This is a sample configuration.
# Save this to values.yaml and upgrade with helm.
prometheus:
prometheusSpec:
additionalScrapeConfigs:
- job_name: rocm-smi-exporter
scrape_interval: 1s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_exporter]
action: keep
regex: rocm-smi
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_node_name]
target_label: node
- source_labels: [__meta_kubernetes_pod_ip]
target_label: instance
Save the above yaml to values.yaml
and upgrade your own helm release.
helm list -n <namespace>
You'll see the output like below:
Note down the release name and version, here they are prometheus
and 61.3.0
.
Then upgrade the release with the command below:
helm upgrade <release-name> prometheus-community/kube-prometheus-stack -f values.yaml --version <version>
Verify in Grafana
Open Grafana, click Explore -> Prometheus -> Select Metric -> ROCM_*
, you should see the captured metrics:
METRIC_GPU_UTIL = "ROCM_SMI_DEV_GPU_UTIL"
METRIC_GPU_MEM_TOTAL = "ROCM_SMI_DEV_GPU_MEM_TOTAL"
METRIC_GPU_MEM_USED = "ROCM_SMI_DEV_GPU_MEM_USED"
METRIC_GPU_MEM_UTIL = "ROCM_SMI_DEV_MEM_UTIL"
METRIC_GPU_POWER = "ROCM_SMI_DEV_POWER"
METRIC_GPU_CU_OCCUPANCY = "ROCM_SMI_DEV_CU_OCCUPANCY"
You can import GPUs Grafana Dashboard into your Grafana.
Extra
- Add args to systemd service
- The python code accepts
--port
and other arguments - If needed, set its value when launching systemd service
- The python code accepts
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for rocm_smi_exporter-0.0.1a4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 27eaa47fa272510a0449133c1aa57c2af4ba7a8d265e909a9444ccb2c6be3009 |
|
MD5 | 4b4bc1e50dcc022035c7c75c7dc285f1 |
|
BLAKE2b-256 | 04d6d246885a64ab8b6dabb9c6bdb3081e4a2630578c2c16607e9c0bc15e1f7e |
Hashes for rocm_smi_exporter-0.0.1a4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a51b50566d4c26ae5313a2b3d8723626a8cbe98700c1e6ed3c3ba946366f1954 |
|
MD5 | 28f3374a660dd2eaddee195ce7183da3 |
|
BLAKE2b-256 | 03bf81fb7fab6af789077fc6a3e7f4c32ffd63a68d8746f3eb03b98403e3aa79 |