Skip to main content

Export rocm-smi metrics as prometheus metrics

Project description

rocm-smi-exporter

Export rocm-smi metrics as prometheus metrics

Docker

docker build . -t powerml/rocm-smi-exporter

# Login with powerml repo in 1password and push to it
docker login
docker push powerml/rocm-smi-exporter

# -v /opt:/opt is needed to make rocm runtime library available to container
docker run -d --rm --name=smi-exporter -p 9001:9001 \
--device=/dev/kfd --device=/dev/dri --group-add video \
-v /opt:/opt powerml/rocm-smi-exporter

# You should see the ROCM_* prefixed metrics
curl localhost:9001/metrics

Integration with kube-prometheus-stack

Assume there is already a kube-prometheus-stack deployment on your kubernetes cluster. Verify that with:

helm list -n <namespace-of-kube-prometheus-stack>

Deploy exporter sidecar container

NOTE: You'll need to have the imagePullSecret to be able to pull powerml/rocm-smi-exporter.

First, you'll need to deploy the exporter as sidecar container to your pods that request AMD GPUs. Use the template below. You'll need to add:

  • rocm-smi-exporter-sidecar to your pod's containers section
  • host-opt to your pod's volumes section
apiVersion: v1
kind: Pod
metadata:
  name: {{ pod-name }}
  labels:
    exporter: rocm-smi
spec:
  containers:
  - name: {{ main-container-name }}
    image: {{ main-image }}
    resources:
      limits:
        amd.com/gpu: {{ count }}
  # Add the following sidecar container to your pod
  - name: rocm-smi-exporter-sidecar
    image: powerml/rocm-smi-exporter
    # This port must be same as the promPort
    args: ["lamini-rocm-smi-exporter", "--port=9001"]
    imagePullPolicy: Always
    ports:
    - containerPort: 9001
      name: prometheus
    volumeMounts:
    - mountPath: /opt
      name: host-opt
  volumes:
  # Add additional volumen for sidecar container
  - name: host-opt
    hostPath:
      # This is ROCm installation directory
      path: /opt
      type: Directory

Upgrade kube-prometheus-stack helm values to pick up metrics from the sidecar

In your kube-prometheus-stack's values.yaml file, add the following rocm-smi-exporter to the prometheus.prometheusSpec.additionalScrapeConfigs section. You can the location in the example of official repo. An example is provided below.

# This is a sample configuration.
# Save this to values.yaml and upgrade with helm.
prometheus:
  prometheusSpec:
    additionalScrapeConfigs:
      - job_name: rocm-smi-exporter
        scrape_interval: 1s
        metrics_path: /metrics
        scheme: http
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_label_exporter]
          action: keep
          regex: rocm-smi
        - source_labels: [__meta_kubernetes_pod_name]
          target_label: pod
        - source_labels: [__meta_kubernetes_namespace]
          target_label: namespace
        - source_labels: [__meta_kubernetes_pod_node_name]
          target_label: node
        - source_labels: [__meta_kubernetes_pod_ip]
          target_label: instance

Save the above yaml to values.yaml and upgrade your own helm release.

helm list -n <namespace>

You'll see the output like below: image Note down the release name and version, here they are prometheus and 61.3.0.

Then upgrade the release with the command below:

helm upgrade <release-name> prometheus-community/kube-prometheus-stack -f values.yaml --version <version>

Verify in Grafana

Open Grafana, click Explore -> Prometheus -> Select Metric -> ROCM_*, you should see the captured metrics:

METRIC_GPU_UTIL = "ROCM_SMI_DEV_GPU_UTIL"
METRIC_GPU_MEM_TOTAL = "ROCM_SMI_DEV_GPU_MEM_TOTAL"
METRIC_GPU_MEM_USED = "ROCM_SMI_DEV_GPU_MEM_USED"
METRIC_GPU_MEM_UTIL = "ROCM_SMI_DEV_MEM_UTIL"
METRIC_GPU_POWER = "ROCM_SMI_DEV_POWER"
METRIC_GPU_CU_OCCUPANCY = "ROCM_SMI_DEV_CU_OCCUPANCY"

image

You can import GPUs Grafana Dashboard into your Grafana.

Extra

  • Add args to systemd service
    • The python code accepts --port and other arguments
    • If needed, set its value when launching systemd service

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rocm_smi_exporter-0.0.1a4.tar.gz (4.7 kB view hashes)

Uploaded Source

Built Distribution

rocm_smi_exporter-0.0.1a4-py3-none-any.whl (5.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page