Skip to main content

Export rocm-smi metrics as prometheus metrics

Project description

rocm-smi-exporter

Export rocm-smi metrics as prometheus metrics

Design

  • Raw metrics from devices are obtained with pyrsmi, which is a wrapper of rocm's C API to access device metrics
  • Raw metrics are converted into Prometheus format.
  • The list of metrics can be seen from the source code.

Install

# Public Pypi package
pip install rocm-smi-exporter==0.0.1a6

# Private Docker image
powerml/rocm-smi-exporter:0.0.1a6

Known issues

On K8s, exporter cannot be deployed as sidecar of a Pod requesting GPUs

When this exporter is deployed as a sidecar container of a pod that requests AMD GPUs, the exporter sidecar has no access to the GPUs (which are accessible to the other containers).

ROCm/k8s-device-plugin is used to manage AMD GPUs on K8s, we suspect its setup confuses librocm_smi64.so, which is used by this exporter to collect metrics.

Working secnarios

Deploy exporter as systemd service for infrastructure monitoring

See systemd

Call exporter API directly in your python application

You can invoke the GPU metrics exporting code inside your python application. And use the scraping configuration as follows.

[DO NOT USE, DOESNT WORK] Integration with kube-prometheus-stack

NOTE: powerml/rocm-smi-exporter:0.0.1a6 is a private image, you may need to build your own image from the pip package.

Assume there is already a kube-prometheus-stack deployment on your kubernetes cluster. Verify that with:

helm list -n <namespace-of-kube-prometheus-stack>

Deploy exporter sidecar container

NOTE: You'll need to have the imagePullSecret to be able to pull powerml/rocm-smi-exporter.

First, you'll need to deploy the exporter as sidecar container to your pods that request AMD GPUs. Use the template below. You'll need to add:

  • rocm-smi-exporter-sidecar to your pod's containers section
  • host-opt to your pod's volumes section
apiVersion: v1
kind: Pod
metadata:
  name: {{ pod-name }}
  labels:
    exporter: rocm-smi
spec:
  containers:
  - name: {{ main-container-name }}
    image: {{ main-image }}
    resources:
      limits:
        amd.com/gpu: {{ count }}
  # Add the following sidecar container to your pod
  - name: rocm-smi-exporter-sidecar
    image: powerml/rocm-smi-exporter
    # This port must be same as the promPort
    args: ["lamini-rocm-smi-exporter", "--port=9001"]
    imagePullPolicy: Always
    ports:
    - containerPort: 9001
      name: prometheus
    volumeMounts:
    - mountPath: /opt
      name: host-opt
  volumes:
  # Add additional volumen for sidecar container
  - name: host-opt
    hostPath:
      # This is ROCm installation directory
      path: /opt
      type: Directory

Upgrade kube-prometheus-stack helm values to pick up metrics from the sidecar

In your kube-prometheus-stack's values.yaml file, add the following rocm-smi-exporter to the prometheus.prometheusSpec.additionalScrapeConfigs section. You can the location in the example of official repo. An example is provided below.

# This is a sample configuration.
# Save this to values.yaml and upgrade with helm.
prometheus:
  prometheusSpec:
    additionalScrapeConfigs:
      - job_name: rocm-smi-exporter
        scrape_interval: 1s
        metrics_path: /metrics
        scheme: http
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_label_exporter]
          action: keep
          regex: rocm-smi
        - source_labels: [__meta_kubernetes_pod_name]
          target_label: pod
        - source_labels: [__meta_kubernetes_namespace]
          target_label: namespace
        - source_labels: [__meta_kubernetes_pod_node_name]
          target_label: node
        - source_labels: [__meta_kubernetes_pod_ip]
          target_label: instance

Save the above yaml to values.yaml and upgrade your own helm release.

helm list -n <namespace>

You'll see the output like below: image Note down the release name and version, here they are prometheus and 61.3.0.

Then upgrade the release with the command below:

helm upgrade <release-name> prometheus-community/kube-prometheus-stack -f values.yaml --version <version>

Verify in Grafana

Open Grafana, click Explore -> Prometheus -> Select Metric -> ROCM_*, you should see the captured metrics:

METRIC_GPU_UTIL = "ROCM_SMI_DEV_GPU_UTIL"
METRIC_GPU_MEM_TOTAL = "ROCM_SMI_DEV_GPU_MEM_TOTAL"
METRIC_GPU_MEM_USED = "ROCM_SMI_DEV_GPU_MEM_USED"
METRIC_GPU_MEM_UTIL = "ROCM_SMI_DEV_MEM_UTIL"
METRIC_GPU_POWER = "ROCM_SMI_DEV_POWER"
METRIC_GPU_CU_OCCUPANCY = "ROCM_SMI_DEV_CU_OCCUPANCY"

image

You can import GPUs Grafana Dashboard into your Grafana.

Reference

  • Add args to systemd service
    • The python code accepts --port and other arguments
    • If needed, set its value when launching systemd service
  • amd/amd-smi-exporter is AMD’s “semi-official” exporter for collecting metrics from AMD devices: CPU/APU/GPU, to us it has a few major issues:
    • It’s not focused on GPU, so it may take much longer than we can wait for it to mature, and it will always take longer for any new features to be implemented.
    • It seems experimental, get no dedicated staffing from AMD
  • nvidia/dcgm-exporter is the mature solution from NVIDIA; it’s focusing on GPUs, and already working
    • We use it as an template to architecture ROCm-smi-exporter, parts related to standard k8s features can be borrowed directly

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rocm_smi_exporter-0.0.1a10.tar.gz (7.0 kB view hashes)

Uploaded Source

Built Distribution

rocm_smi_exporter-0.0.1a10-py3-none-any.whl (7.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page