Skip to main content

Last9 GPU Telemetry: Vendor-Agnostic GPU Monitoring for AI Clusters

Project description

l9gpu

License: MIT Python Go PyPI Artifact Hub

DCGM exporter tells you a GPU is hot. It won't tell you whose job is frying it.

Most GPU observability stops at the hardware — utilization, temperature, ECC — and hands you a gpu.uuid with no answer to the only question that matters: who's paying for this idle H100?

l9gpu closes the loop. One agent per node emits vendor-neutral OTLP with workload attribution baked in — Kubernetes pod, namespace, deployment; Slurm job, user, partition. You point it at any OTLP backend and get per-team, per-job, per-model accounting without building a pipeline.

It works on NVIDIA, AMD, and Intel Gaudi today. It will keep working on whatever comes next because it emits OpenTelemetry, not a bespoke format. There's no vendor backend in the agent itself. That's deliberate.


Quick Start — Kubernetes

# Classic Helm repo
helm repo add l9gpu https://last9.github.io/gpu-telemetry
helm install l9gpu l9gpu/l9gpu -n monitoring --create-namespace \
  --set monitoring.sink=otel \
  --set monitoring.cluster=my-cluster \
  --set otlpSecretName=l9gpu-otlp

# or OCI
helm install l9gpu oci://ghcr.io/last9/charts/l9gpu --version 0.1.0 -n monitoring

Create the OTLP secret first:

kubectl create secret generic l9gpu-otlp -n monitoring \
  --from-literal=OTEL_EXPORTER_OTLP_ENDPOINT=<your-otlp-endpoint> \
  --from-literal=OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <your-token>"

AMD / Gaudi nodes: --set collectors.nvidia=false --set collectors.amd=true (or collectors.gaudi=true).

Full Helm guide: docs/HELM.md. Topology examples (EKS + DCGM, multi-GPU, sidecar collector): deploy/helm/l9gpu/examples/.

Quick Start — Bare Metal / systemd

pip install l9gpu
export OTEL_EXPORTER_OTLP_ENDPOINT=<your-otlp-endpoint>
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <your-token>"

l9gpu nvml_monitor  --sink otel --cluster my-cluster  # NVIDIA
l9gpu amd_monitor   --sink otel --cluster my-cluster  # AMD
l9gpu gaudi_monitor --sink otel --cluster my-cluster  # Intel Gaudi

Sanity-check without OTLP: l9gpu nvml_monitor --sink stdout --once.

systemd unit files: systemd/.


What l9gpu is not

  • Not a Prometheus exporter. It emits OTLP. Your Collector handles Prometheus scraping if you want it.
  • Not a backend. l9gpu exports standard OTLP to whatever speaks OTLP. There's no Last9 lock-in in the agent.
  • Not a DCGM replacement. DCGM profiling (SM occupancy, tensor pipe, NVLink) is complementary — bundle both through one Collector pipeline.
  • Not only NVIDIA. AMD MI300X / MI325X and Intel Gaudi 2/3 are first-class.

Architecture

l9gpu architecture flow

Collectors on each node normalize NVML / DCGM / amdsmi / hl-smi into the gpu.* OTel namespace and ship OTLP to a Collector. The Collector enriches with k8sprocessor or slurmprocessor and fans out.

Every cycle (default 60s) emits metrics (one OTLP gauge per GPU per metric) and logs (one OTLP log per GPU per cycle with the full snapshot — useful for backends that prefer log-shaped events or for replaying history).

Full walk-through: docs/ARCHITECTURE.md.


Workload attribution

Kubernetesk8sprocessor enriches each GPU data point with k8s.pod.name, k8s.namespace.name, k8s.deployment.name, k8s.job.name, cloud.availability_zone, cloud.region. Setup, RBAC, label allow-lists: docs/K8S_WORKLOAD_ATTRIBUTION.md.

Slurmslurmprocessor attaches slurm.job.id, slurm.user, slurm.account, partition, QoS:

processors:
  slurm:
    cache_duration: 60
    cache_filepath: /tmp/slurmprocessor_cache.json
    query_slurmctld: false

Full config: slurmprocessor/README.md.


Dashboards & alerts

Pre-built Grafana dashboards in dashboards/grafana/ — multi-cluster fleet, per-pod workload, health/reliability (ECC, throttling, XID), DCGM profiling, inference engines (vLLM, SGLang, TGI, Triton, NIM), fleet efficiency / idle detection.

Alert rules in alerts/prometheus/ (17 PrometheusRule CRDs) and alerts/grafana/. Enable via Helm: helm upgrade --set alerts.enabled=true ….


Pre-built collector

Skip ocb and run a ready-made Collector with k8sprocessor + slurmprocessor baked in:

docker run --rm -v $PWD/config.yaml:/etc/l9gpu/config.yaml:ro \
  ghcr.io/last9/l9gpu-collector:latest --config=/etc/l9gpu/config.yaml

Details and binary/tarball install: docs/COLLECTOR.md.


Components

Directory Language Role
l9gpu/ Python Node-level collector (DaemonSet / systemd). Emits OTLP metrics + logs.
k8sprocessor/ Go OTel Collector processor. Enriches with K8s pod / workload / cloud metadata.
slurmprocessor/ Go OTel Collector processor. Enriches with Slurm job metadata.
k8shelper/ Go Shared K8s API helper library.
shelper/ Go Shared Slurm helper library.

Hardware support

NVIDIA A100, H100 / H200, B200 / GB200, T4, A10, L4 (NVML + DCGM) · AMD MI300X, MI325X (amdsmi) · Intel Gaudi 2, Gaudi 3 (hl-smi).

Full metric catalog with units and attributes: docs/METRICS.md.


Demo

One-command EKS stack — vLLM + SGLang + TGI + Triton alongside l9gpu NVML, DCGM, cost, fleet-health, and per-engine monitors:

./deploy/demo/launch.sh

Documentation


Contributing

PRs welcome. See CONTRIBUTING.md for dev setup, tests, and PR flow. By contributing you agree your work is licensed under the same terms as the rest of the project. Security reports: SECURITY.md.

Credits & attribution

l9gpu (the Python package), shelper, and slurmprocessor are derived from Meta's facebookresearch/gcm project (MIT and Apache-2.0). We extended them with Kubernetes workload attribution, AMD / Intel Gaudi collectors, vLLM / SGLang / TGI / Triton / NIM monitors, cost and fleet-health signals, and OTLP-native export. k8shelper/ and k8sprocessor/ are original Last9 work. See NOTICE for the full breakdown.

License

MIT for l9gpu, k8shelper, k8sprocessor. Apache-2.0 for slurmprocessor, shelper. Each subdirectory carries its own LICENSE where it differs from the repo root.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

l9gpu-0.2.0.tar.gz (708.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

l9gpu-0.2.0-py3-none-any.whl (711.9 kB view details)

Uploaded Python 3

File details

Details for the file l9gpu-0.2.0.tar.gz.

File metadata

  • Download URL: l9gpu-0.2.0.tar.gz
  • Upload date:
  • Size: 708.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for l9gpu-0.2.0.tar.gz
Algorithm Hash digest
SHA256 360414c86d40688caf063e96af44090d5048177d0be104368fa7c09236edfea5
MD5 cccd6f8bc1c9deda6a46d093969ecd04
BLAKE2b-256 ba2fc95b8b5b0804064625b57e3fae0b90d6d628323df4e8b24c1b249a76f807

See more details on using hashes here.

Provenance

The following attestation bundles were made for l9gpu-0.2.0.tar.gz:

Publisher: release.yml on last9/gpu-telemetry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file l9gpu-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: l9gpu-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 711.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for l9gpu-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 95af19e3f1249ed0c74535c6702dfa0098c856c7fc78b788f749aa776c90c95a
MD5 dec8fcc722487b586ff0bf24e5c34c97
BLAKE2b-256 e64641514972663837faa2fd038c5a16edb5f6aa277b92e822b878ce68a7c10b

See more details on using hashes here.

Provenance

The following attestation bundles were made for l9gpu-0.2.0-py3-none-any.whl:

Publisher: release.yml on last9/gpu-telemetry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page