Last9 GPU Telemetry: Vendor-Agnostic GPU Monitoring for AI Clusters
Project description
l9gpu
DCGM exporter tells you a GPU is hot. It won't tell you whose job is frying it.
Most GPU observability stops at the hardware — utilization, temperature, ECC —
and hands you a gpu.uuid with no answer to the only question that matters:
who's paying for this idle H100?
l9gpu closes the loop. One agent per node emits vendor-neutral OTLP with
workload attribution baked in — Kubernetes pod, namespace, deployment;
Slurm job, user, partition. You point it at any OTLP backend and get
per-team, per-job, per-model accounting without building a pipeline.
It works on NVIDIA, AMD, and Intel Gaudi today. It will keep working on whatever comes next because it emits OpenTelemetry, not a bespoke format. There's no vendor backend in the agent itself. That's deliberate.
Quick Start — Kubernetes
# Classic Helm repo
helm repo add l9gpu https://last9.github.io/gpu-telemetry
helm install l9gpu l9gpu/l9gpu -n monitoring --create-namespace \
--set monitoring.sink=otel \
--set monitoring.cluster=my-cluster \
--set otlpSecretName=l9gpu-otlp
# or OCI
helm install l9gpu oci://ghcr.io/last9/charts/l9gpu --version 0.1.0 -n monitoring
Create the OTLP secret first:
kubectl create secret generic l9gpu-otlp -n monitoring \
--from-literal=OTEL_EXPORTER_OTLP_ENDPOINT=<your-otlp-endpoint> \
--from-literal=OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <your-token>"
AMD / Gaudi nodes: --set collectors.nvidia=false --set collectors.amd=true
(or collectors.gaudi=true).
Full Helm guide: docs/HELM.md. Topology examples
(EKS + DCGM, multi-GPU, sidecar collector): deploy/helm/l9gpu/examples/.
Quick Start — Bare Metal / systemd
pip install l9gpu
export OTEL_EXPORTER_OTLP_ENDPOINT=<your-otlp-endpoint>
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <your-token>"
l9gpu nvml_monitor --sink otel --cluster my-cluster # NVIDIA
l9gpu amd_monitor --sink otel --cluster my-cluster # AMD
l9gpu gaudi_monitor --sink otel --cluster my-cluster # Intel Gaudi
Sanity-check without OTLP: l9gpu nvml_monitor --sink stdout --once.
systemd unit files: systemd/.
What l9gpu is not
- Not a Prometheus exporter. It emits OTLP. Your Collector handles Prometheus scraping if you want it.
- Not a backend.
l9gpuexports standard OTLP to whatever speaks OTLP. There's no Last9 lock-in in the agent. - Not a DCGM replacement. DCGM profiling (SM occupancy, tensor pipe, NVLink) is complementary — bundle both through one Collector pipeline.
- Not only NVIDIA. AMD MI300X / MI325X and Intel Gaudi 2/3 are first-class.
Architecture
Collectors on each node normalize NVML / DCGM / amdsmi / hl-smi into the
gpu.* OTel namespace and ship OTLP to a Collector. The Collector enriches
with k8sprocessor or slurmprocessor
and fans out.
Every cycle (default 60s) emits metrics (one OTLP gauge per GPU per metric) and logs (one OTLP log per GPU per cycle with the full snapshot — useful for backends that prefer log-shaped events or for replaying history).
Full walk-through: docs/ARCHITECTURE.md.
Workload attribution
Kubernetes — k8sprocessor enriches each GPU data
point with k8s.pod.name, k8s.namespace.name, k8s.deployment.name,
k8s.job.name, cloud.availability_zone, cloud.region. Setup,
RBAC, label allow-lists: docs/K8S_WORKLOAD_ATTRIBUTION.md.
Slurm — slurmprocessor attaches slurm.job.id,
slurm.user, slurm.account, partition, QoS:
processors:
slurm:
cache_duration: 60
cache_filepath: /tmp/slurmprocessor_cache.json
query_slurmctld: false
Full config: slurmprocessor/README.md.
Dashboards & alerts
Pre-built Grafana dashboards in dashboards/grafana/ —
multi-cluster fleet, per-pod workload, health/reliability (ECC, throttling,
XID), DCGM profiling, inference engines (vLLM, SGLang, TGI, Triton, NIM),
fleet efficiency / idle detection.
Alert rules in alerts/prometheus/ (17 PrometheusRule
CRDs) and alerts/grafana/. Enable via Helm:
helm upgrade --set alerts.enabled=true ….
Pre-built collector
Skip ocb and run a ready-made Collector with k8sprocessor +
slurmprocessor baked in:
docker run --rm -v $PWD/config.yaml:/etc/l9gpu/config.yaml:ro \
ghcr.io/last9/l9gpu-collector:latest --config=/etc/l9gpu/config.yaml
Details and binary/tarball install: docs/COLLECTOR.md.
Components
| Directory | Language | Role |
|---|---|---|
l9gpu/ |
Python | Node-level collector (DaemonSet / systemd). Emits OTLP metrics + logs. |
k8sprocessor/ |
Go | OTel Collector processor. Enriches with K8s pod / workload / cloud metadata. |
slurmprocessor/ |
Go | OTel Collector processor. Enriches with Slurm job metadata. |
k8shelper/ |
Go | Shared K8s API helper library. |
shelper/ |
Go | Shared Slurm helper library. |
Hardware support
NVIDIA A100, H100 / H200, B200 / GB200, T4, A10, L4 (NVML + DCGM) · AMD MI300X, MI325X (amdsmi) · Intel Gaudi 2, Gaudi 3 (hl-smi).
Full metric catalog with units and attributes: docs/METRICS.md.
Demo
One-command EKS stack — vLLM + SGLang + TGI + Triton alongside l9gpu NVML, DCGM, cost, fleet-health, and per-engine monitors:
./deploy/demo/launch.sh
Documentation
- Architecture — system design, topology, data flow
- Metrics reference — every metric, unit, attribute
- Integration guide — PromQL, OTel Collector recipes, cloud notes
- K8s workload attribution — RBAC, enrichment, label allow-lists
- Scaling — cardinality management for large fleets
- GPU & LLM observability — vLLM / NIM / Triton specifics
- Helm install · Pre-built collector
- AWS testing cookbook — end-to-end EC2 and EKS walk-through
l9gpuCLI reference ·slurmprocessor·shelper
Contributing
PRs welcome. See CONTRIBUTING.md for dev setup, tests,
and PR flow. By contributing you agree your work is licensed under the same
terms as the rest of the project. Security reports: SECURITY.md.
Credits & attribution
l9gpu (the Python package), shelper, and slurmprocessor are derived
from Meta's facebookresearch/gcm
project (MIT and Apache-2.0). We extended them with Kubernetes workload
attribution, AMD / Intel Gaudi collectors, vLLM / SGLang / TGI / Triton /
NIM monitors, cost and fleet-health signals, and OTLP-native export.
k8shelper/ and k8sprocessor/ are original Last9 work. See
NOTICE for the full breakdown.
License
MIT for l9gpu, k8shelper, k8sprocessor. Apache-2.0 for slurmprocessor,
shelper. Each subdirectory carries its own LICENSE where it differs from
the repo root.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file l9gpu-0.2.0.tar.gz.
File metadata
- Download URL: l9gpu-0.2.0.tar.gz
- Upload date:
- Size: 708.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
360414c86d40688caf063e96af44090d5048177d0be104368fa7c09236edfea5
|
|
| MD5 |
cccd6f8bc1c9deda6a46d093969ecd04
|
|
| BLAKE2b-256 |
ba2fc95b8b5b0804064625b57e3fae0b90d6d628323df4e8b24c1b249a76f807
|
Provenance
The following attestation bundles were made for l9gpu-0.2.0.tar.gz:
Publisher:
release.yml on last9/gpu-telemetry
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
l9gpu-0.2.0.tar.gz -
Subject digest:
360414c86d40688caf063e96af44090d5048177d0be104368fa7c09236edfea5 - Sigstore transparency entry: 1341669845
- Sigstore integration time:
-
Permalink:
last9/gpu-telemetry@a146f500a6ab42c43cdc41ea8d3639ef8a96ae56 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/last9
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@a146f500a6ab42c43cdc41ea8d3639ef8a96ae56 -
Trigger Event:
push
-
Statement type:
File details
Details for the file l9gpu-0.2.0-py3-none-any.whl.
File metadata
- Download URL: l9gpu-0.2.0-py3-none-any.whl
- Upload date:
- Size: 711.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
95af19e3f1249ed0c74535c6702dfa0098c856c7fc78b788f749aa776c90c95a
|
|
| MD5 |
dec8fcc722487b586ff0bf24e5c34c97
|
|
| BLAKE2b-256 |
e64641514972663837faa2fd038c5a16edb5f6aa277b92e822b878ce68a7c10b
|
Provenance
The following attestation bundles were made for l9gpu-0.2.0-py3-none-any.whl:
Publisher:
release.yml on last9/gpu-telemetry
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
l9gpu-0.2.0-py3-none-any.whl -
Subject digest:
95af19e3f1249ed0c74535c6702dfa0098c856c7fc78b788f749aa776c90c95a - Sigstore transparency entry: 1341669846
- Sigstore integration time:
-
Permalink:
last9/gpu-telemetry@a146f500a6ab42c43cdc41ea8d3639ef8a96ae56 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/last9
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@a146f500a6ab42c43cdc41ea8d3639ef8a96ae56 -
Trigger Event:
push
-
Statement type: