Last9 GPU Telemetry: Vendor-Agnostic GPU Monitoring for AI Clusters

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

last9

Project description

l9gpu

Python

DCGM exporter tells you a GPU is hot. It won't tell you whose job is frying it.

Most GPU observability stops at the hardware — utilization, temperature, ECC — and hands you a gpu.uuid with no answer to the only question that matters: who's paying for this idle H100?

l9gpu closes the loop. One agent per node emits vendor-neutral OTLP with workload attribution baked in — Kubernetes pod, namespace, deployment; Slurm job, user, partition. You point it at any OTLP backend and get per-team, per-job, per-model accounting without building a pipeline.

It works on NVIDIA, AMD, and Intel Gaudi today. It will keep working on whatever comes next because it emits OpenTelemetry, not a bespoke format. There's no vendor backend in the agent itself. That's deliberate.

Quick Start — Kubernetes

# Classic Helm repo
helm repo add l9gpu https://last9.github.io/gpu-telemetry
helm install l9gpu l9gpu/l9gpu -n monitoring --create-namespace \
  --set monitoring.sink=otel \
  --set monitoring.cluster=my-cluster \
  --set otlpSecretName=l9gpu-otlp

# or OCI
helm install l9gpu oci://ghcr.io/last9/charts/l9gpu --version 0.1.0 -n monitoring

Create the OTLP secret first:

kubectl create secret generic l9gpu-otlp -n monitoring \
  --from-literal=OTEL_EXPORTER_OTLP_ENDPOINT=<your-otlp-endpoint> \
  --from-literal=OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <your-token>"

AMD / Gaudi nodes: --set collectors.nvidia=false --set collectors.amd=true (or collectors.gaudi=true).

Full Helm guide: docs/HELM.md. Topology examples (EKS + DCGM, multi-GPU, sidecar collector): deploy/helm/l9gpu/examples/.

Quick Start — Bare Metal / systemd

pip install l9gpu
export OTEL_EXPORTER_OTLP_ENDPOINT=<your-otlp-endpoint>
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <your-token>"

l9gpu nvml_monitor  --sink otel --cluster my-cluster  # NVIDIA
l9gpu amd_monitor   --sink otel --cluster my-cluster  # AMD
l9gpu gaudi_monitor --sink otel --cluster my-cluster  # Intel Gaudi

Sanity-check without OTLP: l9gpu nvml_monitor --sink stdout --once.

systemd unit files: systemd/.

What l9gpu is not

Not a Prometheus exporter. It emits OTLP. Your Collector handles Prometheus scraping if you want it.
Not a backend. l9gpu exports standard OTLP to whatever speaks OTLP. There's no Last9 lock-in in the agent.
Not a DCGM replacement. DCGM profiling (SM occupancy, tensor pipe, NVLink) is complementary — bundle both through one Collector pipeline.
Not only NVIDIA. AMD MI300X / MI325X and Intel Gaudi 2/3 are first-class.

Architecture

l9gpu architecture flow

Collectors on each node normalize NVML / DCGM / amdsmi / hl-smi into the gpu.* OTel namespace and ship OTLP to a Collector. The Collector enriches with k8sprocessor or slurmprocessor and fans out.

Every cycle (default 60s) emits metrics (one OTLP gauge per GPU per metric) and logs (one OTLP log per GPU per cycle with the full snapshot — useful for backends that prefer log-shaped events or for replaying history).

Full walk-through: docs/ARCHITECTURE.md.

Workload attribution

Kubernetes — k8sprocessor enriches each GPU data point with k8s.pod.name, k8s.namespace.name, k8s.deployment.name, k8s.job.name, cloud.availability_zone, cloud.region. Setup, RBAC, label allow-lists: docs/K8S_WORKLOAD_ATTRIBUTION.md.

Slurm — slurmprocessor attaches slurm.job.id, slurm.user, slurm.account, partition, QoS:

processors:
  slurm:
    cache_duration: 60
    cache_filepath: /tmp/slurmprocessor_cache.json
    query_slurmctld: false

Full config: slurmprocessor/README.md.

Dashboards & alerts

Pre-built Grafana dashboards in dashboards/grafana/ — multi-cluster fleet, per-pod workload, health/reliability (ECC, throttling, XID), DCGM profiling, inference engines (vLLM, SGLang, TGI, Triton, NIM), fleet efficiency / idle detection.

Alert rules in alerts/prometheus/ (17 PrometheusRule CRDs) and alerts/grafana/. Enable via Helm: helm upgrade --set alerts.enabled=true ….

Pre-built collector

Skip ocb and run a ready-made Collector with k8sprocessor + slurmprocessor baked in:

docker run --rm -v $PWD/config.yaml:/etc/l9gpu/config.yaml:ro \
  ghcr.io/last9/l9gpu-collector:latest --config=/etc/l9gpu/config.yaml

Details and binary/tarball install: docs/COLLECTOR.md.

Components

Directory	Language	Role
`l9gpu/`	Python	Node-level collector (DaemonSet / systemd). Emits OTLP metrics + logs.
`k8sprocessor/`	Go	OTel Collector processor. Enriches with K8s pod / workload / cloud metadata.
`slurmprocessor/`	Go	OTel Collector processor. Enriches with Slurm job metadata.
`k8shelper/`	Go	Shared K8s API helper library.
`shelper/`	Go	Shared Slurm helper library.

Hardware support

NVIDIA A100, H100 / H200, B200 / GB200, T4, A10, L4 (NVML + DCGM) · AMD MI300X, MI325X (amdsmi) · Intel Gaudi 2, Gaudi 3 (hl-smi).

Full metric catalog with units and attributes: docs/METRICS.md.

Demo

One-command EKS stack — vLLM + SGLang + TGI + Triton alongside l9gpu NVML, DCGM, cost, fleet-health, and per-engine monitors:

./deploy/demo/launch.sh

Documentation

Architecture — system design, topology, data flow
Metrics reference — every metric, unit, attribute
Integration guide — PromQL, OTel Collector recipes, cloud notes
K8s workload attribution — RBAC, enrichment, label allow-lists
Scaling — cardinality management for large fleets
GPU & LLM observability — vLLM / NIM / Triton specifics
Helm install · Pre-built collector
AWS testing cookbook — end-to-end EC2 and EKS walk-through
l9gpu CLI reference · slurmprocessor · shelper

Contributing

PRs welcome. See CONTRIBUTING.md for dev setup, tests, and PR flow. By contributing you agree your work is licensed under the same terms as the rest of the project. Security reports: SECURITY.md.

Credits & attribution

l9gpu (the Python package), shelper, and slurmprocessor are derived from Meta's facebookresearch/gcm project (MIT and Apache-2.0). We extended them with Kubernetes workload attribution, AMD / Intel Gaudi collectors, vLLM / SGLang / TGI / Triton / NIM monitors, cost and fleet-health signals, and OTLP-native export. k8shelper/ and k8sprocessor/ are original Last9 work. See NOTICE for the full breakdown.

License

MIT for l9gpu, k8shelper, k8sprocessor. Apache-2.0 for slurmprocessor, shelper. Each subdirectory carries its own LICENSE where it differs from the repo root.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

last9

Release history Release notifications | RSS feed

This version

0.2.0

Apr 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

l9gpu-0.2.0.tar.gz (708.6 kB view details)

Uploaded Apr 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

l9gpu-0.2.0-py3-none-any.whl (711.9 kB view details)

Uploaded Apr 20, 2026 Python 3

File details

Details for the file l9gpu-0.2.0.tar.gz.

File metadata

Download URL: l9gpu-0.2.0.tar.gz
Upload date: Apr 20, 2026
Size: 708.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for l9gpu-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`360414c86d40688caf063e96af44090d5048177d0be104368fa7c09236edfea5`
MD5	`cccd6f8bc1c9deda6a46d093969ecd04`
BLAKE2b-256	`ba2fc95b8b5b0804064625b57e3fae0b90d6d628323df4e8b24c1b249a76f807`

See more details on using hashes here.

Provenance

The following attestation bundles were made for l9gpu-0.2.0.tar.gz:

Publisher: release.yml on last9/gpu-telemetry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: l9gpu-0.2.0.tar.gz
- Subject digest: 360414c86d40688caf063e96af44090d5048177d0be104368fa7c09236edfea5
- Sigstore transparency entry: 1341669845
- Sigstore integration time: Apr 20, 2026
Source repository:
- Permalink: last9/gpu-telemetry@a146f500a6ab42c43cdc41ea8d3639ef8a96ae56
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/last9
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a146f500a6ab42c43cdc41ea8d3639ef8a96ae56
- Trigger Event: push

File details

Details for the file l9gpu-0.2.0-py3-none-any.whl.

File metadata

Download URL: l9gpu-0.2.0-py3-none-any.whl
Upload date: Apr 20, 2026
Size: 711.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for l9gpu-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`95af19e3f1249ed0c74535c6702dfa0098c856c7fc78b788f749aa776c90c95a`
MD5	`dec8fcc722487b586ff0bf24e5c34c97`
BLAKE2b-256	`e64641514972663837faa2fd038c5a16edb5f6aa277b92e822b878ce68a7c10b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for l9gpu-0.2.0-py3-none-any.whl:

Publisher: release.yml on last9/gpu-telemetry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: l9gpu-0.2.0-py3-none-any.whl
- Subject digest: 95af19e3f1249ed0c74535c6702dfa0098c856c7fc78b788f749aa776c90c95a
- Sigstore transparency entry: 1341669846
- Sigstore integration time: Apr 20, 2026
Source repository:
- Permalink: last9/gpu-telemetry@a146f500a6ab42c43cdc41ea8d3639ef8a96ae56
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/last9
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a146f500a6ab42c43cdc41ea8d3639ef8a96ae56
- Trigger Event: push

l9gpu 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

l9gpu

Quick Start — Kubernetes

Quick Start — Bare Metal / systemd

What l9gpu is not

Architecture

Workload attribution

Dashboards & alerts

Pre-built collector

Components

Hardware support

Demo

Documentation

Contributing

Credits & attribution

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance