Skip to main content

GPU energy observability for AI training workloads

Project description

matcha

GPU energy observability for AI training

PyPI version Python versions PyPI Downloads License

Measure GPU energy per training run and per step. No code changes. Structured output for any observability stack.


Install

pip install usematcha

Requires Linux + an NVIDIA GPU with drivers installed. Python 3.9+.

Quick start

Prefix your training command with matcha run:

matcha run torchrun --standalone --nproc_per_node=8 train_gpt.py

Your training runs at full speed. matcha prints one line at the end:

matcha_energy gpus:8x NVIDIA H100 80GB HBM3 total:778168J (216.16Wh) duration:203.1s avg_power:3832W peak_power:4120W samples:2031

No code changes. No config files. Works with any training script.


Commands

matcha run — total energy, zero overhead

Launches your command, polls GPU power in the background, prints a summary when it finishes. Training runs natively — no stdout interception, no performance impact.

matcha run python train.py
matcha run torchrun --standalone --nproc_per_node=8 train_gpt.py
matcha run deepspeed --num_gpus=8 train.py --deepspeed ds_config.json

matcha wrap — per-step energy

Parses your training stdout for step markers (step 10, iter 10, step:10/1000, [10/1000], etc.) and appends energy data to each step line.

matcha wrap torchrun --standalone --nproc_per_node=8 train_gpt.py
step:1/20000 train_loss:6.9357 train_time:612ms step_avg:612.00ms energy:2354.0J/step avg_power:3847W peak_power:4120W
step:2/20000 train_loss:16.7414 train_time:831ms step_avg:721.50ms energy:3012.6J/step avg_power:3625W peak_power:3998W
step:3/20000 train_loss:8.7524 train_time:1258ms step_avg:783.40ms energy:3472.8J/step avg_power:3610W peak_power:3890W
...
matcha_energy gpus:8x NVIDIA H100 80GB HBM3 total:778168J (216.16Wh) duration:203.1s avg_power:3832W peak_power:4120W samples:2031

matcha monitor — live per-GPU dashboard

A drop-in replacement for running watch nvidia-smi in a second terminal. Shows per-GPU power, utilization, temperature, memory, and a running total.

matcha monitor
matcha monitor --gpus 0,1,2,3 --interval 500

Structured output for observability

matcha emits JSONL records (session_start, step, session_end) ready to stream into ClickHouse, Grafana, or any logging pipeline. Enable with --json / --output on run or wrap:

matcha wrap --output run.jsonl \
    --label team=capacity --label config=lr_3e-4 --label seed=42 \
    torchrun --standalone --nproc_per_node=8 train_gpt.py

Training stdout passes through untouched. Records are appended to run.jsonl:

{"type":"session_start","ts":"2026-04-17T15:19:18.004Z","run_id":"3f7a9b1c4e2d","matcha_version":"0.2.1","hostname":"h100-node-4","driver_version":"535.104.12","interval_ms":100,"energy_source":"counter","gpus":[{"idx":0,"uuid":"GPU-f364...","name":"NVIDIA H100 80GB HBM3"}, ... ],"cmd":["torchrun","--standalone","--nproc_per_node=8","train_gpt.py"],"labels":{"team":"capacity","config":"lr_3e-4","seed":"42"}}
{"type":"step","ts":"2026-04-17T15:19:18.616Z","run_id":"3f7a9b1c4e2d","step":1,"step_gap":1,"energy_j":2354.0,"energy_per_step_j":2354.0,"duration_s":0.612,"avg_power_w":3847.0,"peak_power_w":4120.0,"gpus":[{"idx":0,"energy_j":323.5,"avg_power_w":528.6,"peak_power_w":540.0}, ... ]}
{"type":"session_end","ts":"2026-04-17T15:22:41.104Z","run_id":"3f7a9b1c4e2d","total_energy_j":778168.0,"energy_wh":216.16,"duration_s":203.1,"avg_power_w":3832.0,"peak_power_w":4120.0,"total_samples":406,"total_steps":20000,"energy_per_step_j":38.91,"energy_source":"counter","gpus":[ ... ]}

Ingest example — ClickHouse:

cat run.jsonl | clickhouse-client --query "INSERT INTO energy_steps FORMAT JSONEachRow"

Flags

Flag Description
--json Emit structured JSONL records.
--output PATH Write JSONL to a file (implies --json). Required for wrap --json.
--label KEY=VALUE Attach a label to the run. Repeatable.
--run-id ID Stable run identifier. Also honors MATCHA_RUN_ID. Auto-generated if unset.
--gpus all, a single index (0), or a list (0,1,2,3). Default: all visible GPUs.
--interval Peak-power poll interval in ms. Default: 100. Energy uses the hardware counter and is independent of this.

Multi-GPU

matcha auto-detects every visible GPU and reports summed totals plus per-GPU breakdowns in structured output. The per-GPU arrays make straggler detection a one-query affair.

# 8xH100 — auto-detects all 8
matcha run torchrun --standalone --nproc_per_node=8 train_gpt.py

# Subset
matcha run --gpus 0,1,2,3 torchrun ...

# Single GPU
matcha run --gpus 0 torchrun ...

Each step record carries a gpus: [{idx, energy_j, avg_power_w, peak_power_w}, ...] array alongside the totals. Useful for:

  • Straggler detection — one rank consistently drawing ~30% less power usually means a stuck collective, a thermal-throttled card, or a PCIe link degraded to Gen3.
  • DP / PP / TP fingerprinting — the per-GPU power pattern over time tells you what parallelism strategy is actually running.
  • Rank-0 asymmetry — expected overhead from checkpoint I/O and collective origins, good to confirm it's bounded.

How it works

matcha reads energy directly from NVML's hardware accumulator (nvmlDeviceGetTotalEnergyConsumption, available on Volta+). Per-step and session energy are exact counter deltas — millijoule-precise, no integration error. A background poller (default 100 ms) plus boundary reads at each step transition track peak power. Pre-Volta GPUs fall back to trapezoidal integration of polled samples. The training process runs natively — matcha never touches stdin, stdout (in run mode), your model, or your training loop.

  • run does not intercept the child's stdout — it's as close to zero-overhead as reasonable.
  • wrap pipes the child's stdout to detect step boundaries, then appends energy data inline or emits structured records.
  • monitor samples directly without launching a child process.

Compatibility

  • Hardware: verified on NVIDIA H100. Works with any GPU supported by NVML (A100, H100, L4, L40S, Blackwell).
  • Frameworks: framework-agnostic — torchrun, deepspeed, accelerate, or plain python.
  • Multi-node: matcha runs per-node and emits per-node records; aggregate with labels.node=... or hostname downstream.

Why

Frontier training runs burn hundreds of MWh. The gap between teams that optimize for energy-per-step and those that don't is measured in millions of dollars per training run. matcha makes that number visible without changing your training code.

8xH100 training run — 1 hour:
  Energy cost:   $0.26 (2.16 kWh @ $0.12/kWh)
  Compute cost:  $23.20 (RunPod @ $23.20/hr)

  → Optimizing energy per step == faster training == less rental time

Built by

Keeya Labs · Docs · GitHub

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

usematcha-0.2.2.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

usematcha-0.2.2-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file usematcha-0.2.2.tar.gz.

File metadata

  • Download URL: usematcha-0.2.2.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for usematcha-0.2.2.tar.gz
Algorithm Hash digest
SHA256 26c2e54fcdb63e58e4f38620f420b4dfd2f3d5d97a0c2f906d1653be8f57fa26
MD5 6d4d35a88178ac9847113bddac1a1c86
BLAKE2b-256 9b186c3c3f3ed7e95551f91dd951d54cfda505f0bab2c38fcdc912d43e19ab7f

See more details on using hashes here.

File details

Details for the file usematcha-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: usematcha-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 16.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for usematcha-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6a8b5c3e1ae3306fff43a1019696ec0211bf0793b1938845bc9fa85021e5885d
MD5 9c483f0ed2d8c05d0d0f6b907e5dec7f
BLAKE2b-256 5768ae6688cf3d49aa889578c368c37c45a02f5fa101da0bc115c344470b18f5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page