Skip to main content

GPU energy observability for AI training workloads

Project description

matcha

GPU energy observability for AI training.

PyPI   Python versions   License

Measure energy per training run and per step — from NVML's hardware counter, not sampled power. Zero-code CLI, Python API, and HuggingFace Trainer callback. Structured output for any observability stack.


Install

pip install usematcha

Linux, Python 3.9+, NVIDIA GPU with drivers installed.

Quickstart

matcha run torchrun --standalone --nproc_per_node=8 train_gpt.py
matcha_energy gpus:8x NVIDIA H100 80GB HBM3 total:778168J (216.16Wh) duration:203.1s avg_power:3832W peak_power:4120W samples:2031

No code changes. No config files. Works with any training script.


Three ways to use it

matcha exposes one measurement engine through three surfaces. All three read the same NVML hardware counter and emit the same StepResult / SessionResult shape.

CLI — zero-code, wraps any training command.

matcha run  python train.py                         # total energy
matcha wrap python train.py                         # per-step energy
matcha monitor                                      # live dashboard

See docs/playbooks/cli for diff, JSONL output, and multi-run comparison.

Python API — opt-in, for framework integrations and notebook work.

import matcha

with matcha.session() as s:
    for i in range(num_steps):
        with s.step(i):
            train_step()

print(s.result.total_energy_j, s.result.energy_wh)

See docs/playbooks/python-api for explicit lifecycle, custom metrics, and multi-GPU details.

HuggingFace Trainer callback — drop-in for the Trainer loop.

from matcha.callbacks import StepEnergyCallback

trainer = Trainer(model=model, args=args, callbacks=[StepEnergyCallback()])
trainer.train()

Per-step energy flows into the Trainer's log dict — visible in stdout, TensorBoard, and WandB automatically. Install with pip install 'usematcha[hf]'.

See docs/playbooks/huggingface for DDP, failure modes, and config.


Observability

Structured output plugs into the stack you already have.

  • JSONL--output run.jsonl writes session_start / step / session_end records with per-GPU breakdowns. Stream into ClickHouse, DuckDB, or any log pipeline.
  • Prometheus--prometheus :9400 exposes a /metrics endpoint with step-level and GPU-live gauges, plus training metrics auto-extracted from stdout.
  • OpenTelemetry--otlp URL pushes the same metric set to Grafana Cloud, Honeycomb, Datadog, or any OTel collector. Install with pip install 'usematcha[otlp]'.

Metric names match across Prometheus and OTLP so dashboards port between deployments.


Multi-GPU

matcha auto-detects every visible GPU and reports summed totals plus a per-GPU breakdown in every record. The per-GPU arrays make straggler detection a one-query affair — one rank consistently drawing ~30% less power usually means a stuck collective, a thermally throttled card, or a PCIe link degraded to Gen3.

matcha run --gpus 0,1,2,3 torchrun ...

How it works

matcha reads energy directly from NVML's hardware accumulator (nvmlDeviceGetTotalEnergyConsumption, Volta+). Per-step and session energy are exact counter deltas — millijoule-precise, no integration error. A background poller plus boundary reads at each step transition track peak power. Pre-Volta GPUs fall back to trapezoidal integration. Training runs natively; matcha never touches your model or training loop.

Full design in ARCHITECTURE.md.


Documentation  ·  Changelog  ·  Architecture  ·  Contributing  ·  Security

Built by Keeya Labs. Apache 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

usematcha-0.3.0rc1.tar.gz (32.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

usematcha-0.3.0rc1-py3-none-any.whl (36.2 kB view details)

Uploaded Python 3

File details

Details for the file usematcha-0.3.0rc1.tar.gz.

File metadata

  • Download URL: usematcha-0.3.0rc1.tar.gz
  • Upload date:
  • Size: 32.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for usematcha-0.3.0rc1.tar.gz
Algorithm Hash digest
SHA256 d36cc5b1c5005b012b6c95e4ddff185b2c15e2e1203baaa15f039eda8381345b
MD5 989c4c76c297e94acf94906811107608
BLAKE2b-256 e9871f6cd1ddb214c57a0ca85e5cd60a8d3bfae8f42cbe4adc3dd0b967a30204

See more details on using hashes here.

File details

Details for the file usematcha-0.3.0rc1-py3-none-any.whl.

File metadata

  • Download URL: usematcha-0.3.0rc1-py3-none-any.whl
  • Upload date:
  • Size: 36.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for usematcha-0.3.0rc1-py3-none-any.whl
Algorithm Hash digest
SHA256 e2a7a3d519ac2f94d9d079a79f5a8c9c831d053e9f9bfee624b02c7be6196929
MD5 3005fe4b05f6bfa0951c93075f07f581
BLAKE2b-256 62dadc839c0a76ae43509bcd708b515e0c4687642a8c9c441cc8c0b682ab0483

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page