GPU energy observability for AI training workloads

These details have not been verified by PyPI

Project links

Project description

matcha logomark

matcha

GPU energy observability for AI training.

Measure energy per training run and per step — from the GPU's hardware counter, not sampled power. Works on NVIDIA, AMD, Intel, and Apple Silicon. Zero-code CLI, Python API, and HuggingFace Trainer callback. Structured output for any observability stack.

Install

pip install usematcha

Python 3.9+. Linux or macOS. One supported GPU: NVIDIA (NVML), AMD (rocm-smi), Intel (xpu-smi), or Apple Silicon (IOReport — no sudo, no extra deps). Auto-detects at start; override with MATCHA_BACKEND=nvml|rocm|intel|apple on multi-vendor hosts.

Quickstart

matcha run torchrun --standalone --nproc_per_node=8 train_gpt.py

matcha_energy gpus:8x NVIDIA H100 80GB HBM3 total:778168J (216.16Wh) duration:203.1s avg_power:3832W peak_power:4120W samples:2031

Same command on a MacBook (M-series) against an MLX training script:

matcha_energy gpus:Apple M4 total:4449J (1.24Wh) duration:837.9s avg_power:5W peak_power:19W samples:8066

No code changes. No config files. Works with any training script.

Three ways to use it

matcha exposes one measurement engine through three surfaces. All three read the active vendor's hardware counter (NVML on NVIDIA, IOReport on Apple Silicon) or polled power (AMD, Intel) and emit the same StepResult / SessionResult shape — including a backend field so multi-vendor fleets slice cleanly.

CLI — zero-code, wraps any training command.

matcha run  python train.py                         # total energy
matcha wrap python train.py                         # per-step energy
matcha monitor                                      # live dashboard

See docs/playbooks/cli for diff, JSONL output, and multi-run comparison.

Python API — opt-in, for framework integrations and notebook work.

import matcha

with matcha.session() as s:
    for i in range(num_steps):
        with s.step(i):
            train_step()

print(s.result.total_energy_j, s.result.energy_wh)

See docs/playbooks/python-api for explicit lifecycle, custom metrics, and multi-GPU details.

HuggingFace Trainer callback — drop-in for the Trainer loop.

from matcha.callbacks import StepEnergyCallback

trainer = Trainer(model=model, args=args, callbacks=[StepEnergyCallback()])
trainer.train()

Per-step energy flows into the Trainer's log dict — visible in stdout, TensorBoard, and WandB automatically. Install with pip install 'usematcha[hf]'.

See docs/playbooks/huggingface for DDP, failure modes, and config.

Observability

Structured output plugs into the stack you already have.

JSONL — --output run.jsonl writes session_start / step / session_end records with per-GPU breakdowns. Stream into ClickHouse, DuckDB, or any log pipeline.
Prometheus — --prometheus :9400 exposes a /metrics endpoint with step-level and GPU-live gauges, plus training metrics auto-extracted from stdout.
OpenTelemetry — --otlp URL pushes the same metric set to Grafana Cloud, Honeycomb, Datadog, or any OTel collector. Install with pip install 'usematcha[otlp]'.

Metric names match across Prometheus and OTLP so dashboards port between deployments.

Multi-GPU

matcha auto-detects every visible GPU and reports summed totals plus a per-GPU breakdown in every record. The per-GPU arrays make straggler detection a one-query affair — one rank consistently drawing ~30% less power usually means a stuck collective, a thermally throttled card, or a PCIe link degraded to Gen3.

matcha run --gpus 0,1,2,3 torchrun ...

How it works

One engine, four backends.

NVIDIA (NVML, Volta+). Reads nvmlDeviceGetTotalEnergyConsumption — a millijoule-precise cumulative energy counter. Per-step and session energy are exact counter deltas (no integration error, zero per-step overhead). Pre-Volta cards fall back to trapezoidal integration of polled power.
Apple Silicon (IOReport). Reads Darwin's IOReport framework directly via stdlib ctypes (/usr/lib/libIOReport.dylib). Same semantic class as NVML — cumulative millijoule GPU counter — so energy_source="counter" on M-series too. No sudo, no powermetrics subprocess, no extra pip deps. Step boundaries force a fresh IOReport sample so per-step attribution is counter-exact even for sub-100 ms steps.
AMD (rocm-smi) / Intel (xpu-smi). Vendor CLI under a cached refresher thread; energy is trapezoidal integration of polled power today. amdsmi / Level Zero counter paths land next.

A background poller plus boundary reads at each step transition track peak power on every backend. Training runs natively; matcha never touches your model or training loop.

Full design in ARCHITECTURE.md.

Documentation · Changelog · Architecture · Contributing · Security

Built by Keeya Labs. Apache 2.0.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

Apr 29, 2026

This version

0.3.0

Apr 20, 2026

0.3.0rc1 pre-release

Apr 20, 2026

0.2.4

Apr 18, 2026

0.2.3

Apr 18, 2026

0.2.2

Apr 17, 2026

0.2.1

Apr 17, 2026

0.2.0

Apr 17, 2026

0.1.0

Apr 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

usematcha-0.3.0.tar.gz (52.4 kB view details)

Uploaded Apr 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

usematcha-0.3.0-py3-none-any.whl (61.3 kB view details)

Uploaded Apr 20, 2026 Python 3

File details

Details for the file usematcha-0.3.0.tar.gz.

File metadata

Download URL: usematcha-0.3.0.tar.gz
Upload date: Apr 20, 2026
Size: 52.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for usematcha-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`d6eedbb79a46fab36528ed319969b6c45b8a15260af1949f1deaf5a21c4cbd79`
MD5	`bc5023ba0668a71fdd1ae7c455fdfa2b`
BLAKE2b-256	`f6e7078bf5d89dd5707e2da7bda9a803f118626f8f939b8d90cbabaa82fdb3b4`

See more details on using hashes here.

File details

Details for the file usematcha-0.3.0-py3-none-any.whl.

File metadata

Download URL: usematcha-0.3.0-py3-none-any.whl
Upload date: Apr 20, 2026
Size: 61.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for usematcha-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2f01c1c0089e47e4cd0fccdfe81b56d31f907cd6564106de789a2d7b85cbff5a`
MD5	`7bbfb0cd28ea179001cb7f71c07be298`
BLAKE2b-256	`4b5babb54278d00fbb6818462dbe077f19c760cd0ca96c77cc28b9e427430088`

See more details on using hashes here.

usematcha 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

matcha

Install

Quickstart

Three ways to use it

Observability

Multi-GPU

How it works

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes