Real-time training observability cards for Metaflow — framework-agnostic, zero infra
Project description
metaflow-traincard
See your training run live — loss curves, GPU usage, and checkpoints — without leaving Metaflow.
The problem
LLM fine-tuning jobs run for hours with zero visibility. By the time you know your loss is diverging or your GPU is idle, the run has been wasting compute for hours. TensorBoard requires a separate server; W&B requires credentials and an account — and neither result is versioned with the run itself.
Quick start
pip install metaflow-traincard
from metaflow import FlowSpec, step, card
from metaflow_traincard import Reporter
class MyFlow(FlowSpec):
@card(type="traincard")
@step
def train(self):
reporter = Reporter()
for step_num, batch in enumerate(loader):
loss = train_step(batch)
reporter.metric("loss", loss, step_num)
reporter.finish()
self.traincard_state = reporter.get_state()
self.next(self.end)
Open the card in the Metaflow UI — live loss curves, GPU bars, and checkpoint history are waiting.
Install
# Core
pip install metaflow-traincard
# With HuggingFace Trainer integration
pip install "metaflow-traincard[hf]"
Usage
Raw PyTorch
reporter = Reporter(
output_dir="/tmp/traincard", # where metrics are buffered
flush_interval=5, # seconds between background flushes
rank=0, # distributed rank (non-zero ranks are silent)
world_size=1,
)
reporter.metric("train/loss", loss, step=global_step)
reporter.metric("train/learning_rate", lr, step=global_step)
reporter.system({
"gpu_utilization": [88.0, 83.0],
"gpu_memory_used_gb": [18.5, 18.2],
"gpu_memory_total_gb": [24.0, 24.0],
"cpu_percent": 35.0,
"ram_used_gb": 42.1,
"ram_total_gb": 64.0,
})
reporter.checkpoint("/tmp/ckpt-100", metadata={"eval_loss": 1.38, "epoch": 2})
reporter.finish()
self.traincard_state = reporter.get_state()
HuggingFace Trainer
from metaflow_traincard import HFTrainCardCallback
trainer = Trainer(
model=model,
args=training_args,
callbacks=[HFTrainCardCallback()],
)
trainer.train()
self.traincard_state = HFTrainCardCallback().reporter.get_state()
HFTrainCardCallback maps all Trainer events automatically — on_log → metrics, on_save → checkpoints, on_evaluate → eval phase, on_train_end → finish. GPU/CPU telemetry is sampled every 10 seconds via pynvml + psutil (both optional).
Preview the card locally
from metaflow import Flow, namespace
from metaflow_traincard import render_state
namespace(None)
state = Flow("MyFlow").latest_run["train"].task["traincard_state"].data
open("card.html", "w").write(render_state(state))
# open card.html
How it works
The Reporter writes metrics and telemetry to a local events.jsonl log via a background thread, flushing an atomic latest.json snapshot every few seconds. On step completion, get_state() returns the full in-memory state dict, which is stored as the traincard_state artifact. The TrainCard renderer reads that artifact and produces a self-contained HTML page — Chart.js charts, GPU utilization bars, checkpoint table, and log viewer — served by Metaflow's card system.
Crash safety: latest.json is written via tmp-then-rename, and a SIGTERM handler flushes state before the process exits. Resume detection: if the same output_dir exists from a prior run, metric history is loaded and a visual discontinuity marker is inserted in each chart.
Card sections
| Section | What it shows |
|---|---|
| Status header | Phase badge (TRAINING / EVALUATING / SAVING / DONE), step, epoch, elapsed time |
| Training Metrics | Live Chart.js line charts — loss, eval loss, LR, grad norm, tokens/sec, any custom metric |
| System Telemetry | Per-GPU utilization bars, VRAM used/total, temperature, CPU %, RAM, disk throughput |
| Checkpoints | Table of saved checkpoints — step, size, age, metadata; BEST badge on lowest eval loss |
| Logs | Searchable tail of recent log lines; errors and warnings highlighted |
| Failure Summary | Exception type, message, traceback toggle, OOM warning (shown only on crash) |
Full card (system telemetry · checkpoints · logs)
Development
git clone https://github.com/npow/metaflow-traincard
cd metaflow-traincard
pip install -e ".[dev]"
pytest tests/ -v
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file metaflow_traincard-0.1.0.tar.gz.
File metadata
- Download URL: metaflow_traincard-0.1.0.tar.gz
- Upload date:
- Size: 29.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f564173d52924a578eb4ed22e377cc7859e070da7cf435801998e3ca96d5ce65
|
|
| MD5 |
c9bee9bd64f4f114630e24a91d6eb68c
|
|
| BLAKE2b-256 |
599cdba244dd2709beb9521d9001871d0b5cc2b41771930276dbb946a502aa14
|
Provenance
The following attestation bundles were made for metaflow_traincard-0.1.0.tar.gz:
Publisher:
publish.yml on npow/metaflow-traincard
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
metaflow_traincard-0.1.0.tar.gz -
Subject digest:
f564173d52924a578eb4ed22e377cc7859e070da7cf435801998e3ca96d5ce65 - Sigstore transparency entry: 1004778952
- Sigstore integration time:
-
Permalink:
npow/metaflow-traincard@319c71c95a652944d3d24244657ac7da4dc169eb -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/npow
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@319c71c95a652944d3d24244657ac7da4dc169eb -
Trigger Event:
push
-
Statement type:
File details
Details for the file metaflow_traincard-0.1.0-py3-none-any.whl.
File metadata
- Download URL: metaflow_traincard-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6309114cdc4763c8ad93f5fa33424921a22d7ce13892f608adce034dc12f8c65
|
|
| MD5 |
1c92e0f75f869ba98ee03c59f7d103ae
|
|
| BLAKE2b-256 |
77e095564e357e25f69160fff8fc4756ffab0d88439803b501df6675eaba804c
|
Provenance
The following attestation bundles were made for metaflow_traincard-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on npow/metaflow-traincard
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
metaflow_traincard-0.1.0-py3-none-any.whl -
Subject digest:
6309114cdc4763c8ad93f5fa33424921a22d7ce13892f608adce034dc12f8c65 - Sigstore transparency entry: 1004778954
- Sigstore integration time:
-
Permalink:
npow/metaflow-traincard@319c71c95a652944d3d24244657ac7da4dc169eb -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/npow
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@319c71c95a652944d3d24244657ac7da4dc169eb -
Trigger Event:
push
-
Statement type: