Real-time training observability cards for Metaflow — framework-agnostic, zero infra

These details have not been verified by PyPI

Project description

metaflow-traincard

See your training run live — loss curves, GPU usage, and checkpoints — without leaving Metaflow.

The problem

LLM fine-tuning jobs run for hours with zero visibility. By the time you know your loss is diverging or your GPU is idle, the run has been wasting compute for hours. TensorBoard requires a separate server; W&B requires credentials and an account — and neither result is versioned with the run itself.

Quick start

pip install metaflow-traincard

from metaflow import FlowSpec, step, card
from metaflow_traincard import Reporter

class MyFlow(FlowSpec):

    @card(type="traincard")
    @step
    def train(self):
        reporter = Reporter()

        for step_num, batch in enumerate(loader):
            loss = train_step(batch)
            reporter.metric("loss", loss, step_num)

        reporter.finish()
        self.traincard_state = reporter.get_state()
        self.next(self.end)

Open the card in the Metaflow UI — live loss curves, GPU bars, and checkpoint history are waiting.

Install

# Core
pip install metaflow-traincard

# With HuggingFace Trainer integration
pip install "metaflow-traincard[hf]"

Usage

Raw PyTorch

reporter = Reporter(
    output_dir="/tmp/traincard",  # where metrics are buffered
    flush_interval=5,             # seconds between background flushes
    rank=0,                       # distributed rank (non-zero ranks are silent)
    world_size=1,
)

reporter.metric("train/loss", loss, step=global_step)
reporter.metric("train/learning_rate", lr, step=global_step)
reporter.system({
    "gpu_utilization": [88.0, 83.0],
    "gpu_memory_used_gb": [18.5, 18.2],
    "gpu_memory_total_gb": [24.0, 24.0],
    "cpu_percent": 35.0,
    "ram_used_gb": 42.1,
    "ram_total_gb": 64.0,
})
reporter.checkpoint("/tmp/ckpt-100", metadata={"eval_loss": 1.38, "epoch": 2})
reporter.finish()

self.traincard_state = reporter.get_state()

HuggingFace Trainer

from metaflow_traincard import HFTrainCardCallback

trainer = Trainer(
    model=model,
    args=training_args,
    callbacks=[HFTrainCardCallback()],
)
trainer.train()
self.traincard_state = HFTrainCardCallback().reporter.get_state()

HFTrainCardCallback maps all Trainer events automatically — on_log → metrics, on_save → checkpoints, on_evaluate → eval phase, on_train_end → finish. GPU/CPU telemetry is sampled every 10 seconds via pynvml + psutil (both optional).

Preview the card locally

from metaflow import Flow, namespace
from metaflow_traincard import render_state

namespace(None)
state = Flow("MyFlow").latest_run["train"].task["traincard_state"].data
open("card.html", "w").write(render_state(state))
# open card.html

How it works

The Reporter writes metrics and telemetry to a local events.jsonl log via a background thread, flushing an atomic latest.json snapshot every few seconds. On step completion, get_state() returns the full in-memory state dict, which is stored as the traincard_state artifact. The TrainCard renderer reads that artifact and produces a self-contained HTML page — Chart.js charts, GPU utilization bars, checkpoint table, and log viewer — served by Metaflow's card system.

Crash safety: latest.json is written via tmp-then-rename, and a SIGTERM handler flushes state before the process exits. Resume detection: if the same output_dir exists from a prior run, metric history is loaded and a visual discontinuity marker is inserted in each chart.

Card sections

Section	What it shows
Status header	Phase badge (TRAINING / EVALUATING / SAVING / DONE), step, epoch, elapsed time
Training Metrics	Live Chart.js line charts — loss, eval loss, LR, grad norm, tokens/sec, any custom metric
System Telemetry	Per-GPU utilization bars, VRAM used/total, temperature, CPU %, RAM, disk throughput
Checkpoints	Table of saved checkpoints — step, size, age, metadata; BEST badge on lowest eval loss
Logs	Searchable tail of recent log lines; errors and warnings highlighted
Failure Summary	Exception type, message, traceback toggle, OOM warning (shown only on crash)

TrainCard — status header and metric charts

Full card (system telemetry · checkpoints · logs)

TrainCard full view

Development

git clone https://github.com/npow/metaflow-traincard
cd metaflow-traincard
pip install -e ".[dev]"
pytest tests/ -v

License

Apache 2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Feb 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metaflow_traincard-0.1.0.tar.gz (29.1 kB view details)

Uploaded Feb 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

metaflow_traincard-0.1.0-py3-none-any.whl (23.1 kB view details)

Uploaded Feb 28, 2026 Python 3

File details

Details for the file metaflow_traincard-0.1.0.tar.gz.

File metadata

Download URL: metaflow_traincard-0.1.0.tar.gz
Upload date: Feb 28, 2026
Size: 29.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metaflow_traincard-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f564173d52924a578eb4ed22e377cc7859e070da7cf435801998e3ca96d5ce65`
MD5	`c9bee9bd64f4f114630e24a91d6eb68c`
BLAKE2b-256	`599cdba244dd2709beb9521d9001871d0b5cc2b41771930276dbb946a502aa14`

See more details on using hashes here.

Provenance

The following attestation bundles were made for metaflow_traincard-0.1.0.tar.gz:

Publisher: publish.yml on npow/metaflow-traincard

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: metaflow_traincard-0.1.0.tar.gz
- Subject digest: f564173d52924a578eb4ed22e377cc7859e070da7cf435801998e3ca96d5ce65
- Sigstore transparency entry: 1004778952
- Sigstore integration time: Feb 28, 2026
Source repository:
- Permalink: npow/metaflow-traincard@319c71c95a652944d3d24244657ac7da4dc169eb
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/npow
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@319c71c95a652944d3d24244657ac7da4dc169eb
- Trigger Event: push

File details

Details for the file metaflow_traincard-0.1.0-py3-none-any.whl.

File metadata

Download URL: metaflow_traincard-0.1.0-py3-none-any.whl
Upload date: Feb 28, 2026
Size: 23.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metaflow_traincard-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6309114cdc4763c8ad93f5fa33424921a22d7ce13892f608adce034dc12f8c65`
MD5	`1c92e0f75f869ba98ee03c59f7d103ae`
BLAKE2b-256	`77e095564e357e25f69160fff8fc4756ffab0d88439803b501df6675eaba804c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for metaflow_traincard-0.1.0-py3-none-any.whl:

Publisher: publish.yml on npow/metaflow-traincard

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: metaflow_traincard-0.1.0-py3-none-any.whl
- Subject digest: 6309114cdc4763c8ad93f5fa33424921a22d7ce13892f608adce034dc12f8c65
- Sigstore transparency entry: 1004778954
- Sigstore integration time: Feb 28, 2026
Source repository:
- Permalink: npow/metaflow-traincard@319c71c95a652944d3d24244657ac7da4dc169eb
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/npow
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@319c71c95a652944d3d24244657ac7da4dc169eb
- Trigger Event: push

metaflow-traincard 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

metaflow-traincard

The problem

Quick start

Install

Usage

Raw PyTorch

HuggingFace Trainer

Preview the card locally

How it works

Card sections

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance