Automatic observability for Metaflow — step duration, CPU, memory, disk, and GPU metrics via OpenTelemetry
Project description
metaflow-observability
Get production metrics for every Metaflow step — without changing your flow code.
The problem
When a Metaflow pipeline slows down or crashes in production, you have no time-series data to tell you whether it was CPU saturation, a memory spike, a disk bottleneck, or a GPU stall. You're left digging through logs after the fact. Metaflow's built-in tooling gives you per-run artifacts and cards, but nothing you can alert on or trend over time.
Quick start
pip install metaflow-observability
from metaflow import FlowSpec, step
from metaflow.decorators import observability
class MyFlow(FlowSpec):
@observability
@step
def train(self):
... # your code — metrics collected automatically
self.next(self.end)
@step
def end(self):
pass
if __name__ == "__main__":
MyFlow()
Metrics are exported via OpenTelemetry. Point them at Prometheus + Grafana with:
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
python flow.py run
Install
# Core (CPU, memory, disk, duration)
pip install metaflow-observability
# With GPU support (NVIDIA only, requires CUDA drivers)
pip install "metaflow-observability[gpu]"
Usage
Zero-config with Prometheus
Add @observability to any step. By default, metrics are scraped via a Prometheus endpoint on port 8000.
@observability
@step
def preprocess(self):
...
Custom OTel backend
Use any OpenTelemetry-compatible backend (Grafana Cloud, Datadog, Honeycomb, etc.) via standard OTel environment variables:
export OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp.example.com
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <token>"
GPU metrics
Install the GPU extra and run on a CUDA-enabled machine — GPU utilization and memory are collected automatically per device, tagged with gpu_index.
pip install "metaflow-observability[gpu]"
How it works
@observability wraps task_pre_step / task_post_step / task_exception hooks in Metaflow's decorator API. Before your step code runs, it starts background threads that sample CPU%, RSS memory, disk I/O throughput, and (optionally) GPU utilization at 1-second intervals. When the step finishes, samples are aggregated and exported as OpenTelemetry instruments:
| Metric | Instrument | Tags |
|---|---|---|
step.duration |
Histogram (seconds) | step, flow, run_id, retry |
step.cpu.pct |
Gauge (avg / max / p95) | same |
step.memory.mb |
Gauge (avg / max RSS) | same |
step.disk.read_bytes |
Counter | same |
step.disk.write_bytes |
Counter | same |
step.disk.read_throughput |
Gauge (MB/s) | same |
step.disk.write_throughput |
Gauge (MB/s) | same |
step.gpu.utilization |
Gauge | + gpu_index |
step.gpu.memory.used_mb |
Gauge | + gpu_index |
step.retries |
Gauge | same |
step.failures |
Counter | same |
Configuration
All configuration is via standard OpenTelemetry environment variables. No extension-specific config needed.
| Variable | Purpose |
|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT |
OTLP endpoint for traces and metrics |
OTEL_EXPORTER_OTLP_HEADERS |
Auth headers (e.g., Authorization=Bearer ...) |
OTEL_SERVICE_NAME |
Service name tag on all metrics |
If neither variable is set, metrics are printed to stdout via the OTel console exporter (useful for local debugging).
Development
git clone https://github.com/npow/metaflow-observability
cd metaflow-observability
pip install -e ".[dev]"
# Run tests
pytest
# Lint + format
ruff check src tests
ruff format src tests
# Type check
mypy
CI runs the full suite across Python 3.9, 3.10, 3.11, and 3.12 on every push.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file metaflow_observability-0.1.0.tar.gz.
File metadata
- Download URL: metaflow_observability-0.1.0.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6883bcfe07a1587ece692bb46dac0cf0338f1d921aa943cdadafd313ba3efb90
|
|
| MD5 |
ecd6aec5f5cd262b66e4a60fe411191f
|
|
| BLAKE2b-256 |
0d1f9a048ab392c9eeeffd74b1aa07ba2f1d04f247c6d4416ba777cf2d4198f8
|
Provenance
The following attestation bundles were made for metaflow_observability-0.1.0.tar.gz:
Publisher:
publish.yml on npow/metaflow-observability
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
metaflow_observability-0.1.0.tar.gz -
Subject digest:
6883bcfe07a1587ece692bb46dac0cf0338f1d921aa943cdadafd313ba3efb90 - Sigstore transparency entry: 1153973719
- Sigstore integration time:
-
Permalink:
npow/metaflow-observability@52bb7a8c96499ce14727bb57de05ca06eea51f53 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/npow
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@52bb7a8c96499ce14727bb57de05ca06eea51f53 -
Trigger Event:
release
-
Statement type:
File details
Details for the file metaflow_observability-0.1.0-py3-none-any.whl.
File metadata
- Download URL: metaflow_observability-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d322c89d40c2fbf92f7a77e1a77d1904dc894802e26ec857d58eb99c24d007f
|
|
| MD5 |
99169c5901607b06d4e0b285c2d3f77d
|
|
| BLAKE2b-256 |
f182766185a463640b8257c487e19cbf097b4f758718bf8252e60b0d18b7a986
|
Provenance
The following attestation bundles were made for metaflow_observability-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on npow/metaflow-observability
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
metaflow_observability-0.1.0-py3-none-any.whl -
Subject digest:
4d322c89d40c2fbf92f7a77e1a77d1904dc894802e26ec857d58eb99c24d007f - Sigstore transparency entry: 1153973731
- Sigstore integration time:
-
Permalink:
npow/metaflow-observability@52bb7a8c96499ce14727bb57de05ca06eea51f53 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/npow
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@52bb7a8c96499ce14727bb57de05ca06eea51f53 -
Trigger Event:
release
-
Statement type: