Automatic observability for Metaflow — step duration, CPU, memory, disk, and GPU metrics via OpenTelemetry

Project description

metaflow-observability

Get production metrics for every Metaflow step — without changing your flow code.

The problem

When a Metaflow pipeline slows down or crashes in production, you have no time-series data to tell you whether it was CPU saturation, a memory spike, a disk bottleneck, or a GPU stall. You're left digging through logs after the fact. Metaflow's built-in tooling gives you per-run artifacts and cards, but nothing you can alert on or trend over time.

Quick start

pip install metaflow-observability

from metaflow import FlowSpec, step
from metaflow.decorators import observability

class MyFlow(FlowSpec):

    @observability
    @step
    def train(self):
        ...  # your code — metrics collected automatically
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == "__main__":
    MyFlow()

Metrics are exported via OpenTelemetry. Point them at Prometheus + Grafana with:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
python flow.py run

Install

# Core (CPU, memory, disk, duration)
pip install metaflow-observability

# With GPU support (NVIDIA only, requires CUDA drivers)
pip install "metaflow-observability[gpu]"

Usage

Zero-config with Prometheus

Add @observability to any step. By default, metrics are scraped via a Prometheus endpoint on port 8000.

@observability
@step
def preprocess(self):
    ...

Custom OTel backend

Use any OpenTelemetry-compatible backend (Grafana Cloud, Datadog, Honeycomb, etc.) via standard OTel environment variables:

export OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp.example.com
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <token>"

GPU metrics

Install the GPU extra and run on a CUDA-enabled machine — GPU utilization and memory are collected automatically per device, tagged with gpu_index.

pip install "metaflow-observability[gpu]"

How it works

@observability wraps task_pre_step / task_post_step / task_exception hooks in Metaflow's decorator API. Before your step code runs, it starts background threads that sample CPU%, RSS memory, disk I/O throughput, and (optionally) GPU utilization at 1-second intervals. When the step finishes, samples are aggregated and exported as OpenTelemetry instruments:

Metric	Instrument	Tags
`step.duration`	Histogram (seconds)	`step`, `flow`, `run_id`, `retry`
`step.cpu.pct`	Gauge (avg / max / p95)	same
`step.memory.mb`	Gauge (avg / max RSS)	same
`step.disk.read_bytes`	Counter	same
`step.disk.write_bytes`	Counter	same
`step.disk.read_throughput`	Gauge (MB/s)	same
`step.disk.write_throughput`	Gauge (MB/s)	same
`step.gpu.utilization`	Gauge	+ `gpu_index`
`step.gpu.memory.used_mb`	Gauge	+ `gpu_index`
`step.retries`	Gauge	same
`step.failures`	Counter	same

Configuration

All configuration is via standard OpenTelemetry environment variables. No extension-specific config needed.

Variable	Purpose
`OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP endpoint for traces and metrics
`OTEL_EXPORTER_OTLP_HEADERS`	Auth headers (e.g., `Authorization=Bearer ...`)
`OTEL_SERVICE_NAME`	Service name tag on all metrics

If neither variable is set, metrics are printed to stdout via the OTel console exporter (useful for local debugging).

Development

git clone https://github.com/npow/metaflow-observability
cd metaflow-observability
pip install -e ".[dev]"

# Run tests
pytest

# Lint + format
ruff check src tests
ruff format src tests

# Type check
mypy

CI runs the full suite across Python 3.9, 3.10, 3.11, and 3.12 on every push.

License

Apache-2.0

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Mar 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metaflow_observability-0.1.0.tar.gz (16.4 kB view details)

Uploaded Mar 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

metaflow_observability-0.1.0-py3-none-any.whl (16.2 kB view details)

Uploaded Mar 21, 2026 Python 3

File details

Details for the file metaflow_observability-0.1.0.tar.gz.

File metadata

Download URL: metaflow_observability-0.1.0.tar.gz
Upload date: Mar 21, 2026
Size: 16.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metaflow_observability-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6883bcfe07a1587ece692bb46dac0cf0338f1d921aa943cdadafd313ba3efb90`
MD5	`ecd6aec5f5cd262b66e4a60fe411191f`
BLAKE2b-256	`0d1f9a048ab392c9eeeffd74b1aa07ba2f1d04f247c6d4416ba777cf2d4198f8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for metaflow_observability-0.1.0.tar.gz:

Publisher: publish.yml on npow/metaflow-observability

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: metaflow_observability-0.1.0.tar.gz
- Subject digest: 6883bcfe07a1587ece692bb46dac0cf0338f1d921aa943cdadafd313ba3efb90
- Sigstore transparency entry: 1153973719
- Sigstore integration time: Mar 21, 2026
Source repository:
- Permalink: npow/metaflow-observability@52bb7a8c96499ce14727bb57de05ca06eea51f53
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/npow
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@52bb7a8c96499ce14727bb57de05ca06eea51f53
- Trigger Event: release

File details

Details for the file metaflow_observability-0.1.0-py3-none-any.whl.

File metadata

Download URL: metaflow_observability-0.1.0-py3-none-any.whl
Upload date: Mar 21, 2026
Size: 16.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metaflow_observability-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4d322c89d40c2fbf92f7a77e1a77d1904dc894802e26ec857d58eb99c24d007f`
MD5	`99169c5901607b06d4e0b285c2d3f77d`
BLAKE2b-256	`f182766185a463640b8257c487e19cbf097b4f758718bf8252e60b0d18b7a986`

See more details on using hashes here.

Provenance

The following attestation bundles were made for metaflow_observability-0.1.0-py3-none-any.whl:

Publisher: publish.yml on npow/metaflow-observability

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: metaflow_observability-0.1.0-py3-none-any.whl
- Subject digest: 4d322c89d40c2fbf92f7a77e1a77d1904dc894802e26ec857d58eb99c24d007f
- Sigstore transparency entry: 1153973731
- Sigstore integration time: Mar 21, 2026
Source repository:
- Permalink: npow/metaflow-observability@52bb7a8c96499ce14727bb57de05ca06eea51f53
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/npow
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@52bb7a8c96499ce14727bb57de05ca06eea51f53
- Trigger Event: release

metaflow-observability 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

metaflow-observability

The problem

Quick start

Install

Usage

How it works

Configuration

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance