Skip to main content

Automatic observability for Metaflow — step duration, CPU, memory, disk, and GPU metrics via OpenTelemetry

Project description

metaflow-observability

CI PyPI License: Apache-2.0 Python 3.9+ Docs

Get production metrics for every Metaflow step — without changing your flow code.

The problem

When a Metaflow pipeline slows down or crashes in production, you have no time-series data to tell you whether it was CPU saturation, a memory spike, a disk bottleneck, or a GPU stall. You're left digging through logs after the fact. Metaflow's built-in tooling gives you per-run artifacts and cards, but nothing you can alert on or trend over time.

Quick start

pip install metaflow-observability
from metaflow import FlowSpec, step
from metaflow.decorators import observability

class MyFlow(FlowSpec):

    @observability
    @step
    def train(self):
        ...  # your code — metrics collected automatically
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == "__main__":
    MyFlow()

Metrics are exported via OpenTelemetry. Point them at Prometheus + Grafana with:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
python flow.py run

Install

# Core (CPU, memory, disk, duration)
pip install metaflow-observability

# With GPU support (NVIDIA only, requires CUDA drivers)
pip install "metaflow-observability[gpu]"

Usage

Zero-config with Prometheus

Add @observability to any step. By default, metrics are scraped via a Prometheus endpoint on port 8000.

@observability
@step
def preprocess(self):
    ...

Custom OTel backend

Use any OpenTelemetry-compatible backend (Grafana Cloud, Datadog, Honeycomb, etc.) via standard OTel environment variables:

export OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp.example.com
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <token>"

GPU metrics

Install the GPU extra and run on a CUDA-enabled machine — GPU utilization and memory are collected automatically per device, tagged with gpu_index.

pip install "metaflow-observability[gpu]"

How it works

@observability wraps task_pre_step / task_post_step / task_exception hooks in Metaflow's decorator API. Before your step code runs, it starts background threads that sample CPU%, RSS memory, disk I/O throughput, and (optionally) GPU utilization at 1-second intervals. When the step finishes, samples are aggregated and exported as OpenTelemetry instruments:

Metric Instrument Tags
step.duration Histogram (seconds) step, flow, run_id, retry
step.cpu.pct Gauge (avg / max / p95) same
step.memory.mb Gauge (avg / max RSS) same
step.disk.read_bytes Counter same
step.disk.write_bytes Counter same
step.disk.read_throughput Gauge (MB/s) same
step.disk.write_throughput Gauge (MB/s) same
step.gpu.utilization Gauge + gpu_index
step.gpu.memory.used_mb Gauge + gpu_index
step.retries Gauge same
step.failures Counter same

Configuration

All configuration is via standard OpenTelemetry environment variables. No extension-specific config needed.

Variable Purpose
OTEL_EXPORTER_OTLP_ENDPOINT OTLP endpoint for traces and metrics
OTEL_EXPORTER_OTLP_HEADERS Auth headers (e.g., Authorization=Bearer ...)
OTEL_SERVICE_NAME Service name tag on all metrics

If neither variable is set, metrics are printed to stdout via the OTel console exporter (useful for local debugging).

Development

git clone https://github.com/npow/metaflow-observability
cd metaflow-observability
pip install -e ".[dev]"

# Run tests
pytest

# Lint + format
ruff check src tests
ruff format src tests

# Type check
mypy

CI runs the full suite across Python 3.9, 3.10, 3.11, and 3.12 on every push.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metaflow_observability-0.1.0.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

metaflow_observability-0.1.0-py3-none-any.whl (16.2 kB view details)

Uploaded Python 3

File details

Details for the file metaflow_observability-0.1.0.tar.gz.

File metadata

  • Download URL: metaflow_observability-0.1.0.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metaflow_observability-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6883bcfe07a1587ece692bb46dac0cf0338f1d921aa943cdadafd313ba3efb90
MD5 ecd6aec5f5cd262b66e4a60fe411191f
BLAKE2b-256 0d1f9a048ab392c9eeeffd74b1aa07ba2f1d04f247c6d4416ba777cf2d4198f8

See more details on using hashes here.

Provenance

The following attestation bundles were made for metaflow_observability-0.1.0.tar.gz:

Publisher: publish.yml on npow/metaflow-observability

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file metaflow_observability-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for metaflow_observability-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4d322c89d40c2fbf92f7a77e1a77d1904dc894802e26ec857d58eb99c24d007f
MD5 99169c5901607b06d4e0b285c2d3f77d
BLAKE2b-256 f182766185a463640b8257c487e19cbf097b4f758718bf8252e60b0d18b7a986

See more details on using hashes here.

Provenance

The following attestation bundles were made for metaflow_observability-0.1.0-py3-none-any.whl:

Publisher: publish.yml on npow/metaflow-observability

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page