Skip to main content

Inephany client library to use Metrana.

Project description

Metrana Client Library

Metrana is a metrics tracking client for ML/RL training runs. It provides a simple API to log metrics from training loops to the Metrana ingestion service. Logging is non-blocking: a Rust-backed engine batches points, streams them over gRPC, applies backpressure, and retries transient failures on a background thread, so your training loop is not slowed down.

Installation

pip install metrana

Requires Python 3.10+.

Supported platforms. metrana depends on the native metrana-logging-engine, which ships prebuilt wheels (there is no source distribution) for: Linux x86-64 and aarch64 (glibc manylinux_2_28 and musl musllinux_1_2), macOS x86-64 and Apple Silicon, and Windows x86-64. On any other platform (e.g. Windows on ARM, or an older-than-manylinux_2_28 Linux) pip install will fail to find a compatible wheel — open an issue if you need a target added.

To log RL environment video with metrana.log_rendering(), install the optional rendering extra (pulls in PyAV for client-side H.264 encoding):

pip install 'metrana[rendering]'

Quick Start

ML training run

import metrana

metrana.init(
    api_key="your-api-key",
    workspace_name="my-workspace",
    project_name="my-project",
    run_name="run-001",
)

for step in range(num_steps):
    ...
    metrana.log("loss", loss)                      # one metric
    metrana.log({"accuracy": acc, "lr": lr})       # several at once

metrana.close()   # flush and shut down — always call this

RL training run

import metrana

metrana.init(api_key="...", workspace_name="ws", project_name="proj", run_name="rl-001")

for rl_step in range(num_updates):
    ...
    # one metric per episode
    metrana.log_rl_episode("episode_return", ep_return, rl_step=rl_step, episode=episode)

    # per-environment-step metrics for a batch of envs at once
    metrana.log_rl_environment_step(
        "reward", rewards, rl_step=rl_step, env_id=env_ids, episode=episodes
    )

metrana.close()

Logging standard metrics

metrana.log(metric_name, value, *, step=None, timestamp=None, scale=None, labels=None, evaluation=False) logs to the default ML-step scale.

Values may be a scalar or an array — a Python list or any NumPy / PyTorch / JAX / TensorFlow tensor (any float dtype, on any device); arrays are converted to contiguous NumPy once before crossing into the engine:

metrana.log("loss", 0.5)                       # one point
metrana.log("loss", [0.51, 0.49, 0.48])        # three points (bulk)
metrana.log("grad_norm", grad_norm_tensor)     # torch/np/jax/tf tensor

Multiple metrics at once — pass a mapping (values may themselves be scalars or arrays):

metrana.log({"loss": loss, "accuracy": acc})

Steps

A series is identified by (metric_name, scale, labels), and each series has its own step axis. The step argument controls it:

step value meaning
None (default) auto-increment from the series' last step
a single int the step of the first point; further points in the call continue from it
a sequence/array an explicit step per point (length must match value)
metrana.log("loss", 0.5)                        # auto: next step
metrana.log("loss", [a, b, c], step=100)        # steps 100, 101, 102
metrana.log("loss", [a, b, c], step=[10, 20, 30])  # explicit steps

Timestamps work the same way via timestamp (Unix milliseconds): None lets the server stamp on arrival; a single int applies to every point; a sequence gives one per point.

Scale, labels, and evaluation

These three arguments shape the series identity:

  • scale — the step scale (a StandardMetricScale value: "ML_STEP", "EPISODE", "ENVIRONMENT_STEP"). None defaults to ML_STEP. Only log / log_distributed take it; the RL helpers fix their own scale.
  • labels — a dict[str, str] that, together with the name and scale, identifies the series. Two points with the same name but different labels go to different series.
  • evaluation — a shorthand that adds the label {"evaluation": "true"} (unless you already set the evaluation key in labels), so evaluation points form a series distinct from otherwise identically-identified training points.
metrana.log("reward", train_reward)                       # training series
metrana.log("reward", eval_reward, evaluation=True)        # distinct eval series
metrana.log("reward", r, labels={"policy": "greedy"})      # distinct labelled series

Retrieving the last step

metrana.get_last_step(metric_name, scale=None, labels=None) returns the last step logged for a series, or None (pass the same scale / labels you logged it with). This is seeded from the server at init(), so after a restart or resume you can continue from where the run left off — useful when you want explicit steps but need to know the current position:

last = metrana.get_last_step("loss")
next_step = 0 if last is None else last + 1
metrana.log("loss", loss, step=next_step)

Closing

metrana.close() flushes queued points and shuts the background engine down. Always call it — the engine runs on a daemon thread, so if the interpreter exits without close(), queued-but-unsent points are lost (a warning is emitted at exit). Use it directly or rely on try/finally.

close() flushes for up to close_timeout seconds (default 15). To abandon a run immediately, dropping anything not yet sent, pass metrana.close(close_timeout=0) — pair it with init(skip_drain_render_on_close=True) if you also want queued rendering frames dropped rather than encoded.

Logging RL metrics

RL metrics use a two-level step: a major rl_step (the training/update step, which must not decrease) plus a minor step. Two helpers cover the common scales; the scale is implied by the function, so you never pass it explicitly. Both helpers also accept labels and evaluation, which behave exactly as for standard metrics.

Per-episode metrics

metrana.log_rl_episode(metric_name, value, rl_step, episode=None, env_id=None, labels=None, evaluation=False) — the episode is the minor step. Omit it to auto-increment from the last logged episode:

metrana.log_rl_episode("episode_return", ep_return, rl_step=rl_step, episode=episode)

Per-environment-step metrics

metrana.log_rl_environment_step(metric_name, value, rl_step, env_id=None, episode=None, labels=None, evaluation=False) logs within-episode steps. The env-step (minor) axis is assigned automatically and resets whenever episode changes, so you supply the episode each point belongs to, not the env-step.

It is vectorized over environments. Pass a list of env ids and a matching value block:

  • single env: env_id="env0", value a scalar or 1D [T] array;
  • many envs: env_id=["env0", "env1", ...] (length M), value a 1D [M] (one point each) or 2D [M, T] array. episode (and timestamp) broadcast: a scalar, 1D, or 2D matching the value.
# 8 envs, one reward each at this rl_step
metrana.log_rl_environment_step("reward", rewards_8, rl_step=rl_step,
                                env_id=env_ids, episode=episodes)

# 8 envs x 128 timesteps in one call
metrana.log_rl_environment_step("reward", rewards_8x128, rl_step=rl_step,
                                env_id=env_ids, episode=episodes_8x128)

metrana.get_env_last_rl_step_and_episode(env_id) returns (last_rl_step, last_episode) for an environment (either may be None) — handy for computing explicit steps after a resume.

Run configuration and attributes

metrana.log_config({"optimizer": {"name": "adam", "lr": 3e-4}, "batch_size": 256})
metrana.set_tags(["baseline", "v2"])      # replace the tag set
metrana.add_tags(["ablation"])            # add without removing
metrana.remove_tags(["baseline"])         # remove
metrana.set_description("LR sweep, seed 0")

config passed to init() is logged the same way (nested dicts/lists flatten under config/). These run-level attributes — along with the git commit SHA and any tags/description given to init() — are applied only by the process that creates the run, so distributed siblings that resume it never clobber them.

For arbitrary run attributes use metrana.log_attributes(prefix_path, value). For per-environment RL attributes (a distinct, env-scoped kind) use metrana.log_env_attributes(env_id, value, episode=None).

Environment renderings

metrana.log_rendering(frame, rl_step, episode, env_id=None) appends a frame to a per-(env_id, episode) H.264 .mp4, encoded on a dedicated background thread (never blocks the training loop).

  • frame: a uint8 NumPy array, (H, W, 3) RGB or (H, W) / (H, W, 1) grayscale. Width and height must be even (libx264 yuv420p).
  • When the (env_id, episode) pair changes, the open encoder for that env is closed and a new one opened for the next episode.

Configure via init(): rendering_output_dir, rendering_fps, rendering_max_concurrent_encoders, rendering_queue_max_size, skip_drain_render_on_close, rendering_close_timeout. Requires the rendering extra.

Naming rules

  • Metric names identify a series together with the scale and labels. Use /-delimited prefixes to group related series (e.g. train/loss, eval/loss are distinct series); labels and the evaluation shorthand are an alternative way to split a name into distinct series.
  • Environment ids appear in URLs, so they must be URL-safe segments.
  • Config / attribute paths are /-delimited; keys must be non-empty and contain only [a-zA-Z0-9._-:/].

Distributed logging

When several processes (e.g. distributed-training ranks) log into one run, two pieces matter:

1. They must agree on the run. Every process that should share a run needs the same orchestration_id. If you don't pass one, it is resolved automatically from METRANA_ORCHESTRATION_ID, then the framework job ids TORCHELASTIC_RUN_ID / SLURM_JOB_ID / RAY_JOB_ID, then a random token (which only descendants that inherit the environment will match). The resolved value is published back to METRANA_ORCHESTRATION_ID so forked/spawned children inherit it. With resume_strategy="never" (the default), the first process creates the run and the rest resume it by matching this identifier; a genuinely different job hitting the same run name errors instead of corrupting it.

# torchrun / Slurm / Ray: nothing to do — the framework job id is picked up automatically.
metrana.init(api_key="...", workspace_name="ws", project_name="proj", run_name="run")

# Custom launcher: pass a token shared by all workers of the job.
metrana.init(..., orchestration_id="job-2025-06-23-abc")

2. Choose the right log function for shared series. Use metrana.log_distributed(...) (instead of metrana.log(...)) when multiple processes write to the same series — for example all ranks logging a global loss. It uses unordered, merge semantics so concurrent writers don't conflict. Provide an explicit step (the global training step) so points from different ranks align on the same axis:

metrana.log_distributed("loss", loss, step=global_step)

Use plain metrana.log(...) for series owned by a single writer (it is ordered and can auto-increment). Pin logger_id (e.g. one per rank) if you want the backend to distinguish a restarted writer from a genuinely new concurrent one.

3. RL metrics need an exclusive owner per environment. The RL functions (metrana.log_rl_episode(...) and metrana.log_rl_environment_step(...)) are ordered per env_id series, so a given environment must be logged by exactly one process. When you shard environments across ranks (e.g. a vectorized env split over workers), make sure each process writes only its own subset of env_ids — two processes logging the same environment race and silently lose steps/points. Plain metrana.log(...) / metrana.log_distributed(...) float series have no such restriction (log_distributed is explicitly built for many writers on one series).

Guarantees and retries

By default Metrana favors never blocking your training loop over guaranteeing delivery. Know the trade-offs:

  • Backpressure (backpressure_strategy, default "drop_new"): when the in-process queue is full, an enqueue waits up to enqueue_timeout_secs (default 0.1) for room, then:
    • "drop_new" — drops the new points (default; protects the loop, can lose data under sustained pressure);
    • "block" — waits indefinitely (no loss from queue pressure, but can stall the loop);
    • "raise" — raises MetranaEventQueueFullError.
  • Retries (max_send_retries, default 60): failed sends are retried with exponential backoff (send_retry_initial_backoff_secssend_retry_max_backoff_secs). After the limit the batch is dropped. Set max_send_retries=None to retry indefinitely (no loss from transient outages, at the cost of unbounded buffering).
  • Errors (error_strategy, default "warn"): how background errors surface — "silent", "warn", "raise_on_log" (raised on the next log call), or "raise_on_close". Drain the engine's sender errors yourself with metrana.check_sender_errors(). (Rendering/encoding errors follow the same strategy but surface on log_rendering() / close().)

Points can be lost only when: backpressure is drop_new and the queue stays full past the timeout; or max_send_retries is finite and a failure persists past it; or the process exits without metrana.close() (the daemon engine thread is killed with points still queued).

To prioritize delivery over loop latency:

metrana.init(
    ...,
    backpressure_strategy="block",   # never drop on queue pressure
    max_send_retries=None,           # retry transient failures forever
    queue_capacity=100_000,          # more headroom before backpressure kicks in
)
# ... and always call metrana.close() (give it a generous close_timeout).

Tuning and observability

  • max_pending_requests (default 30): in-flight streaming requests — raise it to push more throughput when the backend is the bottleneck.
  • queue_capacity (default 10_000): in-process point buffer depth.
  • batch_max_age_secs (default 1.0): how long points wait to coalesce into a batch before sending.
  • max_msg_size: max serialized request size in bytes.

metrana.get_metrics() returns a point-in-time snapshot of the engine's self-metrics. For each data kind (float_points, rl_float_points, attribute_updates, env_attribute_updates) it reports how much was added (attempted), enqueued, sent (server-acked), and dropped (shed under backpressure or after exhausting retries), plus transport/health counters (connection_attempts, requests_sent, send_errors, errors_reported, errors_evicted).

Comparing added vs sent tells you whether anything was lost:

m = metrana.get_metrics()
print(m.float_points_added, m.float_points_sent, m.float_points_dropped)

The counters are monotonic, so diff two snapshots over a window to get rates (e.g. for a periodic health log):

import time

prev = metrana.get_metrics()
time.sleep(10)
now = metrana.get_metrics()
sent_per_sec = (now.float_points_sent - prev.float_points_sent) / 10
backlog = now.float_points_added - now.float_points_sent      # added but not yet acked
if now.float_points_dropped > prev.float_points_dropped:
    print("data loss in the last window — raise queue_capacity / max_pending_requests or retries")

Environment Variables

Variable Equivalent init() argument
METRANA_API_KEY api_key
METRANA_ORCHESTRATION_ID orchestration_id
METRANA_BACKPRESSURE_STRATEGY backpressure_strategy
METRANA_ERROR_MODES error_strategy
METRANA_RESUME_STRATEGY resume_strategy
METRANA_LOG_LEVEL log_level
METRANA_EVENT_QUEUE_MAX_SIZE queue_capacity
METRANA_SKIP_DRAIN_RENDER_ON_CLOSE skip_drain_render_on_close
METRANA_RENDERING_CLOSE_TIMEOUT rendering_close_timeout

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metrana-0.5.2.tar.gz (50.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

metrana-0.5.2-py3-none-any.whl (48.1 kB view details)

Uploaded Python 3

File details

Details for the file metrana-0.5.2.tar.gz.

File metadata

  • Download URL: metrana-0.5.2.tar.gz
  • Upload date:
  • Size: 50.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for metrana-0.5.2.tar.gz
Algorithm Hash digest
SHA256 c76d447cc37dc63bdfe749ef61643310a939581e21ddc100e527c81201a96ec6
MD5 55b5843571ce3180a794aa4034f085b7
BLAKE2b-256 2c19d35bc4af0a976a34efb1874e1265b0f50bf81e1bd3b1e0934eea2c49acf8

See more details on using hashes here.

File details

Details for the file metrana-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: metrana-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 48.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for metrana-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 780b9cc5dd6d243e885444ffbe14a1a26f411d560866ac7147ffd5c731ee8c3e
MD5 b63ed98267d9495fe44c41e6021c8ef4
BLAKE2b-256 377d3e6d80a4e183fb8c645b9888278640f5b6be518c37e298a61d0db2dc3ed7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page