Inephany client library to use Metrana.
Project description
Metrana Client Library
Metrana is a metrics tracking client for ML/RL training runs. It provides a simple API to log metrics from training loops to the Metrana ingestion service. Logging is non-blocking: a Rust-backed engine batches points, streams them over gRPC, applies backpressure, and retries transient failures on a background thread, so your training loop is not slowed down.
Installation
pip install metrana
Requires Python 3.10+.
Supported platforms. metrana depends on the native metrana-logging-engine, which ships
prebuilt wheels (there is no source distribution) for: Linux x86-64 and aarch64 (glibc manylinux_2_28
and musl musllinux_1_2), macOS x86-64 and Apple Silicon, and Windows x86-64. On any other platform
(e.g. Windows on ARM, or an older-than-manylinux_2_28 Linux) pip install will fail to find a
compatible wheel — open an issue if you need a target added.
To log RL environment video with metrana.log_rendering(), install the optional rendering
extra (pulls in PyAV for client-side H.264 encoding):
pip install 'metrana[rendering]'
Quick Start
ML training run
import metrana
metrana.init(
api_key="your-api-key",
workspace_name="my-workspace",
project_name="my-project",
run_name="run-001",
)
for step in range(num_steps):
...
metrana.log("loss", loss) # one metric
metrana.log({"accuracy": acc, "lr": lr}) # several at once
metrana.close() # flush and shut down — always call this
RL training run
import metrana
metrana.init(api_key="...", workspace_name="ws", project_name="proj", run_name="rl-001")
for rl_step in range(num_updates):
...
# one metric per episode
metrana.log_rl_episode("episode_return", ep_return, rl_step=rl_step, episode=episode)
# per-environment-step metrics for a batch of envs at once
metrana.log_rl_environment_step(
"reward", rewards, rl_step=rl_step, env_id=env_ids, episode=episodes
)
metrana.close()
Logging standard metrics
metrana.log(metric_name, value, *, step=None, timestamp=None, scale=None, labels=None, evaluation=False)
logs to the default ML-step scale.
Values may be a scalar or an array — a Python list or any NumPy / PyTorch / JAX / TensorFlow tensor (any float dtype, on any device); arrays are converted to contiguous NumPy once before crossing into the engine:
metrana.log("loss", 0.5) # one point
metrana.log("loss", [0.51, 0.49, 0.48]) # three points (bulk)
metrana.log("grad_norm", grad_norm_tensor) # torch/np/jax/tf tensor
Multiple metrics at once — pass a mapping (values may themselves be scalars or arrays):
metrana.log({"loss": loss, "accuracy": acc})
Steps
A series is identified by (metric_name, scale, labels), and each series has its own step axis. The
step argument controls it:
step value |
meaning |
|---|---|
None (default) |
auto-increment from the series' last step |
a single int |
the step of the first point; further points in the call continue from it |
| a sequence/array | an explicit step per point (length must match value) |
metrana.log("loss", 0.5) # auto: next step
metrana.log("loss", [a, b, c], step=100) # steps 100, 101, 102
metrana.log("loss", [a, b, c], step=[10, 20, 30]) # explicit steps
Timestamps work the same way via timestamp (Unix milliseconds): None lets the server stamp
on arrival; a single int applies to every point; a sequence gives one per point.
Scale, labels, and evaluation
These three arguments shape the series identity:
scale— the step scale (aStandardMetricScalevalue:"ML_STEP","EPISODE","ENVIRONMENT_STEP").Nonedefaults toML_STEP. Onlylog/log_distributedtake it; the RL helpers fix their own scale.labels— adict[str, str]that, together with the name and scale, identifies the series. Two points with the same name but different labels go to different series.evaluation— a shorthand that adds the label{"evaluation": "true"}(unless you already set theevaluationkey inlabels), so evaluation points form a series distinct from otherwise identically-identified training points.
metrana.log("reward", train_reward) # training series
metrana.log("reward", eval_reward, evaluation=True) # distinct eval series
metrana.log("reward", r, labels={"policy": "greedy"}) # distinct labelled series
Retrieving the last step
metrana.get_last_step(metric_name, scale=None, labels=None) returns the last step logged for a
series, or None (pass the same scale / labels you logged it with). This is
seeded from the server at init(), so after a restart or resume you can continue from where the run
left off — useful when you want explicit steps but need to know the current position:
last = metrana.get_last_step("loss")
next_step = 0 if last is None else last + 1
metrana.log("loss", loss, step=next_step)
Closing
metrana.close() flushes queued points and shuts the background engine down. Always call it —
the engine runs on a daemon thread, so if the interpreter exits without close(), queued-but-unsent
points are lost (a warning is emitted at exit). Use it directly or rely on try/finally.
close() flushes for up to close_timeout seconds (default 15). To abandon a run immediately,
dropping anything not yet sent, pass metrana.close(close_timeout=0) — pair it with
init(skip_drain_render_on_close=True) if you also want queued rendering frames dropped rather than
encoded.
Logging RL metrics
RL metrics use a two-level step: a major rl_step (the training/update step, which must not
decrease) plus a minor step. Two helpers cover the common scales; the scale is implied by the
function, so you never pass it explicitly. Both helpers also accept labels and evaluation, which
behave exactly as for standard metrics.
Per-episode metrics
metrana.log_rl_episode(metric_name, value, rl_step, episode=None, env_id=None, labels=None, evaluation=False)
— the episode is the minor step. Omit it to auto-increment from the last logged episode:
metrana.log_rl_episode("episode_return", ep_return, rl_step=rl_step, episode=episode)
Per-environment-step metrics
metrana.log_rl_environment_step(metric_name, value, rl_step, env_id=None, episode=None, labels=None, evaluation=False)
logs within-episode steps. The env-step (minor) axis is assigned automatically and resets whenever
episode changes, so you supply the episode each point belongs to, not the env-step.
It is vectorized over environments. Pass a list of env ids and a matching value block:
- single env:
env_id="env0",valuea scalar or 1D[T]array; - many envs:
env_id=["env0", "env1", ...](lengthM),valuea 1D[M](one point each) or 2D[M, T]array.episode(andtimestamp) broadcast: a scalar, 1D, or 2D matching the value.
# 8 envs, one reward each at this rl_step
metrana.log_rl_environment_step("reward", rewards_8, rl_step=rl_step,
env_id=env_ids, episode=episodes)
# 8 envs x 128 timesteps in one call
metrana.log_rl_environment_step("reward", rewards_8x128, rl_step=rl_step,
env_id=env_ids, episode=episodes_8x128)
metrana.get_env_last_rl_step_and_episode(env_id) returns (last_rl_step, last_episode) for an
environment (either may be None) — handy for computing explicit steps after a resume.
Run configuration and attributes
metrana.log_config({"optimizer": {"name": "adam", "lr": 3e-4}, "batch_size": 256})
metrana.set_tags(["baseline", "v2"]) # replace the tag set
metrana.add_tags(["ablation"]) # add without removing
metrana.remove_tags(["baseline"]) # remove
metrana.set_description("LR sweep, seed 0")
config passed to init() is logged the same way (nested dicts/lists flatten under config/).
These run-level attributes — along with the git commit SHA and any tags/description given to
init() — are applied only by the process that creates the run, so distributed siblings that
resume it never clobber them.
For arbitrary run attributes use metrana.log_attributes(prefix_path, value). For
per-environment RL attributes (a distinct, env-scoped kind) use
metrana.log_env_attributes(env_id, value, episode=None).
Environment renderings
metrana.log_rendering(frame, rl_step, episode, env_id=None) appends a frame to a per-(env_id, episode) H.264 .mp4, encoded on a dedicated background thread (never blocks the training loop).
frame: auint8NumPy array,(H, W, 3)RGB or(H, W)/(H, W, 1)grayscale. Width and height must be even (libx264yuv420p).- When the
(env_id, episode)pair changes, the open encoder for that env is closed and a new one opened for the next episode.
Configure via init(): rendering_output_dir, rendering_fps, rendering_max_concurrent_encoders,
rendering_queue_max_size, skip_drain_render_on_close, rendering_close_timeout. Requires the
rendering extra.
Naming rules
- Metric names identify a series together with the scale and labels. Use
/-delimited prefixes to group related series (e.g.train/loss,eval/lossare distinct series); labels and theevaluationshorthand are an alternative way to split a name into distinct series. - Environment ids appear in URLs, so they must be URL-safe segments.
- Config / attribute paths are
/-delimited; keys must be non-empty and contain only[a-zA-Z0-9._-:/].
Distributed logging
When several processes (e.g. distributed-training ranks) log into one run, two pieces matter:
1. They must agree on the run. Every process that should share a run needs the same
orchestration_id. If you don't pass one, it is resolved automatically from
METRANA_ORCHESTRATION_ID, then the framework job ids TORCHELASTIC_RUN_ID / SLURM_JOB_ID /
RAY_JOB_ID, then a random token (which only descendants that inherit the environment will match).
The resolved value is published back to METRANA_ORCHESTRATION_ID so forked/spawned children
inherit it. With resume_strategy="never" (the default), the first process creates the run and the
rest resume it by matching this identifier; a genuinely different job hitting the same run name errors
instead of corrupting it.
# torchrun / Slurm / Ray: nothing to do — the framework job id is picked up automatically.
metrana.init(api_key="...", workspace_name="ws", project_name="proj", run_name="run")
# Custom launcher: pass a token shared by all workers of the job.
metrana.init(..., orchestration_id="job-2025-06-23-abc")
2. Choose the right log function for shared series. Use metrana.log_distributed(...) (instead
of metrana.log(...)) when multiple processes write to the same series — for example all ranks
logging a global loss. It uses unordered, merge semantics so concurrent writers don't conflict.
Provide an explicit step (the global training step) so points from different ranks align on the
same axis:
metrana.log_distributed("loss", loss, step=global_step)
Use plain metrana.log(...) for series owned by a single writer (it is ordered and can
auto-increment). Pin logger_id (e.g. one per rank) if you want the backend to distinguish a
restarted writer from a genuinely new concurrent one.
3. RL metrics need an exclusive owner per environment. The RL functions
(metrana.log_rl_episode(...) and metrana.log_rl_environment_step(...)) are ordered per
env_id series, so a given environment must be logged by exactly one process. When you shard
environments across ranks (e.g. a vectorized env split over workers), make sure each process writes
only its own subset of env_ids — two processes logging the same environment race and silently lose
steps/points. Plain metrana.log(...) / metrana.log_distributed(...) float series have no such
restriction (log_distributed is explicitly built for many writers on one series).
Guarantees and retries
By default Metrana favors never blocking your training loop over guaranteeing delivery. Know the trade-offs:
- Backpressure (
backpressure_strategy, default"drop_new"): when the in-process queue is full, an enqueue waits up toenqueue_timeout_secs(default0.1) for room, then:"drop_new"— drops the new points (default; protects the loop, can lose data under sustained pressure);"block"— waits indefinitely (no loss from queue pressure, but can stall the loop);"raise"— raisesMetranaEventQueueFullError.
- Retries (
max_send_retries, default60): failed sends are retried with exponential backoff (send_retry_initial_backoff_secs→send_retry_max_backoff_secs). After the limit the batch is dropped. Setmax_send_retries=Noneto retry indefinitely (no loss from transient outages, at the cost of unbounded buffering). - Errors (
error_strategy, default"warn"): how background errors surface —"silent","warn","raise_on_log"(raised on the next log call), or"raise_on_close". Drain the engine's sender errors yourself withmetrana.check_sender_errors(). (Rendering/encoding errors follow the same strategy but surface onlog_rendering()/close().)
Points can be lost only when: backpressure is drop_new and the queue stays full past the
timeout; or max_send_retries is finite and a failure persists past it; or the process exits
without metrana.close() (the daemon engine thread is killed with points still queued).
To prioritize delivery over loop latency:
metrana.init(
...,
backpressure_strategy="block", # never drop on queue pressure
max_send_retries=None, # retry transient failures forever
queue_capacity=100_000, # more headroom before backpressure kicks in
)
# ... and always call metrana.close() (give it a generous close_timeout).
Tuning and observability
max_pending_requests(default30): in-flight streaming requests — raise it to push more throughput when the backend is the bottleneck.queue_capacity(default10_000): in-process point buffer depth.batch_max_age_secs(default1.0): how long points wait to coalesce into a batch before sending.max_msg_size: max serialized request size in bytes.
metrana.get_metrics() returns a point-in-time snapshot of the engine's self-metrics. For each data
kind (float_points, rl_float_points, attribute_updates, env_attribute_updates) it reports how
much was added (attempted), enqueued, sent (server-acked), and dropped (shed under
backpressure or after exhausting retries), plus transport/health counters (connection_attempts,
requests_sent, send_errors, errors_reported, errors_evicted).
Comparing added vs sent tells you whether anything was lost:
m = metrana.get_metrics()
print(m.float_points_added, m.float_points_sent, m.float_points_dropped)
The counters are monotonic, so diff two snapshots over a window to get rates (e.g. for a periodic health log):
import time
prev = metrana.get_metrics()
time.sleep(10)
now = metrana.get_metrics()
sent_per_sec = (now.float_points_sent - prev.float_points_sent) / 10
backlog = now.float_points_added - now.float_points_sent # added but not yet acked
if now.float_points_dropped > prev.float_points_dropped:
print("data loss in the last window — raise queue_capacity / max_pending_requests or retries")
Environment Variables
| Variable | Equivalent init() argument |
|---|---|
METRANA_API_KEY |
api_key |
METRANA_ORCHESTRATION_ID |
orchestration_id |
METRANA_BACKPRESSURE_STRATEGY |
backpressure_strategy |
METRANA_ERROR_MODES |
error_strategy |
METRANA_RESUME_STRATEGY |
resume_strategy |
METRANA_LOG_LEVEL |
log_level |
METRANA_EVENT_QUEUE_MAX_SIZE |
queue_capacity |
METRANA_SKIP_DRAIN_RENDER_ON_CLOSE |
skip_drain_render_on_close |
METRANA_RENDERING_CLOSE_TIMEOUT |
rendering_close_timeout |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file metrana-0.5.2.tar.gz.
File metadata
- Download URL: metrana-0.5.2.tar.gz
- Upload date:
- Size: 50.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c76d447cc37dc63bdfe749ef61643310a939581e21ddc100e527c81201a96ec6
|
|
| MD5 |
55b5843571ce3180a794aa4034f085b7
|
|
| BLAKE2b-256 |
2c19d35bc4af0a976a34efb1874e1265b0f50bf81e1bd3b1e0934eea2c49acf8
|
File details
Details for the file metrana-0.5.2-py3-none-any.whl.
File metadata
- Download URL: metrana-0.5.2-py3-none-any.whl
- Upload date:
- Size: 48.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
780b9cc5dd6d243e885444ffbe14a1a26f411d560866ac7147ffd5c731ee8c3e
|
|
| MD5 |
b63ed98267d9495fe44c41e6021c8ef4
|
|
| BLAKE2b-256 |
377d3e6d80a4e183fb8c645b9888278640f5b6be518c37e298a61d0db2dc3ed7
|