Skip to main content

Rundial Python SDK (non-blocking ingest with bounded spool and ergonomic run API)

Project description

rundial

pip install rundial

Phase 4 introduces a non-blocking metrics transport with:

  • bounded in-memory queue on the training thread
  • background flush worker
  • bounded disk spool (default enabled)
  • gzip compression in worker transport (threshold-based)
  • retry with exponential backoff + jitter
  • diagnostics counters for dropped/accepted/retried points

CLI quickstart

Installing rundial also installs the rundial CLI:

rundial init --endpoint http://127.0.0.1:8787
rundial auth whoami
rundial target ls
rundial doctor

Operational commands:

rundial workspace ls
rundial project ls --workspace default-workspace

rundial run start --workspace default-workspace --project default-project --name baseline-001 --kind training
rundial run list --workspace default-workspace --project default-project --status running
rundial run status run_...
rundial run finish run_... --state completed

rundial metrics tail run_... --workspace default-workspace --project default-project --keys train/loss
rundial metrics export run_... \
  --workspace default-workspace \
  --project default-project \
  --keys train/loss \
  --format csv \
  --out metrics.csv
rundial logs tail run_... --workspace default-workspace --project default-project --min-level info
rundial logs export run_... \
  --workspace default-workspace \
  --project default-project \
  --format json \
  --out logs.json

All commands accept global --json output. Exit codes are stable: 0 success, 1 command or transport error, and 2 authentication/authorization failure.

CLI operational smoke, with the open-core stack running:

RUNDIAL_API_KEY=rdk_... python user_tests/cli_operational_parity_smoke.py \
  --workspace default-workspace \
  --project default-project

Start a run with workspace/project strings only:

rundial run start \
  --endpoint http://127.0.0.1:8787 \
  --workspace default-workspace \
  --project default-project \
  --name baseline-001 \
  --kind training

Config precedence:

  1. CLI flags
  2. env vars (RUNDIAL_API_KEY, RUNDIAL_ENDPOINT, RUNDIAL_WORKSPACE, RUNDIAL_PROJECT)
  3. ~/.config/rundial/config.toml

Quick start (recommended)

import rundial as rd

with rd.init(
    workspace="team-alpha",
    project="mnist-demo",
    name="baseline-001",
    kind="training",
    endpoint="http://127.0.0.1:8787",
    api_key="rdk_...",
    mode="online",
) as run:
    run.log({"train/loss": 0.42, "train/acc": 0.91}, step=1)
    run.log_metric("eval/loss", value=0.31, time_ms=1_760_000_000_000)
    run.log_text("starting eval loop", level="info")
    run.checkpoint("checkpoints/model.pt", step=1)

# `run.finish()` / `run.close()` finalize the run as `completed`.
# Use `run.fail(...)` or `run.abort(...)` for explicit terminal outcomes.

This slug-first mode resolves workspace/project to the canonical internal run target before run start. Use kind="agent" or kind="eval" for agent and evaluation runs; the default is kind="training".

Logs and console capture

run.log_text(message, level="info") shares the same bounded, non-blocking queue as metric logging. Messages are capped at 8 KiB, truncated lines are flagged, and queue drops are visible through run.diagnostics().

import rundial as rd

with rd.init(
    workspace="team-alpha",
    project="mnist-demo",
    name="logs-demo",
    endpoint="http://127.0.0.1:8787",
    api_key="rdk_...",
    capture_console=True,
) as run:
    print("stdout is mirrored into Rundial logs")
    run.log_text("manual warning", level="warn")

capture_console=True tees stdout as info and stderr as error. The caller still writes to the original stream, and Rundial drops-and-counts when the bounded queue is full instead of blocking the training process.

Lightweight traces

Trace spans use the same non-blocking ingest worker and disk spool as metrics and logs. Attributes and events are normalized in the worker; large prompt, completion, or tool-output values above 16 KiB are uploaded through the artifact pipe and replaced on the span with a small evidence reference.

with rd.init(
    workspace="team-alpha",
    project="mnist-demo",
    name="agent-demo",
    kind="agent",
    endpoint="http://127.0.0.1:8787",
    api_key="rdk_...",
) as run:
    with run.trace("planner.step", attrs={"phase": "plan"}) as span:
        span.event("prompt.ready", {"tokens": 128})
        span.set_attrs({"model": "example-model"})

    run.tool_call("search", input={"q": "Ada Lovelace"}, output="large tool output...")

Artifacts and checkpoints

run.log_artifact(path_or_dir, name="checkpoint") enqueues artifact work and returns before hashing or uploading files. A dedicated background uploader handles manifest hashing, pre-signed upload URLs, multipart uploads for large files, and finalization without sharing the metrics/log worker.

with rd.init(workspace="team-alpha", project="mnist-demo", api_key="rdk_...") as run:
    run.log_artifact("outputs/eval-report", name="eval-report")
    run.checkpoint("checkpoints/model.pt", step=100, keep_last=5)
    rd.checkpoint("checkpoints/model.pt", step=101)

Artifact upload jobs are journaled in the SDK spool directory and retried by the next client process if an upload is interrupted. run.checkpoint(...) and the current-run convenience rd.checkpoint(...) use artifact type checkpoint, alias latest, and a server-enforced keep-last retention policy. The API default is to keep the latest 5 finalized checkpoints per run/name when a client omits the hint; pass keep_last=K to tune it for a checkpoint call.

To consume an artifact from another run, record lineage and download through a blocking handle:

with rd.init(workspace="team-alpha", project="mnist-demo", api_key="rdk_...") as run:
    artifact = run.use_artifact("checkpoint:latest")
    artifact.download("inputs/checkpoint")

Lineage UI is still in progress for the v1 artifact milestone.

Media

run.log(...) accepts image and table helper values for common visual inspection workflows. Media bytes ride the artifact uploader, while Rundial stores only a bounded manifest row for querying and display.

with rd.init(workspace="team-alpha", project="mnist-demo", api_key="rdk_...") as run:
    run.log({"samples": rd.Image("outputs/sample-grid.png", caption="validation samples")}, step=10)
    run.log(
        {
            "predictions": rd.Table(
                columns=["id", "label", "score"],
                rows=[["img-1", "cat", 0.91], ["img-2", "dog", 0.87]],
            )
        },
        step=10,
    )

rd.Image(...) accepts filesystem paths, PIL-like objects with save(...), and uint8 numpy-like arrays shaped (height, width), (height, width, 1), (height, width, 3), or (height, width, 4). Array and PIL-like serialization happens in the artifact worker, not inside run.log(...). File-backed media jobs are replayable through the artifact journal; generated media is best-effort until the worker materializes the generated file.

Framework Integrations

Install optional framework adapters only when you need them:

pip install "rundial[integrations]"
Framework Import What it maps
PyTorch Lightning from rundial.integrations import RundialLogger hyperparams to run config, metrics to run.log(...), checkpoints to artifacts
Hugging Face Transformers from rundial.integrations import RundialCallback Trainer args/model config to run config, logs/eval metrics to run.log(...), saved checkpoints to artifacts
Keras from rundial.integrations import RundialKerasCallback fit/optimizer params to run config, epoch/batch metrics to run.log(...), checkpoint paths to artifacts

The base rundial install has no hard framework dependencies. Adapter imports remain safe without Lightning, Transformers, or Keras installed; installing the extra provides the native callback base classes for framework type checks.

W&B Compatibility

For common W&B-style training scripts, swap only the import line:

import rundial.compat.wandb as wandb

The shim supports wandb.init, wandb.log, wandb.config, wandb.finish, run.summary, wandb.Image, wandb.Table, wandb.watch, wandb.define_metric, and wandb.login. Unsupported symbols raise NotImplementedError with a pointer to the compatibility table in docs/wandb-compat.md.

Resume existing runs

Use run_id with an explicit resume mode when restarting a crashed or interrupted job:

import rundial as rd

with rd.init(
    workspace="team-alpha",
    project="mnist-demo",
    run_id="run_abc123",
    resume="allow",
    endpoint="http://127.0.0.1:8787",
    api_key="rdk_...",
) as run:
    run.log({"train/loss": 0.38}, step=50)

Resume modes:

  • resume="never" (default): create run_id only if it does not already exist.
  • resume="allow": attach to a running run or create it if missing; terminal runs are not reopened.
  • resume="must": require an existing run; terminal runs are explicitly reopened as running.

Duplicate steps are resolved at query time. Rundial keeps raw metric rows append-only, but series queries show the latest accepted value per (runId, metricKey, step) using ingest time, with a stable row-id tie breaker. This keeps training-loop ingest fast while resumed curves remain monotonic by step.

Discovery helpers

import rundial as rd

client = rd.Client(
    endpoint="http://127.0.0.1:8787",
    api_key="rdk_...",
    spool_enabled=False,
    start_worker_on_init=False,
)
print(client.whoami())
print(client.list_workspaces())
print(client.list_projects("default-workspace"))
client.close(timeout_seconds=0.1, drain=False)

If the server does not expose /api/v1/runs/resolve-target, slug-first run start fails with an actionable stale-build error. Rebuild/restart API and retry.

Runtime notes

  • run.log() / run.log_metric() are non-blocking and never perform network or disk I/O.
  • run.log_text() and opt-in console capture use the same non-blocking queue and expose log_lines_truncated, dropped_log_lines_queue_full, and dropped_log_lines_invalid diagnostics.
  • system metrics are sampled by a background thread by default and logged as ordinary system/* metrics; pass system_metrics=False to rd.init(...) to opt out, or system_metrics_interval_seconds=... to tune the cadence (minimum 2 seconds).
  • run.finish() / run.close() flush and finalize the run; use client.close(...) when you only want to release the client transport.
  • NaN and infinite metric values are dropped without raising, counted in run.diagnostics().non_finite_dropped, and warn once per metric key.
  • disk spool is enabled by default at .rundial_spool and is bounded by size/age.
  • if disk spool writes fail, fallback memory buffering stays bounded and drops oldest points.
  • close() returns within the requested timeout plus a bounded transport wait; when it cannot send all pending points before the deadline, un-sent points are handed to the disk spool and re-sent by the next process.
  • run.diagnostics().pending_spooled_batches reports durable batches waiting for delivery.
  • worker transport can gzip large payloads (gzip_enabled, gzip_min_bytes).
  • use run.diagnostics() to inspect queue pressure, retries, and drop counters.
  • modes:
    • online (default): upload in background with retries/spool fallback
    • offline: buffer to spool only (no upload attempts)
    • disabled: safe no-op logging for tests and dry-runs
  • distributed policy:
    • distributed="rank0" (default): only rank 0 emits logs
    • distributed="all": all ranks emit logs (use with caution for cardinality/volume)
  • rank detection uses common env vars (RANK, LOCAL_RANK, SLURM_PROCID, etc.); override explicitly with distributed_rank=<int>.

Backward-compatible low-level API

from rundial_sdk import RundialClient

RundialClient remains supported for advanced/manual lifecycle control.

Benchmark guardrail

Run the Phase 4 benchmark/guardrail script:

bun run test:phase4:sdk:benchmark

The command validates hot-path latency and bounded spool behavior under sustained retryable failures.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rundial-1.0.0rc1.tar.gz (88.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rundial-1.0.0rc1-py3-none-any.whl (94.2 kB view details)

Uploaded Python 3

File details

Details for the file rundial-1.0.0rc1.tar.gz.

File metadata

  • Download URL: rundial-1.0.0rc1.tar.gz
  • Upload date:
  • Size: 88.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rundial-1.0.0rc1.tar.gz
Algorithm Hash digest
SHA256 875a95342f39b7543a3b8a95c706e0eb527152c651065f1e544a0d49cc1de571
MD5 53973703b49ba86623e892f0760ff19e
BLAKE2b-256 4908c8e3775834417e04369e9b13be7edc176288d3d9f287a26fdbf985eda81b

See more details on using hashes here.

Provenance

The following attestation bundles were made for rundial-1.0.0rc1.tar.gz:

Publisher: release.yml on rundial-dev/rundial

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rundial-1.0.0rc1-py3-none-any.whl.

File metadata

  • Download URL: rundial-1.0.0rc1-py3-none-any.whl
  • Upload date:
  • Size: 94.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rundial-1.0.0rc1-py3-none-any.whl
Algorithm Hash digest
SHA256 42dd3d2eb6ca9903d8bd0149de1c2ae8bd5048778995989193e9840854b45468
MD5 e74d282690261261f66bca6927feb0c1
BLAKE2b-256 4e90260b06dfffc559adbce777ac202b211215104e6ffd52b5d2ab4f892dd581

See more details on using hashes here.

Provenance

The following attestation bundles were made for rundial-1.0.0rc1-py3-none-any.whl:

Publisher: release.yml on rundial-dev/rundial

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page