Skip to main content

Shared OpenTelemetry library for NVIDIA NeMo ecosystem (Megatron-LM, NeMo-RL, NeMo-Gym)

Project description

NeMo Lens

codecov CICD NeMo Lens Python 3.13+ GitHub Stars

Early development: This library is under active development. Expect breaking changes between releases.

Shared OpenTelemetry instrumentation library for the NVIDIA NeMo ecosystem (Megatron-LM, NeMo-RL, NeMo-Gym).

Provides unified tracing, metrics, and log bridging across distributed training jobs. Cheap when disabled — group-gated calls (managed_span, @trace_fn) cost only a single frozenset lookup when their span group is off. managed_span then yields None (its body still runs); @trace_fn just calls the wrapped function. (span_cm is always-on and not gated.) Only opentelemetry-api (no-op) is required at import time; the full SDK loads only on exporting ranks.

Install

pip install nemo-lens           # API only — no-op at runtime, no SDK overhead
pip install 'nemo-lens[sdk]'    # adds SDK + OTLP exporters, required on exporting ranks

Quickstart

from nemo.lens import NemoLensConfig, setup_telemetry, managed_span

config = NemoLensConfig.from_env()
handle = setup_telemetry(config, rank=rank, world_size=world_size)

try:
    for i in range(steps):
        with managed_span('step', 'train.step', iteration=i) as span:
            loss = train_step()
            if span:
                span.set_attribute('loss', loss)
finally:
    handle.shutdown()

Enable with environment variables:

NEMO_LENS_ENABLED=1
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
NEMO_LENS_SPAN_GROUPS=per_step   # includes the 'step' group used above (default={job,checkpoint,evaluate} omits it)

Three instrumentation primitives

Primitive Use when
managed_span(group, name, **attrs) Context manager; group-gated, yields None when disabled
@trace_fn(group, name) Decorator; same gating, no re-indentation
span_cm(name, tracer=...) Always-on context manager; use for top-level spans

Distributed training

By default only one rank exports (single_rank, last rank). Change with:

NEMO_LENS_EXPORT_STRATEGY=all_ranks            # every rank
NEMO_LENS_EXPORT_STRATEGY=sampled              # fraction via NEMO_LENS_EXPORT_SAMPLE_RATE
NEMO_LENS_EXPORT_STRATEGY=first_rank_per_node  # one rank per node (LOCAL_RANK=0)

Custom strategies (your own rank-selection logic) are supported via register_export_strategy — see docs/user-guide/custom-strategies.md.

Local observability stack

docker compose -f docker-compose.otel.yml up -d
# Jaeger   → http://localhost:16686
# Grafana  → http://localhost:3000
# Kibana   → http://localhost:5601

Development

git clone <repo-url> && cd lens
uv venv && uv pip install -e . --group dev
pre-commit install
pytest

Docs

Full documentation: cd docs && make serve (requires pip install --group docs -e .).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nemo_lens-0.1.0.tar.gz (46.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nemo_lens-0.1.0-py3-none-any.whl (51.5 kB view details)

Uploaded Python 3

File details

Details for the file nemo_lens-0.1.0.tar.gz.

File metadata

  • Download URL: nemo_lens-0.1.0.tar.gz
  • Upload date:
  • Size: 46.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for nemo_lens-0.1.0.tar.gz
Algorithm Hash digest
SHA256 044621a6d877739e0bc69aced2f2f2a5c565eadc3f9070b99af72bb57abd73aa
MD5 8340bf6f9e7a82733a63a4d218e0afec
BLAKE2b-256 949fc32809fdb5c375218d8dc4f5762133c0be63a791ced8654a4f2408822513

See more details on using hashes here.

File details

Details for the file nemo_lens-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: nemo_lens-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 51.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for nemo_lens-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 87eca2609dfd41db8f663d943c0c83153e209c1a880bcab816750b6a4f841d58
MD5 510b806bd6ea3a0dc1ada9e963d8c91b
BLAKE2b-256 8ae423f7b705bd4102be32f92e651ad5dece122d93e09b8518d3df8c17fd2f6b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page