Shared OpenTelemetry library for NVIDIA NeMo ecosystem (Megatron-LM, NeMo-RL, NeMo-Gym)
Project description
Early development: This library is under active development. Expect breaking changes between releases.
Shared OpenTelemetry instrumentation library for the NVIDIA NeMo ecosystem (Megatron-LM, NeMo-RL, NeMo-Gym).
Provides unified tracing, metrics, and log bridging across distributed training jobs. Cheap when disabled — group-gated calls (managed_span, @trace_fn) cost only a single frozenset lookup when their span group is off. managed_span then yields None (its body still runs); @trace_fn just calls the wrapped function. (span_cm is always-on and not gated.) Only opentelemetry-api (no-op) is required at import time; the full SDK loads only on exporting ranks.
Install
pip install nemo-lens # API only — no-op at runtime, no SDK overhead
pip install 'nemo-lens[sdk]' # adds SDK + OTLP exporters, required on exporting ranks
Quickstart
from nemo.lens import NemoLensConfig, setup_telemetry, managed_span
config = NemoLensConfig.from_env()
handle = setup_telemetry(config, rank=rank, world_size=world_size)
try:
for i in range(steps):
with managed_span('step', 'train.step', iteration=i) as span:
loss = train_step()
if span:
span.set_attribute('loss', loss)
finally:
handle.shutdown()
Enable with environment variables:
NEMO_LENS_ENABLED=1
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
NEMO_LENS_SPAN_GROUPS=per_step # includes the 'step' group used above (default={job,checkpoint,evaluate} omits it)
Three instrumentation primitives
| Primitive | Use when |
|---|---|
managed_span(group, name, **attrs) |
Context manager; group-gated, yields None when disabled |
@trace_fn(group, name) |
Decorator; same gating, no re-indentation |
span_cm(name, tracer=...) |
Always-on context manager; use for top-level spans |
Distributed training
By default only one rank exports (single_rank, last rank). Change with:
NEMO_LENS_EXPORT_STRATEGY=all_ranks # every rank
NEMO_LENS_EXPORT_STRATEGY=sampled # fraction via NEMO_LENS_EXPORT_SAMPLE_RATE
NEMO_LENS_EXPORT_STRATEGY=first_rank_per_node # one rank per node (LOCAL_RANK=0)
Custom strategies (your own rank-selection logic) are supported via register_export_strategy — see docs/user-guide/custom-strategies.md.
Local observability stack
docker compose -f docker-compose.otel.yml up -d
# Jaeger → http://localhost:16686
# Grafana → http://localhost:3000
# Kibana → http://localhost:5601
Development
git clone <repo-url> && cd lens
uv venv && uv pip install -e . --group dev
pre-commit install
pytest
Docs
Full documentation: cd docs && make serve (requires pip install --group docs -e .).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nemo_lens-0.1.0.tar.gz.
File metadata
- Download URL: nemo_lens-0.1.0.tar.gz
- Upload date:
- Size: 46.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
044621a6d877739e0bc69aced2f2f2a5c565eadc3f9070b99af72bb57abd73aa
|
|
| MD5 |
8340bf6f9e7a82733a63a4d218e0afec
|
|
| BLAKE2b-256 |
949fc32809fdb5c375218d8dc4f5762133c0be63a791ced8654a4f2408822513
|
File details
Details for the file nemo_lens-0.1.0-py3-none-any.whl.
File metadata
- Download URL: nemo_lens-0.1.0-py3-none-any.whl
- Upload date:
- Size: 51.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87eca2609dfd41db8f663d943c0c83153e209c1a880bcab816750b6a4f841d58
|
|
| MD5 |
510b806bd6ea3a0dc1ada9e963d8c91b
|
|
| BLAKE2b-256 |
8ae423f7b705bd4102be32f92e651ad5dece122d93e09b8518d3df8c17fd2f6b
|