Training job telemetry using OneLogger library.

These details have not been verified by PyPI

Project description

One Logger Training Telemetry

One Logger Training Telemetry is built on top of one-logger-core using the latter to collect telemetry data on training jobs. It includes

Predefined spans, events, and attributes for a typical training job.
Easy integration with several trainng frameworks.

Concepts

Similar to one-logger-core, this library uses concepts inspired by Open Telemetry (spans, events, and attributes). To illustrate the concepts better, the table below shows an example structure (parent-child span relationships) in a typical training job with synchronous checkpoint saving(each cell of the table represents a span, which is a child of the span shown in the cell left of it). Note that this is just an example; the exact structure of spans is determined by the actual structure of code, which is reflected in the timing and order in which the training job calls the telemetry callbacks (or context managers).

APPLICATION span	DIST_INIT span
	DATA_LOADER_INIT span
	CHECKPOINT_LOAD span
	MODEL_INIT span
	OPTIMIZER_INIT span
	TRAINING_LOOP span	TRAINING_SINGLE_ITERATION span*	DATA_LOADING span
			MODEL_FORWARD span
			MODEL_BACKWARD span
			OPTIMIZER_UPDATE
			CHECKPOINT_SAVE_SYNC span*
		VALIDATION_LOOP span*	VALIDATION_SINGLE_ITERATION span
	TESTING_LOOP span	TESTING_SINGLE_ITERATION span

NOTE: "*" means that the parent span can have multiple instances of this span. For example, training loop has multiple spans of type TRAINING_SINGLE_ITERATION (one for each iteration of training).

That is, the application (represented as the application span) includes a training loop and an optional testing loop (each represented as a child span of the application span). The training loop span can include several operations (each represented as a span). Training loop span itself has several child spans. This structure allows us to reason about the relationship between different operations and attach metrics/attributes to each operation.

When a training job is integrated with this library, the above spans are created automatically. Moreover, for some of the spans, a set of predefined attributes (e.g., metrics, timing data, etc) are collected and reported to the telemetry backend.

Span Name	Predefined Span Attributes
TRAINING_LOOP	TrainingLoopAttributes
CHECKPOINT_SAVE_SYNC	CheckpointSaveSpanAttributes
CHECKPOINT_SAVE_ASYNC	CheckpointSaveSpanAttributes

Moreover, for each span, several events are trigerred (again automatically when a training job integrates with the library). The table below shows all the predefined events created for each span type and the predefined attributes collected and reported for each event:

Span Name	Event Name	Event Attributes Class Name
APPLICATION	SPAN_START	timestamp
	ONE_LOGGER_INITIALIZATION	OneLoggerInitializationAttributes
	SPAN_STOP	timestamp
TRAINING_LOOP	SPAN_START	timestamp
	TRAINING_METRICS_UPDATE*	TrainingMetricsUpdateAttributes
	SPAN_STOP	timestamp
TRAINING_SINGLE_ITERATION	None	Instead of collecting metrics on each iteration, we use the TRAINING_MULTI_ITERATION_METRICS_UPDATE of the training loop event to control the amount of data sent to the backends
VALIDATION_LOOP	SPAN_START	timestamp
	VALIDATION_METRICS_UPDATE*	ValidationMetricsUpdateAttributes
	SPAN_STOP	timestamp
CHECKPOINT_SAVE_ASYNC or CHECKPOINT_SAVE_SYNC	SPAN_START	timestamp
	SAVE_CHECKPOINT_SUCCESS	SaveCheckpointSuccessEventAttributes
	SYNC_CHECKPOINT_METRICS_UPDATE	SyncCheckpointMetricsUpdateAttributes
SPAN_STOP	timestamp
TESTING_LOOP	SPAN_START	timestamp
	TESTING_METRICS_UPDATE	TestingMetricsUpdateAttributes
	SPAN_STOP	timestamp

> **_NOTE:_** "*" means that the parent span can have multiple instances of this span. For example, training loop has multiple spans of type TRAINING_SINGLE_ITERATION (one for each iteration of training).

Integration with One Logger Training Telemetry

There are two ways to integrate with this library:

If you are using a training framework that we support, you can simply use our glue code. For example, if you are using PyTorch lightning, all you need to do is to wrap your trainer class in our OneLoggerPTLTrainer class. In these cases, we use the callback mechanism of the underlying framework (Lightning in this case) to create spans, events, and collect attributes.
If you have a custom training job or are using a framework that we don't support yet, you can simply use our telemetry APIs (context managers or callbacks) to tell the library what part of your code corresponds to major training events (training loop, checkpoint loading or saving code, etc). With this, the library will automatically create several spans, keeps tracks of relevant metrics, and exports them for you.
For more advanced use cases, you can also call the core one logger API directly (using timed_span for example).

Below, we will go into details for each of the above approaches.

Integration using context managers

Use the context managers defined in src/one_logger/training_telemetry/api/context.py to demarcate your main function, training loop, validation loop, etc. See the example code at src/one_logger/training_telemetry/docs/example.py. Below is a simplified version of that code:

@application() # < ---- telemetry context manager
def main() -> None:

    ....

    with training_loop(train_iterations_start=0): # < ---- telemetry context manager
        ...


        for epoch in range(num_epochs):
            for batch_idx, (inputs, targets) in enumerate(dataloader):
                with training_iteration(): # < ---- telemetry context manager
                    ....

# Initialize the telemetry provider with a default configuration
base_config = OneLoggerConfig(
    application_name="test_app",
    session_tag_or_fn="test_session",
    world_size_or_fn=5,
)

training_config = TrainingTelemetryConfig(
    world_size_or_fn=5,
    is_log_throughput_enabled_or_fn=True,
    flops_per_sample_or_fn=100,
    global_batch_size_or_fn=32,
    log_every_n_train_iterations=10,
    perf_tag_or_fn="test_perf",
)

# configure the telemetry library and start the main() function
(TrainingTelemetryProvider.instance()
    .with_base_config(base_config)
    .with_exporter(FileExporter(file_path=Path("training_telemetry.json")))
    .configure_provider())

@application()
def main() -> None:
    # Set training telemetry config after on_app_start is called
    TrainingTelemetryProvider.instance().set_training_telemetry_config(training_config)
    # ... rest of your training code

main()

Integration using callbacks

You can get training telemetry data by calling the calbacks defined in src/one_logger/training_telemetry/api/callbacks.py.

Here is a simplified example:

def main() -> None:
    # configure the telemetry library and start the main() function
  (TrainingTelemetryProvider.instance()
      .with_base_config(base_config)
      .with_exporter(FileExporter(file_path=Path("training_telemetry.json")))
      .configure_provider())

    on_app_start() # < ---- callback
    
    # Set training telemetry config after on_app_start is called
    TrainingTelemetryProvider.instance().set_training_telemetry_config(training_config)

    ....

        on_train_start(train_iterations_start=0) # < ---- callback
        ...
        for epoch in range(num_epochs):
            
            for batch_idx, (inputs, targets) in enumerate(dataloader):
                on_training_single_iteration_start(): # < ---- callback
                    ....
                on_training_single_iteration_end(): # < ---- callback
    on_app_end() # < ---- callback

Note that you can combined this appraoch with calling the core onelogger API if you need to. See Integration using one logger core API for more info.

Integration using one logger core API

One logger training telemetry library is built on top of the core one logger library. Therefore, you have full access to the core API. Specifically,

you can use the Span object created by the training context manager and then add attributes or create events.
You can use the timed_span API to create your own spans.

def main() -> None:
    # configure the telemetry library and start the main() function
    (TrainingTelemetryProvider.instance()
        .with_base_config(base_config)
        .with_exporter(FileExporter(file_path=Path("training_telemetry.json")))
        # If you are creating custom spans, make sure you set the export_customization_mode and span_name_filter
        # such that such spans are exported.
        .with_export_customization(export_customization_mode=ExportCustomizationMode.xxxx, 
                                   span_name_filter=[...])
        .configure_provider())

    with application() as app_span: # <---- start application span
        # Set training telemetry config after on_app_start is called
        TrainingTelemetryProvider.instance().set_training_telemetry_config(training_config)

        with training_loop(train_iterations_start=0) as span: # <---- access the span created by the context manager
        ...
        training_span.add_attribute("my_custom_attribute", "my_custom_value") # <--- Adding custom attributes

        with timed_span("my_custom_span", span_attributes=Attributes({"my_custom_attribute": "my_custom_value"})):
            # This code block is considered "my_custom_span"
            ....
            TrainingTelemetryProvider.instance().recorder.event(Event.create(...)) # <--- Firing a custom event

Comparison

The table below helps you choose the best integration approach based on your requirements.

Using a supported training framework?	Need to define custom spans/events/attribs?*	Recommendation
Y	N	Use framework-level integration
Y	Y	Use framework-level integration along with context managers for custom spans/events
N	N	Use callbacks or context managers. The former has the advantage that it allows you to separate the telemetry code from training/model code (i.e. encapsulate all of telemetry code in callback functions).
N	Y	Use context managers as callbacks only exist for predefined spans/events and the context managers (timed_span) are more readable and less error prone that calling the recorder API directly.

Custom spans/events/attributes refer to spans and events that are not predefined in the library (see StandardTrainingJobSpanName and StandardTrainingJobEventName enums and attributes defined in attributes.py for a list of predefined spans/events/attributes). You may need to define custom spans/events/attributes if you want to collect telemetry data beyond what predefined spans/events/attributes collect.

Configuration

Lazy initialization of training telemetry config

TrainingTelemetryConfig encapsulates both configuration knobs for telemetry and some data about the training job that is needed for telemetry. The user is expected to provide the config on the job start up. Since some of the training job properties may not be known at the job start time (e.g., the batch size may be determined only after initializing the data reader), a subset of config parameters can be provided after the job start up using set_training_telemetry_config:

# at the start of the job
TrainingTelemetryProvider.instance().with_base_config(base_config).with_exporter(exporter).configure_provider()

# ....

training_config = TrainingTelemetryConfig(
    perf_tag_or_fn="new_perf_tag",
    world_size_or_fn=8,
    global_batch_size_or_fn=64,
)
TrainingTelemetryProvider.instance().set_training_telemetry_config(training_config)
# start of the training loop

perf_tag and session_tag

One Logger is meant to add instrumentation to applications to track performance of the application. One of the important usages of One Logger is to identify significant performance changes that are not expected. That is, we run an application once and collect baseline performance data. Then, every time we run the application again, we can compare the performance data against the baseline. These comparisons are only useful if we can differentiate between cases that changes in the performance are expected and those that are not.

Another use case is to measure the impact of a change in code, job configuration, or execution environment on the performance of an application.

To be able to do the above easily, one logger supports tagging each run with some extra metadata to allow meaningful comparisons of runs. perf_tag and session_tag parameters are created for this reason. The user of the library can set the appropriate values as part of configuring one logger. Those values will be exported alongside the telemetry data to a telemetry backend. When interpretting, analyzing, or aggregating telemetry data, these tags provide extra context about the job to

flag anomalies in the performance (and unexpected performance degradation).
track progress of the application even if the progress is made by different jobs across different machines or clusters.
track the performance of the application over time and correlate performance changes with code changes.
and more.

Below, we will explain the semantics of perf_tag and session_tag and the relationship between them. We expect the user to ensure these values are set correctly for each execution of their job.

perf_tag: used to identify jobs whose performance is expected to be comparable. This means jobs with the same perf_tag must be performing similar tasks and are using the same code, config, and resources (or only differ in ways that are not expected to impact the performance).

session_tag: used to identify jobs that together contribute to the same task. This means the jobs are "logically" part of a single larger job (e.g., a hypothetical long running job that is split into multiple jobs due to resource constraints or resuming after a failure).

Let's use a few examples to illustrate the usage of these knobs.

Imagine we have a model training application. A user downloads a snapshot of the code of this application (say a git branch at a certain commit) and runs the application on some hardware with some configuration (number of GPUs, batch size, etc). Let's assume the user needs to run 1000 iterations of training to complete the task (train a model with acceptable accuracy). Now let's go through a few scenarios:

Scenario 1: User runs the job. It completes without a problem and fully trains the model. A week later, the user changes the model architecture significantly and then runs the job again. Due to the fundamental change in the job code, the two runs are not expected to have similar performance characteristics . In this case, the user should assign different values to "session_tag" across the two jobs because the two runs are independent from each other (they are independent training sessions each training the model from scratch to completion). Moreover, the user must assign different values to "perf_tag" because the two runs are not expected to have similar performance characteristics due to the changes in model code.

Scenario 2: Simialr to scenario 1 except that for the second execution of the job, instead of changing the model architecture, the user allocates more resources to the job (the code remains the same). In this case, the user should assign different values to "session_tag" across the two jobs because the two runs are independent from each other (they are independent training sessions each training the model from scratch to completion). Moreover, the user must assign different values to "perf_tag" because the two runs are not expected to have similar performance characteristics due to the changes in resources.

Scenario 3: The user runs the job to completion. The next day, there is an OS upgrade performed on the cluster to apply a security patch, which in theory should not impact the performance of the jobs on that cluster. The user runs the job again without changing the code or config. In this scenario: since the two runs are independently training the model from scratch to completion (in other words, are not part of the same training session), each run should get a different value for session_tag. However, since the two executions used the same code, config, and execution environment, they must have the same perf_tag.

Scenario 4: The user runs the job but it fails due to an issue at iteration 100 (e.g., due a hardware issue, a scheduling constraint causing the job to be evicted, or a small bug in the model code). The user fixes the issues and runs the job again. In this scenario, the fix is not expected to significantly change the performance characteristics of the job. Since the user is using training chekcpoints, the second run will resume training from iteration 90 when the last checkpoint was saved. Once the second run completes, we have a fully trained model. In this scenario, the two runs are indeed logically part of the same task (same training session) and are expected to have the same performance characteristics as we didn't change the code, hardware, or configs in any way that we expect to impact the performance. So the two runs should have the same perf_tag and session_tag values.

Scenario 5: The user runs the job but it fails at iteration 100 due to an issue in model code. To fix the issue, the user makes a change to the model code that, in addition to fixing the bug, significantly speeds up training (e.g, an unnecessary loop is removed). The user runs the job again and due to checkpointing, the second run starts from iteration 90 when the last checkpoint was saved. In this scenario, the two runs are indeed logically part of the same task (training the model from scratch to completion) but they are not expected to have the same performance characteristics due to the above-mentioned change in code. In this case, the two runs should have the same session_tag but different vaues for perf_tag.

In summary, when configuring one logger for a particular application,

Change the value of perf_tag, whenever a change is made that is expected to change the performance charactristics of the application (changes in code, config/resources, or execution environment).
Use a unique value for session_tag for each single logical execution of your application (if a single logical execution is spread across multiple physical jobs due to interruptions in execution, all those jobs must have the same session_tag).

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.3.1

Oct 29, 2025

This version

2.3.0 yanked

Oct 17, 2025

2.1.0 yanked

Sep 18, 2025

2.0.0 yanked

Sep 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nv_one_logger_training_telemetry-2.3.0.tar.gz (44.4 kB view details)

Uploaded Oct 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nv_one_logger_training_telemetry-2.3.0-py3-none-any.whl (44.1 kB view details)

Uploaded Oct 17, 2025 Python 3

File details

Details for the file nv_one_logger_training_telemetry-2.3.0.tar.gz.

File metadata

Download URL: nv_one_logger_training_telemetry-2.3.0.tar.gz
Upload date: Oct 17, 2025
Size: 44.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.5 CPython/3.8.20 Darwin/25.0.0

File hashes

Hashes for nv_one_logger_training_telemetry-2.3.0.tar.gz
Algorithm	Hash digest
SHA256	`69933c013cff5d0b035f47e69402ebb7aa83cf674aba816c141d9b3b19e4d97c`
MD5	`6caf2156af79253da9174e3c616e3613`
BLAKE2b-256	`6764ff1813f6bd617f208e58b2b9505e61ff97ccf36ba3520c54e7c3b9eab695`

See more details on using hashes here.

File details

Details for the file nv_one_logger_training_telemetry-2.3.0-py3-none-any.whl.

File metadata

Download URL: nv_one_logger_training_telemetry-2.3.0-py3-none-any.whl
Upload date: Oct 17, 2025
Size: 44.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.5 CPython/3.8.20 Darwin/25.0.0

File hashes

Hashes for nv_one_logger_training_telemetry-2.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1a6838a65fab9be6a133f9d26d7d6069a8c9af477225d455f9f9aebee9f30b34`
MD5	`e72af0de06eff6684a12372e33d2e5fb`
BLAKE2b-256	`59e6bd32cf056d9a47aec5bf86b1a861242c2de36e4596c6bddcb8db4b081d8e`

See more details on using hashes here.

nv-one-logger-training-telemetry 2.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

One Logger Training Telemetry

Concepts

Integration with One Logger Training Telemetry

Integration using context managers

Integration using callbacks

Integration using one logger core API

Comparison

Configuration

Lazy initialization of training telemetry config

perf_tag and session_tag

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes