Skip to main content

Training job telemetry using OneLogger library.

Project description

One Logger Training Telemetry

One Logger Training Telemetry is built on top of one-logger-core using the latter to collect telemetry data on training jobs. It includes

  • Predefined spans, events, and attributes for a typical training job.
  • Easy integration with several trainng frameworks.

Concepts

Similar to one-logger-core, this library uses concepts inspired by Open Telemetry (spans, events, and attributes). To illustrate the concepts better, the table below shows an example structure (parent-child span relationships) in a typical training job with synchronous checkpoint saving(each cell of the table represents a span, which is a child of the span shown in the cell left of it). Note that this is just an example; the exact structure of spans is determined by the actual structure of code, which is reflected in the timing and order in which the training job calls the telemetry callbacks (or context managers).

APPLICATION span DIST_INIT span
DATA_LOADER_INIT span
CHECKPOINT_LOAD span
MODEL_INIT span
OPTIMIZER_INIT span
TRAINING_LOOP span TRAINING_SINGLE_ITERATION span* DATA_LOADING span
MODEL_FORWARD span
MODEL_BACKWARD span
OPTIMIZER_UPDATE
CHECKPOINT_SAVE_SYNC span*
VALIDATION_LOOP span* VALIDATION_SINGLE_ITERATION span
TESTING_LOOP span TESTING_SINGLE_ITERATION span

NOTE: "*" means that the parent span can have multiple instances of this span. For example, training loop has multiple spans of type TRAINING_SINGLE_ITERATION (one for each iteration of training).

That is, the application (represented as the application span) includes a training loop and an optional testing loop (each represented as a child span of the application span). The training loop span can include several operations (each represented as a span). Training loop span itself has several child spans. This structure allows us to reason about the relationship between different operations and attach metrics/attributes to each operation.

When a training job is integrated with this library, the above spans are created automatically. Moreover, for some of the spans, a set of predefined attributes (e.g., metrics, timing data, etc) are collected and reported to the telemetry backend.

Span Name Predefined Span Attributes
TRAINING_LOOP TrainingLoopAttributes
CHECKPOINT_SAVE_SYNC CheckpointSaveSpanAttributes
CHECKPOINT_SAVE_ASYNC CheckpointSaveSpanAttributes

Moreover, for each span, several events are trigerred (again automatically when a training job integrates with the library). The table below shows all the predefined events created for each span type and the predefined attributes collected and reported for each event:

Span Name Event Name Event Attributes Class Name
APPLICATION SPAN_START timestamp
ONE_LOGGER_INITIALIZATION OneLoggerInitializationAttributes
SPAN_STOP timestamp
TRAINING_LOOP SPAN_START timestamp
TRAINING_METRICS_UPDATE* TrainingMetricsUpdateAttributes
SPAN_STOP timestamp
TRAINING_SINGLE_ITERATION None Instead of collecting metrics on each iteration, we use the TRAINING_MULTI_ITERATION_METRICS_UPDATE of the training loop event to control the amount of data sent to the backends
VALIDATION_LOOP SPAN_START timestamp
VALIDATION_METRICS_UPDATE* ValidationMetricsUpdateAttributes
SPAN_STOP timestamp
CHECKPOINT_SAVE_ASYNC or CHECKPOINT_SAVE_SYNC SPAN_START timestamp
SAVE_CHECKPOINT_SUCCESS SaveCheckpointSuccessEventAttributes
SYNC_CHECKPOINT_METRICS_UPDATE SyncCheckpointMetricsUpdateAttributes
SPAN_STOP timestamp
TESTING_LOOP SPAN_START timestamp
TESTING_METRICS_UPDATE TestingMetricsUpdateAttributes
SPAN_STOP timestamp
> **_NOTE:_** "*" means that the parent span can have multiple instances of this span. For example, training loop has multiple spans of type TRAINING_SINGLE_ITERATION (one for each iteration of training).

Integration with One Logger Training Telemetry

There are two ways to integrate with this library:

  • If you are using a training framework that we support, you can simply use our glue code. For example, if you are using PyTorch lightning, all you need to do is to wrap your trainer class in our OneLoggerPTLTrainer class. In these cases, we use the callback mechanism of the underlying framework (Lightning in this case) to create spans, events, and collect attributes.

  • If you have a custom training job or are using a framework that we don't support yet, you can simply use our telemetry APIs (context managers or callbacks) to tell the library what part of your code corresponds to major training events (training loop, checkpoint loading or saving code, etc). With this, the library will automatically create several spans, keeps tracks of relevant metrics, and exports them for you.

  • For more advanced use cases, you can also call the core one logger API directly (using timed_span for example).

Below, we will go into details for each of the above approaches.

Integration using context managers

Use the context managers defined in src/one_logger/training_telemetry/api/context.py to demarcate your main function, training loop, validation loop, etc. See the example code at src/one_logger/training_telemetry/docs/example.py. Below is a simplified version of that code:

@application() # < ---- telemetry context manager
def main() -> None:

    ....

    with training_loop(train_iterations_start=0): # < ---- telemetry context manager
        ...


        for epoch in range(num_epochs):
            for batch_idx, (inputs, targets) in enumerate(dataloader):
                with training_iteration(): # < ---- telemetry context manager
                    ....

# Initialize the telemetry provider with a default configuration
base_config = OneLoggerConfig(
    application_name="test_app",
    session_tag_or_fn="test_session",
    world_size_or_fn=5,
)

training_config = TrainingTelemetryConfig(
    world_size_or_fn=5,
    is_log_throughput_enabled_or_fn=True,
    flops_per_sample_or_fn=100,
    global_batch_size_or_fn=32,
    log_every_n_train_iterations=10,
    perf_tag_or_fn="test_perf",
)

# configure the telemetry library and start the main() function
(TrainingTelemetryProvider.instance()
    .with_base_config(base_config)
    .with_exporter(FileExporter(file_path=Path("training_telemetry.json")))
    .configure_provider())

@application()
def main() -> None:
    # Set training telemetry config after on_app_start is called
    TrainingTelemetryProvider.instance().set_training_telemetry_config(training_config)
    # ... rest of your training code

main()

Integration using callbacks

You can get training telemetry data by calling the calbacks defined in src/one_logger/training_telemetry/api/callbacks.py.

Here is a simplified example:

def main() -> None:
    # configure the telemetry library and start the main() function
  (TrainingTelemetryProvider.instance()
      .with_base_config(base_config)
      .with_exporter(FileExporter(file_path=Path("training_telemetry.json")))
      .configure_provider())

    on_app_start() # < ---- callback
    
    # Set training telemetry config after on_app_start is called
    TrainingTelemetryProvider.instance().set_training_telemetry_config(training_config)

    ....

        on_train_start(train_iterations_start=0) # < ---- callback
        ...
        for epoch in range(num_epochs):
            
            for batch_idx, (inputs, targets) in enumerate(dataloader):
                on_training_single_iteration_start(): # < ---- callback
                    ....
                on_training_single_iteration_end(): # < ---- callback
    on_app_end() # < ---- callback

Note that you can combined this appraoch with calling the core onelogger API if you need to. See Integration using one logger core API for more info.

Integration using one logger core API

One logger training telemetry library is built on top of the core one logger library. Therefore, you have full access to the core API. Specifically,

  • you can use the Span object created by the training context manager and then add attributes or create events.
  • You can use the timed_span API to create your own spans.
def main() -> None:
    # configure the telemetry library and start the main() function
    (TrainingTelemetryProvider.instance()
        .with_base_config(base_config)
        .with_exporter(FileExporter(file_path=Path("training_telemetry.json")))
        # If you are creating custom spans, make sure you set the export_customization_mode and span_name_filter
        # such that such spans are exported.
        .with_export_customization(export_customization_mode=ExportCustomizationMode.xxxx, 
                                   span_name_filter=[...])
        .configure_provider())

    with application() as app_span: # <---- start application span
        # Set training telemetry config after on_app_start is called
        TrainingTelemetryProvider.instance().set_training_telemetry_config(training_config)

        with training_loop(train_iterations_start=0) as span: # <---- access the span created by the context manager
        ...
        training_span.add_attribute("my_custom_attribute", "my_custom_value") # <--- Adding custom attributes

        with timed_span("my_custom_span", span_attributes=Attributes({"my_custom_attribute": "my_custom_value"})):
            # This code block is considered "my_custom_span"
            ....
            TrainingTelemetryProvider.instance().recorder.event(Event.create(...)) # <--- Firing a custom event
                            

Comparison

The table below helps you choose the best integration approach based on your requirements.

Using a supported
training framework?
Need to define custom
spans/events/attribs?*
Recommendation
Y N Use framework-level integration
Y Y Use framework-level integration along with context managers for custom spans/events
N N Use callbacks or context managers.
The former has the advantage that it allows you to separate the telemetry code from training/model code
(i.e. encapsulate all of telemetry code in callback functions).
N Y Use context managers as callbacks only exist for predefined spans/events and
the context managers (timed_span) are more readable and less error prone that calling the recorder API directly.
  • Custom spans/events/attributes refer to spans and events that are not predefined in the library (see StandardTrainingJobSpanName and StandardTrainingJobEventName enums and attributes defined in attributes.py for a list of predefined spans/events/attributes). You may need to define custom spans/events/attributes if you want to collect telemetry data beyond what predefined spans/events/attributes collect.

Configuration

Lazy initialization of training telemetry config

TrainingTelemetryConfig encapsulates both configuration knobs for telemetry and some data about the training job that is needed for telemetry. The user is expected to provide the config on the job start up. Since some of the training job properties may not be known at the job start time (e.g., the batch size may be determined only after initializing the data reader), a subset of config parameters can be provided after the job start up using set_training_telemetry_config:

# at the start of the job
TrainingTelemetryProvider.instance().with_base_config(base_config).with_exporter(exporter).configure_provider()

# ....

training_config = TrainingTelemetryConfig(
    perf_tag_or_fn="new_perf_tag",
    world_size_or_fn=8,
    global_batch_size_or_fn=64,
)
TrainingTelemetryProvider.instance().set_training_telemetry_config(training_config)
# start of the training loop 

perf_tag and session_tag

One Logger is meant to add instrumentation to applications to track performance of the application. One of the important usages of One Logger is to identify significant performance changes that are not expected. That is, we run an application once and collect baseline performance data. Then, every time we run the application again, we can compare the performance data against the baseline. These comparisons are only useful if we can differentiate between cases that changes in the performance are expected and those that are not.

Another use case is to measure the impact of a change in code, job configuration, or execution environment on the performance of an application.

To be able to do the above easily, one logger supports tagging each run with some extra metadata to allow meaningful comparisons of runs. perf_tag and session_tag parameters are created for this reason. The user of the library can set the appropriate values as part of configuring one logger. Those values will be exported alongside the telemetry data to a telemetry backend. When interpretting, analyzing, or aggregating telemetry data, these tags provide extra context about the job to

  • flag anomalies in the performance (and unexpected performance degradation).
  • track progress of the application even if the progress is made by different jobs across different machines or clusters.
  • track the performance of the application over time and correlate performance changes with code changes.
  • and more.

Below, we will explain the semantics of perf_tag and session_tag and the relationship between them. We expect the user to ensure these values are set correctly for each execution of their job.

perf_tag: used to identify jobs whose performance is expected to be comparable. This means jobs with the same perf_tag must be performing similar tasks and are using the same code, config, and resources (or only differ in ways that are not expected to impact the performance).

session_tag: used to identify jobs that together contribute to the same task. This means the jobs are "logically" part of a single larger job (e.g., a hypothetical long running job that is split into multiple jobs due to resource constraints or resuming after a failure).

Let's use a few examples to illustrate the usage of these knobs.

Imagine we have a model training application. A user downloads a snapshot of the code of this application (say a git branch at a certain commit) and runs the application on some hardware with some configuration (number of GPUs, batch size, etc). Let's assume the user needs to run 1000 iterations of training to complete the task (train a model with acceptable accuracy). Now let's go through a few scenarios:

Scenario 1: User runs the job. It completes without a problem and fully trains the model. A week later, the user changes the model architecture significantly and then runs the job again. Due to the fundamental change in the job code, the two runs are not expected to have similar performance characteristics . In this case, the user should assign different values to "session_tag" across the two jobs because the two runs are independent from each other (they are independent training sessions each training the model from scratch to completion). Moreover, the user must assign different values to "perf_tag" because the two runs are not expected to have similar performance characteristics due to the changes in model code.

Scenario 2: Simialr to scenario 1 except that for the second execution of the job, instead of changing the model architecture, the user allocates more resources to the job (the code remains the same). In this case, the user should assign different values to "session_tag" across the two jobs because the two runs are independent from each other (they are independent training sessions each training the model from scratch to completion). Moreover, the user must assign different values to "perf_tag" because the two runs are not expected to have similar performance characteristics due to the changes in resources.

Scenario 3: The user runs the job to completion. The next day, there is an OS upgrade performed on the cluster to apply a security patch, which in theory should not impact the performance of the jobs on that cluster. The user runs the job again without changing the code or config. In this scenario: since the two runs are independently training the model from scratch to completion (in other words, are not part of the same training session), each run should get a different value for session_tag. However, since the two executions used the same code, config, and execution environment, they must have the same perf_tag.

Scenario 4: The user runs the job but it fails due to an issue at iteration 100 (e.g., due a hardware issue, a scheduling constraint causing the job to be evicted, or a small bug in the model code). The user fixes the issues and runs the job again. In this scenario, the fix is not expected to significantly change the performance characteristics of the job. Since the user is using training chekcpoints, the second run will resume training from iteration 90 when the last checkpoint was saved. Once the second run completes, we have a fully trained model. In this scenario, the two runs are indeed logically part of the same task (same training session) and are expected to have the same performance characteristics as we didn't change the code, hardware, or configs in any way that we expect to impact the performance. So the two runs should have the same perf_tag and session_tag values.

Scenario 5: The user runs the job but it fails at iteration 100 due to an issue in model code. To fix the issue, the user makes a change to the model code that, in addition to fixing the bug, significantly speeds up training (e.g, an unnecessary loop is removed). The user runs the job again and due to checkpointing, the second run starts from iteration 90 when the last checkpoint was saved. In this scenario, the two runs are indeed logically part of the same task (training the model from scratch to completion) but they are not expected to have the same performance characteristics due to the above-mentioned change in code. In this case, the two runs should have the same session_tag but different vaues for perf_tag.

In summary, when configuring one logger for a particular application,

  • Change the value of perf_tag, whenever a change is made that is expected to change the performance charactristics of the application (changes in code, config/resources, or execution environment).

  • Use a unique value for session_tag for each single logical execution of your application (if a single logical execution is spread across multiple physical jobs due to interruptions in execution, all those jobs must have the same session_tag).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nv_one_logger_training_telemetry-2.3.0.tar.gz (44.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file nv_one_logger_training_telemetry-2.3.0.tar.gz.

File metadata

File hashes

Hashes for nv_one_logger_training_telemetry-2.3.0.tar.gz
Algorithm Hash digest
SHA256 69933c013cff5d0b035f47e69402ebb7aa83cf674aba816c141d9b3b19e4d97c
MD5 6caf2156af79253da9174e3c616e3613
BLAKE2b-256 6764ff1813f6bd617f208e58b2b9505e61ff97ccf36ba3520c54e7c3b9eab695

See more details on using hashes here.

File details

Details for the file nv_one_logger_training_telemetry-2.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for nv_one_logger_training_telemetry-2.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1a6838a65fab9be6a133f9d26d7d6069a8c9af477225d455f9f9aebee9f30b34
MD5 e72af0de06eff6684a12372e33d2e5fb
BLAKE2b-256 59e6bd32cf056d9a47aec5bf86b1a861242c2de36e4596c6bddcb8db4b081d8e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page