Utilities for monitoring training of large foundation models

These details have not been verified by PyPI

Project description

Training Telemetry

A Python library that records events, metrics, and errors during model training in standardized formats:

structured key=value logs
JSON for text files
Open Telemetry (OTEL) traces and logs
NVTX code markers for Nvidia Nsight Systems

Overview

The objective of this library is to provide a standard format for logging events, metrics and errors that can be adopted by existing frameworks and applications for training large AI models. The end result is that the runtime performance and errors of these training models can be monitored in a consistent manner, without impacting the training performance. Time spans provide detailed information on how each training process spends its time during startup, training and checkpoint saving. Errors can be analyzed and correlated with infrastructure events once the application fails, in order to provide users with more actionable information.

This library is lightweight and intentionally has very few dependencies, so as to facilitate integration with training frameworks that normally have a long list of dependencies. The API is provided on two levels:

A context-based API, where monitoring can be done via context managers or function decorators
A low-level recorder API with start/stop/event/error functions for callback implementations and other low-level requirements

The following events are currently supported:

Application runtime and application-specific metrics
Training loop progress and timing
Individual iteration metrics, including loss, accuracy, TFLOPS, consumed samples, forward and backward times
Checkpoint saves, including global and local checkpoints, async and sync checkpoint strategies
Errors and exceptions
Model validation and testing
Custom metrics and events

Events are logged by one or more of the following backends:

A Python logger backend, logging events as messages using a logger at INFO level with structured log format
A file logger backend, where each event is logged as a one-line JSON object
An OpenTelemetry backend, where each event is converted to a span and sent to the OTEL collector

Events have metrics attached to them. A special class of events, error events, captures error messages and stack traces.

Key Features

Context managers for timing code blocks
Event recording with customizable metrics
Exception handling and error reporting
Flexible backend system for storing/analyzing telemetry data as log messages, JSON objects or OTEL traces
Low overhead monitoring

Installation

The library package is available on the following pypi public repositories:

Install with:

pip install aidot-training-telemetry

If using Poetry, run the following command:

poetry add aidot-training-telemetry

Usage

Using the context API, initialize the main function with:

def get_application_metrics():
    return ApplicationMetrics.create(
        rank=get_rank_index(),
        world_size=get_rank_count(),
        node_name="localhost",
        timezone=str(get_localzone()),
        total_iterations=num_epochs * len(dataloader),
        checkpoint_enabled=True,
        checkpoint_strategy="sync",
    )


@application_running(metrics=get_application_metrics())
def main():
    [...]

This will capture any exceptions not handled by the application, and log them as an error event before re-raising them.

For the training loop and iterations:

with training_iteration() as training_iteration_span:
    [...]
    training_iteration_span.add_metrics(
        IterationMetrics.create(
            current_iteration=current_iteration,
            num_iterations=len(dataloader),
            loss=loss.item(),
            accuracy=accuracy.item(),
        )
    )

For checkpoint monitoring:

with checkpoint_save() as checkpoint_save_span:
    [...]
    checkpoint_save_span.add_metrics(
        CheckpointSaveMetrics.create(
            checkpoint_type=CheckPointType.LOCAL,
            current_iteration=current_iteration,
            checkpoint_directory=temp_dir,
            checkpoint_filename=os.path.basename(checkpoint_file_name),
        )
    )

For a concrete example refer to the torch example or usage examples.

It's also possible to manually create spans and events, refer to the recorder API for how to do this.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.1.2

Dec 1, 2025

1.1.0

Sep 25, 2025

1.0.1

Sep 2, 2025

1.0.0

Sep 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aidot_training_telemetry-1.1.2.tar.gz (38.3 kB view details)

Uploaded Dec 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aidot_training_telemetry-1.1.2-py3-none-any.whl (55.9 kB view details)

Uploaded Dec 1, 2025 Python 3

File details

Details for the file aidot_training_telemetry-1.1.2.tar.gz.

File metadata

Download URL: aidot_training_telemetry-1.1.2.tar.gz
Upload date: Dec 1, 2025
Size: 38.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.19

File hashes

Hashes for aidot_training_telemetry-1.1.2.tar.gz
Algorithm	Hash digest
SHA256	`f7bb1426982ae8cb7b15063312bdd625ccdec3dd74282cb54038bc4242d0126c`
MD5	`56b79c8dcb456f31ece3433b1093c367`
BLAKE2b-256	`d0c80d625aec233f68856cf7e87263a7b7efa41877b2db0a56ced03abaca4960`

See more details on using hashes here.

File details

Details for the file aidot_training_telemetry-1.1.2-py3-none-any.whl.

File metadata

Download URL: aidot_training_telemetry-1.1.2-py3-none-any.whl
Upload date: Dec 1, 2025
Size: 55.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.19

File hashes

Hashes for aidot_training_telemetry-1.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`40bb482cae6aef37aaf51ad510219446810fc30386642171991a92e984bd38fc`
MD5	`3b2b684ce626328aa7ce5043c75e0e72`
BLAKE2b-256	`cc10738b9e386916c099f34ca48d8011fc8c832438e73fd04960155ce2d0447b`

See more details on using hashes here.

aidot-training-telemetry 1.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Training Telemetry

Overview

Key Features

Installation

Usage

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes