Wrappers that facilitate enabling training job telemetry for a set of supported training frameworks.

These details have not been verified by PyPI

Project description

Pytorch Lightning Integration

one_logger_pytorch_lightning_integration library provides an easy way to add telemetry to applications that use Pytorch Lightning for training. The integration works by using Lightning's callback mechanism. Given that the Lightning's callback API doesn't support everything we need, the one_logger_pytorch_lightning_integration library supplements the Lightning callback mechanism to ensure we support async checkpointing and some app lifecycle events not supported by Lightning callback API.

Minmum Requirements

python version >= 3.9, < 3.14.
torch version >= 2.8.0.
pytorch-lightning version >=2.5.3.

Integrate nv-one-logger to PTL application via hook_trainer_cls

hook_trainer_cls adds telemetry hooks to your Trainer. Using this method, several of the training events will be automatically captured. However, you still need to explicitly call the one logger API for other events. See explicit vs implicit section for more details.

    TrainingTelemetryProvider.instance().with_base_config(config).with_exporter(exporter).configure_provider()
    ...
    HookedTrainer, nv_one_logger_callback = hook_trainer_cls(Trainer, TrainingTelemetryProvider.instance())
    # Instantiate using "HookedTrainer" passing it the same parameters you would pass to the regular Lightning Trainer.
    HookedTrainer = HookedTrainer(
        max_epochs=NUM_EPOCHS,
        limit_train_batches=NUM_TRAIN_BATCHES,
        limit_val_batches=NUM_VAL_BATCHES,
        devices=NUM_DEVICES,
        # No need to pass one logger callback. This will be added automatically. 
        # You can pass other callbacks that you need.
        callbacks=[...], 
        ....
    )

    # You can now use the "trainer" instance the same way you use a regular lightning trainer.
    ...
    # You can also use the returned callback from hook_trainer_clas (or get it via the "nv_one_logger_callback" propery of the trainer)
    # to invoke on_xxx methods that are not part of the Lightning Callback interface such as "on_model_init_start", on_model_init_end, 
    # on_dataloader_init_start, etc.   
    
    # Note that nv_one_logger_callback == trainer.nv_one_logger_callback
    
    nv_one_logger_callback.on_app_end()

Explicit vs Implicit Telemetry Collection

As mentioned above, thanks to integration with the built-in Lightning's callback mechanism, when your code calls trainer.fit, several training-related spans (e.g., training loop, training iterations, validation iterations, etc) are captured and reported to NV one logger training telemetry implicitely (without the need for you to write any extra code).

However, several lifecycle events are not captured by the Lightning callback mechanism; therefore, if you are interested in collecting telemetry on those events, you need to call the corresponding TimeEventCallback on_xxx methods explicitely.

The table below shows which spans are captured implicitely and which ones require an explicit call.

Span	How to collect data?
APPLICATION	You need to explcitely call on_app_end. on_app_start is called automatically.
TRAINING_LOOP	Implict via trainer.fit()
TRAINING_SINGLE_ITERATION	Implict via trainer.fit()
VALIDATION_LOOP	Implict via trainer.fit()
VALIDATION_SINGLE_ITERATION	Implict via trainer.fit()
TESTING_LOOP	Explicit: call on_testing_start and on_testing_end
DATA_LOADER_INIT	Explicit: call on_dataloader_init_end and on_dataloader_init_end
MODEL_INIT	Explicit: call on_model_init_start and on_model_init_end
OPTIMIZER_INIT	Explicit: call on_optimizer_init_start and on_optimizer_init_end
CHECKPOINT_LOAD	Explicit: call on_load_checkpoint_start and on_load_checkpoint_end
CHECKPOINT_SAVE_SYNC	Implict. Works both for checkpoints saved automatically by trainer.fit() and checkpoints saved explicitly during training.
CHECKPOINT_SAVE_ASYNC	Implict. Works both for checkpoints saved automatically by trainer.fit() and checkpoints saved explicitly during training.

Full Example

Below is a simple training application that shows how you can enable telemetry for Loghtning by adding a few lines of code.

import os
import torch
from pytorch_lightning import LightningModule, Trainer
from nv_one_logger.training_telemetry.api.training_telemetry_provider import TrainingTelemetryProvider
from nv_one_logger.training_telemetry.integration.pytorch_lightning import hook_trainer_cls

class SimpleModel(LightningModule):
    def __init__(self, learning_rate=1e-3):
        super().__init__()
        self.learning_rate = learning_rate
        self.model = torch.nn.Sequential(
            torch.nn.Linear(10, 20),
            torch.nn.ReLU(),
            torch.nn.Linear(20, 1)
        )
        
    def forward(self, x):
        return self.model(x)
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = torch.nn.functional.mse_loss(y_hat, y)
        self.log('train_loss', loss)
        return loss
    
    def configure_optimizers(self):
        nv_one_logger_callback = trainer.nv_one_logger_callback        
        nv_one_logger_callback.on_optimizer_init_start()        
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        nv_one_logger_callback.on_optimizer_init_end()
        
        return optimizer

def main():
    # 1. Configure OneLoggerTrainingTelemetryProvider
    base_config = OneLoggerConfig(
        application_name="test_app",
        session_tag_or_fn="test_session",
        world_size_or_fn=5,
    )
    
    training_config = TrainingTelemetryConfig(
        world_size_or_fn=5,
        is_log_throughput_enabled_or_fn=True,
        flops_per_sample_or_fn=100,
        global_batch_size_or_fn=32,
        log_every_n_train_iterations=10,
        perf_tag_or_fn="test_perf",
    )
    exporter = FileExporter(file_path=Path("training_telemetry.json"))
    TrainingTelemetryProvider.instance().with_base_config(base_config).with_exporter(exporter).configure_provider()

    # 2. Create and hook the Trainer class
    HookedTrainer = hook_trainer_cls(Trainer, TrainingTelemetryProvider.instance())
    trainer = HookedTrainer(
        max_steps=train_iterations_target,
        devices=num_devices,
    )
    nv_one_logger_callback = trainer.nv_one_logger_callback

    # 3. Set training telemetry config after on_app_start is called
    TrainingTelemetryProvider.instance().set_training_telemetry_config(training_config)
    
    # 4. Initialize model with OneLogger hooks and timestamps
    nv_one_logger_callback.on_model_init_start()
    model = SimpleModel()
    nv_one_logger_callback.on_model_init_end()

    # 5. Load checkpoint if needed
    nv_one_logger_callback.on_load_checkpoint_start()
    if os.path.exists("pretrained.ckpt"):
        model = SimpleModel.load_from_checkpoint("pretrained.ckpt")
    nv_one_logger_callback.on_load_checkpoint_end()

    # 6. Create dummy dataset
    nv_one_logger_callback.on_dataloader_init_start()
    train_dataset = torch.utils.data.TensorDataset(
        torch.randn(1000, 10),
        torch.randn(1000, 1)
    )
    train_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=32,
        shuffle=True
    )
    nv_one_logger_callback.on_dataloader_init_end()
    
    # 7. Start training
    trainer.fit(model, train_loader)

    # 8. End application
    nv_one_logger_callback.on_app_end()

if __name__ == "__main__":
    main()

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.3.1

Oct 29, 2025

2.3.0 yanked

Oct 17, 2025

2.2.0 yanked

Oct 15, 2025

2.1.0 yanked

Sep 18, 2025

This version

2.0.0 yanked

Sep 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nv_one_logger_pytorch_lightning_integration-2.0.0.tar.gz (11.5 kB view details)

Uploaded Sep 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nv_one_logger_pytorch_lightning_integration-2.0.0-py3-none-any.whl (9.8 kB view details)

Uploaded Sep 9, 2025 Python 3

File details

Details for the file nv_one_logger_pytorch_lightning_integration-2.0.0.tar.gz.

File metadata

Download URL: nv_one_logger_pytorch_lightning_integration-2.0.0.tar.gz
Upload date: Sep 9, 2025
Size: 11.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.5 CPython/3.13.5 Darwin/24.6.0

File hashes

Hashes for nv_one_logger_pytorch_lightning_integration-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`7ebf54d915fe1cf10e6870b5a5f2c244466b23135e27405d18cf1b7a11857b82`
MD5	`af419612f07b7a50a9d70c341fe27690`
BLAKE2b-256	`c9bba81401009dc934694716326f536468bf3b5390d4219fcca9c479f7628550`

See more details on using hashes here.

File details

Details for the file nv_one_logger_pytorch_lightning_integration-2.0.0-py3-none-any.whl.

File metadata

Download URL: nv_one_logger_pytorch_lightning_integration-2.0.0-py3-none-any.whl
Upload date: Sep 9, 2025
Size: 9.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.5 CPython/3.13.5 Darwin/24.6.0

File hashes

Hashes for nv_one_logger_pytorch_lightning_integration-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c2801cd9ff6c7e1833ad25ed897ebf4cfbca4dcba97e003c7e686ed693603377`
MD5	`e2c741144f760dad2afc903deb6584ae`
BLAKE2b-256	`678ba09feff88981b73ab35817c22e9d90022e6c353516cfc26844ca52708ec6`

See more details on using hashes here.

nv-one-logger-pytorch-lightning-integration 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Pytorch Lightning Integration

Minmum Requirements

Integrate nv-one-logger to PTL application via hook_trainer_cls

Explicit vs Implicit Telemetry Collection

Full Example

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes