Skip to main content

Wrappers that facilitate enabling training job telemetry for a set of supported training frameworks.

Project description

Pytorch Lightning Integration

one_logger_pytorch_lightning_integration library provides an easy way to add telemetry to applications that use Pytorch Lightning for training. The integration works by using Lightning's callback mechanism. Given that the Lightning's callback API doesn't support everything we need, the one_logger_pytorch_lightning_integration library supplements the Lightning callback mechanism to ensure we support async checkpointing and some app lifecycle events not supported by Lightning callback API.

Minmum Requirements

  • python version >= 3.9, < 3.14.
  • torch version >= 2.8.0.
  • pytorch-lightning version >=2.5.3.

Integrate nv-one-logger to PTL application via hook_trainer_cls

hook_trainer_cls adds telemetry hooks to your Trainer. Using this method, several of the training events will be automatically captured. However, you still need to explicitly call the one logger API for other events. See explicit vs implicit section for more details.

    TrainingTelemetryProvider.instance().with_base_config(config).with_exporter(exporter).configure_provider()
    ...
    HookedTrainer, nv_one_logger_callback = hook_trainer_cls(Trainer, TrainingTelemetryProvider.instance())
    # Instantiate using "HookedTrainer" passing it the same parameters you would pass to the regular Lightning Trainer.
    HookedTrainer = HookedTrainer(
        max_epochs=NUM_EPOCHS,
        limit_train_batches=NUM_TRAIN_BATCHES,
        limit_val_batches=NUM_VAL_BATCHES,
        devices=NUM_DEVICES,
        # No need to pass one logger callback. This will be added automatically. 
        # You can pass other callbacks that you need.
        callbacks=[...], 
        ....
    )

    # You can now use the "trainer" instance the same way you use a regular lightning trainer.
    ...
    # You can also use the returned callback from hook_trainer_clas (or get it via the "nv_one_logger_callback" propery of the trainer)
    # to invoke on_xxx methods that are not part of the Lightning Callback interface such as "on_model_init_start", on_model_init_end, 
    # on_dataloader_init_start, etc.   
    
    # Note that nv_one_logger_callback == trainer.nv_one_logger_callback
    
    nv_one_logger_callback.on_app_end()

Explicit vs Implicit Telemetry Collection

As mentioned above, thanks to integration with the built-in Lightning's callback mechanism, when your code calls trainer.fit, several training-related spans (e.g., training loop, training iterations, validation iterations, etc) are captured and reported to NV one logger training telemetry implicitely (without the need for you to write any extra code).

However, several lifecycle events are not captured by the Lightning callback mechanism; therefore, if you are interested in collecting telemetry on those events, you need to call the corresponding TimeEventCallback on_xxx methods explicitely.

The table below shows which spans are captured implicitely and which ones require an explicit call.

Span How to collect data?
APPLICATION You need to explcitely call on_app_end. on_app_start is called automatically.
TRAINING_LOOP Implict via trainer.fit()
TRAINING_SINGLE_ITERATION Implict via trainer.fit()
VALIDATION_LOOP Implict via trainer.fit()
VALIDATION_SINGLE_ITERATION Implict via trainer.fit()
TESTING_LOOP Explicit: call on_testing_start and on_testing_end
DATA_LOADER_INIT Explicit: call on_dataloader_init_end and on_dataloader_init_end
MODEL_INIT Explicit: call on_model_init_start and on_model_init_end
OPTIMIZER_INIT Explicit: call on_optimizer_init_start and on_optimizer_init_end
CHECKPOINT_LOAD Explicit: call on_load_checkpoint_start and on_load_checkpoint_end
CHECKPOINT_SAVE_SYNC Implict. Works both for checkpoints saved automatically by trainer.fit() and checkpoints saved explicitly during training.
CHECKPOINT_SAVE_ASYNC Implict. Works both for checkpoints saved automatically by trainer.fit() and checkpoints saved explicitly during training.

Full Example

Below is a simple training application that shows how you can enable telemetry for Loghtning by adding a few lines of code.

import os
import torch
from pytorch_lightning import LightningModule, Trainer
from nv_one_logger.training_telemetry.api.training_telemetry_provider import TrainingTelemetryProvider
from nv_one_logger.training_telemetry.integration.pytorch_lightning import hook_trainer_cls

class SimpleModel(LightningModule):
    def __init__(self, learning_rate=1e-3):
        super().__init__()
        self.learning_rate = learning_rate
        self.model = torch.nn.Sequential(
            torch.nn.Linear(10, 20),
            torch.nn.ReLU(),
            torch.nn.Linear(20, 1)
        )
        
    def forward(self, x):
        return self.model(x)
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = torch.nn.functional.mse_loss(y_hat, y)
        self.log('train_loss', loss)
        return loss
    
    def configure_optimizers(self):
        nv_one_logger_callback = trainer.nv_one_logger_callback        
        nv_one_logger_callback.on_optimizer_init_start()        
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        nv_one_logger_callback.on_optimizer_init_end()
        
        return optimizer

def main():
    # 1. Configure OneLoggerTrainingTelemetryProvider
    base_config = OneLoggerConfig(
        application_name="test_app",
        session_tag_or_fn="test_session",
        world_size_or_fn=5,
    )
    
    training_config = TrainingTelemetryConfig(
        world_size_or_fn=5,
        is_log_throughput_enabled_or_fn=True,
        flops_per_sample_or_fn=100,
        global_batch_size_or_fn=32,
        log_every_n_train_iterations=10,
        perf_tag_or_fn="test_perf",
    )
    exporter = FileExporter(file_path=Path("training_telemetry.json"))
    TrainingTelemetryProvider.instance().with_base_config(base_config).with_exporter(exporter).configure_provider()

    # 2. Create and hook the Trainer class
    HookedTrainer = hook_trainer_cls(Trainer, TrainingTelemetryProvider.instance())
    trainer = HookedTrainer(
        max_steps=train_iterations_target,
        devices=num_devices,
    )
    nv_one_logger_callback = trainer.nv_one_logger_callback

    # 3. Set training telemetry config after on_app_start is called
    TrainingTelemetryProvider.instance().set_training_telemetry_config(training_config)
    
    # 4. Initialize model with OneLogger hooks and timestamps
    nv_one_logger_callback.on_model_init_start()
    model = SimpleModel()
    nv_one_logger_callback.on_model_init_end()

    # 5. Load checkpoint if needed
    nv_one_logger_callback.on_load_checkpoint_start()
    if os.path.exists("pretrained.ckpt"):
        model = SimpleModel.load_from_checkpoint("pretrained.ckpt")
    nv_one_logger_callback.on_load_checkpoint_end()

    # 6. Create dummy dataset
    nv_one_logger_callback.on_dataloader_init_start()
    train_dataset = torch.utils.data.TensorDataset(
        torch.randn(1000, 10),
        torch.randn(1000, 1)
    )
    train_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=32,
        shuffle=True
    )
    nv_one_logger_callback.on_dataloader_init_end()
    
    # 7. Start training
    trainer.fit(model, train_loader)

    # 8. End application
    nv_one_logger_callback.on_app_end()

if __name__ == "__main__":
    main() 

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file nv_one_logger_pytorch_lightning_integration-2.3.1.tar.gz.

File metadata

File hashes

Hashes for nv_one_logger_pytorch_lightning_integration-2.3.1.tar.gz
Algorithm Hash digest
SHA256 b32d99b6a8f02a16538bcade939b0a7edd7249e936aacefe336b5519447340c3
MD5 79994bc0ed6104fd4b3de57c415978b7
BLAKE2b-256 0cd03475b7ab17d367362f650fb0419e8669f41e63c1018f4a8ac2fbecfd2e85

See more details on using hashes here.

File details

Details for the file nv_one_logger_pytorch_lightning_integration-2.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for nv_one_logger_pytorch_lightning_integration-2.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f92904055fb0082516480cc1e3dd0bb6cedb2b033985ebfd4814b9cbf7da2cb2
MD5 339888a467dd65b48a42d2155ca9e4b8
BLAKE2b-256 f05601a55efb365b6864646b4ac941a1d9de66f024e880764510ba5a7a63f62c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page