Wrappers that facilitate enabling training job telemetry for a set of supported training frameworks.
Project description
Pytorch Lightning Integration
one_logger_pytorch_lightning_integration library provides an easy way to add telemetry to applications that use Pytorch Lightning for training.
The integration works by using Lightning's callback mechanism.
Given that the Lightning's callback API doesn't support everything we need, the one_logger_pytorch_lightning_integration library supplements
the Lightning callback mechanism to ensure we support async checkpointing and some app lifecycle events not supported by Lightning callback API.
Minmum Requirements
pythonversion>= 3.9, < 3.14.torchversion>= 2.8.0.pytorch-lightningversion>=2.5.3.
Integrate nv-one-logger to PTL application via hook_trainer_cls
hook_trainer_cls adds telemetry hooks to your Trainer.
Using this method, several of the training events will be automatically captured. However, you still need to explicitly call the one logger API for other events. See explicit vs implicit section for more details.
TrainingTelemetryProvider.instance().with_base_config(config).with_exporter(exporter).configure_provider()
...
HookedTrainer, nv_one_logger_callback = hook_trainer_cls(Trainer, TrainingTelemetryProvider.instance())
# Instantiate using "HookedTrainer" passing it the same parameters you would pass to the regular Lightning Trainer.
HookedTrainer = HookedTrainer(
max_epochs=NUM_EPOCHS,
limit_train_batches=NUM_TRAIN_BATCHES,
limit_val_batches=NUM_VAL_BATCHES,
devices=NUM_DEVICES,
# No need to pass one logger callback. This will be added automatically.
# You can pass other callbacks that you need.
callbacks=[...],
....
)
# You can now use the "trainer" instance the same way you use a regular lightning trainer.
...
# You can also use the returned callback from hook_trainer_clas (or get it via the "nv_one_logger_callback" propery of the trainer)
# to invoke on_xxx methods that are not part of the Lightning Callback interface such as "on_model_init_start", on_model_init_end,
# on_dataloader_init_start, etc.
# Note that nv_one_logger_callback == trainer.nv_one_logger_callback
nv_one_logger_callback.on_app_end()
Explicit vs Implicit Telemetry Collection
As mentioned above, thanks to integration with the built-in
Lightning's callback mechanism, when your code
calls trainer.fit, several training-related spans (e.g., training loop, training iterations, validation iterations, etc)
are captured and reported to NV one logger training telemetry implicitely (without the need for you to write any extra code).
However, several lifecycle events are not captured by the Lightning callback mechanism; therefore, if you are interested in collecting
telemetry on those events, you need to call the corresponding TimeEventCallback on_xxx methods explicitely.
The table below shows which spans are captured implicitely and which ones require an explicit call.
| Span | How to collect data? |
|---|---|
| APPLICATION | You need to explcitely call on_app_end. on_app_start is called automatically. |
| TRAINING_LOOP | Implict via trainer.fit() |
| TRAINING_SINGLE_ITERATION | Implict via trainer.fit() |
| VALIDATION_LOOP | Implict via trainer.fit() |
| VALIDATION_SINGLE_ITERATION | Implict via trainer.fit() |
| TESTING_LOOP | Explicit: call on_testing_start and on_testing_end |
| DATA_LOADER_INIT | Explicit: call on_dataloader_init_end and on_dataloader_init_end |
| MODEL_INIT | Explicit: call on_model_init_start and on_model_init_end |
| OPTIMIZER_INIT | Explicit: call on_optimizer_init_start and on_optimizer_init_end |
| CHECKPOINT_LOAD | Explicit: call on_load_checkpoint_start and on_load_checkpoint_end |
| CHECKPOINT_SAVE_SYNC | Implict. Works both for checkpoints saved automatically by trainer.fit() and checkpoints saved explicitly during training. |
| CHECKPOINT_SAVE_ASYNC | Implict. Works both for checkpoints saved automatically by trainer.fit() and checkpoints saved explicitly during training. |
Full Example
Below is a simple training application that shows how you can enable telemetry for Loghtning by adding a few lines of code.
import os
import torch
from pytorch_lightning import LightningModule, Trainer
from nv_one_logger.training_telemetry.api.training_telemetry_provider import TrainingTelemetryProvider
from nv_one_logger.training_telemetry.integration.pytorch_lightning import hook_trainer_cls
class SimpleModel(LightningModule):
def __init__(self, learning_rate=1e-3):
super().__init__()
self.learning_rate = learning_rate
self.model = torch.nn.Sequential(
torch.nn.Linear(10, 20),
torch.nn.ReLU(),
torch.nn.Linear(20, 1)
)
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = torch.nn.functional.mse_loss(y_hat, y)
self.log('train_loss', loss)
return loss
def configure_optimizers(self):
nv_one_logger_callback = trainer.nv_one_logger_callback
nv_one_logger_callback.on_optimizer_init_start()
optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
nv_one_logger_callback.on_optimizer_init_end()
return optimizer
def main():
# 1. Configure OneLoggerTrainingTelemetryProvider
base_config = OneLoggerConfig(
application_name="test_app",
session_tag_or_fn="test_session",
world_size_or_fn=5,
)
training_config = TrainingTelemetryConfig(
world_size_or_fn=5,
is_log_throughput_enabled_or_fn=True,
flops_per_sample_or_fn=100,
global_batch_size_or_fn=32,
log_every_n_train_iterations=10,
perf_tag_or_fn="test_perf",
)
exporter = FileExporter(file_path=Path("training_telemetry.json"))
TrainingTelemetryProvider.instance().with_base_config(base_config).with_exporter(exporter).configure_provider()
# 2. Create and hook the Trainer class
HookedTrainer = hook_trainer_cls(Trainer, TrainingTelemetryProvider.instance())
trainer = HookedTrainer(
max_steps=train_iterations_target,
devices=num_devices,
)
nv_one_logger_callback = trainer.nv_one_logger_callback
# 3. Set training telemetry config after on_app_start is called
TrainingTelemetryProvider.instance().set_training_telemetry_config(training_config)
# 4. Initialize model with OneLogger hooks and timestamps
nv_one_logger_callback.on_model_init_start()
model = SimpleModel()
nv_one_logger_callback.on_model_init_end()
# 5. Load checkpoint if needed
nv_one_logger_callback.on_load_checkpoint_start()
if os.path.exists("pretrained.ckpt"):
model = SimpleModel.load_from_checkpoint("pretrained.ckpt")
nv_one_logger_callback.on_load_checkpoint_end()
# 6. Create dummy dataset
nv_one_logger_callback.on_dataloader_init_start()
train_dataset = torch.utils.data.TensorDataset(
torch.randn(1000, 10),
torch.randn(1000, 1)
)
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=32,
shuffle=True
)
nv_one_logger_callback.on_dataloader_init_end()
# 7. Start training
trainer.fit(model, train_loader)
# 8. End application
nv_one_logger_callback.on_app_end()
if __name__ == "__main__":
main()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nv_one_logger_pytorch_lightning_integration-2.0.0.tar.gz.
File metadata
- Download URL: nv_one_logger_pytorch_lightning_integration-2.0.0.tar.gz
- Upload date:
- Size: 11.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.5 CPython/3.13.5 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ebf54d915fe1cf10e6870b5a5f2c244466b23135e27405d18cf1b7a11857b82
|
|
| MD5 |
af419612f07b7a50a9d70c341fe27690
|
|
| BLAKE2b-256 |
c9bba81401009dc934694716326f536468bf3b5390d4219fcca9c479f7628550
|
File details
Details for the file nv_one_logger_pytorch_lightning_integration-2.0.0-py3-none-any.whl.
File metadata
- Download URL: nv_one_logger_pytorch_lightning_integration-2.0.0-py3-none-any.whl
- Upload date:
- Size: 9.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.5 CPython/3.13.5 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2801cd9ff6c7e1833ad25ed897ebf4cfbca4dcba97e003c7e686ed693603377
|
|
| MD5 |
e2c741144f760dad2afc903deb6584ae
|
|
| BLAKE2b-256 |
678ba09feff88981b73ab35817c22e9d90022e6c353516cfc26844ca52708ec6
|