Skip to main content

Trigger wandb offline syncs from a compute node without internet

Project description

Wandb Offline Sync Hook

A convenient way to trigger synchronizations to wandb if your compute nodes don't have internet!

Documentation Status PyPI version Python 3.8‒3.11 PR welcome pre-commit.ci status .github/workflows/test.yaml link checker codecov gitmoji Black

🤔 What is this?

  • ✅ You use wandb/Weights & Biases to record your machine learning trials?
  • ✅ Your ML experiments run on compute nodes without internet access (for example, using a batch system)?
  • ✅ Your compute nodes and your head/login node (with internet) have access to a shared file system?

Then this package might be useful. For alternatives, see below.

What you might have been doing so far

You probably have been using export WANDB_MODE="offline" on the compute nodes and then ran something like

cd /.../result_dir/
for d in $(ls -t -d */); do cd $d; wandb sync --sync-all; cd ..; done

from your head node (with internet access) every now and then. However, obviously this is not very satisfying as it doesn't update live. Sure, you could throw this in a while True loop, but if you have a lot of trials in your directory, this will take forever, cause unnecessary network traffic and it's just not very elegant.

How does wandb-osh solve the problem?

  1. You add a hook that is called every time an epoch concludes (that is, when we want to trigger a sync).
  2. You start the wandb-osh script in your head node with internet access. This script will now trigger wandb sync upon request from one of the compute nodes.

How is this implemented?

Very simple: Every time an epoch concludes, the hook gets called and creates a file in the communication directory (~/.wandb_osh_communication by default). The wandb-osh script that is running on the head node (with internet) reads these files and performs the synchronization.

What alternatives are there?

With ray tune, you can use your ray head node as the place to synchronize from (rather than deploying it via the batch system as well, as the current docs suggest). See the note below or my demo repository. Similar strategies might be possible for wandb as well (let me know!).

📦 Installation

pip3 install wandb-osh

For completeness, the extra dependencies lightning and ray are given, but they only ensure that the corresponding package is installed. For example

pip3 install 'wandb-osh[lightning]'

also installs pytorch lightning if it is not already present, but has no other effect.

For development, make sure also to include the testing extra requirement.

pip3 install --editable '.[testing]'

🔥 Running it!

Two steps: Set up the hook, then run the script from your head node.

Step 1: Setting up the hook

With pure wandb

Let's adapt the simple pytorch example from the wandb docs (it only takes 3 lines!):

import wandb
from wandb_osh.hooks import TriggerWandbSyncHook  # <-- New!


trigger_sync = TriggerWandbSyncHook()  # <--- New!

wandb.init(config=args, mode="offline")

model = ... # set up your model

# Magic
wandb.watch(model, log_freq=100)

model.train()
for batch_idx, (data, target) in enumerate(train_loader):
    output = model(data)
    loss = F.nll_loss(output, target)
    loss.backward()
    optimizer.step()
    if batch_idx % args.log_interval == 0:
        wandb.log({"loss": loss})
        trigger_sync()  # <-- New!
With pytorch lightning

Simply add the TriggerWandbSyncLightningCallback to your list of callbacks and you're good to go!

from wandb_osh.lightning_hooks import TriggerWandbSyncLightningCallback  # <-- New!
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning import Trainer

logger = WandbLogger(
    project="project",
    group="group",
    offline=True,
)

model = MyLightningModule()
trainer = Trainer(
    logger=logger,
    callbacks=[TriggerWandbSyncLightningCallback()]  # <-- New!
)
trainer.fit(model, train_dataloader, val_dataloader)
With ray tune

Note With ray tune, you might not need this package! While the approach suggested in the ray tune SLURM docs deploys the ray head on a worker node as well (so it doesn't have internet), this actually isn't needed. Instead, you can run the ray head and the tuning script on the head node and only submit batch jobs for your workers. In this way, wandb will be called from the head node and internet access is no problem there. For more information on this approach, take a look at my demo repository.

You probably already use the WandbLoggerCallback callback. We simply add a second callback for wandb-osh (it only takes two new lines!):

import os
from wandb_osh.ray_hooks import TriggerWandbSyncRayHook  # <-- New!


os.environ["WANDB_MODE"] = "offline"

callbacks = [
    WandbLoggerCallback(...),  # <-- ray tune documentation tells you about this
    TriggerWandbSyncRayHook(),  # <-- New!
]

tuner = tune.Tuner(
    trainable,
    tune_config=...,
    run_config=RunConfig(
        ...,
        callbacks=callbacks,
    ),
)
With anything else

Simply take the TriggerWandbSyncHook class and use it as a callback in your training loop (as in the wandb example above), passing the directory that wandb is syncing to as an argument.

Step 2: Running the script on the head node

After installation, you should have a wandb-osh script in your $PATH. Simply call it like this:

wandb-osh
The output will look something like this
INFO: Starting to watch /home/kl5675/.wandb_osh_command_dir
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/b1f60706 ... done.
INFO: Finished syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/92a3ef1b ... done.
INFO: Finished syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_a2caa9c0_2_attr_pt_thld=0.0092,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-17
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_a2caa9c0_2_attr_pt_thld=0.0092,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-17/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/a2caa9c0 ... done.

Take a look at wandb-osh --help or check the documentation for all command line options. You can add options to the wandb sync call by placing them after --. For example

wandb-osh -- --sync-all

❓ Q & A

I get the warning "wandb: NOTE: use wandb sync --sync-all to sync 1 unsynced runs from local directory."

You can start wandb-osh with wandb-osh -- --sync-all to always synchronize all available runs.

How can I suppress logging messages (e.g., warnings about the syncing not being fast enough)

import wandb_osh

# for wandb_osh.__version__ >= 1.2.0
wandb_osh.set_log_level("ERROR")

🧰 Development setup

pip3 install pre-commit
pre-commit install

💖 Contributing

Your help is greatly appreciated! Suggestions, bug reports and feature requests are best opened as github issues. You are also very welcome to submit a pull request!

Bug reports and pull requests are credited with the help of the allcontributors bot.

Barthelemy Meynard-Piganeau
Barthelemy Meynard-Piganeau

🐛
MoH-assan
MoH-assan

🐛
Cedric Leonard
Cedric Leonard

💻 🐛

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wandb_osh-1.2.2.tar.gz (17.3 kB view details)

Uploaded Source

Built Distribution

wandb_osh-1.2.2-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file wandb_osh-1.2.2.tar.gz.

File metadata

  • Download URL: wandb_osh-1.2.2.tar.gz
  • Upload date:
  • Size: 17.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for wandb_osh-1.2.2.tar.gz
Algorithm Hash digest
SHA256 e0b0455483e7c7ad1c9f784454860798ad680544291b9271fa2b657a00232ab1
MD5 e69bd02dd4932289a784ac5ce960e388
BLAKE2b-256 c44856692264ae6e00d249cd7e0c43eadf0ed70abf01b623ce9594671879e3b3

See more details on using hashes here.

File details

Details for the file wandb_osh-1.2.2-py3-none-any.whl.

File metadata

  • Download URL: wandb_osh-1.2.2-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for wandb_osh-1.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 50634ac1984332e0ecbad44b865ace2e3e6ad0a5804432d284b62102f238c37c
MD5 1a1410b6ddae901c1eab7ea0fe20148f
BLAKE2b-256 b82cfd0193f57e2cc3d7f0a1fd57128b18279d5993b657c30e5b6a3378189a4c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page