Trigger wandb offline syncs from a compute node without internet
Project description
Wandb Offline Sync Hook
A convenient way to trigger synchronizations to wandb if your compute nodes don't have internet!🤔 What is this?
- ✅ You use
wandb
/Weights & Biases to record your machine learning trials? - ✅ Your ML experiments run on compute nodes without internet access (for example, using a batch system)?
- ✅ Your compute nodes and head nodes have access to a shared file system?
Then this package can be useful.
What you might have been doing so far
You probably have been using export WANDB_MODE="offline"
and then ran something like
for d in $(ls -t -d */);do cd $d; wandb sync --sync-all; cd ..; done
on the result directory to sync all runs from your head node (with internet access) every now and then.
However, obviously this is not very satisfying as it doesn't update live.
Sure, you could throw this in a while True
loop, but if you have a lot of trials in your directory, this will take forever, cause unnecessary network traffic and it's just not very elegant.
How does wandb-osh
solve the problem?
- You add a hook that is called every time an epoch concludes (that is, when we want to trigger a sync).
- You start the
wandb-osh
script in your head node with internet access. This script will now triggerwandb sync
upon request from one of the compute nodes.
How is this implemented?
Very simple: Every time an epoch concludes, the hook gets called and creates a file in a communication directory (~/.wandb_osh_communication
by default). wandb-osh
scans the communication directory and reads synchronization instructions from such files.
📦 Installation
pip3 install wandb-osh
If you want to use this package with ray
, use
pip3 install 'wandb-osh[ray]'
🔥 Running it!
Two steps: Set up the hook, then run the script from your head node.
Step 1: Setting up the hook
With wandb
Let's adapt the simple pytorch example from the wandb docs (it only takes 3 lines!):
import wandb
from wandb_osh.hooks import TriggerWandbSyncHook # <-- New!
trigger_sync = TriggerWandbSyncHook() # <--- New!
wandb.init(config=args, mode="offline")
model = ... # set up your model
# Magic
wandb.watch(model, log_freq=100)
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
wandb.log({"loss": loss})
trigger_sync() # <-- New!
With ray tune
You probably already use the WandbLoggerCallback
callback. We simply add a second callback for wandb-osh
(it only takes one new line!):
import os
from wandb_osh.ray_hooks import TriggerWandbSyncRayHook
os.environ["WANDB_MODE"] = "offline"
callbacks = [
WandbLoggerCallback(...), # <-- ray tune documentation tells you about this
TriggerWandbSyncRayHook(), # <-- New!
]
tuner = tune.Tuner(
trainable,
tune_config=...,
run_config=RunConfig(
...,
callbacks=callbacks,
),
)
With anything else
Simply take the TriggerWandbSyncHook
class and use it as a callback in your training
loop (as in the wandb
example above), passing the directory that wandb
is syncing
to as an argument.
Step 2: Running the script on the head node
After installation, you should have a wandb-osh
script in your $PATH
. Simply call it like this:
wandb-osh
The output will look something like this:
INFO: Starting to watch /home/kl5675/.wandb_osh_command_dir
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/b1f60706 ... done.
INFO: Finished syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/92a3ef1b ... done.
INFO: Finished syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_a2caa9c0_2_attr_pt_thld=0.0092,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-17
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_a2caa9c0_2_attr_pt_thld=0.0092,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-17/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/a2caa9c0 ... done.
Take a look at wandb-osh --help
for all command line options.
You can also add options to the wandb sync
call by placing them after --
. For example
wandb-osh -- --sync-all
🧰 Development setup
pip3 install pre-commit
pre-commit install
💖 Contributing
Your help is greatly appreciated! Suggestions, bug reports and feature requests are best opened as github issues. You are also very welcome to submit a pull request!
Bug reports and pull requests are credited with the help of the allcontributors bot.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for wandb_osh-1.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bcfa00257001ab3b6e62e928e86f8e0b8222732d277a8d092ab961017948e25b |
|
MD5 | 473aaea6268269059458d4a37ee88c81 |
|
BLAKE2b-256 | 44a617f7d74b99cc67e76ec24ff091cb1970b6af226716b3a7bff6686f383810 |