Trigger wandb offline syncs from a compute node without internet
Project description
Wandb Offline Sync Hook
A convenient way to trigger synchronizations to wandb if your compute nodes don't have internet!🤔 What is this?
- ✅ You use
wandb
/Weights & Biases to record your machine learning trials? - ✅ Your ML experiments run on compute nodes without internet access (for example, using a batch system)?
- ✅ Your compute nodes and your head/login node (with internet) have access to a shared file system?
Then this package can be useful.
What you might have been doing so far
You probably have been using export WANDB_MODE="offline"
on the compute nodes and then ran something like
cd /.../result_dir/
for d in $(ls -t -d */); do cd $d; wandb sync --sync-all; cd ..; done
from your head node (with internet access) every now and then.
However, obviously this is not very satisfying as it doesn't update live.
Sure, you could throw this in a while True
loop, but if you have a lot of trials in your directory, this will take forever, cause unnecessary network traffic and it's just not very elegant.
How does wandb-osh
solve the problem?
- You add a hook that is called every time an epoch concludes (that is, when we want to trigger a sync).
- You start the
wandb-osh
script in your head node with internet access. This script will now triggerwandb sync
upon request from one of the compute nodes.
How is this implemented?
Very simple: Every time an epoch concludes, the hook gets called and creates a file in a communication directory (~/.wandb_osh_communication
by default). wandb-osh
scans the communication directory and reads synchronization instructions from such files.
📦 Installation
pip3 install wandb-osh
If you want to use this package with ray
, use
pip3 install 'wandb-osh[ray]'
🔥 Running it!
Two steps: Set up the hook, then run the script from your head node.
Step 1: Setting up the hook
With wandb
Let's adapt the simple pytorch example from the wandb docs (it only takes 3 lines!):
import wandb
from wandb_osh.hooks import TriggerWandbSyncHook # <-- New!
trigger_sync = TriggerWandbSyncHook() # <--- New!
wandb.init(config=args, mode="offline")
model = ... # set up your model
# Magic
wandb.watch(model, log_freq=100)
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
wandb.log({"loss": loss})
trigger_sync() # <-- New!
With ray tune
You probably already use the WandbLoggerCallback
callback. We simply add a second callback for wandb-osh
(it only takes one new line!):
import os
from wandb_osh.ray_hooks import TriggerWandbSyncRayHook
os.environ["WANDB_MODE"] = "offline"
callbacks = [
WandbLoggerCallback(...), # <-- ray tune documentation tells you about this
TriggerWandbSyncRayHook(), # <-- New!
]
tuner = tune.Tuner(
trainable,
tune_config=...,
run_config=RunConfig(
...,
callbacks=callbacks,
),
)
With anything else
Simply take the TriggerWandbSyncHook
class and use it as a callback in your training
loop (as in the wandb
example above), passing the directory that wandb
is syncing
to as an argument.
Step 2: Running the script on the head node
After installation, you should have a wandb-osh
script in your $PATH
. Simply call it like this:
wandb-osh
The output will look something like this:
INFO: Starting to watch /home/kl5675/.wandb_osh_command_dir
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/b1f60706 ... done.
INFO: Finished syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/92a3ef1b ... done.
INFO: Finished syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_a2caa9c0_2_attr_pt_thld=0.0092,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-17
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_a2caa9c0_2_attr_pt_thld=0.0092,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-17/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/a2caa9c0 ... done.
Take a look at wandb-osh --help
or check the documentation for all command line options.
You can add options to the wandb sync
call by placing them after --
. For example
wandb-osh -- --sync-all
❓ Q & A
I get the warning "wandb: NOTE: use wandb sync --sync-all to sync 1 unsynced runs from local directory."
You can start wandb-osh
with wandb-osh -- --sync-all
to always synchronize
all available runs.
🧰 Development setup
pip3 install pre-commit
pre-commit install
💖 Contributing
Your help is greatly appreciated! Suggestions, bug reports and feature requests are best opened as github issues. You are also very welcome to submit a pull request!
Bug reports and pull requests are credited with the help of the allcontributors bot.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for wandb_osh-1.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c88b263087ab2a989835b5248c16954a0787ec0e2e3b4a1bfb09155b147e242 |
|
MD5 | c7c340574342b10a25f44c268a25d2f5 |
|
BLAKE2b-256 | da4353ae9b3069161313f50efa173fa465e722ce6ecf745a8a17210daab7d7a6 |