Jupyter extension for interactive distributed PyTorch training
Project description
nbdistributed
A library for distributed PyTorch execution in Jupyter notebooks with seamless REPL-like behavior.
In Development Note:
This library is being built to help run my new course and as a result is constantly changing. For right now it is "stable enough" however as I find new features to use/need in the course I need to expand the framework.
As a result it is not open to contributions at this time.
Features
- Seamless Distributed Execution: Run PyTorch code across multiple GPUs directly from Jupyter notebooks
- REPL-like Behavior: See results immediately without explicit print statements
- Automatic GPU Management: Smart allocation of GPUs to worker processes
- Interactive Development: Real-time feedback and error reporting
- IDE Support: Namespace synchronization for code completion and type hints
- Robust Process Management: Graceful startup, monitoring, and shutdown
Installation
pip install nbdistributed
Quick Start
- Import and initialize in your Jupyter notebook:
%load_ext nbdistributed
%dist_init -n 4 # Start 4 worker processes
- Run code on all workers:
import torch
print(f"Rank {rank} running on {torch.cuda.get_device_name()}")
- Run code on specific ranks:
%%rank[0,1]
print(f"Running on rank {rank}")
Architecture
The library consists of four main components:
1. Magic Commands (magic.py)
- Provides IPython magic commands for interaction
- Manages automatic distributed execution
- Handles namespace synchronization
- Key commands:
%dist_init: Initialize workers%%distributed: Execute on all ranks%%rank[n]: Execute on specific ranks%sync: Synchronize workers%dist_status: Show worker status%dist_mode: Toggle automatic mode%dist_shutdown: Clean shutdown
2. Worker Process (worker.py)
- Runs on each GPU/CPU
- Executes distributed PyTorch code
- Maintains isolated Python namespace
- Features:
- REPL-like output capturing
- Error handling and reporting
- GPU device management
- Namespace synchronization
3. Process Manager (process_manager.py)
- Manages worker lifecycle
- Handles GPU assignments
- Monitors process health
- Provides:
- Clean process startup
- Status monitoring
- Graceful shutdown
- GPU utilization tracking
4. Communication Manager (communication.py)
- Coordinates inter-process communication
- Uses ZMQ for efficient messaging
- Features:
- Asynchronous message handling
- Reliable message delivery
- Timeout management
- Worker targeting
Usage Examples
Basic Distributed Training
%dist_init -n 2 # Start 2 workers
import torch
import torch.distributed as dist
# Create tensor on each GPU
x = torch.randn(100, 100).cuda()
# All-reduce across GPUs
dist.all_reduce(x)
print(f"Rank {rank}: {x.mean():.3f}") # Same value on all ranks
Selective Execution
%%rank[0]
# Only runs on rank 0
model = torch.nn.Linear(10, 10).cuda()
print("Model created on rank 0")
# In another cell:
# Broadcast model parameters to all ranks
for param in model.parameters():
dist.broadcast(param.data, src=0)
print(f"Rank {rank} received model")
GPU Information
%dist_status
# Shows:
# - Process status
# - GPU assignments
# - Memory usage
# - Device names
Advanced Features
1. GPU Assignment
Specify exact GPU-to-rank mapping:
%dist_init -n 4 -g "0,1,2,3" # Assign specific GPUs
2. Namespace Synchronization
The library automatically syncs worker namespaces to enable IDE features:
- Code completion
- Type hints
- Variable inspection
3. Error Handling
Errors are caught and reported with:
- Full traceback
- Rank information
- GPU context
4. Process Recovery
The library provides robust error recovery:
%dist_reset # Complete environment reset
%dist_init # Start fresh
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nbdistributed-0.1.0.tar.gz.
File metadata
- Download URL: nbdistributed-0.1.0.tar.gz
- Upload date:
- Size: 35.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
426199be51a48f36a07fef79c532d03143dc1f942026693a3aae65c27a8cec16
|
|
| MD5 |
0049ed0e0039c29507c75c71288f8ef3
|
|
| BLAKE2b-256 |
5e6e0ae97b27b2a69555e3b3f907fc5d5749971c188889ed257cbf7ae5c919d0
|
File details
Details for the file nbdistributed-0.1.0-py3-none-any.whl.
File metadata
- Download URL: nbdistributed-0.1.0-py3-none-any.whl
- Upload date:
- Size: 37.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c5bb3a2f0cb6ed783c780671b4e95c914d8dbc5b8ec5312bf4835590242d074e
|
|
| MD5 |
cbc34c116df6c37d66ec4122863bd54b
|
|
| BLAKE2b-256 |
3a355e6bc502aedc3ac893980ea26debcd844c45217e18df10c992fb2516f8e1
|