Skip to main content

Jupyter extension for interactive distributed PyTorch training

Project description

nbdistributed

A library for distributed PyTorch execution in Jupyter notebooks with seamless REPL-like behavior.

In Development Note:

This library is being built to help run my new course and as a result is constantly changing. For right now it is "stable enough" however as I find new features to use/need in the course I need to expand the framework.

As a result it is not open to contributions at this time.

Features

  • Seamless Distributed Execution: Run PyTorch code across multiple GPUs directly from Jupyter notebooks
  • REPL-like Behavior: See results immediately without explicit print statements
  • Automatic GPU Management: Smart allocation of GPUs to worker processes
  • Interactive Development: Real-time feedback and error reporting
  • IDE Support: Namespace synchronization for code completion and type hints
  • Robust Process Management: Graceful startup, monitoring, and shutdown

Installation

pip install nbdistributed

Quick Start

  1. Import and initialize in your Jupyter notebook:
%load_ext nbdistributed
%dist_init -n 4  # Start 4 worker processes
  1. Run code on all workers:
import torch
print(f"Rank {rank} running on {torch.cuda.get_device_name()}")
  1. Run code on specific ranks:
%%rank[0,1]
print(f"Running on rank {rank}")

Architecture

The library consists of four main components:

1. Magic Commands (magic.py)

  • Provides IPython magic commands for interaction
  • Manages automatic distributed execution
  • Handles namespace synchronization
  • Key commands:
    • %dist_init: Initialize workers
    • %%distributed: Execute on all ranks
    • %%rank[n]: Execute on specific ranks
    • %sync: Synchronize workers
    • %dist_status: Show worker status
    • %dist_mode: Toggle automatic mode
    • %dist_shutdown: Clean shutdown

2. Worker Process (worker.py)

  • Runs on each GPU/CPU
  • Executes distributed PyTorch code
  • Maintains isolated Python namespace
  • Features:
    • REPL-like output capturing
    • Error handling and reporting
    • GPU device management
    • Namespace synchronization

3. Process Manager (process_manager.py)

  • Manages worker lifecycle
  • Handles GPU assignments
  • Monitors process health
  • Provides:
    • Clean process startup
    • Status monitoring
    • Graceful shutdown
    • GPU utilization tracking

4. Communication Manager (communication.py)

  • Coordinates inter-process communication
  • Uses ZMQ for efficient messaging
  • Features:
    • Asynchronous message handling
    • Reliable message delivery
    • Timeout management
    • Worker targeting

Usage Examples

Basic Distributed Training

%dist_init -n 2  # Start 2 workers

import torch
import torch.distributed as dist

# Create tensor on each GPU
x = torch.randn(100, 100).cuda()

# All-reduce across GPUs
dist.all_reduce(x)
print(f"Rank {rank}: {x.mean():.3f}")  # Same value on all ranks

Selective Execution

%%rank[0]
# Only runs on rank 0
model = torch.nn.Linear(10, 10).cuda()
print("Model created on rank 0")
# In another cell:
# Broadcast model parameters to all ranks
for param in model.parameters():
    dist.broadcast(param.data, src=0)
print(f"Rank {rank} received model")

GPU Information

%dist_status
# Shows:
# - Process status
# - GPU assignments
# - Memory usage
# - Device names

Advanced Features

1. GPU Assignment

Specify exact GPU-to-rank mapping:

%dist_init -n 4 -g "0,1,2,3"  # Assign specific GPUs

2. Namespace Synchronization

The library automatically syncs worker namespaces to enable IDE features:

  • Code completion
  • Type hints
  • Variable inspection

3. Error Handling

Errors are caught and reported with:

  • Full traceback
  • Rank information
  • GPU context

4. Process Recovery

The library provides robust error recovery:

%dist_reset    # Complete environment reset
%dist_init     # Start fresh

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nbdistributed-0.1.0.tar.gz (35.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nbdistributed-0.1.0-py3-none-any.whl (37.8 kB view details)

Uploaded Python 3

File details

Details for the file nbdistributed-0.1.0.tar.gz.

File metadata

  • Download URL: nbdistributed-0.1.0.tar.gz
  • Upload date:
  • Size: 35.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for nbdistributed-0.1.0.tar.gz
Algorithm Hash digest
SHA256 426199be51a48f36a07fef79c532d03143dc1f942026693a3aae65c27a8cec16
MD5 0049ed0e0039c29507c75c71288f8ef3
BLAKE2b-256 5e6e0ae97b27b2a69555e3b3f907fc5d5749971c188889ed257cbf7ae5c919d0

See more details on using hashes here.

File details

Details for the file nbdistributed-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: nbdistributed-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 37.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for nbdistributed-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c5bb3a2f0cb6ed783c780671b4e95c914d8dbc5b8ec5312bf4835590242d074e
MD5 cbc34c116df6c37d66ec4122863bd54b
BLAKE2b-256 3a355e6bc502aedc3ac893980ea26debcd844c45217e18df10c992fb2516f8e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page