Jupyter extension for interactive distributed PyTorch training

These details have not been verified by PyPI

Project description

nbdistributed

A library for distributed PyTorch execution in Jupyter notebooks with seamless REPL-like behavior.

In Development Note:

This library is being built to help run my new course and as a result is constantly changing. For right now it is "stable enough" however as I find new features to use/need in the course I need to expand the framework.

As a result it is not open to contributions at this time.

Features

Seamless Distributed Execution: Run PyTorch code across multiple GPUs directly from Jupyter notebooks
REPL-like Behavior: See results immediately without explicit print statements
Automatic GPU Management: Smart allocation of GPUs to worker processes
Interactive Development: Real-time feedback and error reporting
IDE Support: Namespace synchronization for code completion and type hints
Robust Process Management: Graceful startup, monitoring, and shutdown

Installation

pip install nbdistributed

Quick Start

Import and initialize in your Jupyter notebook:

%load_ext nbdistributed
%dist_init -n 4  # Start 4 worker processes

Run code on all workers:

import torch
print(f"Rank {rank} running on {torch.cuda.get_device_name()}")

Run code on specific ranks:

%%rank[0,1]
print(f"Running on rank {rank}")

Architecture

The library consists of four main components:

1. Magic Commands (`magic.py`)

Provides IPython magic commands for interaction
Manages automatic distributed execution
Handles namespace synchronization
Key commands:
- %dist_init: Initialize workers
- %%distributed: Execute on all ranks
- %%rank[n]: Execute on specific ranks
- %sync: Synchronize workers
- %dist_status: Show worker status
- %dist_mode: Toggle automatic mode
- %dist_shutdown: Clean shutdown

2. Worker Process (`worker.py`)

Runs on each GPU/CPU
Executes distributed PyTorch code
Maintains isolated Python namespace
Features:
- REPL-like output capturing
- Error handling and reporting
- GPU device management
- Namespace synchronization

3. Process Manager (`process_manager.py`)

Manages worker lifecycle
Handles GPU assignments
Monitors process health
Provides:
- Clean process startup
- Status monitoring
- Graceful shutdown
- GPU utilization tracking

4. Communication Manager (`communication.py`)

Coordinates inter-process communication
Uses ZMQ for efficient messaging
Features:
- Asynchronous message handling
- Reliable message delivery
- Timeout management
- Worker targeting

Usage Examples

Basic Distributed Training

%dist_init -n 2  # Start 2 workers

import torch
import torch.distributed as dist

# Create tensor on each GPU
x = torch.randn(100, 100).cuda()

# All-reduce across GPUs
dist.all_reduce(x)
print(f"Rank {rank}: {x.mean():.3f}")  # Same value on all ranks

Selective Execution

%%rank[0]
# Only runs on rank 0
model = torch.nn.Linear(10, 10).cuda()
print("Model created on rank 0")
# In another cell:
# Broadcast model parameters to all ranks
for param in model.parameters():
    dist.broadcast(param.data, src=0)
print(f"Rank {rank} received model")

GPU Information

%dist_status
# Shows:
# - Process status
# - GPU assignments
# - Memory usage
# - Device names

Advanced Features

1. GPU Assignment

Specify exact GPU-to-rank mapping:

%dist_init -n 4 -g "0,1,2,3"  # Assign specific GPUs

2. Namespace Synchronization

The library automatically syncs worker namespaces to enable IDE features:

Code completion
Type hints
Variable inspection

3. Error Handling

Errors are caught and reported with:

Full traceback
Rank information
GPU context

4. Process Recovery

The library provides robust error recovery:

%dist_reset    # Complete environment reset
%dist_init     # Start fresh

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 26, 2025

0.0.1

May 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nbdistributed-0.1.0.tar.gz (35.8 kB view details)

Uploaded Jun 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nbdistributed-0.1.0-py3-none-any.whl (37.8 kB view details)

Uploaded Jun 26, 2025 Python 3

File details

Details for the file nbdistributed-0.1.0.tar.gz.

File metadata

Download URL: nbdistributed-0.1.0.tar.gz
Upload date: Jun 26, 2025
Size: 35.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for nbdistributed-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`426199be51a48f36a07fef79c532d03143dc1f942026693a3aae65c27a8cec16`
MD5	`0049ed0e0039c29507c75c71288f8ef3`
BLAKE2b-256	`5e6e0ae97b27b2a69555e3b3f907fc5d5749971c188889ed257cbf7ae5c919d0`

See more details on using hashes here.

File details

Details for the file nbdistributed-0.1.0-py3-none-any.whl.

File metadata

Download URL: nbdistributed-0.1.0-py3-none-any.whl
Upload date: Jun 26, 2025
Size: 37.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for nbdistributed-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c5bb3a2f0cb6ed783c780671b4e95c914d8dbc5b8ec5312bf4835590242d074e`
MD5	`cbc34c116df6c37d66ec4122863bd54b`
BLAKE2b-256	`3a355e6bc502aedc3ac893980ea26debcd844c45217e18df10c992fb2516f8e1`

See more details on using hashes here.

nbdistributed 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

nbdistributed

In Development Note:

Features

Installation

Quick Start

Architecture

1. Magic Commands (magic.py)

2. Worker Process (worker.py)

3. Process Manager (process_manager.py)

4. Communication Manager (communication.py)

Usage Examples

Basic Distributed Training

Selective Execution

GPU Information

Advanced Features

1. GPU Assignment

2. Namespace Synchronization

3. Error Handling

4. Process Recovery

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. Magic Commands (`magic.py`)

2. Worker Process (`worker.py`)

3. Process Manager (`process_manager.py`)

4. Communication Manager (`communication.py`)