A Python library for large-scale deep learning training across thousands of GPUs for LLMs and other massive models
Project description
DeepScale
A Python library for large-scale deep learning training across thousands of GPUs, designed for training massive models like Large Language Models (LLMs) and other billion-parameter architectures.
Features
- Massive Scale Training: Support for training across thousands of GPUs
- LLM Training: Specialized utilities for Large Language Model training
- Distributed Training: Advanced distributed training strategies
- Model Parallelism: Pipeline and tensor parallelism for massive models
- Memory Optimization: Techniques for training billion-parameter models
- Multi-Node Support: Cross-node communication and synchronization
- Gradient Scaling: Efficient gradient accumulation and synchronization
Installation
pip install deepscale
Quick Start
Large-Scale Training
import torch
import torch.nn as nn
from deepscale import DistributedTrainer, ModelParallel, get_device_info
# Create a large model (e.g., transformer-based LLM)
class LargeLanguageModel(nn.Module):
def __init__(self, vocab_size=50000, d_model=2048, num_layers=24):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.transformer = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model, nhead=16),
num_layers=num_layers
)
self.output_proj = nn.Linear(d_model, vocab_size)
def forward(self, x):
x = self.embedding(x)
x = self.transformer(x)
return self.output_proj(x)
# Initialize large model
model = LargeLanguageModel()
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
# Set up distributed training across thousands of GPUs
num_gpus = 1000 # Example: 1000 GPUs
device_ids = list(range(num_gpus))
# Initialize distributed trainer
trainer = DistributedTrainer(
model=model,
num_gpus=num_gpus,
batch_size_per_gpu=4,
gradient_accumulation_steps=8
)
# Calculate effective batch size
effective_batch_size = trainer.calculate_effective_batch_size()
print(f"Effective batch size across {num_gpus} GPUs: {effective_batch_size:,}")
# Set up model parallelism for massive models
model_parallel = ModelParallel(model, device_ids=device_ids)
parallel_model = model_parallel.setup_model_parallel()
# Get cluster information
device_info = get_device_info()
print(f"CUDA devices available: {device_info['cuda_device_count']}")
LLM Training at Scale
from deepscale import LLMTrainer, PipelineParallel, GradientScaling
# Initialize LLM trainer for massive scale
trainer = LLMTrainer(
model_size="70B", # 70 billion parameters
num_gpus=2048, # 2048 GPUs
sequence_length=4096,
batch_size_per_gpu=1,
gradient_checkpointing=True
)
# Set up pipeline parallelism for the LLM
pipeline_parallel = PipelineParallel(
model=trainer.model,
num_stages=64, # Split across 64 pipeline stages
micro_batch_size=1
)
# Configure gradient scaling for stability
gradient_scaler = GradientScaling(
initial_scale=2**16,
growth_factor=2.0,
backoff_factor=0.5
)
# Start training
print(f"Training {trainer.model_size} parameter model on {trainer.num_gpus} GPUs")
print(f"Effective batch size: {trainer.effective_batch_size:,}")
print(f"Pipeline stages: {pipeline_parallel.num_stages}")
Memory Optimization for Massive Models
from deepscale import MemoryOptimizer, ModelSharding, ZeroOptimizer
# Optimize memory for billion-parameter models
memory_optimizer = MemoryOptimizer(
model=trainer.model,
offload_optimizer=True,
offload_params=True,
cpu_offload=True
)
# Shard model across multiple GPUs
model_sharding = ModelSharding(
model=trainer.model,
sharding_strategy="tensor_parallel",
num_shards=8
)
# Zero redundancy optimizer
zero_optimizer = ZeroOptimizer(
model=trainer.model,
stage=2, # ZeRO-2
partition_optimizer=True,
partition_gradients=True
)
# Calculate memory savings
memory_savings = memory_optimizer.calculate_memory_savings()
print(f"Memory savings: {memory_savings:.2f} GB")
API Reference
DistributedTrainer
__init__(model, num_gpus, batch_size_per_gpu, gradient_accumulation_steps): Initialize distributed trainercalculate_effective_batch_size(): Calculate effective batch size across all GPUssetup_distributed_training(): Set up distributed training environmenttrain_step(data, labels): Perform one training stepget_training_stats(): Get training statistics
LLMTrainer
__init__(model_size, num_gpus, sequence_length, batch_size_per_gpu): Initialize LLM trainersetup_llm_training(): Set up LLM-specific training configurationtrain_on_batch(batch): Train on a single batchget_model_size_info(): Get detailed model size information
PipelineParallel
__init__(model, num_stages, micro_batch_size): Initialize pipeline parallelismsetup_pipeline(): Set up pipeline stagesforward_pipeline(input_data): Forward pass through pipelineget_pipeline_efficiency(): Calculate pipeline efficiency
MemoryOptimizer
__init__(model, offload_optimizer, offload_params, cpu_offload): Initialize memory optimizeroptimize_memory(): Apply memory optimizationscalculate_memory_savings(): Calculate memory savingsget_memory_usage(): Get current memory usage
ModelSharding
__init__(model, sharding_strategy, num_shards): Initialize model shardingshard_model(): Shard model across devicesget_shard_info(): Get sharding information
ZeroOptimizer
__init__(model, stage, partition_optimizer, partition_gradients): Initialize ZeRO optimizersetup_zero(): Set up ZeRO optimizationget_zero_stats(): Get ZeRO optimization statistics
Requirements
- Python 3.8+
- PyTorch 1.9.0+
- NumPy 1.21.0+
- NCCL (for multi-GPU communication)
- CUDA 11.0+ (for GPU training)
- OpenMPI (for multi-node training)
Development
To install the development dependencies:
pip install -e ".[dev]"
Run tests:
pytest
Format code:
black src/
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Changelog
0.1.0 (2024-01-XX)
- Initial release
- Large-scale distributed training support
- LLM training utilities
- Pipeline and tensor parallelism
- Memory optimization for billion-parameter models
- Multi-node training capabilities
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deepscale-0.1.0.tar.gz.
File metadata
- Download URL: deepscale-0.1.0.tar.gz
- Upload date:
- Size: 14.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98127b5d31e32e172b4bc5d7f9de1dbf24cad8d648b92b4cc96a5cd5b0a9137a
|
|
| MD5 |
a9c70f582801735128cdb7edf4268f37
|
|
| BLAKE2b-256 |
59ad4089ec8ec13609cbc6b675f64b506c94b7122fcd3ac856a87528f78fc833
|
File details
Details for the file deepscale-0.1.0-py3-none-any.whl.
File metadata
- Download URL: deepscale-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2bf51407a8a6c4aadb62ae69a5606644fe63f6a34942e21e2049eb8b0d7b1c6
|
|
| MD5 |
f0e4df89859a6706cd4d95c9b9456c3f
|
|
| BLAKE2b-256 |
761598c7bcce53ea2773b3a8a1e1ca0d59421f1704f5bb8bdc4dedded6fa5db2
|