A Python library for large-scale deep learning training across thousands of GPUs for LLMs and other massive models

These details have not been verified by PyPI

Project links

Project description

DeepScale

A Python library for large-scale deep learning training across thousands of GPUs, designed for training massive models like Large Language Models (LLMs) and other billion-parameter architectures.

Features

Massive Scale Training: Support for training across thousands of GPUs
LLM Training: Specialized utilities for Large Language Model training
Distributed Training: Advanced distributed training strategies
Model Parallelism: Pipeline and tensor parallelism for massive models
Memory Optimization: Techniques for training billion-parameter models
Multi-Node Support: Cross-node communication and synchronization
Gradient Scaling: Efficient gradient accumulation and synchronization

Installation

pip install deepscale

Quick Start

Large-Scale Training

import torch
import torch.nn as nn
from deepscale import DistributedTrainer, ModelParallel, get_device_info

# Create a large model (e.g., transformer-based LLM)
class LargeLanguageModel(nn.Module):
    def __init__(self, vocab_size=50000, d_model=2048, num_layers=24):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, nhead=16),
            num_layers=num_layers
        )
        self.output_proj = nn.Linear(d_model, vocab_size)
    
    def forward(self, x):
        x = self.embedding(x)
        x = self.transformer(x)
        return self.output_proj(x)

# Initialize large model
model = LargeLanguageModel()
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

# Set up distributed training across thousands of GPUs
num_gpus = 1000  # Example: 1000 GPUs
device_ids = list(range(num_gpus))

# Initialize distributed trainer
trainer = DistributedTrainer(
    model=model,
    num_gpus=num_gpus,
    batch_size_per_gpu=4,
    gradient_accumulation_steps=8
)

# Calculate effective batch size
effective_batch_size = trainer.calculate_effective_batch_size()
print(f"Effective batch size across {num_gpus} GPUs: {effective_batch_size:,}")

# Set up model parallelism for massive models
model_parallel = ModelParallel(model, device_ids=device_ids)
parallel_model = model_parallel.setup_model_parallel()

# Get cluster information
device_info = get_device_info()
print(f"CUDA devices available: {device_info['cuda_device_count']}")

LLM Training at Scale

from deepscale import LLMTrainer, PipelineParallel, GradientScaling

# Initialize LLM trainer for massive scale
trainer = LLMTrainer(
    model_size="70B",  # 70 billion parameters
    num_gpus=2048,     # 2048 GPUs
    sequence_length=4096,
    batch_size_per_gpu=1,
    gradient_checkpointing=True
)

# Set up pipeline parallelism for the LLM
pipeline_parallel = PipelineParallel(
    model=trainer.model,
    num_stages=64,  # Split across 64 pipeline stages
    micro_batch_size=1
)

# Configure gradient scaling for stability
gradient_scaler = GradientScaling(
    initial_scale=2**16,
    growth_factor=2.0,
    backoff_factor=0.5
)

# Start training
print(f"Training {trainer.model_size} parameter model on {trainer.num_gpus} GPUs")
print(f"Effective batch size: {trainer.effective_batch_size:,}")
print(f"Pipeline stages: {pipeline_parallel.num_stages}")

Memory Optimization for Massive Models

from deepscale import MemoryOptimizer, ModelSharding, ZeroOptimizer

# Optimize memory for billion-parameter models
memory_optimizer = MemoryOptimizer(
    model=trainer.model,
    offload_optimizer=True,
    offload_params=True,
    cpu_offload=True
)

# Shard model across multiple GPUs
model_sharding = ModelSharding(
    model=trainer.model,
    sharding_strategy="tensor_parallel",
    num_shards=8
)

# Zero redundancy optimizer
zero_optimizer = ZeroOptimizer(
    model=trainer.model,
    stage=2,  # ZeRO-2
    partition_optimizer=True,
    partition_gradients=True
)

# Calculate memory savings
memory_savings = memory_optimizer.calculate_memory_savings()
print(f"Memory savings: {memory_savings:.2f} GB")

API Reference

DistributedTrainer

__init__(model, num_gpus, batch_size_per_gpu, gradient_accumulation_steps): Initialize distributed trainer
calculate_effective_batch_size(): Calculate effective batch size across all GPUs
setup_distributed_training(): Set up distributed training environment
train_step(data, labels): Perform one training step
get_training_stats(): Get training statistics

LLMTrainer

__init__(model_size, num_gpus, sequence_length, batch_size_per_gpu): Initialize LLM trainer
setup_llm_training(): Set up LLM-specific training configuration
train_on_batch(batch): Train on a single batch
get_model_size_info(): Get detailed model size information

PipelineParallel

__init__(model, num_stages, micro_batch_size): Initialize pipeline parallelism
setup_pipeline(): Set up pipeline stages
forward_pipeline(input_data): Forward pass through pipeline
get_pipeline_efficiency(): Calculate pipeline efficiency

MemoryOptimizer

__init__(model, offload_optimizer, offload_params, cpu_offload): Initialize memory optimizer
optimize_memory(): Apply memory optimizations
calculate_memory_savings(): Calculate memory savings
get_memory_usage(): Get current memory usage

ModelSharding

__init__(model, sharding_strategy, num_shards): Initialize model sharding
shard_model(): Shard model across devices
get_shard_info(): Get sharding information

ZeroOptimizer

__init__(model, stage, partition_optimizer, partition_gradients): Initialize ZeRO optimizer
setup_zero(): Set up ZeRO optimization
get_zero_stats(): Get ZeRO optimization statistics

Requirements

Python 3.8+
PyTorch 1.9.0+
NumPy 1.21.0+
NCCL (for multi-GPU communication)
CUDA 11.0+ (for GPU training)
OpenMPI (for multi-node training)

Development

To install the development dependencies:

pip install -e ".[dev]"

Run tests:

pytest

Format code:

black src/

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

0.1.0 (2024-01-XX)

Initial release
Large-scale distributed training support
LLM training utilities
Pipeline and tensor parallelism
Memory optimization for billion-parameter models
Multi-node training capabilities

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Oct 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepscale-0.1.0.tar.gz (14.0 kB view details)

Uploaded Oct 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

deepscale-0.1.0-py3-none-any.whl (11.9 kB view details)

Uploaded Oct 29, 2025 Python 3

File details

Details for the file deepscale-0.1.0.tar.gz.

File metadata

Download URL: deepscale-0.1.0.tar.gz
Upload date: Oct 29, 2025
Size: 14.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for deepscale-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`98127b5d31e32e172b4bc5d7f9de1dbf24cad8d648b92b4cc96a5cd5b0a9137a`
MD5	`a9c70f582801735128cdb7edf4268f37`
BLAKE2b-256	`59ad4089ec8ec13609cbc6b675f64b506c94b7122fcd3ac856a87528f78fc833`

See more details on using hashes here.

File details

Details for the file deepscale-0.1.0-py3-none-any.whl.

File metadata

Download URL: deepscale-0.1.0-py3-none-any.whl
Upload date: Oct 29, 2025
Size: 11.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for deepscale-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c2bf51407a8a6c4aadb62ae69a5606644fe63f6a34942e21e2049eb8b0d7b1c6`
MD5	`f0e4df89859a6706cd4d95c9b9456c3f`
BLAKE2b-256	`761598c7bcce53ea2773b3a8a1e1ca0d59421f1704f5bb8bdc4dedded6fa5db2`

See more details on using hashes here.

deepscale 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DeepScale

Features

Installation

Quick Start

Large-Scale Training

LLM Training at Scale

Memory Optimization for Massive Models

API Reference

DistributedTrainer

LLMTrainer

PipelineParallel

MemoryOptimizer

ModelSharding

ZeroOptimizer

Requirements

Development

Contributing

License

Changelog

0.1.0 (2024-01-XX)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes