RAM is All You Need

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

RamTorch

RAM is All You Need - A PyTorch library for memory-efficient deep learning that enables training and inference of large models that don't fit in GPU memory.

Overview

RamTorch provides CPU-GPU hybrid implementations of neural network components that keep parameters in CPU memory and transfer them to GPU on-demand. This approach dramatically reduces GPU memory usage while maintaining computational efficiency through asynchronous CUDA streams and intelligent batching.

Key Features

Memory-Efficient Linear Layers: Parameters stored on CPU with on-demand GPU transfer
Asynchronous CUDA Streams: Overlaps computation with data transfer for minimal latency
ZeRO-Style Distributed Training:
- ZeRO-1: Optimizer state sharding across multiple GPUs
- ZeRO-2: Gradient sharding with automatic reduction
Shared CPU Memory: Multi-GPU workers share the same CPU tensor storage
Drop-in Replacement: Compatible with existing PyTorch code

Installation

pip install ramtorch

Or install from source:

git clone https://github.com/lodestone-rock/RamTorch.git
cd RamTorch
pip install -e .

Quick Start

Basic Usage

Replace torch.nn.Linear with ramtorch.Linear for automatic memory optimization:

import torch
from ramtorch import Linear

# Standard PyTorch approach (high GPU memory usage)
# linear = torch.nn.Linear(1000, 1000)

# RamTorch approach (low GPU memory usage)
linear = Linear(1000, 1000, device="cuda")

# Use exactly like a normal PyTorch layer
x = torch.randn(32, 1000, device="cuda")
output = linear(x)  # Parameters automatically transferred from CPU to GPU

Building Models

import torch.nn as nn
from ramtorch import Linear

class MemoryEfficientModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            Linear(1000, 2000),
            nn.ReLU(),
            Linear(2000, 2000),
            nn.ReLU(),
            Linear(2000, 100)
        )
    
    def forward(self, x):
        return self.layers(x)

model = MemoryEfficientModel()

Converting Existing Models

Use the helper function to automatically replace all Linear layers:

from ramtorch.helpers import replace_linear_with_ramtorch

# Your existing PyTorch model
model = YourExistingModel()

# Replace all nn.Linear layers with RamTorch Linear layers
model = replace_linear_with_ramtorch(model, rank=0)
model = model.to("cuda:0")

Multi-GPU Training with ZeRO-1 and ZeRO-2

RamTorch implements ZeRO-style optimizations where multiple GPU workers share the same CPU parameter storage, dramatically reducing memory usage:

import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.utils.data import DataLoader, DistributedSampler

from ramtorch import AdamW
from ramtorch.helpers import replace_linear_with_ramtorch
from ramtorch.zero1 import create_zero_param_groups, broadcast_zero_params
from ramtorch.zero2 import setup_grad_sharding_hooks

def train(rank, world_size, model):
    # Setup distributed process group
    dist.init_process_group(backend='nccl', rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)
    
    # Replace Linear layers with RamTorch (shares CPU memory across workers)
    model = replace_linear_with_ramtorch(model, rank)
    model.to(rank)
    
    # Setup ZeRO-1: Shard optimizer states across workers
    # Each worker only maintains optimizer states for a subset of parameters
    all_params = list(model.parameters())
    param_groups = [{'params': all_params, 'lr': 1e-3, 'weight_decay': 0.01}]
    rank_param_groups = create_zero_param_groups(param_groups, world_size)
    
    # Setup ZeRO-2: Shard gradients across workers
    # Gradients are partitioned and only linked on the worker responsible for them
    setup_grad_sharding_hooks(rank_param_groups, rank)
    
    # Each worker's optimizer only handles its shard
    optimizer = AdamW(rank_param_groups[rank])
    
    # Scheduler works normally
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
    
    # Setup distributed data loading
    dataset = YourDataset()
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    loader = DataLoader(dataset, batch_size=32, sampler=sampler)
    
    # Training loop
    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)  # Important for proper shuffling
        
        for batch in loader:
            inputs, targets = batch
            inputs = inputs.to(rank)
            targets = targets.to(rank)
            
            # Forward and backward pass
            outputs = model(inputs)
            loss = criterion(outputs, targets) / world_size  # Scale loss
            
            # Synchronize before backward to ensure all workers are ready
            torch.cuda.synchronize()
            loss.backward()
            torch.cuda.synchronize()
            
            # Update parameters (each worker updates only its shard)
            optimizer.step()
            
            # Broadcast updated parameters to all workers
            # RamTorch parameters are already shared via CPU memory,
            # but standard PyTorch parameters need explicit broadcasting
            broadcast_zero_params(rank_param_groups)
            
            scheduler.step()
            
            # IMPORTANT: Use model.zero_grad(), not optimizer.zero_grad()
            # Each worker handles partial gradients, so we need to zero
            # gradients at the model level to properly clear all workers' buffers
            model.zero_grad()
            
            torch.cuda.synchronize()
    
    dist.destroy_process_group()

if __name__ == "__main__":
    world_size = torch.cuda.device_count()
    
    # Model must be instantiated BEFORE spawning GPU workers
    # RamTorch shares CPU tensors across workers, so the model needs to exist
    # in the parent process before forking to enable proper memory sharing
    model = YourModel()
    
    mp.spawn(train, args=(world_size, model), nprocs=world_size)

Performance Considerations

When to Use RamTorch

Best suited for:

Large models that don't fit in GPU memory
Multi-GPU training where memory is the bottleneck
Inference scenarios with memory constraints
Training with limited GPU memory but abundant CPU memory and bandwidth
Distributed training with many parameters

Less suitable for:

Small models that fit comfortably in GPU memory
Scenarios where CPU-GPU bandwidth is the bottleneck
Real-time applications requiring minimal latency
Single-batch inference where transfer overhead dominates

Optimization Tips

Use Larger Batch Sizes: Helps amortize transfer costs across more computation
Mixed Precision Training: Combine with torch.cuda.amp for additional memory savings
Strategic Placement: Use RamTorch layers for the largest components only
Gradient Checkpointing: Combine with activation checkpointing to further reduce memory
Multi-GPU Setup: RamTorch's shared CPU memory makes multi-GPU training particularly efficient

Architecture

CPU-Offloaded Linear Layer

The core innovation of RamTorch:

Stores parameters on CPU memory with share_memory_() for zero-copy sharing across processes
Asynchronously transfers weights to GPU during forward pass using dedicated CUDA streams
Uses CUDA events for proper stream synchronization
Automatically cleans up GPU memory after computation

Memory Flow

                    ┌─────────────────────────┐
                    │   CPU Memory (Shared)   │
                    │  Parameters stored once │
                    └────────────┬────────────┘
                                 │
                    ┌────────────┴────────────┐
                    │                         │
            ┌───────▼────────┐       ┌────────▼────────┐
            │  GPU Worker 0  │       │  GPU Worker 1   │
            │ (Async Stream) │       │ (Async Stream)  │
            └───────┬────────┘       └────────┬────────┘
                    │                         │
                    │   Compute on GPU        │
                    │                         │
            ┌───────▼────────┐       ┌───────▼─────────┐
            │    Result 0    │       │    Result 1     │
            └────────────────┘       └─────────────────┘
                    │                         │
                    └────────────┬────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │   Cleanup GPU Memory    │
                    └─────────────────────────┘

ZeRO-Style Sharding

┌─────────────────────────────────────────────────────┐
│              Model Parameters (CPU Shared)          │
│  [P₀, P₁, P₂, P₃, P₄, P₅, P₆, P₇, P₈, P₉, ...]      │
└─────────────────────────────────────────────────────┘
                         │
        ┌────────────────┼────────────────┐
        │                │                │
┌───────▼────────┐ ┌─────▼────────┐ ┌─────▼────────┐
│ GPU Worker 0   │ │ GPU Worker 1 │ │ GPU Worker 2 │
│ (CPU mem map)  │ │(CPU mem map) │ │(CPU mem map) │
│ Optimizer for: │ │ Optimizer for│ │ Optimizer for│
│  P₀, P₁, P₂    │ │  P₃, P₄, P₅  │ │  P₆, P₇, P₈  │
│                │ │              │ │              │
│ Gradients for: │ │ Gradients for│ │ Gradients for│
│  P₀, P₁, P₂    │ │  P₃, P₄, P₅  │ │  P₆, P₇, P₈  │
└────────────────┘ └──────────────┘ └──────────────┘

Contributing

We welcome contributions! Please see our contributing guidelines for details.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Citation

If you use RamTorch in your research, please cite:

@software{ramtorch2025,
  author = {Lodestone},
  title = {RamTorch: Memory-Efficient Deep Learning with CPU-GPU Hybrid Architecture},
  url = {https://github.com/lodestone-rock/RamTorch},
  year = {2025}
}

Acknowledgments

Built on top of PyTorch's excellent automatic differentiation and CUDA stream management capabilities. Inspired by Microsoft's ZeRO optimizer and DeepSpeed library.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

This version

0.2.3

Apr 11, 2026

0.2.2

Nov 9, 2025

0.2.1

Nov 6, 2025

0.2.0

Nov 2, 2025

0.1.9

Oct 17, 2025

0.1.8

Oct 15, 2025

0.1.7

Oct 12, 2025

0.1.6

Oct 9, 2025

0.1.5

Sep 26, 2025

0.1.4

Sep 22, 2025

0.1.3

Sep 20, 2025

0.1.2

Sep 20, 2025

0.1.1

Sep 19, 2025

0.1.0

Sep 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ramtorch-0.2.3.tar.gz (28.7 kB view details)

Uploaded Apr 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ramtorch-0.2.3-py3-none-any.whl (30.4 kB view details)

Uploaded Apr 11, 2026 Python 3

File details

Details for the file ramtorch-0.2.3.tar.gz.

File metadata

Download URL: ramtorch-0.2.3.tar.gz
Upload date: Apr 11, 2026
Size: 28.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for ramtorch-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`525a0703cfd72cb244b4d3c9e3bcb05968be690017dcb62aa7f96aabd5bb54f0`
MD5	`c38bfad7aa027a83a6655fae1afa01b8`
BLAKE2b-256	`c0a33063708bf8d0d8f8fba65aea2b4458831928e941761fc63f0084d149d75e`

See more details on using hashes here.

File details

Details for the file ramtorch-0.2.3-py3-none-any.whl.

File metadata

Download URL: ramtorch-0.2.3-py3-none-any.whl
Upload date: Apr 11, 2026
Size: 30.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for ramtorch-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f8f24d830c930b1d3bd10070d3080ab17c8fc828046a0296d2ebe1db6143d05f`
MD5	`7e87ce9bd7497297fbdf11d06c865b2c`
BLAKE2b-256	`d015dd2b60e5c844033837331a12e797daa55a86458271b054b9ef95ef13933b`

See more details on using hashes here.

RamTorch 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RamTorch

Overview

Key Features

Installation

Quick Start

Basic Usage

Building Models

Converting Existing Models

Multi-GPU Training with ZeRO-1 and ZeRO-2

Performance Considerations

When to Use RamTorch

Optimization Tips

Architecture

CPU-Offloaded Linear Layer

Memory Flow

ZeRO-Style Sharding

Contributing

License

Citation

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes