Skip to main content

RAM is All You Need

Project description

RamTorch

RAM is All You Need - A PyTorch library for memory-efficient deep learning that enables training and inference of large models that don't fit in GPU memory.

Overview

RamTorch provides CPU-GPU hybrid implementations of neural network components that keep parameters in CPU memory and transfer them to GPU on-demand. This approach dramatically reduces GPU memory usage while maintaining computational efficiency through asynchronous CUDA streams and intelligent batching.

Key Features

  • Memory-Efficient Linear Layers: CPU-stored parameters with on-demand GPU transfer
  • Asynchronous CUDA Streams: Overlap computation with data transfer for minimal latency
  • ZeRO-1 Optimizer Support: Distributed optimizer state sharding across multiple GPUs
  • Drop-in Replacement: Compatible with existing PyTorch code

Installation

pip install ramtorch

Or install from source:

git clone https://github.com/lodestone-rock/RamTorch.git
cd RamTorch
pip install -e .

Quick Start

Basic Usage

Replace torch.nn.Linear with ramtorch.modules.Linear for automatic memory optimization:

import torch
import ramtorch.modules as ram_modules

# Standard PyTorch approach (high GPU memory usage)
# linear = torch.nn.Linear(1000, 1000)

# RamTorch approach (low GPU memory usage)
linear = ram_modules.Linear(1000, 1000, device="cuda")

# Use exactly like a normal PyTorch layer
x = torch.randn(32, 1000, device="cuda")
output = linear(x)  # Parameters automatically transferred from CPU to GPU

Building Models

import torch.nn as nn
import ramtorch.modules as ram_modules

class MemoryEfficientModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            ram_modules.Linear(1000, 2000),
            nn.ReLU(),
            ram_modules.Linear(2000, 2000),
            nn.ReLU(),
            ram_modules.Linear(2000, 100)
        )
    
    def forward(self, x):
        return self.layers(x)

model = MemoryEfficientModel()

ZeRO-1 Optimizer Sharding

For distributed training with optimizer state sharding:

import torch.distributed as dist
from ramtorch.zero1 import create_zero_param_groups, broadcast_zero_params

# Initialize distributed training
dist.init_process_group(backend='nccl')
model = YourModel()
all_params = list(model.parameters())
rank = dist.get_rank()
world_size = dist.get_world_size()

# Create ZeRO-1 sharded optimizer
param_groups = [{'params': all_params, 'lr': 1e-3, 'weight_decay': 0.01}]
sharded_groups, owner_ranks = create_zero_param_groups(param_groups, rank, world_size)
optimizer = torch.optim.AdamW(sharded_groups)

# Scheduler works normally with sharded optimizer
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

# Training loop
for epoch in range(num_epochs):
    for batch in dataloader:
        # Forward/backward with gradient accumulation
        for micro_batch in split_batch(batch):
            loss = model(micro_batch)
            loss.backward()

        # All-reduce gradients across ranks (you need to implement this)
        all_reduce_gradients(all_params)
        
        # Each rank updates only its owned parameters
        optimizer.step()
        
        # Broadcast updated parameters from owners to all ranks
        broadcast_zero_params(all_params, owner_ranks)
        
        model.zero_grad()
        scheduler.step()

Performance Considerations

When to Use RamTorch

Best suited for:

  • Large models that don't fit in GPU memory
  • Inference scenarios with memory constraints
  • Training with limited GPU memory but abundant CPU memory
  • Distributed training with many parameters

Less suitable for:

  • Small models that fit comfortably in GPU memory
  • Scenarios where CPU-GPU bandwidth is the bottleneck
  • Real-time applications requiring minimal latency

Optimization Tips

  1. Use Larger Batch Sizes: Helps amortize transfer costs
  2. Configure MAX_INFLIGHT: Tune based on your GPU memory availability
  3. Mixed Precision: Combine with torch.cuda.amp for additional memory savings
  4. Strategic Placement: Use RamTorch layers for the largest components only

Architecture

CPU Bouncing Linear Layer

  1. Stores parameters on CPU memory (with share_memory_() for multiprocessing)
  2. Asynchronously transfers weights to GPU during forward pass
  3. Uses CUDA events for proper stream synchronization
  4. Automatically throttles transfers to prevent memory overflow

Memory Flow

CPU Memory (Parameters) → Transfer Stream → GPU Memory (Computation) → Result
                     ↑                                                      ↓
                     └────── Cleanup after computation ←──────────────────┘

Contributing

We welcome contributions! Please see our contributing guidelines for details.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Citation

If you use RamTorch in your research, please cite:

@software{ramtorch2025,
  author = {Lodestone},
  title = {RamTorch: Memory-Efficient Deep Learning with CPU-GPU Hybrid Architecture},
  url = {https://github.com/lodestone-rock/RamTorch},
  year = {2025}
}

Acknowledgments

Built on top of PyTorch's excellent automatic differentiation and CUDA stream management capabilities.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ramtorch-0.1.1.tar.gz (14.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ramtorch-0.1.1-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file ramtorch-0.1.1.tar.gz.

File metadata

  • Download URL: ramtorch-0.1.1.tar.gz
  • Upload date:
  • Size: 14.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ramtorch-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8dcbb13872d410e63f1b334e7e18b2782f75c82a55b56fe234c1bb475e0204a5
MD5 dab23752c6ddc61cdbee5cb1906480a3
BLAKE2b-256 deac5d146d80d33e98de1012fe607da9631a7210048ffc410152ab832ef8e86a

See more details on using hashes here.

File details

Details for the file ramtorch-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: ramtorch-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ramtorch-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 50ac7aef7f9a77917ba385b4c06ed53551baae0fad1639d5953be92a709fb1cb
MD5 05a4c77fff580e0c067cb2e68ca8d4e7
BLAKE2b-256 425ddc84fc5d76fbdfad9d0bd16e495c62864b04ee577574e8453a25ffedd45b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page