RAM is All You Need
Project description
RamTorch
RAM is All You Need - A PyTorch library for memory-efficient deep learning that enables training and inference of large models that don't fit in GPU memory.
Overview
RamTorch provides CPU-GPU hybrid implementations of neural network components that keep parameters in CPU memory and transfer them to GPU on-demand. This approach dramatically reduces GPU memory usage while maintaining computational efficiency through asynchronous CUDA streams and intelligent batching.
Key Features
- Memory-Efficient Linear Layers: CPU-stored parameters with on-demand GPU transfer
- Asynchronous CUDA Streams: Overlap computation with data transfer for minimal latency
- ZeRO-1 Optimizer Support: Distributed optimizer state sharding across multiple GPUs
- Drop-in Replacement: Compatible with existing PyTorch code
Installation
pip install ramtorch
Or install from source:
git clone https://github.com/lodestone-rock/RamTorch.git
cd RamTorch
pip install -e .
Quick Start
Basic Usage
Replace torch.nn.Linear with ramtorch.modules.Linear for automatic memory optimization:
import torch
from ramtorch import Linear
# Standard PyTorch approach (high GPU memory usage)
# linear = torch.nn.Linear(1000, 1000)
# RamTorch approach (low GPU memory usage)
linear = Linear(1000, 1000, device="cuda")
# Use exactly like a normal PyTorch layer
x = torch.randn(32, 1000, device="cuda")
output = linear(x) # Parameters automatically transferred from CPU to GPU
Building Models
import torch.nn as nn
from ramtorch import Linear
class MemoryEfficientModel(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
Linear(1000, 2000),
nn.ReLU(),
Linear(2000, 2000),
nn.ReLU(),
Linear(2000, 100)
)
def forward(self, x):
return self.layers(x)
model = MemoryEfficientModel()
ZeRO-1 Optimizer Sharding
For distributed training with optimizer state sharding:
import torch.distributed as dist
from ramtorch.zero1 import create_zero_param_groups, broadcast_zero_params
# Initialize distributed training
dist.init_process_group(backend='nccl')
model = YourModel()
all_params = list(model.parameters())
rank = dist.get_rank()
world_size = dist.get_world_size()
# Create ZeRO-1 sharded optimizer
param_groups = [{'params': all_params, 'lr': 1e-3, 'weight_decay': 0.01}]
rank_param_groups = create_zero_param_groups(param_groups, world_size)
optimizer = torch.optim.AdamW(sharded_groups[rank]) # only optimize the shard
# Scheduler works normally with sharded optimizer
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
# Training loop
for epoch in range(num_epochs):
for batch in dataloader:
# Forward/backward with gradient accumulation
for micro_batch in split_batch(batch):
loss = model(micro_batch)
loss.backward()
# All-reduce gradients across ranks (you need to implement this)
all_reduce_gradients(all_params)
# Each rank updates only its owned parameters
optimizer.step()
# Broadcast updated parameters from owners to all ranks
broadcast_zero_params(rank_param_groups)
# It has to be model.zero_grad()! because optimizer on each rank only handles its own shard
model.zero_grad()
scheduler.step()
Performance Considerations
When to Use RamTorch
Best suited for:
- Large models that don't fit in GPU memory
- Inference scenarios with memory constraints
- Training with limited GPU memory but abundant CPU memory
- Distributed training with many parameters
Less suitable for:
- Small models that fit comfortably in GPU memory
- Scenarios where CPU-GPU bandwidth is the bottleneck
- Real-time applications requiring minimal latency
Optimization Tips
- Use Larger Batch Sizes: Helps amortize transfer costs
- Mixed Precision: Combine with
torch.cuda.ampfor additional memory savings - Strategic Placement: Use RamTorch layers for the largest components only
Architecture
CPU Bouncing Linear Layer
- Stores parameters on CPU memory (with
share_memory_()for multiprocessing) - Asynchronously transfers weights to GPU during forward pass
- Uses CUDA events for proper stream synchronization
Memory Flow
CPU Memory (Parameters) → Transfer Stream → GPU Memory (Computation) → Result
↑ ↓
└────── Cleanup after computation ←──────────────────┘
Contributing
We welcome contributions! Please see our contributing guidelines for details.
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Citation
If you use RamTorch in your research, please cite:
@software{ramtorch2025,
author = {Lodestone},
title = {RamTorch: Memory-Efficient Deep Learning with CPU-GPU Hybrid Architecture},
url = {https://github.com/lodestone-rock/RamTorch},
year = {2025}
}
Acknowledgments
Built on top of PyTorch's excellent automatic differentiation and CUDA stream management capabilities.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ramtorch-0.1.5.tar.gz.
File metadata
- Download URL: ramtorch-0.1.5.tar.gz
- Upload date:
- Size: 16.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60c7f895f571c7374c675370613f18f8128a232561cbcee9d5fdce17644439ea
|
|
| MD5 |
e430ae6ecaa79c40b166e2beb06d6e6c
|
|
| BLAKE2b-256 |
0caee8bf4debcbd8d85654dc8e2b27dfd2db8fcf91e982211000866b03da892c
|
File details
Details for the file ramtorch-0.1.5-py3-none-any.whl.
File metadata
- Download URL: ramtorch-0.1.5-py3-none-any.whl
- Upload date:
- Size: 24.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1fac3918f3227f4727ce8e3003bc2d302061748e39b93994d5e426ed2284732
|
|
| MD5 |
5f84a465a2ca94ca469832289efd59b7
|
|
| BLAKE2b-256 |
186f4811a5f8b9d3be486de07c7ea1678bf0c3cab4400f1837c124418f9d60a2
|