Remote GPU — pure Python client, zero dependencies, auto CPU fallback
Project description
Table of Contents | فهرست مطالب
| # | Section |
|---|---|
| 1 | Distinctive Capabilities |
| 2 | Comparative Advantage |
| 3 | Architecture at a Glance |
| 4 | Usage — Step by Step |
| 5 | Performance Benchmarks |
| 6 | قابلیتهای متمایز |
| 7 | مزیت رقابتی |
| 8 | نحوه استفاده — گام به گام |
1. Distinctive Capabilities
RemoteCUDA is not merely another remote GPU tool — it is a complete distributed execution framework that operates transparently beneath your existing PyTorch codebase.
Capability Matrix
| Capability | Description | Impact |
|---|---|---|
| Zero-Change Integration | No import rewrites, no decorators, no context managers required. Your model.cuda() calls work exactly as written. |
Eliminates migration cost entirely. |
| Network Auto-Discovery | UDP multicast-based server detection. Clients locate GPU servers without IP addresses, configuration files, or environment variables. | Zero-configuration deployment. |
| Multi-GPU Scheduling | Four distinct scheduling strategies — Adaptive, Load-Balancing, Memory-Aware, Latency-Optimized — with automatic strategy switching based on workload profiling. | Optimal resource utilization without manual tuning. |
| Local Dataset Retention | Datasets remain on the client filesystem. Only the active mini-batch traverses the network. A 512MB LRU/LFU tensor cache eliminates redundant transfers. | 50GB dataset → zero upfront transfer. |
| Asynchronous Stream Pipelining | Data transfer and GPU computation overlap via CUDA stream semantics. While batch N computes on the GPU, batch N+1 is already in flight over the network. | Hides network latency behind computation. |
| Automatic Failover | GPU health monitored via heartbeat protocol. Failed tasks automatically reassigned to healthy GPUs with configurable retry logic (default: 3 attempts). | Graceful degradation under hardware failure. |
| Heterogeneous GPU Support | Mixed GPU architectures (RTX 4090 + A100 + RTX 3090) within a single pool. Scheduler accounts for compute capability, memory capacity, and current utilization. | Leverage all available hardware simultaneously. |
| No CUDA on Client | Client requires only PyTorch CPU build. No NVIDIA drivers, no CUDA toolkit, no GPU hardware. | Works on any laptop, including ultrabooks. |
Scheduling Strategies
┌─────────────────────────────────────────────────────────────────────────────┐ │ GPU Scheduler — Strategy Selection │ ├───────────────────┬─────────────────────────────────────────────────────────┤ │ Strategy │ Behavior │ ├───────────────────┼─────────────────────────────────────────────────────────┤ │ Adaptive │ Profiles each task; routes memory-heavy ops to GPUs │ │ (default) │ with most free VRAM, latency-sensitive ops to GPUs │ │ │ with lowest historical response time. │ ├───────────────────┼─────────────────────────────────────────────────────────┤ │ Load-Balancing │ Distributes tasks evenly across all available GPUs. │ │ │ Optimal for homogeneous clusters running uniform │ │ │ workloads (e.g., hyperparameter sweeps). │ ├───────────────────┼─────────────────────────────────────────────────────────┤ │ Memory-Aware │ Routes each task to the GPU with the most free memory. │ │ │ Essential for large models (LLMs, diffusion models) │ │ │ where VRAM is the primary constraint. │ ├───────────────────┼─────────────────────────────────────────────────────────┤ │ Latency-Optimized │ Routes to GPUs with lowest historical average │ │ │ forward-pass time. Ideal for real-time inference │ │ │ serving where tail latency matters. │ └───────────────────┴─────────────────────────────────────────────────────────┘
2. Comparative Advantage
RemoteCUDA occupies a unique position in the design space. Unlike SSH-based approaches (which require code relocation) and unlike simpler CUDA-forwarding tools (which lack intelligence), RemoteCUDA provides transparent interception combined with sophisticated resource management.
Feature-by-Feature Comparison
| Feature | RemoteCUDA | SSH + SCP | VSCode Remote | Chidori GPU | Tensorlink |
|---|---|---|---|---|---|
| Zero code changes | Yes | No | No | Yes | Yes |
| Local dataset access | Yes | No | No | Yes | Yes |
| Network auto-discovery | Yes | No | No | No | No |
| Single-command server setup | Yes | No | No | Yes | No |
| Multi-GPU parallelism | Yes | No | No | No | No |
| Intelligent load balancing | Yes | No | No | No | No |
| Automatic failover | Yes | No | No | No | No |
| Multiple scheduling strategies | Yes (4) | No | No | No | No |
| Tensor caching layer | Yes | No | No | No | No |
| Async stream pipelining | Yes | No | No | No | No |
| Heterogeneous GPU support | Yes | No | No | No | No |
| No CUDA on client | Yes | Yes | No | Yes | Yes |
| pip installable | Yes | N/A | N/A | Yes | No |
| Windows support | Yes | No | No | Yes | Yes |
| Open source (MIT) | Yes | N/A | No | Yes | Yes |
Positioning
Intelligence / Automation ▲ │ │ ┌──────────────┐ │ │ RemoteCUDA │ ← Transparent + Smart │ └──────────────┘ │ │ ┌──────────────┐ │ │ Chidori GPU │ ← Transparent only │ └──────────────┘ │ │ ┌──────────────┐ │ │ VSCode │ ← Manual + Remote │ │ Remote │ │ └──────────────┘ │ │ ┌──────────────┐ │ │ SSH + SCP │ ← Manual + Relocation │ └──────────────┘ │ └──────────────────────────────►
RemoteCUDA is the only solution that combines zero code impact (transparent CUDA interception) with intelligent resource orchestration (scheduling, caching, failover, pipelining).
3. Architecture at a Glance
┌──────────────────────────────────────────────────────────────────────────────┐ │ CLIENT MACHINE (No GPU) │ │ │ │ Your Code RemoteCUDA Stack │ │ ┌──────────┐ ┌─────────────────────────────────────────────────────┐ │ │ │model.cuda()──→│ AutoCUDAHook → GPUScheduler → GPUPool │ │ │ │tensor.to() │ │ (intercept) (strategy) (connections) │ │ │ │loss.back() │ └─────────────────────────────────────────────────────┘ │ │ └──────────┘ ┌─────────────────────────────────────────────────────┐ │ │ │ TensorCache │ StreamManager │ TensorBridge │ │ │ │ (512MB LRU) │ (async pipes) │ (lifecycle) │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ └────────────────────────────────────┬──────────────────────────────────────────┘ │ │ TCP/IP (1–10 Gbps Ethernet) │ Protocol: Binary, zlib/lz4 compressed │ Discovery: UDP Multicast (239.255.100.100) │ ┌──────────────────────────┼──────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ GPU Server 1 │ │ GPU Server 2 │ │ GPU Server N │ │ │ │ │ │ │ │ ┌────────────┐ │ │ ┌────────────┐ │ │ ┌────────────┐ │ │ │ GPUWorker 0│ │ │ │ GPUWorker 0│ │ │ │ GPUWorker 0│ │ │ │ (RTX 4090) │ │ │ │ (RTX 3090) │ │ │ │ (A100) │ │ │ ├────────────┤ │ │ ├────────────┤ │ │ ├────────────┤ │ │ │ GPUWorker 1│ │ │ │ GPUWorker 1│ │ │ │ GPUWorker 1│ │ │ │ (RTX 4090) │ │ │ │ (RTX 3090) │ │ │ │ (A100) │ │ │ └────────────┘ │ │ └────────────┘ │ │ └────────────┘ │ │ │ │ │ │ │ │ Discovery: ON │ │ Discovery: ON │ │ Discovery: ON │ └──────────────────┘ └──────────────────┘ └──────────────────┘
Key Design Decisions:
- One Worker per GPU: Each physical GPU gets its own worker thread, memory manager, and CUDA stream. No cross-GPU contention.
- Binary Protocol: Custom 16-byte header with CRC32 integrity check. Payload compressed with zlib (level 3) by default; supports lz4 and zstd as optional accelerators.
- UDP Discovery: Servers announce presence via multicast; clients listen passively. No central registry. Servers can join or leave dynamically.
4. Usage — Step by Step
4.1 Server Setup (GPU Machine)
Prerequisites: Python 3.8+, PyTorch with CUDA, NVIDIA drivers.
# Install
pip install remotecuda
# Start the service
remotecuda start
RemoteCUDA GPU Service
GPU: NVIDIA GeForce RTX 4090 (24.0 GB)
Port: 55555
Status: Ready — Broadcasting on network
Discovery: ON (UDP multicast)
GPU is now available on the network.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file remotecuda-5.0.0.tar.gz.
File metadata
- Download URL: remotecuda-5.0.0.tar.gz
- Upload date:
- Size: 96.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e626b7e3c20027f0b24121ceb6c6a2ed462f69cd8bae39f2cc01c00f6879374e
|
|
| MD5 |
10c8a9453c4ffa13772cbc4bf4b875f6
|
|
| BLAKE2b-256 |
9c2b9549d44c8e9c9511ffed9a1ee455da98ef1996921b2a59aff60c44ad134d
|
File details
Details for the file remotecuda-5.0.0-py3-none-any.whl.
File metadata
- Download URL: remotecuda-5.0.0-py3-none-any.whl
- Upload date:
- Size: 96.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5f573a0f8b04ffc9ef28885885611ca846063778f2e3ad01a5461a0de8e708c
|
|
| MD5 |
8252d92f2a0d77829f894caee6df4043
|
|
| BLAKE2b-256 |
0f12d25ca21c7deda1a6830bbb8eb8fef102407730c75729394fc63c050663f3
|