Skip to main content

Remote GPU — pure Python client, zero dependencies, auto CPU fallback

Project description

RemoteCUDA

Remote GPU. Zero Code Changes.

GPU راه دور. بدون تغییر حتی یک خط کد.


PyPI Python License Downloads



Table of Contents | فهرست مطالب

# Section
1 Distinctive Capabilities
2 Comparative Advantage
3 Architecture at a Glance
4 Usage — Step by Step
5 Performance Benchmarks
6 قابلیت‌های متمایز
7 مزیت رقابتی
8 نحوه استفاده — گام به گام



1. Distinctive Capabilities

RemoteCUDA is not merely another remote GPU tool — it is a complete distributed execution framework that operates transparently beneath your existing PyTorch codebase.

Capability Matrix

Capability Description Impact
Zero-Change Integration No import rewrites, no decorators, no context managers required. Your model.cuda() calls work exactly as written. Eliminates migration cost entirely.
Network Auto-Discovery UDP multicast-based server detection. Clients locate GPU servers without IP addresses, configuration files, or environment variables. Zero-configuration deployment.
Multi-GPU Scheduling Four distinct scheduling strategies — Adaptive, Load-Balancing, Memory-Aware, Latency-Optimized — with automatic strategy switching based on workload profiling. Optimal resource utilization without manual tuning.
Local Dataset Retention Datasets remain on the client filesystem. Only the active mini-batch traverses the network. A 512MB LRU/LFU tensor cache eliminates redundant transfers. 50GB dataset → zero upfront transfer.
Asynchronous Stream Pipelining Data transfer and GPU computation overlap via CUDA stream semantics. While batch N computes on the GPU, batch N+1 is already in flight over the network. Hides network latency behind computation.
Automatic Failover GPU health monitored via heartbeat protocol. Failed tasks automatically reassigned to healthy GPUs with configurable retry logic (default: 3 attempts). Graceful degradation under hardware failure.
Heterogeneous GPU Support Mixed GPU architectures (RTX 4090 + A100 + RTX 3090) within a single pool. Scheduler accounts for compute capability, memory capacity, and current utilization. Leverage all available hardware simultaneously.
No CUDA on Client Client requires only PyTorch CPU build. No NVIDIA drivers, no CUDA toolkit, no GPU hardware. Works on any laptop, including ultrabooks.

Scheduling Strategies

┌─────────────────────────────────────────────────────────────────────────────┐ │ GPU Scheduler — Strategy Selection │ ├───────────────────┬─────────────────────────────────────────────────────────┤ │ Strategy │ Behavior │ ├───────────────────┼─────────────────────────────────────────────────────────┤ │ Adaptive │ Profiles each task; routes memory-heavy ops to GPUs │ │ (default) │ with most free VRAM, latency-sensitive ops to GPUs │ │ │ with lowest historical response time. │ ├───────────────────┼─────────────────────────────────────────────────────────┤ │ Load-Balancing │ Distributes tasks evenly across all available GPUs. │ │ │ Optimal for homogeneous clusters running uniform │ │ │ workloads (e.g., hyperparameter sweeps). │ ├───────────────────┼─────────────────────────────────────────────────────────┤ │ Memory-Aware │ Routes each task to the GPU with the most free memory. │ │ │ Essential for large models (LLMs, diffusion models) │ │ │ where VRAM is the primary constraint. │ ├───────────────────┼─────────────────────────────────────────────────────────┤ │ Latency-Optimized │ Routes to GPUs with lowest historical average │ │ │ forward-pass time. Ideal for real-time inference │ │ │ serving where tail latency matters. │ └───────────────────┴─────────────────────────────────────────────────────────┘




2. Comparative Advantage

RemoteCUDA occupies a unique position in the design space. Unlike SSH-based approaches (which require code relocation) and unlike simpler CUDA-forwarding tools (which lack intelligence), RemoteCUDA provides transparent interception combined with sophisticated resource management.

Feature-by-Feature Comparison

Feature RemoteCUDA SSH + SCP VSCode Remote Chidori GPU Tensorlink
Zero code changes Yes No No Yes Yes
Local dataset access Yes No No Yes Yes
Network auto-discovery Yes No No No No
Single-command server setup Yes No No Yes No
Multi-GPU parallelism Yes No No No No
Intelligent load balancing Yes No No No No
Automatic failover Yes No No No No
Multiple scheduling strategies Yes (4) No No No No
Tensor caching layer Yes No No No No
Async stream pipelining Yes No No No No
Heterogeneous GPU support Yes No No No No
No CUDA on client Yes Yes No Yes Yes
pip installable Yes N/A N/A Yes No
Windows support Yes No No Yes Yes
Open source (MIT) Yes N/A No Yes Yes

Positioning

Intelligence / Automation ▲ │ │ ┌──────────────┐ │ │ RemoteCUDA │ ← Transparent + Smart │ └──────────────┘ │ │ ┌──────────────┐ │ │ Chidori GPU │ ← Transparent only │ └──────────────┘ │ │ ┌──────────────┐ │ │ VSCode │ ← Manual + Remote │ │ Remote │ │ └──────────────┘ │ │ ┌──────────────┐ │ │ SSH + SCP │ ← Manual + Relocation │ └──────────────┘ │ └──────────────────────────────►

RemoteCUDA is the only solution that combines zero code impact (transparent CUDA interception) with intelligent resource orchestration (scheduling, caching, failover, pipelining).




3. Architecture at a Glance

┌──────────────────────────────────────────────────────────────────────────────┐ │ CLIENT MACHINE (No GPU) │ │ │ │ Your Code RemoteCUDA Stack │ │ ┌──────────┐ ┌─────────────────────────────────────────────────────┐ │ │ │model.cuda()──→│ AutoCUDAHook → GPUScheduler → GPUPool │ │ │ │tensor.to() │ │ (intercept) (strategy) (connections) │ │ │ │loss.back() │ └─────────────────────────────────────────────────────┘ │ │ └──────────┘ ┌─────────────────────────────────────────────────────┐ │ │ │ TensorCache │ StreamManager │ TensorBridge │ │ │ │ (512MB LRU) │ (async pipes) │ (lifecycle) │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ └────────────────────────────────────┬──────────────────────────────────────────┘ │ │ TCP/IP (1–10 Gbps Ethernet) │ Protocol: Binary, zlib/lz4 compressed │ Discovery: UDP Multicast (239.255.100.100) │ ┌──────────────────────────┼──────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ GPU Server 1 │ │ GPU Server 2 │ │ GPU Server N │ │ │ │ │ │ │ │ ┌────────────┐ │ │ ┌────────────┐ │ │ ┌────────────┐ │ │ │ GPUWorker 0│ │ │ │ GPUWorker 0│ │ │ │ GPUWorker 0│ │ │ │ (RTX 4090) │ │ │ │ (RTX 3090) │ │ │ │ (A100) │ │ │ ├────────────┤ │ │ ├────────────┤ │ │ ├────────────┤ │ │ │ GPUWorker 1│ │ │ │ GPUWorker 1│ │ │ │ GPUWorker 1│ │ │ │ (RTX 4090) │ │ │ │ (RTX 3090) │ │ │ │ (A100) │ │ │ └────────────┘ │ │ └────────────┘ │ │ └────────────┘ │ │ │ │ │ │ │ │ Discovery: ON │ │ Discovery: ON │ │ Discovery: ON │ └──────────────────┘ └──────────────────┘ └──────────────────┘

Key Design Decisions:

  • One Worker per GPU: Each physical GPU gets its own worker thread, memory manager, and CUDA stream. No cross-GPU contention.
  • Binary Protocol: Custom 16-byte header with CRC32 integrity check. Payload compressed with zlib (level 3) by default; supports lz4 and zstd as optional accelerators.
  • UDP Discovery: Servers announce presence via multicast; clients listen passively. No central registry. Servers can join or leave dynamically.



4. Usage — Step by Step

4.1 Server Setup (GPU Machine)

Prerequisites: Python 3.8+, PyTorch with CUDA, NVIDIA drivers.

# Install
pip install remotecuda

# Start the service
remotecuda start



 RemoteCUDA GPU Service
 GPU:        NVIDIA GeForce RTX 4090 (24.0 GB)
 Port:       55555
 Status:     Ready  Broadcasting on network
 Discovery:  ON (UDP multicast)

 GPU is now available on the network.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

remotecuda-5.0.0.tar.gz (96.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

remotecuda-5.0.0-py3-none-any.whl (96.5 kB view details)

Uploaded Python 3

File details

Details for the file remotecuda-5.0.0.tar.gz.

File metadata

  • Download URL: remotecuda-5.0.0.tar.gz
  • Upload date:
  • Size: 96.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for remotecuda-5.0.0.tar.gz
Algorithm Hash digest
SHA256 e626b7e3c20027f0b24121ceb6c6a2ed462f69cd8bae39f2cc01c00f6879374e
MD5 10c8a9453c4ffa13772cbc4bf4b875f6
BLAKE2b-256 9c2b9549d44c8e9c9511ffed9a1ee455da98ef1996921b2a59aff60c44ad134d

See more details on using hashes here.

File details

Details for the file remotecuda-5.0.0-py3-none-any.whl.

File metadata

  • Download URL: remotecuda-5.0.0-py3-none-any.whl
  • Upload date:
  • Size: 96.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for remotecuda-5.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f5f573a0f8b04ffc9ef28885885611ca846063778f2e3ad01a5461a0de8e708c
MD5 8252d92f2a0d77829f894caee6df4043
BLAKE2b-256 0f12d25ca21c7deda1a6830bbb8eb8fef102407730c75729394fc63c050663f3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page