Distributed GPU inference engine for heterogeneous prosumer hardware
Project description
Hyperlane
Distributed GPU Inference Engine for Heterogeneous Prosumer Hardware
Hyperlane enables single-command LLM inference across multiple consumer GPUs on a local area network. It uses pipeline parallelism to split large models layer-wise and coordinates execution with minimal network overhead.
Architecture
Hyperlane is split into two core components:
hyperlane_worker (C++/CUDA)
High-performance inference server with:
- Control Plane: gRPC service for model shard loading and orchestration
- Execution Engine: ONNX Runtime with CUDA provider for layer execution
- Data Plane: Async TCP sockets for inter-worker tensor transmission
- Service Discovery: Avahi/mDNS registration with GPU stats in TXT records
- Optimization: FP16→INT4 quantization, async CUDA streams, pinned memory DMA
hyperlane_client (Python)
User-friendly orchestration library with:
- Discovery: zeroconf-based auto-discovery of workers on the LAN
- Sharding: Knapsack partitioning algorithm for layer distribution
- ONNX Export: torch.onnx.export for shard compilation
- gRPC Orchestration: Async client for worker control
- pybind11 Bridge: Zero-copy tensor socket for efficient inference starts
Pipeline Parallelism
Models are partitioned layer-wise (not tensor-wise) to minimize network latency:
Input → GPU A (Layers 1-10) → GPU B (Layers 11-20) → GPU C (Layers 21-32) → Output
Each GPU:
- Executes its layers on the input tensor (FP16/FP32)
- Quantizes output to INT4 in VRAM
- Async D2H copy to pinned host memory
- Sends via TCP socket (DMA-friendly)
- Next GPU receives and async H2D + dequantizes
This overlap minimizes pipeline bubbles and network blocking.
Installation
System Requirements
- Ubuntu 20.04+ (Linux only)
- NVIDIA GPU with CUDA Compute Capability 6.0+
- CUDA Toolkit 11.8+
- cuDNN 8.0+
- gRPC + protobuf
- ONNX Runtime with CUDA provider
- Avahi (mDNS)
Quick Setup
# Install system dependencies
sudo apt-get update
sudo apt-get install -y \
build-essential cmake \
protobuf-compiler libprotobuf-dev \
libgrpc++-dev protobuf-compiler-grpc \
libavahi-client-dev libavahi-common-dev \
python3-dev python3-pip
# Download/install CUDA and cuDNN (if not present)
# https://developer.nvidia.com/cuda-toolkit
# https://developer.nvidia.com/cudnn
# Install onnxruntime-gpu
pip install onnxruntime-gpu
# Clone and build
git clone <repo> hyperlane
cd hyperlane
bash build.sh
Quick Start
Start a Worker
# On GPU machine 1 (port 50051 default)
./hyperlane_worker/build/hyperlane_worker
# On GPU machine 2 (port 50052)
./hyperlane_worker/build/hyperlane_worker 50052
# On GPU machine 3 (port 50053)
./hyperlane_worker/build/hyperlane_worker 50053
Workers will register themselves via Avahi/mDNS and expose GPU stats.
Load and Run a Model
import asyncio
from hyperlane_client import DiscoveryManager, AutoDistributedModel
async def main():
# Discover workers on LAN
discovery = DiscoveryManager()
discovery.start_discovery()
await asyncio.sleep(2) # Wait for discovery
print(f"Found {len(discovery.discovered_workers)} workers")
# Load model and auto-shard across workers
model = AutoDistributedModel.from_pretrained(
"meta-llama/Llama-2-7b-hf",
discovery
)
# Run inference
output = model.generate("What is machine learning?", max_tokens=128)
print(output)
discovery.stop_discovery()
asyncio.run(main())
File Structure
.
├── hyperlane_worker/ # C++/CUDA inference server
│ ├── CMakeLists.txt
│ ├── include/ # Headers
│ │ ├── worker.h # Main orchestrator
│ │ ├── service_impl.h # gRPC service
│ │ ├── onnx_session.h # ONNX Runtime wrapper
│ │ ├── tensor_sender.h # Async sender
│ │ ├── tensor_receiver.h # Async receiver
│ │ ├── cuda_ops.h # Quantization kernels
│ └── src/ # Implementations
│ ├── main.cc
│ ├── worker.cc
│ ├── service_impl.cc
│ ├── onnx_session.cc
│ ├── tensor_sender.cc
│ ├── tensor_receiver.cc
│ ├── cuda_ops.cu
│ └── service_discovery.cc
├── hyperlane_client/ # Python orchestration
│ ├── setup.py
│ ├── generate_grpc.py # Proto compilation script
│ ├── hyperlane_client/
│ │ ├── __init__.py
│ │ ├── discovery.py # Zeroconf discovery
│ │ ├── orchestrator.py # Model sharding & deployment
│ │ └── grpc_client.py # Worker gRPC client
│ └── pybind/
│ ├── CMakeLists.txt
│ ├── tensor_socket.h # pybind11 wrapper
│ ├── tensor_socket.cpp # Implementation
│ └── tensor_socket.py # Python interface
├── proto/
│ └── service.proto # gRPC service definition
├── build.sh # Build script
└── README.md # This file
Performance Considerations
- Quantization: INT4 reduces tensor size 4x, but verify accuracy on your models.
- Network: 1 Gbps LAN is sufficient for most prosumer setups; higher is better.
- Pinned Memory: Allocate based on available system RAM (512 MB default per worker).
- CUDA Streams: Overlap D2H, socket send, and next H2D for minimal latency.
Troubleshooting
Worker doesn't start
# Check CUDA availability
nvidia-smi
# Check port
lsof -i :50051
# Check Avahi daemon
systemctl status avahi-daemon
Discovery fails
# Ensure Avahi is running
sudo systemctl start avahi-daemon
# Check mDNS resolution
avahi-browse -a
gRPC connection refused
# Verify worker is running and listening
netstat -tlnp | grep hyperlane_worker
# Check firewall
sudo ufw allow 50051:50100/tcp
Development
Build just the worker
cd hyperlane_worker
mkdir -p build && cd build
cmake .. && cmake --build .
Build just the Python client
cd hyperlane_client
python3 setup.py develop
Run tests (TODO)
pytest tests/
Roadmap
- TensorRT engine support (beyond ONNX)
- Dynamic layer redistribution based on load
- Speculative decoding for batch inference
- Quantization-aware training (QAT)
- Web dashboard for monitoring
- Support for other accelerators (AMD, Intel)
License
MIT (or chosen license)
Contributing
Pull requests welcome. Please ensure:
- Code follows project style (clang-format for C++, black for Python)
- All async code is non-blocking
- Performance-critical paths are validated on real hardware
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hyperlane-0.1.0-py3-none-any.whl.
File metadata
- Download URL: hyperlane-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c3363ec0f7c400c8f86fc6ea1be059b4ac787731207d02e28644992a4a8989a4
|
|
| MD5 |
34a27177c64a4d78b1e3f0245a228b41
|
|
| BLAKE2b-256 |
a26da4a600ba017869e4600700ddc0b73ba256962108fc03c695f23f1f76f54c
|