Distributed GPU inference engine for heterogeneous prosumer hardware

These details have not been verified by PyPI

Project links

Project description

Hyperlane

Distributed GPU Inference Engine for Heterogeneous Prosumer Hardware

Hyperlane enables single-command LLM inference across multiple consumer GPUs on a local area network. It uses pipeline parallelism to split large models layer-wise and coordinates execution with minimal network overhead.

Architecture

Hyperlane is split into two core components:

hyperlane_worker (C++/CUDA)

High-performance inference server with:

Control Plane: gRPC service for model shard loading and orchestration
Execution Engine: ONNX Runtime with CUDA provider for layer execution
Data Plane: Async TCP sockets for inter-worker tensor transmission
Service Discovery: Avahi/mDNS registration with GPU stats in TXT records
Optimization: FP16→INT4 quantization, async CUDA streams, pinned memory DMA

hyperlane_client (Python)

User-friendly orchestration library with:

Discovery: zeroconf-based auto-discovery of workers on the LAN
Sharding: Knapsack partitioning algorithm for layer distribution
ONNX Export: torch.onnx.export for shard compilation
gRPC Orchestration: Async client for worker control
pybind11 Bridge: Zero-copy tensor socket for efficient inference starts

Pipeline Parallelism

Models are partitioned layer-wise (not tensor-wise) to minimize network latency:

Input → GPU A (Layers 1-10) → GPU B (Layers 11-20) → GPU C (Layers 21-32) → Output

Each GPU:

Executes its layers on the input tensor (FP16/FP32)
Quantizes output to INT4 in VRAM
Async D2H copy to pinned host memory
Sends via TCP socket (DMA-friendly)
Next GPU receives and async H2D + dequantizes

This overlap minimizes pipeline bubbles and network blocking.

Installation

System Requirements

Ubuntu 20.04+ (Linux only)
NVIDIA GPU with CUDA Compute Capability 6.0+
CUDA Toolkit 11.8+
cuDNN 8.0+
gRPC + protobuf
ONNX Runtime with CUDA provider
Avahi (mDNS)

Quick Setup

# Install system dependencies
sudo apt-get update
sudo apt-get install -y \
  build-essential cmake \
  protobuf-compiler libprotobuf-dev \
  libgrpc++-dev protobuf-compiler-grpc \
  libavahi-client-dev libavahi-common-dev \
  python3-dev python3-pip

# Download/install CUDA and cuDNN (if not present)
# https://developer.nvidia.com/cuda-toolkit
# https://developer.nvidia.com/cudnn

# Install onnxruntime-gpu
pip install onnxruntime-gpu

# Clone and build
git clone <repo> hyperlane
cd hyperlane
bash build.sh

Quick Start

Start a Worker

# On GPU machine 1 (port 50051 default)
./hyperlane_worker/build/hyperlane_worker

# On GPU machine 2 (port 50052)
./hyperlane_worker/build/hyperlane_worker 50052

# On GPU machine 3 (port 50053)
./hyperlane_worker/build/hyperlane_worker 50053

Workers will register themselves via Avahi/mDNS and expose GPU stats.

Load and Run a Model

import asyncio
from hyperlane_client import DiscoveryManager, AutoDistributedModel

async def main():
    # Discover workers on LAN
    discovery = DiscoveryManager()
    discovery.start_discovery()
    await asyncio.sleep(2)  # Wait for discovery
    
    print(f"Found {len(discovery.discovered_workers)} workers")
    
    # Load model and auto-shard across workers
    model = AutoDistributedModel.from_pretrained(
        "meta-llama/Llama-2-7b-hf",
        discovery
    )
    
    # Run inference
    output = model.generate("What is machine learning?", max_tokens=128)
    print(output)
    
    discovery.stop_discovery()

asyncio.run(main())

File Structure

.
├── hyperlane_worker/              # C++/CUDA inference server
│   ├── CMakeLists.txt
│   ├── include/                   # Headers
│   │   ├── worker.h              # Main orchestrator
│   │   ├── service_impl.h        # gRPC service
│   │   ├── onnx_session.h        # ONNX Runtime wrapper
│   │   ├── tensor_sender.h       # Async sender
│   │   ├── tensor_receiver.h     # Async receiver
│   │   ├── cuda_ops.h            # Quantization kernels
│   └── src/                       # Implementations
│       ├── main.cc
│       ├── worker.cc
│       ├── service_impl.cc
│       ├── onnx_session.cc
│       ├── tensor_sender.cc
│       ├── tensor_receiver.cc
│       ├── cuda_ops.cu
│       └── service_discovery.cc
├── hyperlane_client/              # Python orchestration
│   ├── setup.py
│   ├── generate_grpc.py          # Proto compilation script
│   ├── hyperlane_client/
│   │   ├── __init__.py
│   │   ├── discovery.py          # Zeroconf discovery
│   │   ├── orchestrator.py       # Model sharding & deployment
│   │   └── grpc_client.py        # Worker gRPC client
│   └── pybind/
│       ├── CMakeLists.txt
│       ├── tensor_socket.h       # pybind11 wrapper
│       ├── tensor_socket.cpp     # Implementation
│       └── tensor_socket.py      # Python interface
├── proto/
│   └── service.proto             # gRPC service definition
├── build.sh                       # Build script
└── README.md                      # This file

Performance Considerations

Quantization: INT4 reduces tensor size 4x, but verify accuracy on your models.
Network: 1 Gbps LAN is sufficient for most prosumer setups; higher is better.
Pinned Memory: Allocate based on available system RAM (512 MB default per worker).
CUDA Streams: Overlap D2H, socket send, and next H2D for minimal latency.

Troubleshooting

Worker doesn't start

# Check CUDA availability
nvidia-smi

# Check port
lsof -i :50051

# Check Avahi daemon
systemctl status avahi-daemon

Discovery fails

# Ensure Avahi is running
sudo systemctl start avahi-daemon

# Check mDNS resolution
avahi-browse -a

gRPC connection refused

# Verify worker is running and listening
netstat -tlnp | grep hyperlane_worker

# Check firewall
sudo ufw allow 50051:50100/tcp

Development

Build just the worker

cd hyperlane_worker
mkdir -p build && cd build
cmake .. && cmake --build .

Build just the Python client

cd hyperlane_client
python3 setup.py develop

Run tests (TODO)

pytest tests/

Roadmap

TensorRT engine support (beyond ONNX)
Dynamic layer redistribution based on load
Speculative decoding for batch inference
Quantization-aware training (QAT)
Web dashboard for monitoring
Support for other accelerators (AMD, Intel)

License

MIT (or chosen license)

Contributing

Pull requests welcome. Please ensure:

Code follows project style (clang-format for C++, black for Python)
All async code is non-blocking
Performance-critical paths are validated on real hardware

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Nov 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hyperlane-0.1.0-py3-none-any.whl (10.3 kB view details)

Uploaded Nov 15, 2025 Python 3

File details

Details for the file hyperlane-0.1.0-py3-none-any.whl.

File metadata

Download URL: hyperlane-0.1.0-py3-none-any.whl
Upload date: Nov 15, 2025
Size: 10.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for hyperlane-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c3363ec0f7c400c8f86fc6ea1be059b4ac787731207d02e28644992a4a8989a4`
MD5	`34a27177c64a4d78b1e3f0245a228b41`
BLAKE2b-256	`a26da4a600ba017869e4600700ddc0b73ba256962108fc03c695f23f1f76f54c`

See more details on using hashes here.

hyperlane 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Hyperlane

Architecture

hyperlane_worker (C++/CUDA)

hyperlane_client (Python)

Pipeline Parallelism

Installation

System Requirements

Quick Setup

Quick Start

Start a Worker

Load and Run a Model

File Structure

Performance Considerations

Troubleshooting

Worker doesn't start

Discovery fails

gRPC connection refused

Development

Build just the worker

Build just the Python client

Run tests (TODO)

Roadmap

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes