High-performance Rust-based load balancer for VLLM with multiple routing algorithms and prefill-decode disaggregation support

These details have not been verified by PyPI

Project description

vLLM Router

A high-performance and light-weight request forwarding system for vLLM large scale deployments, providing advanced load balancing methods and prefill/decode disaggregation support.

Key Features

Core Architecture: Request routing framework and async processing patterns
Load Balancing: Multiple algorithms (cache-aware, power of two, consistent hashing, random, round robin)
Prefill-Decode Disaggregation: Specialized routing for separated processing phases
Service Discovery: Kubernetes-native worker management and health monitoring
Enterprise Features: Circuit breakers, retry logic, metrics collection

Quick Start

Prerequisites

Rust and Cargo:

# Install rustup (Rust installer and version manager)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Follow the installation prompts, then reload your shell
source $HOME/.cargo/env

# Verify installation
rustc --version
cargo --version

# Install protobuf compiler (on Ubuntu/Debian)
sudo apt-get update
sudo apt-get install -y protobuf-compiler libprotobuf-dev

Python with pip installed

Installation & Basic Usage

Rust Binary

# Build Rust components
cargo build --release

Python Package

Install from PyPI

pip install vllm-router                                                                                                                                                        ```

To build from source:
```bash    
pip install setuptools-rust wheel build
python -m build
pip install dist/*.whl

# Rebuild & reinstall in one step during development
python -m build && pip install --force-reinstall dist/*.whl

Usage Examples

Standard Data Parallelism Routing

# Launch router with data parallelism (8 replicas per worker URL)
# When data-parallel-size > 1, the router automatically creates DP-aware workers
./target/release/vllm-router \
    --worker-urls http://worker1:8000 http://worker2:8000 \
    --policy consistent_hash \
    --intra-node-data-parallel-size 8

# Alternative: using cargo run
cargo run --release -- \
    --worker-urls http://worker1:8000 http://worker2:8000 \
    --policy consistent_hash \
    --intra-node-data-parallel-size 8

# Alternative: using python launcher
vllm-router \
  --worker-urls http://worker1:8000 http://worker2:8000 \
    --policy consistent_hash \
    --intra-node-data-parallel-size 8

Prefill-Decode Disaggregation

# When vLLM runs the NIXL connector, prefill/decode URLs are required.
# See a working example in scripts/llama3.1/ folder.
cargo run --release -- \
    --policy consistent_hash \
    --vllm-pd-disaggregation \
    --prefill http://127.0.0.1:8081 \
    --prefill http://127.0.0.1:8082 \
    --decode http://127.0.0.1:8083 \
    --decode http://127.0.0.1:8084 \
    --decode http://127.0.0.1:8085 \
    --decode http://127.0.0.1:8086 \
    --host 127.0.0.1 \
    --port 8090 \
    --intra-node-data-parallel-size 1 \


# When vLLM runs the NCCL connector, ZMQ based discovery is supported.
# See a working example in scripts/install.sh
cargo run --release -- \
    --policy consistent_hash \
    --vllm-pd-disaggregation \
    --vllm-discovery-address 0.0.0.0:30001 \
    --host 0.0.0.0 \
    --port 10001 \
    --prefill-policy consistent_hash \
    --decode-policy consistent_hash

Configuration

Authentication

Enable bearer-token validation by listing validation URLs (comma-separated) in .env via API_KEY_VALIDATION_URLS or passing --api-key-validation-urls. When set, all HTTP endpoints require Authorization: Bearer <token> and tokens are validated with HTTP 200 responses.

# .env
API_KEY_VALIDATION_URLS=https://codebase.helmholtz.cloud/api/v4/user

# CLI override
vllm-router --api-key-validation-urls https://codebase.helmholtz.cloud/api/v4/user

Metrics

Prometheus metrics endpoint available at 127.0.0.1:29000 by default.

# Custom metrics configuration
vllm-router \
    --worker-urls http://localhost:8080 http://localhost:8081 \
    --prometheus-host 0.0.0.0 \
    --prometheus-port 9000

Retries and Circuit Breakers

Retry Configuration

Retries are enabled by default with exponential backoff and jitter:

vllm-router \
  --worker-urls http://localhost:8080 http://localhost:8081 \
  --retry-max-retries 3 \
  --retry-initial-backoff-ms 100 \
  --retry-max-backoff-ms 10000 \
  --retry-backoff-multiplier 2.0 \
  --retry-jitter-factor 0.1

Circuit Breaker Configuration

Circuit breakers protect workers and provide automatic recovery:

vllm-router \
  --worker-urls http://localhost:8080 http://localhost:8081 \
  --cb-failure-threshold 5 \
  --cb-success-threshold 2 \
  --cb-timeout-duration-secs 30 \
  --cb-window-duration-secs 60

Circuit Breaker State Machine:

Closed → Open after N consecutive failures (failure-threshold)
Open → HalfOpen after timeout (timeout-duration-secs)
HalfOpen → Closed after M consecutive successes (success-threshold)

Retry Policy: Retries on HTTP status codes 408/429/500/502/503/504, with backoff/jitter between attempts.

Request ID Tracking

Track requests across distributed systems with configurable headers:

# Use custom request ID headers
vllm-router \
    --worker-urls http://localhost:8080 \
    --request-id-headers x-trace-id x-request-id

Default headers: x-request-id, x-correlation-id, x-trace-id, request-id

Load Balancing Policies

The router supports multiple load balancing policies:

Policy	Description	Session Affinity	Use Case
`round_robin`	Sequential distribution across workers	No	General purpose, even distribution
`random`	Uniform random selection	No	Simple deployments
`consistent_hash`	Routes same session/user to same worker	Yes	Multi-turn chat, KV cache reuse
`power_of_two`	Picks least loaded of two random workers	No	Load-sensitive workloads
`cache_aware`	Optimizes for prefix cache hits	Yes	Repeated prompts, few-shot

# Example: Using consistent_hash with HTTP header for session affinity
curl -X POST http://router:8000/v1/chat/completions \
  -H "X-Session-ID: my-session-123" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3", "messages": [{"role": "user", "content": "Hello!"}]}'

For detailed configuration options, hash key priorities, and usage examples, see Load Balancing Documentation.

Advanced Features

Kubernetes Service Discovery

Automatic worker discovery and management in Kubernetes environments.

Basic Service Discovery

vllm-router \
    --service-discovery \
    --selector app=vllm-worker role=inference \
    --service-discovery-namespace default

Command Line Arguments Reference

Service Discovery

--service-discovery: Enable Kubernetes service discovery
--service-discovery-port: Port for worker URLs (default: 8000)
--service-discovery-namespace: Kubernetes namespace to watch
--selector: Label selectors for regular mode (format: key1=value1 key2=value2)

Development

Troubleshooting

VSCode Rust Analyzer Issues: Set rust-analyzer.linkedProjects to the absolute path of Cargo.toml:

{
  "rust-analyzer.linkedProjects": ["/workspaces/vllm/vllm-router/Cargo.toml"]
}

CI/CD Pipeline

The continuous integration pipeline includes comprehensive testing, benchmarking, and publishing:

Build & Test

Build Wheels: Uses cibuildwheel for manylinux x86_64 packages
Build Source Distribution: Creates source distribution for pip fallback
Rust HTTP Server Benchmarking: Performance testing of router overhead
Basic Inference Testing: End-to-end validation through the router
PD Disaggregation Testing: Benchmark and sanity checks for prefill-decode load balancing

Publishing

PyPI Publishing: Wheels and source distributions published when version changes in pyproject.toml
Container Images: Docker images published using /docker/Dockerfile.router

Acknowledgement

This project is a fork of SGLang Model Gateway, and we would like to explicitly acknowledge and thank the original authors for their work. At this stage, our fork includes only minimal changes to preserve the existing interface and ensure compatibility with vLLM. We anticipate further divergence as we pursue the roadmap we have in mind, which is the reason for creating the fork.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.14

Apr 24, 2026

0.1.13

Apr 23, 2026

This version

0.1.12

Mar 6, 2026

0.1.11 yanked

Mar 4, 2026

Reason this release was yanked:

https://github.com/vllm-project/router/issues/100

0.1.10

Feb 4, 2026

0.1.9

Jan 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_router-0.1.12.tar.gz (247.9 kB view details)

Uploaded Mar 6, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllm_router-0.1.12-cp38-abi3-manylinux_2_28_x86_64.whl (11.6 MB view details)

Uploaded Mar 6, 2026 CPython 3.8+manylinux: glibc 2.28+ x86-64

vllm_router-0.1.12-cp38-abi3-manylinux_2_28_aarch64.whl (11.3 MB view details)

Uploaded Mar 6, 2026 CPython 3.8+manylinux: glibc 2.28+ ARM64

File details

Details for the file vllm_router-0.1.12.tar.gz.

File metadata

Download URL: vllm_router-0.1.12.tar.gz
Upload date: Mar 6, 2026
Size: 247.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for vllm_router-0.1.12.tar.gz
Algorithm	Hash digest
SHA256	`717fe9ea7b3ffe496cbf3425fde40c0b85cd3f2da372bbe90063fe160be2e459`
MD5	`284aee095f2a307e31a8dc0189726519`
BLAKE2b-256	`0e6c00d2b31db5d751e3a0a0fd6b1f477a2d42873ee89683f75a0de7f17616ce`

See more details on using hashes here.

File details

Details for the file vllm_router-0.1.12-cp38-abi3-manylinux_2_28_x86_64.whl.

File metadata

Download URL: vllm_router-0.1.12-cp38-abi3-manylinux_2_28_x86_64.whl
Upload date: Mar 6, 2026
Size: 11.6 MB
Tags: CPython 3.8+, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for vllm_router-0.1.12-cp38-abi3-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`27658c1bb294e02a500b5cb4303b430e5d5d9f2f805f18babd2ade2dbdb36554`
MD5	`06f2bff8f5e1171fa9419aa5766fe3f7`
BLAKE2b-256	`a8b89ce36a665b11765c33f9ac8d47665d782dab637c140f79e5c2c865b00b6f`

See more details on using hashes here.

File details

Details for the file vllm_router-0.1.12-cp38-abi3-manylinux_2_28_aarch64.whl.

File metadata

Download URL: vllm_router-0.1.12-cp38-abi3-manylinux_2_28_aarch64.whl
Upload date: Mar 6, 2026
Size: 11.3 MB
Tags: CPython 3.8+, manylinux: glibc 2.28+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for vllm_router-0.1.12-cp38-abi3-manylinux_2_28_aarch64.whl
Algorithm	Hash digest
SHA256	`28d3e1dcd2a6f30949157a0daae3b80a90429c0547c49273b165ea74c89096c3`
MD5	`9707e2a712f69a2d6f6808f2b5f500b1`
BLAKE2b-256	`a50248a3188cf499111029f77d9ddce83082717adba5d1b0e96071106dbf770d`

See more details on using hashes here.

vllm-router 0.1.12

Navigation

Verified details

Owner

Maintainers

Unverified details

Meta

Classifiers

Project description

vLLM Router

Key Features

Quick Start

Prerequisites

Installation & Basic Usage

Rust Binary

Python Package

Usage Examples

Standard Data Parallelism Routing

Prefill-Decode Disaggregation

Configuration

Authentication

Metrics

Retries and Circuit Breakers

Retry Configuration

Circuit Breaker Configuration

Request ID Tracking

Load Balancing Policies

Advanced Features

Kubernetes Service Discovery

Basic Service Discovery

Command Line Arguments Reference

Service Discovery

Development

Troubleshooting

CI/CD Pipeline

Build & Test

Publishing

Acknowledgement

Project details

Verified details

Owner

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes