Skip to main content

A minimal, high-performance large language model (LLM) inference engine implementing vLLM in Rust.

Project description

🚀 vLLM.rs

Blazing-fast LLM inference in pure Rust. No PyTorch. No Python runtime. Just fast, portable, production-ready inference.

English | 简体中文


✨ Why vLLM.rs?

  • Zero Python dependencies — Pure Rust backend, no PyTorch, no CUDA Python bindings.
  • FastNative Flash Attention, FlashInfer, CUDA Graphs, continuous batching, prefix caching, and PD disaggregation. Up to 175 tok/s decode speed for 30B+ models on consumer GPUs.
  • Tiny footprint — Core scheduling + attention logic in < 5000 lines of Rust.
  • Cross-platform — CUDA (Linux/Windows), Metal (macOS). Same binary, same API.
  • Production-ready — OpenAI/Anthropic-compatible APIs, built-in ChatGPT-style Web UI, MCP tool calling, structured outputs, embedding + tokenizer endpoints.
  • Aggressive KV compression — TurboQuant (2–4 bit KV cache) extends context up to 4.3× with minimal quality loss. Run 30B+ MoE models with millions of context on single 24/32 GB GPUs.
  • Lightweight Python bindings — Optional PyO3 wheel when you need a Python entry point.

Quick Start

Option A — 🚀 Rust (recommended)

# Prerequisites: Rust compiler and CUDA Toolkit (if not installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
sudo apt-get install -y git build-essential libssl-dev pkg-config

# Repo for install
export VLLM_RS_REPO="https://github.com/guoqingbao/vllm.rs"

# 1. Install (one-time, remove `flashinfer` and `cutlass` features on SM_70/SM_75, e.g., V100)
cargo install --git $VLLM_RS_REPO vllm-rs --features cuda,nccl,flashinfer,cutlass

# or, git clone and install from local source code
# git clone $VLLM_RS_REPO && cd vllm.rs
# ./build.sh --install --features cuda,nccl,flashinfer,cutlass

# 2. Run
vllm-rs --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo4 --ui-server

# local model
# vllm-rs --w /home/Qwen3.6-35B-A3B --d 0,1 --ui-server

# 3. Vibe Coding Client (optinal)
cargo install xbot # config to use local Base URL

Open http://IP:8001 for the built-in chat UI, or use http://IP:8000/v1/ as API server Base URL.

Optionally add --kvcache-dtype to compress KV cache and extend context:

Flag (--kvcache-dtype) Compression Quality GPU Requirement
(default) 1× (BF16) Baseline All
fp8 Near-lossless SM70+ / Apple M1
turbo8 2.6× 79–100% throughput SM70+ / Apple M1
turbo4 3.7× Best balance SM70+ / Apple M1
turbo3 4.7× Max compression SM70+

Option B — 📦 Python (pip install)

  • 💡Turing/V100 (SM70/SM75), Hopper (SM90) / Blackwell (SM100+): download wheel from GitHub Releases;
# Metal (macOS) / Ampere (SM80, A100)
pip install vllm_rs
python3 -m vllm_rs.server --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo4 --ui-server

Option C — Install with Docker

  • 💡Change sm_xx to sm_70/sm_75 (Turing/V100, remove flashinfer and cutlass features), sm_80/sm_89 (Ampere), sm_90 (Hopper), sm_100/sm_120 (Blackwell)
# Example: Hopper (SM_90, CUDA 13.0.0), append extra argument 1 for rust crate mirror (Chinese Mainland)
./build_docker.sh "cuda,nccl,flashinfer,cutlass" sm_90 13.0.0

See Docker guide →


📈 Performance

V100-32G, A100-40G, Hopper-80G and RTX 5090

Model Format Size Decoding Speed
Ministral-3-3B (Multimodal) ISQ (BF16→Q4K) 3B 171.92 tokens/s
Qwen3-VL-8B-Instruct (Multimodal) Q8_0 8B 105.31 tokens/s
Llama-3.1-8B ISQ (BF16→Q4K) 8B 120.74 tokens/s
DeepSeek-R1-0528-Qwen3-8B Q4_K_M 8B 124.87 tokens/s
GLM-4-9B-0414 Q4_K_M 9B 70.38 tokens/s
QwQ-32B Q4_K_M 32B 41.36 tokens/s
Qwen3-30B-A3B NVFP4 30B (MoE) 175.30 tokens/s (RTX 5090)
Qwen3-30B-A3B NVFP4 30B (MoE) 67.10 tokens/s (V100, Software FP4)
Qwen3.5-27B (Multimodal) Q4_K_M 27B (Dense) 45.20 tokens/s
Qwen3.5-27B/Qwen3.6-27B FP8 27B (Dense) 42 tokens/s (Hopper)
Qwen3.6-35B-A3B (Multimodal) FP8 35B (MoE) 102 tokens/s (Hopper)
GLM4.7 Flash NVFP4 30B (MoE) 79 tokens/s (Hopper, Software FP4)
Gemma4-31B ISQ (BF16→Q4K) 31B (Dense) 41 tokens/s (Hopper)
Gemma4-26B-A4B NVFP4 26B (MoE) 131 tokens/s (RTX 5090)
MiniMax-M2.5 NVFP4 229B (MoE) 62 tokens/s (Hopper, Software FP4, TP=2)
Apple Silicon (M4)
Model Batch Size Output Tokens Time (s) Throughput (tokens/s)
Qwen3-0.6B (BF16) 128 63488 83.13s 763.73
Qwen3-0.6B (BF16) 32 15872 23.53s 674.43
Qwen3-0.6B (BF16) 1 456 9.23s 49.42
Qwen3-4B (Q4_K_M) 1 1683 52.62s 31.98
Qwen3-8B (Q2_K) 1 1300 80.88s 16.07
Qwen3.5-4B (Q3_K_M) 1 1592 69.04s 23.06
Qwen3.5-2B (NVFP4) 1 1883 60.76s 30.99
Qwen3.5-2B (NVFP4) 2 3942 81.96s 48.10

Full benchmarks →


🧠 Supported Models

  • ✅ LLaMa (LLaMa2, LLaMa3, LLaMa4, IQuest-Coder)
  • ✅ Qwen (Qwen2, Qwen3)
  • ✅ Qwen2/Qwen3 MoE
  • ✅ Qwen3 Next
  • ✅ Qwen3.5/3.6 Dense/MoE (27B, 35B, 122B, 397B, Multimodal model)
  • ✅ Mistral v1, v2
  • ✅ Mistral-3-VL Reasoning (3B, 8B, 14B, Multimodal model)
  • ✅ GLM4 (0414, Not ChatGLM)
  • ✅ GLM4 MoE (4.6/4.7)
  • ✅ GLM4.7 Flash
  • ✅ DeepSeek V3/R1/V3.2
  • ✅ Phi3 / Phi4 (Phi-3, Phi-4, Phi-4-mini, etc.)
  • ✅ Gemma3/Gemma4 (Multimodal model)
  • ✅ Qwen3-VL (Dense, Multimodal model)
  • ✅ MiroThinker-v1.5 (30B, 235B)

Formats: Safetensors (BF16, FP8-blockwise, GPTQ, AWQ, MXFP4, NVFP4) | GGUF (all quant types) | ISQ (on-the-fly quantization)

TurboQuant KV Cache — Run 30B+ Models on Consumer GPUs

TurboQuant compresses KV cache to 2–4 bits via Walsh-Hadamard transform rotation + per-head absmax quantization. Max context tokens with turbo4:

Model KV budget BF16 turbo4 Gain
Qwen3.6-35B-A3B (NVFP4) 7 GB (24 GB GPU) 700k 2.7M 3.9×
15 GB (32 GB GPU) 1.5M 5.8M 3.9×
Qwen3.6-27B (FP8) 7 GB 112k 434k 3.9×
15 GB 240k 930k 3.9×
Qwen3-30B-A3B (Q4_K_M) 7 GB 74k 281k 3.8×
15 GB 160k 602k 3.8×
Gemma4-26B-A4B (NVFP4) 7 GB 32k 125k 3.9×
15 GB 70k 271k 3.9×

Hybrid models (Qwen3.6) have fewer full attention layers, making TurboQuant especially effective. MLA models (DeepSeek, GLM4.7 Flash) use fp8 instead. The KV budget in the table is the theoretical maximum; actual usage can only utilize up to 90% of the KV budget (--kv-fraction 0.9), leaving room for runtime and batching buffers.

# 35B MoE on single 24/32 GB GPU
vllm-rs --m unsloth/Qwen3.6-35B-A3B-NVFP4 --kvcache-dtype turbo4

# Production precision
vllm-rs --m Qwen/Qwen3.6-35B-A3B-FP8 --kvcache-dtype fp8

# 27B Dense + turbo4
vllm-rs --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo4

# 30B MoE GGUF + turbo4
vllm-rs --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF \
  --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --kvcache-dtype turbo4

📘 Usage (Rust)

Installation

CUDA (Linux)
# Prerequisites
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
sudo apt-get install -y git build-essential libssl-dev pkg-config

# Optional: CUDA toolkit + NCCL
sudo apt-get install -y cuda-nvcc-12-9 cuda-nvrtc-dev-12-9 libcublas-dev-12-9 libcurand-dev-12-9
sudo apt-get install -y libnccl2 libnccl-dev

# Build & install
./build.sh --install --features cuda,nccl,flashinfer,cutlass
# Flash Attention backend alternative:
./build.sh --install --features cuda,nccl,flashattn,cutlass
# V100 / older (no flash backends):
./build.sh --install --features cuda,nccl
Metal (macOS)
# Install Xcode command-line tools first
cargo install --features metal
Docker
# sm_80 = A100, sm_90 = Hopper, sm_120 = Blackwell
./build_docker.sh "cuda,nccl,flashinfer,cutlass,python" sm_80 12.9.0 0
# Production image with Flash Attention:
./build_docker.sh --prod "cuda,nccl,flashattn,cutlass,python" sm_90 13.0.0

See Docker guide →

Running Models

  • 💡By default, vllm-rs starts an OpenAI-compatible API server at http://localhost:8000. Add --ui-server to also launch the built-in ChatGPT-style Web UI at http://localhost:8001.

  • 💡For built within Docker, refer Run vLLM.rs in docker →

# FP8 model (sm90+ with cutlass) + web UI
vllm-rs --m Qwen/Qwen3.6-27B-FP8 --ui-server

# Unquantized Safetensors (multi-GPU)
vllm-rs --d 0,1 --m Qwen/Qwen3-30B-A3B-Instruct-2507 --kvcache-dtype fp8

# ISQ on-the-fly quantization
vllm-rs --m Qwen/Qwen3.6-35B-A3B --isq q4k

# NVFP4 model
vllm-rs --m unsloth/Qwen3.6-27B-NVFP4

# MXFP4
vllm-rs --m olka-fi/Qwen3.5-4B-MXFP4

# GGUF model (4-bit KvCache)
vllm-rs --m unsloth/Qwen3.5-27B-GGUF --f Qwen3.5-27B-Q4_K_M.gguf --kvcache-dtype turbo4

# FP8 on Metal
vllm-rs --m Qwen/Qwen3.5-4B-FP8

# Gemma4 26B (NVFP4)
vllm-rs --m unsloth/gemma-4-26b-a4b-it-NVFP4

# MLA model (GLM4.7 Flash)
vllm-rs --m GadflyII/GLM-4.7-Flash-NVFP4

# Interactive CLI chat
vllm-rs --i --m unsloth/Qwen3.5-27B-GGUF --f Qwen3.5-27B-Q4_K_M.gguf
ISQ (on-the-fly quantization) + KV cache compression
# ISQ Q4K + FP8 KV cache
vllm-rs --m Qwen/Qwen3.6-35B-A3B --isq q4k --kvcache-dtype fp8

# ISQ Q4K + TurboQuant KV cache
vllm-rs --m Qwen/Qwen3.6-35B-A3B --isq q4k --kvcache-dtype turbo4

# Metal ISQ
vllm-rs --w /path/Qwen3-4B --isq q6k
GGUF models
# Single GPU — GGUF
vllm-rs --m unsloth/Qwen3.5-27B-GGUF --f Qwen3.5-27B-Q4_K_M.gguf

# Multi-GPU — GGUF
vllm-rs --d 0,1 --f /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf
TurboQuant KV cache (2–4 bit) — see TurboQuant section
# turbo4: 4-bit K+V — 3.7× compression, best tradeoff
vllm-rs --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo4

# turbo3: 3-bit K + 4-bit V — 4.7× compression
vllm-rs --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo3

# turbo8: FP8 K + 4-bit V — 2.6× compression, highest quality
vllm-rs --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo8

# 35B MoE (NVFP4 + turbo4) — fits on single 24 GB GPU
vllm-rs --m unsloth/Qwen3.6-35B-A3B-NVFP4 --kvcache-dtype turbo4

# 30B MoE (GGUF Q4_K_M + turbo4) — consumer GPU
vllm-rs --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF \
  --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --kvcache-dtype turbo4
Multimodal models (Qwen3-VL, Gemma4, Mistral3-VL)
# Upload images via built-in Chat UI or send image_url in API requests

# Qwen3.6 35B MoE (FP8, multimodal)
vllm-rs --m Qwen/Qwen3.6-35B-A3B-FP8 --ui-server

# Qwen3-VL 8B (GGUF)
vllm-rs --m unsloth/Qwen3-VL-8B-Instruct-GGUF --f Qwen3-VL-8B-Instruct-Q8_0.gguf --ui-server

# Gemma4 26B MoE (NVFP4, multimodal)
vllm-rs --m unsloth/gemma-4-26b-a4b-it-NVFP4 --ui-server

# Mistral-3 VL 3B (BF16, multimodal)
vllm-rs --m mistralai/Ministral-3-3B --ui-server

📘 Usage (Python)

Running Models

# FP8 model + web UI
python3 -m vllm_rs.server --m Qwen/Qwen3.6-27B-FP8 --ui-server

# Unquantized Safetensors (multi-GPU)
python3 -m vllm_rs.server --m Qwen/Qwen3.5-122B-A10B --d 0,1 --kvcache-dtype fp8

# ISQ on-the-fly quantization
python3 -m vllm_rs.server --w /path/Qwen3.6-35B-A3B --isq q4k --d 0 --kvcache-dtype turbo8

# NVFP4 / MXFP4
python3 -m vllm_rs.server --m unsloth/Qwen3.6-27B-NVFP4
python3 -m vllm_rs.server --m olka-fi/Qwen3.5-4B-MXFP4
python3 -m vllm_rs.server --m GadflyII/GLM-4.7-Flash-NVFP4

# GGUF
python3 -m vllm_rs.server --m unsloth/Qwen3.5-27B-GGUF --f Qwen3.5-27B-Q4_K_M.gguf

# Multimodal
python3 -m vllm_rs.server --m Qwen/Qwen3.6-35B-A3B-FP8 --kvcache-dtype fp8

# GPTQ / AWQ
python3 -m vllm_rs.server --w /home/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4-Marlin

See more Python examples →

Build Python wheel from source
pip install maturin maturin[patchelf]

# FlashInfer backend (SM80+)
./build.sh --release --features cuda,nccl,flashinfer,cutlass,python

# Flash Attention backend
./build.sh --release --features cuda,nccl,flashattn,cutlass,python

# macOS Metal
maturin build --release --features metal,python

# Install
pip install target/wheels/vllm_rs-*.whl --force-reinstall

🔀 Prefill-Decode Disaggregation

Split prefill (prompt processing) and decode (token generation) across GPUs or machines. Eliminates decode stalls during long-context prefilling. PD Server and PD Client must use same KvCache type (--kvcache-dtype). API request(s) must send to PD Client and the PD Server only process internal prefill requests sent from PD Client.

Mode Config Use Case
Local IPC (default, no flag) Same machine, CUDA
File IPC --pd-url file:///path Docker containers, shared volume
Remote TCP --pd-url tcp://host:port Different machines

Local IPC (multirank)

# PD Server (prefill GPU, default port 7000)
vllm-rs --d 0,1 --m Qwen/Qwen3-30B-A3B-Instruct-2507 --pd-server

# PD Client (decode GPU + API)
vllm-rs --d 2,3 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --ui-server --port 8000 --pd-client

Multinode (tcp mode)

# Server machine (192.168.1.100)
target/release/vllm-rs --d 0,1 --m Qwen/... --pd-server --pd-url tcp://0.0.0.0:8100

# Client machine
target/release/vllm-rs --d 0,1 --w /path/... --pd-client --pd-url tcp://192.168.1.100:8100 --ui-server --port 8000

Metal/macOS requires --pd-url (no LocalIPC support).

Multi-container(file:// mode)
mkdir -p /tmp/pd-sockets

# Server container
docker run --gpus '"device=0,1"' -v /tmp/pd-sockets:/sockets ...
target/release/vllm-rs --d 0,1 --m Qwen/... --pd-server --pd-url file:///sockets

# Client container
docker run --gpus '"device=2,3"' -v /tmp/pd-sockets:/sockets ...
target/release/vllm-rs --d 0,1 --w /path/... --pd-client --pd-url file:///sockets --ui-server --port 8000

🔌 MCP Tool Calling

vllm-rs --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF \
  --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --ui-server --mcp-config ./mcp.json

MCP documentation →


🔌 Structured Outputs

Constraint-based generation via llguidance — Lark grammars, regex, JSON Schema.

Structured outputs documentation →


📚 Documentation

Guide Description
Get Started Build, run, and configure
Docker Container builds and deployment
Performance Full benchmark tables
Prefix Cache Automatic KV cache reuse
Multimodal Vision-language models
Embedding Text embedding API
Tokenizer API Tokenize / detokenize endpoints
Tool Parsing Tool call detection and parsing
MCP Integration Model Context Protocol
Guided Decoding Structured outputs
Rust Crate Use as a library
Add a Model Port a new architecture (AI-assisted)
Test a Model Validate model quality (AI-assisted)

Using Agents under vLLM.rs backend: xbot · OpenCode · Kilo Code · Claude Code · Goose


⚙️ CLI Reference

Flag Description
--m HuggingFace model ID (auto-download)
--w Local Safetensors model path
--f GGUF file path (or filename when --m is given)
--d Device IDs (e.g. --d 0,1)
--ui-server API server + built-in ChatGPT-style web UI
--server API server only (no web UI)
--i Interactive CLI chat
--isq On-the-fly quantization: q2k, q3k, q4k, q5k, q6k, q8_0
--kvcache-dtype KV cache quantization: fp8, turbo8, turbo4, turbo3
--max-num-seqs Max concurrent requests (default: 32, macOS: 8)
--max-tokens Max tokens per response (default: 16384)
--kv-fraction GPU memory fraction for KV cache
--cpu-mem-fold CPU swap memory ratio (default: 0.2)
--pd-server Run as PD prefill server
--pd-client Run as PD decode client
--pd-url PD connection URL (tcp://, http://, file://)
--disable-prefix-cache Disable prefix caching
--prefix-cache-max-tokens Cap prefix cache size
--disable-cuda-graph Disable CUDA graph capture
--yarn-scaling-factor YARN RoPE context extension factor
--temperature Sampling temperature (0–1)
--top-k / --top-p Top-k / nucleus sampling
--presence-penalty Penalize repeated tokens (−2 to 2)
--frequency-penalty Penalize frequent tokens (−2 to 2)
--mcp-config MCP servers JSON config
--mcp-command / --mcp-args Single MCP server command + args

📽️ Demo


🛠️ Roadmap

  • Batched inference (Metal)
  • GGUF format support
  • FlashAttention (CUDA)
  • CUDA Graph
  • OpenAI-compatible API (streaming support)
  • Continuous batching
  • Multi-gpu inference (Safetensors, GPTQ, AWQ, GGUF)
  • Speedup prompt processing on Metal/macOS
  • Chunked Prefill
  • Prefix cache (available on CUDA when prefix-cache enabled)
  • Model loading from hugginface hub
  • Model loading from ModelScope (China)
  • Prefix cache for Metal/macOS
  • FP8 KV Cache (CUDA, all backends including FlashInfer on SM80+)
  • FP8 KV Cache (Metal)
  • FP8 KV Cache (with FlashInfer, SM80+)
  • TurboQuant KV Cache (2-4 bit compression with WHT rotation)
  • FP8 Models (CUDA: MoE, Dense; Metal: Dense)
  • Additional model support (Kimi K2, GLM 5.1 etc.)
  • CPU KV Cache Offloading
  • Prefill-decode Disaggregation (CUDA)
  • Prefill-decode Disaggregation (Metal)
  • Built-in ChatGPT-like Web Server
  • Embedding API
  • Tokenize/Detokenize API
  • MCP Integration & Tool Calling
  • Prefix Caching
  • Claude/Anthropic-compatible API Server
  • Support CUDA 13
  • Support FlashInfer backend
  • Support DeepGEMM backend (Hopper)
  • MXFP4/NVFP4 Model Support
  • Support Turboquant (4-bit, 3-bit) KvCache
  • TentorRT-LLM

📚 References

Star History

Star History Chart

Like this project? Give it a ⭐ and contribute!

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vllm_rs-0.11.5-cp38-abi3-macosx_11_0_arm64.whl (11.2 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file vllm_rs-0.11.5-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vllm_rs-0.11.5-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 892470cc26d2cc45c2f5b048b74c70895f221454d11cd7228b1b7d100e8f1f0c
MD5 8709ce31642cf83039b1f0f718f9cc5a
BLAKE2b-256 2b9c3918782cba82e55c8fe97e949e67140d71d20395da4c6f52b0875d3055a5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page