A minimal, high-performance large language model (LLM) inference engine implementing vLLM in Rust.

These details have not been verified by PyPI

Project links

Project description

🚀 vLLM.rs

Blazing-fast LLM inference in pure Rust. No PyTorch. No Python runtime. Just fast, portable, production-ready inference.

English | 简体中文

✨ Why vLLM.rs?

Zero Python dependencies — Pure Rust backend, no PyTorch, no CUDA Python bindings.
Fast — Native Flash Attention, FlashInfer, CUDA Graphs, continuous batching, prefix caching, and PD disaggregation. Up to 175 tok/s decode speed for 30B+ models on consumer GPUs.
Tiny footprint — Core scheduling + attention logic in < 5000 lines of Rust.
Cross-platform — CUDA (Linux/Windows), Metal (macOS). Same binary, same API.
Production-ready — OpenAI/Anthropic-compatible APIs, built-in ChatGPT-style Web UI, MCP tool calling, structured outputs, embedding + tokenizer endpoints.
Aggressive KV compression — TurboQuant (2–4 bit KV cache) extends context up to 4.3× with minimal quality loss. Run 30B+ MoE models with millions of context on single 24/32 GB GPUs.
Lightweight Python bindings — Optional PyO3 wheel when you need a Python entry point.

Quick Start

Option A — 🚀 Rust (recommended)

# Prerequisites: Rust compiler and CUDA Toolkit (if not installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
sudo apt-get install -y git build-essential libssl-dev pkg-config

# Repo for install
export VLLM_RS_REPO="https://github.com/guoqingbao/vllm.rs"

# 1. Install (one-time, remove `flashinfer` and `cutlass` features on SM_70/SM_75, e.g., V100)
cargo install --git $VLLM_RS_REPO vllm-rs --features cuda,nccl,flashinfer,cutlass

# or, git clone and install from local source code
# git clone $VLLM_RS_REPO && cd vllm.rs
# ./build.sh --install --features cuda,nccl,flashinfer,cutlass

# 2. Run
vllm-rs --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo4 --ui-server

# local model
# vllm-rs --w /home/Qwen3.6-35B-A3B --d 0,1 --ui-server

# 3. Vibe Coding Client (optinal)
cargo install xbot # config to use local Base URL

Open http://IP:8001 for the built-in chat UI, or use http://IP:8000/v1/ as API server Base URL.

Optionally add --kvcache-dtype to compress KV cache and extend context:

Flag (`--kvcache-dtype`)	Compression	Quality	GPU Requirement
(default)	1× (BF16)	Baseline	All
`fp8`	2×	Near-lossless	SM70+ / Apple M1
`turbo8`	2.6×	79–100% throughput	SM70+ / Apple M1
`turbo4`	3.7×	Best balance	SM70+ / Apple M1
`turbo3`	4.7×	Max compression	SM70+

Option B — 📦 Python (`pip install`)

💡Turing/V100 (SM70/SM75), Hopper (SM90) / Blackwell (SM100+): download wheel from GitHub Releases;

# Metal (macOS) / Ampere (SM80, A100)
pip install vllm_rs
python3 -m vllm_rs.server --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo4 --ui-server

Option C — Install with Docker

💡Change sm_xx to sm_70/sm_75 (Turing/V100, remove flashinfer and cutlass features), sm_80/sm_89 (Ampere), sm_90 (Hopper), sm_100/sm_120 (Blackwell)

# Example: Hopper (SM_90, CUDA 13.0.0), append extra argument 1 for rust crate mirror (Chinese Mainland)
./build_docker.sh "cuda,nccl,flashinfer,cutlass" sm_90 13.0.0

See Docker guide →

📈 Performance

V100-32G, A100-40G, Hopper-80G and RTX 5090

Model	Format	Size	Decoding Speed
Ministral-3-3B (Multimodal)	ISQ (BF16→Q4K)	3B	171.92 tokens/s
Qwen3-VL-8B-Instruct (Multimodal)	Q8_0	8B	105.31 tokens/s
Llama-3.1-8B	ISQ (BF16→Q4K)	8B	120.74 tokens/s
DeepSeek-R1-0528-Qwen3-8B	Q4_K_M	8B	124.87 tokens/s
GLM-4-9B-0414	Q4_K_M	9B	70.38 tokens/s
QwQ-32B	Q4_K_M	32B	41.36 tokens/s
Qwen3-30B-A3B	NVFP4	30B (MoE)	175.30 tokens/s (RTX 5090)
Qwen3-30B-A3B	NVFP4	30B (MoE)	67.10 tokens/s (V100, Software FP4)
Qwen3.5-27B (Multimodal)	Q4_K_M	27B (Dense)	45.20 tokens/s
Qwen3.5-27B/Qwen3.6-27B	FP8	27B (Dense)	42 tokens/s (Hopper)
Qwen3.6-35B-A3B (Multimodal)	FP8	35B (MoE)	102 tokens/s (Hopper)
GLM4.7 Flash	NVFP4	30B (MoE)	79 tokens/s (Hopper, Software FP4)
Gemma4-31B	ISQ (BF16→Q4K)	31B (Dense)	41 tokens/s (Hopper)
Gemma4-26B-A4B	NVFP4	26B (MoE)	131 tokens/s (RTX 5090)
MiniMax-M2.5	NVFP4	229B (MoE)	62 tokens/s (Hopper, Software FP4, TP=2)

Apple Silicon (M4)

Model	Batch Size	Output Tokens	Time (s)	Throughput (tokens/s)
Qwen3-0.6B (BF16)	128	63488	83.13s	763.73
Qwen3-0.6B (BF16)	32	15872	23.53s	674.43
Qwen3-0.6B (BF16)	1	456	9.23s	49.42
Qwen3-4B (Q4_K_M)	1	1683	52.62s	31.98
Qwen3-8B (Q2_K)	1	1300	80.88s	16.07
Qwen3.5-4B (Q3_K_M)	1	1592	69.04s	23.06
Qwen3.5-2B (NVFP4)	1	1883	60.76s	30.99
Qwen3.5-2B (NVFP4)	2	3942	81.96s	48.10

Full benchmarks →

🧠 Supported Models

✅ LLaMa (LLaMa2, LLaMa3, LLaMa4, IQuest-Coder)
✅ Qwen (Qwen2, Qwen3)
✅ Qwen2/Qwen3 MoE
✅ Qwen3 Next
✅ Qwen3.5/3.6 Dense/MoE (27B, 35B, 122B, 397B, Multimodal model)
✅ Mistral v1, v2
✅ Mistral-3-VL Reasoning (3B, 8B, 14B, Multimodal model)
✅ GLM4 (0414, Not ChatGLM)
✅ GLM4 MoE (4.6/4.7)
✅ GLM4.7 Flash
✅ DeepSeek V3/R1/V3.2
✅ Phi3 / Phi4 (Phi-3, Phi-4, Phi-4-mini, etc.)
✅ Gemma3/Gemma4 (Multimodal model)
✅ Qwen3-VL (Dense, Multimodal model)
✅ MiroThinker-v1.5 (30B, 235B)

Formats: Safetensors (BF16, FP8-blockwise, GPTQ, AWQ, MXFP4, NVFP4) | GGUF (all quant types) | ISQ (on-the-fly quantization)

TurboQuant KV Cache — Run 30B+ Models on Consumer GPUs

TurboQuant compresses KV cache to 2–4 bits via Walsh-Hadamard transform rotation + per-head absmax quantization. Max context tokens with turbo4:

Model	KV budget	BF16	turbo4	Gain
Qwen3.6-35B-A3B (NVFP4)	7 GB (24 GB GPU)	700k	2.7M	3.9×
	15 GB (32 GB GPU)	1.5M	5.8M	3.9×
Qwen3.6-27B (FP8)	7 GB	112k	434k	3.9×
	15 GB	240k	930k	3.9×
Qwen3-30B-A3B (Q4_K_M)	7 GB	74k	281k	3.8×
	15 GB	160k	602k	3.8×
Gemma4-26B-A4B (NVFP4)	7 GB	32k	125k	3.9×
	15 GB	70k	271k	3.9×

Hybrid models (Qwen3.6) have fewer full attention layers, making TurboQuant especially effective. MLA models (DeepSeek, GLM4.7 Flash) use fp8 instead. The KV budget in the table is the theoretical maximum; actual usage can only utilize up to 90% of the KV budget (--kv-fraction 0.9), leaving room for runtime and batching buffers.

# 35B MoE on single 24/32 GB GPU
vllm-rs --m unsloth/Qwen3.6-35B-A3B-NVFP4 --kvcache-dtype turbo4

# Production precision
vllm-rs --m Qwen/Qwen3.6-35B-A3B-FP8 --kvcache-dtype fp8

# 27B Dense + turbo4
vllm-rs --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo4

# 30B MoE GGUF + turbo4
vllm-rs --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF \
  --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --kvcache-dtype turbo4

📘 Usage (Rust)

Installation

CUDA (Linux)

# Prerequisites
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
sudo apt-get install -y git build-essential libssl-dev pkg-config

# Optional: CUDA toolkit + NCCL
sudo apt-get install -y cuda-nvcc-12-9 cuda-nvrtc-dev-12-9 libcublas-dev-12-9 libcurand-dev-12-9
sudo apt-get install -y libnccl2 libnccl-dev

# Build & install
./build.sh --install --features cuda,nccl,flashinfer,cutlass
# Flash Attention backend alternative:
./build.sh --install --features cuda,nccl,flashattn,cutlass
# V100 / older (no flash backends):
./build.sh --install --features cuda,nccl

Metal (macOS)

# Install Xcode command-line tools first
cargo install --features metal

Docker

# sm_80 = A100, sm_90 = Hopper, sm_120 = Blackwell
./build_docker.sh "cuda,nccl,flashinfer,cutlass,python" sm_80 12.9.0 0
# Production image with Flash Attention:
./build_docker.sh --prod "cuda,nccl,flashattn,cutlass,python" sm_90 13.0.0

See Docker guide →

Running Models

💡By default, vllm-rs starts an OpenAI-compatible API server at http://localhost:8000. Add --ui-server to also launch the built-in ChatGPT-style Web UI at http://localhost:8001.
💡For built within Docker, refer Run vLLM.rs in docker →

# FP8 model (sm90+ with cutlass) + web UI
vllm-rs --m Qwen/Qwen3.6-27B-FP8 --ui-server

# Unquantized Safetensors (multi-GPU)
vllm-rs --d 0,1 --m Qwen/Qwen3-30B-A3B-Instruct-2507 --kvcache-dtype fp8

# ISQ on-the-fly quantization
vllm-rs --m Qwen/Qwen3.6-35B-A3B --isq q4k

# NVFP4 model
vllm-rs --m unsloth/Qwen3.6-27B-NVFP4

# MXFP4
vllm-rs --m olka-fi/Qwen3.5-4B-MXFP4

# GGUF model (4-bit KvCache)
vllm-rs --m unsloth/Qwen3.5-27B-GGUF --f Qwen3.5-27B-Q4_K_M.gguf --kvcache-dtype turbo4

# FP8 on Metal
vllm-rs --m Qwen/Qwen3.5-4B-FP8

# Gemma4 26B (NVFP4)
vllm-rs --m unsloth/gemma-4-26b-a4b-it-NVFP4

# MLA model (GLM4.7 Flash)
vllm-rs --m GadflyII/GLM-4.7-Flash-NVFP4

# Interactive CLI chat
vllm-rs --i --m unsloth/Qwen3.5-27B-GGUF --f Qwen3.5-27B-Q4_K_M.gguf

ISQ (on-the-fly quantization) + KV cache compression

# ISQ Q4K + FP8 KV cache
vllm-rs --m Qwen/Qwen3.6-35B-A3B --isq q4k --kvcache-dtype fp8

# ISQ Q4K + TurboQuant KV cache
vllm-rs --m Qwen/Qwen3.6-35B-A3B --isq q4k --kvcache-dtype turbo4

# Metal ISQ
vllm-rs --w /path/Qwen3-4B --isq q6k

GGUF models

# Single GPU — GGUF
vllm-rs --m unsloth/Qwen3.5-27B-GGUF --f Qwen3.5-27B-Q4_K_M.gguf

# Multi-GPU — GGUF
vllm-rs --d 0,1 --f /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf

TurboQuant KV cache (2–4 bit) — see TurboQuant section

# turbo4: 4-bit K+V — 3.7× compression, best tradeoff
vllm-rs --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo4

# turbo3: 3-bit K + 4-bit V — 4.7× compression
vllm-rs --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo3

# turbo8: FP8 K + 4-bit V — 2.6× compression, highest quality
vllm-rs --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo8

# 35B MoE (NVFP4 + turbo4) — fits on single 24 GB GPU
vllm-rs --m unsloth/Qwen3.6-35B-A3B-NVFP4 --kvcache-dtype turbo4

# 30B MoE (GGUF Q4_K_M + turbo4) — consumer GPU
vllm-rs --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF \
  --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --kvcache-dtype turbo4

Multimodal models (Qwen3-VL, Gemma4, Mistral3-VL)

# Upload images via built-in Chat UI or send image_url in API requests

# Qwen3.6 35B MoE (FP8, multimodal)
vllm-rs --m Qwen/Qwen3.6-35B-A3B-FP8 --ui-server

# Qwen3-VL 8B (GGUF)
vllm-rs --m unsloth/Qwen3-VL-8B-Instruct-GGUF --f Qwen3-VL-8B-Instruct-Q8_0.gguf --ui-server

# Gemma4 26B MoE (NVFP4, multimodal)
vllm-rs --m unsloth/gemma-4-26b-a4b-it-NVFP4 --ui-server

# Mistral-3 VL 3B (BF16, multimodal)
vllm-rs --m mistralai/Ministral-3-3B --ui-server

📘 Usage (Python)

Running Models

# FP8 model + web UI
python3 -m vllm_rs.server --m Qwen/Qwen3.6-27B-FP8 --ui-server

# Unquantized Safetensors (multi-GPU)
python3 -m vllm_rs.server --m Qwen/Qwen3.5-122B-A10B --d 0,1 --kvcache-dtype fp8

# ISQ on-the-fly quantization
python3 -m vllm_rs.server --w /path/Qwen3.6-35B-A3B --isq q4k --d 0 --kvcache-dtype turbo8

# NVFP4 / MXFP4
python3 -m vllm_rs.server --m unsloth/Qwen3.6-27B-NVFP4
python3 -m vllm_rs.server --m olka-fi/Qwen3.5-4B-MXFP4
python3 -m vllm_rs.server --m GadflyII/GLM-4.7-Flash-NVFP4

# GGUF
python3 -m vllm_rs.server --m unsloth/Qwen3.5-27B-GGUF --f Qwen3.5-27B-Q4_K_M.gguf

# Multimodal
python3 -m vllm_rs.server --m Qwen/Qwen3.6-35B-A3B-FP8 --kvcache-dtype fp8

# GPTQ / AWQ
python3 -m vllm_rs.server --w /home/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4-Marlin

See more Python examples →

Build Python wheel from source

pip install maturin maturin[patchelf]

# FlashInfer backend (SM80+)
./build.sh --release --features cuda,nccl,flashinfer,cutlass,python

# Flash Attention backend
./build.sh --release --features cuda,nccl,flashattn,cutlass,python

# macOS Metal
maturin build --release --features metal,python

# Install
pip install target/wheels/vllm_rs-*.whl --force-reinstall

🔀 Prefill-Decode Disaggregation

Split prefill (prompt processing) and decode (token generation) across GPUs or machines. Eliminates decode stalls during long-context prefilling. PD Server and PD Client must use same KvCache type (--kvcache-dtype). API request(s) must send to PD Client and the PD Server only process internal prefill requests sent from PD Client.

Mode	Config	Use Case
Local IPC	(default, no flag)	Same machine, CUDA
File IPC	`--pd-url file:///path`	Docker containers, shared volume
Remote TCP	`--pd-url tcp://host:port`	Different machines

Local IPC (multirank)

# PD Server (prefill GPU, default port 7000)
vllm-rs --d 0,1 --m Qwen/Qwen3-30B-A3B-Instruct-2507 --pd-server

# PD Client (decode GPU + API)
vllm-rs --d 2,3 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --ui-server --port 8000 --pd-client

Multinode (tcp mode)

# Server machine (192.168.1.100)
target/release/vllm-rs --d 0,1 --m Qwen/... --pd-server --pd-url tcp://0.0.0.0:8100

# Client machine
target/release/vllm-rs --d 0,1 --w /path/... --pd-client --pd-url tcp://192.168.1.100:8100 --ui-server --port 8000

Metal/macOS requires --pd-url (no LocalIPC support).

Multi-container（file:// mode）

mkdir -p /tmp/pd-sockets

# Server container
docker run --gpus '"device=0,1"' -v /tmp/pd-sockets:/sockets ...
target/release/vllm-rs --d 0,1 --m Qwen/... --pd-server --pd-url file:///sockets

# Client container
docker run --gpus '"device=2,3"' -v /tmp/pd-sockets:/sockets ...
target/release/vllm-rs --d 0,1 --w /path/... --pd-client --pd-url file:///sockets --ui-server --port 8000

🔌 MCP Tool Calling

vllm-rs --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF \
  --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --ui-server --mcp-config ./mcp.json

MCP documentation →

🔌 Structured Outputs

Constraint-based generation via llguidance — Lark grammars, regex, JSON Schema.

Structured outputs documentation →

📚 Documentation

Guide	Description
Get Started	Build, run, and configure
Docker	Container builds and deployment
Performance	Full benchmark tables
Prefix Cache	Automatic KV cache reuse
Multimodal	Vision-language models
Embedding	Text embedding API
Tokenizer API	Tokenize / detokenize endpoints
Tool Parsing	Tool call detection and parsing
MCP Integration	Model Context Protocol
Guided Decoding	Structured outputs
Rust Crate	Use as a library
Add a Model	Port a new architecture (AI-assisted)
Test a Model	Validate model quality (AI-assisted)

Using Agents under vLLM.rs backend: xbot · OpenCode · Kilo Code · Claude Code · Goose

⚙️ CLI Reference

Flag	Description
`--m`	HuggingFace model ID (auto-download)
`--w`	Local Safetensors model path
`--f`	GGUF file path (or filename when `--m` is given)
`--d`	Device IDs (e.g. `--d 0,1`)
`--ui-server`	API server + built-in ChatGPT-style web UI
`--server`	API server only (no web UI)
`--i`	Interactive CLI chat
`--isq`	On-the-fly quantization: `q2k`, `q3k`, `q4k`, `q5k`, `q6k`, `q8_0`
`--kvcache-dtype`	KV cache quantization: `fp8`, `turbo8`, `turbo4`, `turbo3`
`--max-num-seqs`	Max concurrent requests (default: 32, macOS: 8)
`--max-tokens`	Max tokens per response (default: 16384)
`--kv-fraction`	GPU memory fraction for KV cache
`--cpu-mem-fold`	CPU swap memory ratio (default: 0.2)
`--pd-server`	Run as PD prefill server
`--pd-client`	Run as PD decode client
`--pd-url`	PD connection URL (`tcp://`, `http://`, `file://`)
`--disable-prefix-cache`	Disable prefix caching
`--prefix-cache-max-tokens`	Cap prefix cache size
`--disable-cuda-graph`	Disable CUDA graph capture
`--yarn-scaling-factor`	YARN RoPE context extension factor
`--temperature`	Sampling temperature (0–1)
`--top-k` / `--top-p`	Top-k / nucleus sampling
`--presence-penalty`	Penalize repeated tokens (−2 to 2)
`--frequency-penalty`	Penalize frequent tokens (−2 to 2)
`--mcp-config`	MCP servers JSON config
`--mcp-command` / `--mcp-args`	Single MCP server command + args

📽️ Demo

🛠️ Roadmap

Batched inference (Metal)
GGUF format support
FlashAttention (CUDA)
CUDA Graph
OpenAI-compatible API (streaming support)
Continuous batching
Multi-gpu inference (Safetensors, GPTQ, AWQ, GGUF)
Speedup prompt processing on Metal/macOS
Chunked Prefill
Prefix cache (available on CUDA when prefix-cache enabled)
Model loading from hugginface hub
Model loading from ModelScope (China)
Prefix cache for Metal/macOS
FP8 KV Cache (CUDA, all backends including FlashInfer on SM80+)
FP8 KV Cache (Metal)
FP8 KV Cache (with FlashInfer, SM80+)
TurboQuant KV Cache (2-4 bit compression with WHT rotation)
FP8 Models (CUDA: MoE, Dense; Metal: Dense)
Additional model support (Kimi K2, GLM 5.1 etc.)
CPU KV Cache Offloading
Prefill-decode Disaggregation (CUDA)
Prefill-decode Disaggregation (Metal)
Built-in ChatGPT-like Web Server
Embedding API
Tokenize/Detokenize API
MCP Integration & Tool Calling
Prefix Caching
Claude/Anthropic-compatible API Server
Support CUDA 13
Support FlashInfer backend
Support DeepGEMM backend (Hopper)
MXFP4/NVFP4 Model Support
Support Turboquant (4-bit, 3-bit) KvCache
TentorRT-LLM

📚 References

Candle-vLLM
Python nano-vllm

Star History

Like this project? Give it a ⭐ and contribute!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.11.5

May 21, 2026

0.11.4

May 21, 2026

0.11.3

May 15, 2026

0.11.1

May 13, 2026

0.11.0

May 11, 2026

0.10.10

May 1, 2026

0.10.9

Apr 27, 2026

0.10.8

Apr 22, 2026

0.10.6

Apr 20, 2026

0.10.5

Apr 17, 2026

0.10.4

Apr 16, 2026

0.10.3

Apr 9, 2026

0.9.16

Apr 1, 2026

0.9.14

Mar 21, 2026

0.9.12

Mar 19, 2026

0.9.9

Mar 12, 2026

0.9.8

Mar 11, 2026

0.9.7

Feb 28, 2026

0.9.4

Feb 20, 2026

0.9.3

Feb 18, 2026

0.9.2

Feb 14, 2026

0.8.12

Feb 2, 2026

0.8.11

Jan 28, 2026

0.8.10

Jan 28, 2026

0.8.9

Jan 28, 2026

0.8.7

Jan 26, 2026

0.8.6

Jan 23, 2026

0.8.5

Jan 23, 2026

0.8.4

Jan 21, 2026

0.8.3

Jan 21, 2026

0.8.2

Jan 21, 2026

0.8.1

Jan 21, 2026

0.8.0

Jan 20, 2026

0.7.16

Jan 15, 2026

0.7.14

Jan 9, 2026

0.7.13

Jan 8, 2026

0.7.12

Jan 7, 2026

0.7.11

Jan 7, 2026

0.7.10

Jan 6, 2026

0.7.9

Jan 6, 2026

0.7.8

Jan 6, 2026

0.7.5

Jan 5, 2026

0.7.2

Jan 1, 2026

0.7.1

Jan 1, 2026

0.6.18

Dec 30, 2025

0.6.17

Dec 30, 2025

0.6.15

Dec 27, 2025

0.6.14

Dec 27, 2025

0.6.13

Dec 27, 2025

0.6.12

Dec 25, 2025

0.6.11

Dec 25, 2025

0.6.10

Dec 24, 2025

0.6.9

Dec 24, 2025

0.6.8

Dec 24, 2025

0.6.7

Dec 24, 2025

0.6.5

Dec 23, 2025

0.6.3

Dec 22, 2025

0.6.2

Dec 22, 2025

0.6.1

Dec 20, 2025

0.6.0

Dec 19, 2025

0.5.23

Dec 19, 2025

0.5.22

Dec 18, 2025

0.5.21

Dec 18, 2025

0.5.19

Dec 18, 2025

0.5.15

Dec 6, 2025

0.5.13

Dec 6, 2025

0.5.12

Dec 5, 2025

0.5.11

Dec 4, 2025

0.5.10

Dec 3, 2025

0.5.9

Nov 28, 2025

0.5.8

Nov 28, 2025

0.5.7

Nov 28, 2025

0.5.6

Nov 27, 2025

0.5.5

Nov 27, 2025

0.5.3

Nov 27, 2025

0.5.2

Nov 26, 2025

0.5.1

Nov 25, 2025

0.5.0

Nov 25, 2025

0.4.18

Nov 25, 2025

0.4.17

Nov 24, 2025

0.4.16

Nov 24, 2025

0.4.15

Nov 24, 2025

0.4.14

Nov 24, 2025

0.4.13

Nov 23, 2025

0.4.12

Nov 23, 2025

0.4.11

Nov 23, 2025

0.4.10

Nov 23, 2025

0.4.9

Nov 23, 2025

0.4.8

Nov 22, 2025

0.4.7

Nov 22, 2025

0.4.5

Nov 21, 2025

0.4.4

Nov 20, 2025

0.4.3

Nov 20, 2025

0.4.2

Nov 19, 2025

0.4.1

Nov 15, 2025

0.4.0

Nov 14, 2025

0.3.19

Nov 11, 2025

0.3.16

Nov 4, 2025

0.3.15

Nov 4, 2025

0.3.13

Oct 22, 2025

0.3.10

Oct 17, 2025

0.3.9

Oct 16, 2025

0.3.8

Oct 16, 2025

0.3.6

Oct 15, 2025

0.3.2

Oct 11, 2025

0.3.1

Oct 10, 2025

0.2.19

Sep 26, 2025

0.2.18

Sep 24, 2025

0.2.17

Sep 20, 2025

0.2.8

Sep 15, 2025

0.2.7

Sep 12, 2025

0.2.1

Aug 28, 2025

0.2.0

Aug 27, 2025

0.1.26

Aug 26, 2025

0.1.15

Aug 13, 2025

0.1.14

Aug 12, 2025

0.1.13

Aug 10, 2025

0.1.12

Aug 9, 2025

0.1.11

Aug 8, 2025

0.1.10

Aug 4, 2025

0.1.9

Aug 4, 2025

0.1.8

Jul 25, 2025

0.1.6

Jul 21, 2025

0.1.4

Jul 18, 2025

0.1.3

Jul 18, 2025

0.1.2

Jul 18, 2025

0.1.1

Jul 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllm_rs-0.11.5-cp38-abi3-macosx_11_0_arm64.whl (11.2 MB view details)

Uploaded May 21, 2026 CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file vllm_rs-0.11.5-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: vllm_rs-0.11.5-cp38-abi3-macosx_11_0_arm64.whl
Upload date: May 21, 2026
Size: 11.2 MB
Tags: CPython 3.8+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.9.0

File hashes

Hashes for vllm_rs-0.11.5-cp38-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`892470cc26d2cc45c2f5b048b74c70895f221454d11cd7228b1b7d100e8f1f0c`
MD5	`8709ce31642cf83039b1f0f718f9cc5a`
BLAKE2b-256	`2b9c3918782cba82e55c8fe97e949e67140d71d20395da4c6f52b0875d3055a5`

See more details on using hashes here.

vllm-rs 0.11.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

🚀 vLLM.rs

✨ Why vLLM.rs?

Quick Start

Option A — 🚀 Rust (recommended)

Option B — 📦 Python (pip install)

Option C — Install with Docker

📈 Performance

🧠 Supported Models

TurboQuant KV Cache — Run 30B+ Models on Consumer GPUs

📘 Usage (Rust)

Installation

Running Models

📘 Usage (Python)

Running Models

🔀 Prefill-Decode Disaggregation

🔌 MCP Tool Calling

🔌 Structured Outputs

📚 Documentation

⚙️ CLI Reference

📽️ Demo

🛠️ Roadmap

📚 References

Star History

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

Option B — 📦 Python (`pip install`)