No project description provided

Project description

🚀 vLLM.rs – A Minimalist vLLM in Rust

A blazing-fast ⚡, lightweight Rust 🦀 implementation of vLLM.

✨ Key Features

🔧 Pure Rust Backend – Absolutely no PyTorch required
🚀 High Performance (with Context-cache and PD Disaggregation) – Superior than Python counterparts
🧠 Minimalist Core – Core logic written in **~ 2000 lines** of clean Rust
💻 Cross-Platform – Supports CUDA (Linux/Windows) and Metal (macOS)
🤖 Built-in Chatbot/API Server – Native Rust server for both CUDA and Metal
🐍 Lightweight Python Interface – PyO3-powered bindings for chat completion
🤝 Open for Contributions – PRs, issues, and stars are welcome!

Chat Performace

A100 (Single Card, 40G)

Model	Format	Size	Decoding Speed
Llama-3.1-8B	ISQ (BF16->Q4K)	8B	90.19 tokens/s
DeepSeek-R1-Distill-Llama-8B	Q2_K	8B	94.47 tokens/s
DeepSeek-R1-0528-Qwen3-8B	Q4_K_M	8B	95 tokens/s
GLM-4-9B-0414	Q4_K_M	9B	70.38 tokens/s
QwQ-32B	Q4_K_M	32B	35.69 tokens/s
Qwen3-30B-A3B	Q4_K_M	30B (MoE)	75.91 tokens/s

Performance of vLLM.rs on Metal (Apple Silicon, M4)

Models: Qwen3-0.6B (BF16), Qwen3-4B (Q4_K_M), Qwen3-8B (Q2_K)； Concurrent Requests: 1 - 128； Max Model Length: 512 - 2048； Max Output Tokens / Request: 512 - 2048；

Model	Batch Size	Output Tokens	Time (s)	Throughput (tokens/s)
Qwen3-0.6B (BF16)	128	63488	83.13s	763.73
Qwen3-0.6B (BF16)	32	15872	23.53s	674.43
Qwen3-0.6B (BF16)	1	456	9.23s	49.42
Qwen3-4B (Q4_K_M)	1	1683	52.62s	31.98
Qwen3-8B (Q2_K)	1	1300	80.88s	16.07

Performance Comparison

Model: Qwen3-0.6B (BF16); Concurrent Requests: 256; Max Model Length: 1024; Max Output Tokens / Request: 1024

Inference Engine	Tokens	Time (s)	Throughput (tokens/s)
vLLM (RTX 4070) (Reference)	133,966	98.37	1361.84
Nano-vLLM (RTX 4070) (Reference)	133,966	93.41	1434.13
vLLM.rs (A100)	262,144	23.88s	10977.55 (40%+ speedup)
Nano-vLLM (A100)	262,144	34.22s	7660.26

Reproducible steps

🧠 Supported Architectures

✅ LLaMa (LLaMa2, LLaMa3)
✅ Qwen (Qwen2, Qwen3)
✅ Qwen2 Moe
✅ Qwen3 Moe
✅ Mistral
✅ GLM4 (0414, Not ChatGLM)

Supports both Safetensor (including GPTQ and AWQ formats) and GGUF formats.

📽️ Demo Video

Watch it in action 🎉

📘 Usage in Rust

Run with --i for interactive chat 🤖, --server for server mode 🌐, --m specify Huggingface model, or --w to specify local safetensors model path, or --f local gguf file:

# Naive CUDA chat mode (single card only, optional `--fp8-kvcache`)
cargo run --release --features cuda,nccl -- --i --d 0 --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --context-cache

# Multi-GPU chat mode (+Flash Attention, +CUDA graph, this scirpt help build the runner)
./run.sh --release --features cuda,nccl,graph,flash-attn -- --i --d 0,1 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 262144 --context-cache

# Multi-GPU server mode (unquantized models)
./run.sh --release --features cuda,nccl,graph,flash-attn -- --d 0,1,2,3 --w /path/Qwen3-30B-A3B-Instruct-2507 --max-model-len 262144 --max-num-seqs 2 --server --port 8000

# Multi-GPU server mode (load as Q4K format, with `--fp8-kvcache` or `--context-cache`)
./run.sh --release --features cuda,nccl,graph,flash-attn -- --d 0,1 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 262144 --max-num-seqs 2 --server --port 8000 --fp8-kvcache

# Multi-GPU server mode (with `--context-cache`, Flash Attention used in both prefill/decode, long time to build)
./run.sh --release --features cuda,nccl,flash-context -- --d 0,1 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 262144 --max-num-seqs 2 --server --port 8000 --context-cache

# macOS chat mode (Metal, or `--server` for server mode)
cargo run --release --features metal -- --i --f /path/DeepSeek-R1-Distill-Llama-8B-Q2_K.gguf

#macOS (Metal, ISQ)
cargo run --release --features metal -- --i --w /path/Qwen3-0.6B --isq q4k --context-cache

🚀 Prefill-decode Disaggregation

# Start the PD server (`port` not required since it does not directly respond request(s))
# Option 1: Rust
./run.sh --release --features cuda,nccl,graph,flash-attn -- --d 0,1 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 262144 --max-num-seqs 2 --server --pd-server
# Option 2: Python (depend: pip install vllm_rs fastapi uvicorn)
python3 -m vllm_rs.server --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 262144 --max-num-seqs 2 --d 0,1 --pd-server

# Start the corresponding PD client
# Option 1: Rust
./run.sh --release --features cuda,nccl,graph,flash-attn -- --d 2,3 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 200000 --max-num-seqs 2 --server --port 8000 --pd-client
# Option 2: Python
python3 -m vllm_rs.server --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 200000 --max-num-seqs 2 --d 2,3 --port 8000 --pd-client

# PD server and client need to use same number of ranks and same model (can be different formats, e.g., unquantized safetensor on Server but GGUF on client).

# If `--pd-url` (e.g., server side: 0.0.0.0:8100, client side: server_ip:8100) is provided, the PD server/client will try to bind (or connect) to the given address, and the client will attempt to connect to the server using the specified URL. In this scenario, the server and client can be located on different machines. This feature is experimental. Metal platform must provide `--pd-url` because it does not support LocalIPC for transfer GPU mem handles.

📘 Usage in Python

📦 Install with pip

💡 1. Manual build required for CUDA compute capability < 8.0 (e.g., V100, no flash-attn support)

💡 2. Prebuilt package has native context cache feature without relying on flash attention, manual build required to use flash-context feature.

python3 -m pip install vllm_rs fastapi uvicorn

🌐✨ API Server Mode

💡 You can use any client compatible with the OpenAI API.

🤖 Here is the client usage of context cache

# Start OpenAI API Server (default http://0.0.0.0:8000）
# openai.base_url = "http://localhost:8000/v1/"
# openai.api_key = "EMPTY"

# Local gguf file (`--f`), max output tokens for each request (`--max-tokens`), FP8 KV Cache (`--fp8-kvcache`, slight accuracy degradation)
python -m vllm_rs.server --f /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --host 0.0.0.0 --port 8000 --max-tokens 32768 --max-model-len 128000 --fp8-kvcache

# Use model weights from huggingface (`--m`: model_id, `--f`: gguf file)
python -m vllm_rs.server --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --host 0.0.0.0 --port 8000

# Multi-GPU (`--d`)
python -m vllm_rs.server --f /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --d 0,1 --host 0.0.0.0 --port 8000 --max-model-len 64000

# Multi-GPU for safetensors model: local safetensors model (`--w`) with in-situ quant to Q4K during model loading (enable maximum context length)
python -m vllm_rs.server --w /path/Qwen3-30B-A3B-Instruct-2507 --d 0,1 --host 0.0.0.0 --port 8000 --isq q4k --max-model-len 262144 --max-num-seqs 1

# multi-GPU inference + context caching for GGUF model (to cache context, you need to include a `session_id` in the `extra_body` field when making a request through the OpenAI API. The session_id should remain the same throughout a conversation, and a new `session_id` should be used for a new conversation, unsed session cache will be cleared. No need to change other settings of the API).
python -m vllm_rs.server --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --d 0,1 --host 0.0.0.0 --port 8000 --max-model-len 64000 --max-num-seqs 8 --context-cache

Interactive Chat and completion

# Interactive chat
# Load with model id
python -m vllm_rs.chat --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --fp8-kvcache

# local gguf file on second device (device order 1，`--d 1`)
python -m vllm_rs.chat --d 1 --f /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf

# Load unquantized safetensors model as GGUF quantized (e.g., q4k), with maximum model context length
python -m vllm_rs.chat --d 0 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 262144 --max-num-seqs 1 --max-tokens 16384

# Enable context cache for fast response (CUDA)
python -m vllm_rs.chat --d 0,1 --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --max-model-len 262144 --max-num-seqs 1 --context-cache

# ISQ q4k (macOS/Metal recommended, optional `--context-cache`)
python -m vllm_rs.chat --w /path/Qwen3-0.6B --isq q4k

# Chat completion
python -m vllm_rs.completion --f /path/qwq-32b-q4_k_m.gguf --prompts "How are you? | How to make money?"

# Chat completion (Multi-GPU, CUDA)
python -m vllm_rs.completion --w /home/GLM-4-9B-0414 --d 0,1 --batch 8 --max-model-len 1024 --max-tokens 1024

🐍 Python API

from vllm_rs import Engine, EngineConfig, SamplingParams, Message
cfg = EngineConfig(weight_path="/path/Qwen3-8B-Q2_K.gguf", max_model_len=4096)
engine = Engine(cfg, "bf16")
params = SamplingParams(temperature=0.6, max_tokens=256)
prompt = engine.apply_chat_template([Message("user", "How are you?")], True)

# Synchronous generation for batched input
outputs = engine.generate_sync([params,params], [prompt, prompt])
print(outputs)

params.session_id = xxx # pass session to use context cache
# Streaming generation for single request
(seq_id, prompt_length, stream) = engine.generate_stream(params, prompt)
for item in stream:
    # item.datatype == "TOKEN"
    print(item.data)

🔨 Build Python Package from source (Optional)

⚠️ The first build may take time if Flash Attention is enabled.

⚠️ When enabling context caching or multi-GPU inference, you also need to compile Runner (using build.sh or run.sh).

🛠️ Prerequisites

Install the Rust toolchain
On macOS, install Xcode command line tools
For Python bindings, install Maturin

Building steps

Install Maturin

# install build dependencies (Linux)
sudo apt install libssl-dev pkg-config -y
pip install maturin
pip install maturin[patchelf]  # For Linux/Windows

Build the Python package

# Naive CUDA (single GPU only) 
maturin build --release --features cuda,python

# Naive CUDA (+CUDA Graph, experimental)
./build.sh --release --features cuda,graph,python

# CUDA (with context-cache and FP8 KV Cache, no Flash Attention) 
./build.sh --release --features cuda,nccl,python

# CUDA (+Flash Attention, only used in prefill stage) 
./build.sh --release --features cuda,nccl,flash-attn,python

# CUDA (+Flash Attention, used in both prefill and decode stage, long time to build) 
./build.sh --release --features cuda,nccl,flash-context,python

# macOS (Metal, single GPU only, with Context-cache and FP8 kvcache)
maturin build --release --features metal,python

Install packages

# the package you built
pip install target/wheels/vllm_rs-*-cp38-abi3-*.whl --force-reinstall
pip install fastapi uvicorn

⚙️ Command Line Arguments

Flag	Description
`--m`	Hugginface Model ID
`--w`	Path to Safetensors model
`--f`	GGUF filename when model_id given or GGUF file path
`--d`	Device ID (e.g. `--d 0`)
`--max-num-seqs`	Maximum number of concurrent requests (default: `32`, `8` on macOS)
`--max-tokens`	Max tokens per response (default: `4096`, up to `max_model_len`)
`--batch`	Only used for benchmark (this will replace `max-num-seqs` and ignore `prompts`)
`--prompts`	Prompts separated by \|
`--dtype`	KV cache dtype: `bf16` (default), `f16`, or `f32`
`--isq`	Load unquantized model as GGUF quantized format such as `q2k`, `q4k`, etc.
`--temperature`	Controls randomness: lower (0.) → deterministic, higher (1.0) → creative/random.
`--top-k`	Limits choices to the top k highest-probability tokens. smaller k → more stable；larger k → more random
`--top-p`	Dynamically chooses the smallest set of tokens whose cumulative probability ≥ p. Range: 0.8 ~ 0.95
`--presence-penalty`	Presence penalty, controls whether the model avoids reusing `tokens that have already appeared`. Range [-2, 2]. Higher positive values → more likely to introduce new tokens; negative values → more likely to repeat previously used tokens
`--frequency-penalty`	Frequency penalty, controls whether the model reduces the probability of `tokens that appear too often`. Range [-2, 2]. Higher positive values → stronger penalty for frequently repeated tokens; negative values → encourages more repetition
`--server`	server mode used in Rust CLI, while Python use `python -m vllm.server`
`--fp8-kvcache`	Use FP8 KV Cache (when flash-context not enabled)
`--cpu-mem-fold`	The percentage of CPU KVCache memory size compare to GPU (default 1.0, range from 0.1 to 10.0)
`--pd-server`	When using PD Disaggregation, specify the current instance as the PD server (this server is only used for Prefill)
`--pd-client`	When using PD Disaggregation, specify the current instance as the PD client (this client sends long-context Prefill requests to the PD server for processing)
`--pd-url`	When using PD Disaggregation, if specified `pd-url`, communication will occur via TCP/IP (used when the PD server and client are on different machines)

🗜️ In-Situ Quantization (GGUF Conversion during loading)

💡 Run any unquantized models as GGUF quantized format, but it may takes few minutes for --isq other than q4k and q8_0.

# macOS
cargo run --release --features metal -- --w /path/Qwen3-0.6B/ --isq q4k --prompts "How are you today?"

# CUDA
cargo run --release --features cuda,flash-attn -- --w /path/Qwen3-8B/ --isq q4k --prompts "How are you today?"

📌 Project Status

🚧 Under active development – breaking changes may occur!

🛠️ Roadmap

Batched inference (Metal)
GGUF format support
FlashAttention (CUDA)
CUDA Graph
OpenAI-compatible API (streaming support)
Continuous batching
Multi-gpu inference (Safetensors, GPTQ, AWQ, GGUF)
Speedup prompt processing on Metal/macOS
Chunked Prefill
Session-based context cache (available on CUDA when context-cache enabled)
Model loading from hugginface hub
Model loading from ModelScope (China)
Context cache for Metal/macOS
FP8 KV Cache (CUDA)
FP8 KV Cache (Metal)
FP8 KV Cache (with Flash-Attn)
Additional model support (GLM 4.6, Kimi K2 Thinking, etc.)
CPU KV Cache Offloading
Prefill-decode Disaggregation (CUDA)
Prefill-decode Disaggregation (Metal)

📚 References

Candle-vLLM
Python nano-vllm

💡 Like this project? Give it a ⭐ and contribute!

Project details

Release history Release notifications | RSS feed

0.9.16

Apr 1, 2026

0.9.14

Mar 21, 2026

0.9.12

Mar 19, 2026

0.9.9

Mar 12, 2026

0.9.8

Mar 11, 2026

0.9.7

Feb 28, 2026

0.9.4

Feb 20, 2026

0.9.3

Feb 18, 2026

0.9.2

Feb 14, 2026

0.8.12

Feb 2, 2026

0.8.11

Jan 28, 2026

0.8.10

Jan 28, 2026

0.8.9

Jan 28, 2026

0.8.7

Jan 26, 2026

0.8.6

Jan 23, 2026

0.8.5

Jan 23, 2026

0.8.4

Jan 21, 2026

0.8.3

Jan 21, 2026

0.8.2

Jan 21, 2026

0.8.1

Jan 21, 2026

0.8.0

Jan 20, 2026

0.7.16

Jan 15, 2026

0.7.14

Jan 9, 2026

0.7.13

Jan 8, 2026

0.7.12

Jan 7, 2026

0.7.11

Jan 7, 2026

0.7.10

Jan 6, 2026

0.7.9

Jan 6, 2026

0.7.8

Jan 6, 2026

0.7.5

Jan 5, 2026

0.7.2

Jan 1, 2026

0.7.1

Jan 1, 2026

0.6.18

Dec 30, 2025

0.6.17

Dec 30, 2025

0.6.15

Dec 27, 2025

0.6.14

Dec 27, 2025

0.6.13

Dec 27, 2025

0.6.12

Dec 25, 2025

0.6.11

Dec 25, 2025

0.6.10

Dec 24, 2025

0.6.9

Dec 24, 2025

0.6.8

Dec 24, 2025

0.6.7

Dec 24, 2025

0.6.5

Dec 23, 2025

0.6.3

Dec 22, 2025

0.6.2

Dec 22, 2025

0.6.1

Dec 20, 2025

0.6.0

Dec 19, 2025

0.5.23

Dec 19, 2025

0.5.22

Dec 18, 2025

0.5.21

Dec 18, 2025

0.5.19

Dec 18, 2025

0.5.15

Dec 6, 2025

0.5.13

Dec 6, 2025

0.5.12

Dec 5, 2025

0.5.11

Dec 4, 2025

0.5.10

Dec 3, 2025

0.5.9

Nov 28, 2025

0.5.8

Nov 28, 2025

0.5.7

Nov 28, 2025

0.5.6

Nov 27, 2025

0.5.5

Nov 27, 2025

0.5.3

Nov 27, 2025

0.5.2

Nov 26, 2025

0.5.1

Nov 25, 2025

0.5.0

Nov 25, 2025

0.4.18

Nov 25, 2025

0.4.17

Nov 24, 2025

0.4.16

Nov 24, 2025

0.4.15

Nov 24, 2025

0.4.14

Nov 24, 2025

0.4.13

Nov 23, 2025

0.4.12

Nov 23, 2025

0.4.11

Nov 23, 2025

0.4.10

Nov 23, 2025

0.4.9

Nov 23, 2025

0.4.8

Nov 22, 2025

0.4.7

Nov 22, 2025

0.4.5

Nov 21, 2025

0.4.4

Nov 20, 2025

0.4.3

Nov 20, 2025

0.4.2

Nov 19, 2025

This version

0.4.1

Nov 15, 2025

0.4.0

Nov 14, 2025

0.3.19

Nov 11, 2025

0.3.16

Nov 4, 2025

0.3.15

Nov 4, 2025

0.3.13

Oct 22, 2025

0.3.10

Oct 17, 2025

0.3.9

Oct 16, 2025

0.3.8

Oct 16, 2025

0.3.6

Oct 15, 2025

0.3.2

Oct 11, 2025

0.3.1

Oct 10, 2025

0.2.19

Sep 26, 2025

0.2.18

Sep 24, 2025

0.2.17

Sep 20, 2025

0.2.8

Sep 15, 2025

0.2.7

Sep 12, 2025

0.2.1

Aug 28, 2025

0.2.0

Aug 27, 2025

0.1.26

Aug 26, 2025

0.1.15

Aug 13, 2025

0.1.14

Aug 12, 2025

0.1.13

Aug 10, 2025

0.1.12

Aug 9, 2025

0.1.11

Aug 8, 2025

0.1.10

Aug 4, 2025

0.1.9

Aug 4, 2025

0.1.8

Jul 25, 2025

0.1.6

Jul 21, 2025

0.1.4

Jul 18, 2025

0.1.3

Jul 18, 2025

0.1.2

Jul 18, 2025

0.1.1

Jul 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllm_rs-0.4.1-cp38-abi3-macosx_11_0_arm64.whl (5.0 MB view details)

Uploaded Nov 15, 2025 CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file vllm_rs-0.4.1-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: vllm_rs-0.4.1-cp38-abi3-macosx_11_0_arm64.whl
Upload date: Nov 15, 2025
Size: 5.0 MB
Tags: CPython 3.8+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.9.1

File hashes

Hashes for vllm_rs-0.4.1-cp38-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`f51282fdc6a99460a0c3b3be1d06bad0fea591964ef389ea16c274c266ebbfa3`
MD5	`16c5b7290d649af2bd2c2c99ac27fedb`
BLAKE2b-256	`14e14f4d148322c9f1d7b3c00d5ce9f20a503cbf48b4b2d9341a2e9234b4eaf7`

See more details on using hashes here.

vllm-rs 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Project description

🚀 vLLM.rs – A Minimalist vLLM in Rust

✨ Key Features

Chat Performace

Performance of vLLM.rs on Metal (Apple Silicon, M4)

Performance Comparison

🧠 Supported Architectures

📽️ Demo Video

📘 Usage in Rust

🚀 Prefill-decode Disaggregation

📘 Usage in Python

📦 Install with pip

🌐✨ API Server Mode

Interactive Chat and completion

🐍 Python API

🔨 Build Python Package from source (Optional)

🛠️ Prerequisites

Building steps

⚙️ Command Line Arguments

🗜️ In-Situ Quantization (GGUF Conversion during loading)

📌 Project Status

🛠️ Roadmap

📚 References

Project details

Verified details

Maintainers

Unverified details

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes